<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: git-history</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/git-history.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2022-11-20T07:00:54+00:00</updated><author><name>Simon Willison</name></author><entry><title>Tracking Mastodon user numbers over time with a bucket of tricks</title><link href="https://simonwillison.net/2022/Nov/20/tracking-mastodon/#atom-tag" rel="alternate"/><published>2022-11-20T07:00:54+00:00</published><updated>2022-11-20T07:00:54+00:00</updated><id>https://simonwillison.net/2022/Nov/20/tracking-mastodon/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;a href="https://joinmastodon.org/"&gt;Mastodon&lt;/a&gt; is definitely having a moment. User growth is skyrocketing as more and more people migrate over from Twitter.&lt;/p&gt;
&lt;p&gt;I've set up a new &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;git scraper&lt;/a&gt; to track the number of registered user accounts on known Mastodon instances over time.&lt;/p&gt;
&lt;p&gt;It's only been running for a few hours, but it's already collected enough data to &lt;a href="https://observablehq.com/@simonw/mastodon-users-and-statuses-over-time"&gt;render this chart&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/mastodon-users-few-hours.png" alt="The chart starts at around 1am with 4,694,000 users - it climbs to 4,716,000 users by 6am in a relatively straight line" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I'm looking forward to seeing how this trend continues to develop over the next days and weeks.&lt;/p&gt;
&lt;h4&gt;Scraping the data&lt;/h4&gt;
&lt;p&gt;My scraper works by tracking &lt;a href="https://instances.social/"&gt;https://instances.social/&lt;/a&gt; - a website that lists a large number (but not all) of the Mastodon instances that are out there.&lt;/p&gt;
&lt;p&gt;That site publishes an &lt;a href="https://instances.social/instances.json"&gt;instances.json&lt;/a&gt; array which currently contains 1,830 objects representing Mastodon instances. Each of those objects looks something like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
    &lt;span class="pl-ent"&gt;"name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;pleroma.otter.sh&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"title"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Otterland&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"short_description"&lt;/span&gt;: &lt;span class="pl-c1"&gt;null&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"description"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Otters does squeak squeak&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"uptime"&lt;/span&gt;: &lt;span class="pl-c1"&gt;0.944757&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"up"&lt;/span&gt;: &lt;span class="pl-c1"&gt;true&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"https_score"&lt;/span&gt;: &lt;span class="pl-c1"&gt;null&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"https_rank"&lt;/span&gt;: &lt;span class="pl-c1"&gt;null&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"ipv6"&lt;/span&gt;: &lt;span class="pl-c1"&gt;true&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"openRegistrations"&lt;/span&gt;: &lt;span class="pl-c1"&gt;false&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"users"&lt;/span&gt;: &lt;span class="pl-c1"&gt;5&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"statuses"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;54870&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"connections"&lt;/span&gt;: &lt;span class="pl-c1"&gt;9821&lt;/span&gt;,
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I have &lt;a href="https://github.com/simonw/scrape-instances-social/blob/main/.github/workflows/scrape.yml"&gt;a GitHub Actions workflow&lt;/a&gt; running approximately every 20 minutes that fetches a copy of that file and commits it back to this repository:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/simonw/scrape-instances-social"&gt;https://github.com/simonw/scrape-instances-social&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Since each instance includes a &lt;code&gt;users&lt;/code&gt; count, the commit history of my &lt;code&gt;instances.json&lt;/code&gt; file tells the story of Mastodon's growth over time.&lt;/p&gt;
&lt;h4&gt;Building a database&lt;/h4&gt;
&lt;p&gt;A commit log of a JSON file is interesting, but the next step is to turn that into actionable information.&lt;/p&gt;
&lt;p&gt;My &lt;a href="https://simonwillison.net/2021/Dec/7/git-history/"&gt;git-history tool&lt;/a&gt; is designed to do exactly that.&lt;/p&gt;
&lt;p&gt;For the chart up above, the only number I care about is the total number of users listed in each snapshot of the file - the sum of that &lt;code&gt;users&lt;/code&gt; field for each instance.&lt;/p&gt;
&lt;p&gt;Here's how to run &lt;code&gt;git-history&lt;/code&gt; against that file's commit history to generate tables showing how that count has changed over time:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;git-history file counts.db instances.json \
  --convert &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;return [&lt;/span&gt;
&lt;span class="pl-s"&gt;    {&lt;/span&gt;
&lt;span class="pl-s"&gt;        'id': 'all',&lt;/span&gt;
&lt;span class="pl-s"&gt;        'users': sum(d['users'] or 0 for d in json.loads(content)),&lt;/span&gt;
&lt;span class="pl-s"&gt;        'statuses': sum(int(d['statuses'] or 0) for d in json.loads(content)),&lt;/span&gt;
&lt;span class="pl-s"&gt;    }&lt;/span&gt;
&lt;span class="pl-s"&gt;  ]&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; --id id&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I'm creating a file called &lt;code&gt;counts.db&lt;/code&gt; that shows the history of the &lt;code&gt;instances.json&lt;/code&gt; file.&lt;/p&gt;
&lt;p&gt;The real trick here though is that &lt;code&gt;--convert&lt;/code&gt; argument. I'm using that to compress each snapshot down to a single row that looks like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
    &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;all&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"users"&lt;/span&gt;: &lt;span class="pl-c1"&gt;4717781&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"statuses"&lt;/span&gt;: &lt;span class="pl-c1"&gt;374217860&lt;/span&gt;
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Normally &lt;code&gt;git-history&lt;/code&gt; expects to work against an array of objects, tracking the history of changes to each one based on their &lt;code&gt;id&lt;/code&gt; property.&lt;/p&gt;
&lt;p&gt;Here I'm tricking it a bit - I only return a single object with the ID of &lt;code&gt;all&lt;/code&gt;. This means that &lt;code&gt;git-history&lt;/code&gt; will only track the history of changes to that single object.&lt;/p&gt;
&lt;p&gt;It works though! The result is a &lt;code&gt;counts.db&lt;/code&gt; file which is currently 52KB and has the following schema (truncated to the most interesting bits):&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;CREATE TABLE [commits] (
   [id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [namespace] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [namespaces]([id]),
   [hash] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [commit_at] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;
);
CREATE TABLE [item_version] (
   [_id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [_item] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [item]([_id]),
   [_version] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;,
   [_commit] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [commits]([id]),
   [id] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [users] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;,
   [statuses] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;,
   [_item_full_hash] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;
);&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Each &lt;code&gt;item_version&lt;/code&gt; row will tell us the number of users and statuses at a particular point in time, based on a join against that &lt;code&gt;commits&lt;/code&gt; table to find the &lt;code&gt;commit_at&lt;/code&gt; date.&lt;/p&gt;
&lt;h4&gt;Publishing the database&lt;/h4&gt;
&lt;p&gt;For this project, I decided to publish the SQLite database to an S3 bucket. I considered pushing the binary SQLite file directly to the GitHub repository but this felt rude, since a binary file that changes every 20 minutes would bloat the repository.&lt;/p&gt;
&lt;p&gt;I wanted to serve the file with open CORS headers so I could load it into Datasette Lite and Observable notebooks.&lt;/p&gt;
&lt;p&gt;I used my &lt;a href="https://s3-credentials.readthedocs.io/"&gt;s3-credentials&lt;/a&gt; tool to create a bucket for this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;~ % s3-credentials create scrape-instances-social --public --website --create-bucket
Created bucket: scrape-instances-social
Attached bucket policy allowing public access
Configured website: IndexDocument=index.html, ErrorDocument=error.html
Created  user: 's3.read-write.scrape-instances-social' with permissions boundary: 'arn:aws:iam::aws:policy/AmazonS3FullAccess'
Attached policy s3.read-write.scrape-instances-social to user s3.read-write.scrape-instances-social
Created access key for user: s3.read-write.scrape-instances-social
{
    "UserName": "s3.read-write.scrape-instances-social",
    "AccessKeyId": "AKIAWXFXAIOZI5NUS6VU",
    "Status": "Active",
    "SecretAccessKey": "...",
    "CreateDate": "2022-11-20 05:52:22+00:00"
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This created a new bucket called &lt;code&gt;scrape-instances-social&lt;/code&gt; configured to work as a website and allow public access.&lt;/p&gt;
&lt;p&gt;It also generated an access key and a secret access key with access to just that bucket. I saved these in GitHub Actions secrets called &lt;code&gt;AWS_ACCESS_KEY_ID&lt;/code&gt; and &lt;code&gt;AWS_SECRET_ACCESS_KEY&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;I enabled a CORS policy on the bucket like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;s3-credentials set-cors-policy scrape-instances-social
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then I added the following to my GitHub Actions workflow to build and upload the database after each run of the scraper:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Build and publish database using git-history&lt;/span&gt;
      &lt;span class="pl-ent"&gt;env&lt;/span&gt;:
        &lt;span class="pl-ent"&gt;AWS_ACCESS_KEY_ID&lt;/span&gt;: &lt;span class="pl-s"&gt;${{ secrets.AWS_ACCESS_KEY_ID }}&lt;/span&gt;
        &lt;span class="pl-ent"&gt;AWS_SECRET_ACCESS_KEY&lt;/span&gt;: &lt;span class="pl-s"&gt;${{ secrets.AWS_SECRET_ACCESS_KEY }}&lt;/span&gt;
      &lt;span class="pl-ent"&gt;run&lt;/span&gt;: &lt;span class="pl-s"&gt;|-&lt;/span&gt;
&lt;span class="pl-s"&gt;        # First download previous database to save some time&lt;/span&gt;
&lt;span class="pl-s"&gt;        wget https://scrape-instances-social.s3.amazonaws.com/counts.db&lt;/span&gt;
&lt;span class="pl-s"&gt;        # Update with latest commits&lt;/span&gt;
&lt;span class="pl-s"&gt;        ./build-count-history.sh&lt;/span&gt;
&lt;span class="pl-s"&gt;        # Upload to S3&lt;/span&gt;
&lt;span class="pl-s"&gt;        s3-credentials put-object scrape-instances-social counts.db counts.db \&lt;/span&gt;
&lt;span class="pl-s"&gt;          --access-key $AWS_ACCESS_KEY_ID \&lt;/span&gt;
&lt;span class="pl-s"&gt;          --secret-key $AWS_SECRET_ACCESS_KEY&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;code&gt;git-history&lt;/code&gt; knows how to only process commits since the last time the database was built, so downloading the previous copy saves a lot of time.&lt;/p&gt;
&lt;h4&gt;Exploring the data&lt;/h4&gt;
&lt;p&gt;Now that I have a SQLite database that's being served over CORS-enabled HTTPS I can open it in &lt;a href="https://simonwillison.net/2022/May/4/datasette-lite/"&gt;Datasette Lite&lt;/a&gt; - my implementation of Datasette compiled to WebAssembly that runs entirely in a browser.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://lite.datasette.io/?url=https://scrape-instances-social.s3.amazonaws.com/counts.db"&gt;https://lite.datasette.io/?url=https://scrape-instances-social.s3.amazonaws.com/counts.db&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Any time anyone follows this link their browser will fetch the latest copy of the &lt;code&gt;counts.db&lt;/code&gt; file directly from S3.&lt;/p&gt;
&lt;p&gt;The most interesting page in there is the &lt;code&gt;item_version_detail&lt;/code&gt; SQL view, which joins against the commits table to show the date of each change:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://lite.datasette.io/?url=https://scrape-instances-social.s3.amazonaws.com/counts.db#/counts/item_version_detail"&gt;https://lite.datasette.io/?url=https://scrape-instances-social.s3.amazonaws.com/counts.db#/counts/item_version_detail&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;(Datasette Lite lets you link directly to pages within Datasette itself via a &lt;code&gt;#hash&lt;/code&gt;.)&lt;/p&gt;
&lt;h4&gt;Plotting a chart&lt;/h4&gt;
&lt;p&gt;Datasette Lite doesn't have charting yet, so I decided to turn to my favourite visualization tool, an &lt;a href="https://observablehq.com/"&gt;Observable&lt;/a&gt; notebook.&lt;/p&gt;
&lt;p&gt;Observable has the ability to query SQLite databases (that are served via CORS) directly these days!&lt;/p&gt;
&lt;p&gt;Here's my notebook:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://observablehq.com/@simonw/mastodon-users-and-statuses-over-time"&gt;https://observablehq.com/@simonw/mastodon-users-and-statuses-over-time&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;There are only four cells needed to create the chart shown above.&lt;/p&gt;
&lt;p&gt;First, we need to open the SQLite database from the remote URL:&lt;/p&gt;
&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-s1"&gt;database&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-v"&gt;SQLiteDatabaseClient&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;open&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;
  &lt;span class="pl-s"&gt;"https://scrape-instances-social.s3.amazonaws.com/counts.db"&lt;/span&gt;
&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Next we need to use an Obervable Database query cell to execute SQL against that database and pull out the data we want to plot - and store it in a &lt;code&gt;query&lt;/code&gt; variable:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;SELECT&lt;/span&gt; _commit_at &lt;span class="pl-k"&gt;as&lt;/span&gt; &lt;span class="pl-k"&gt;date&lt;/span&gt;, users, statuses
&lt;span class="pl-k"&gt;FROM&lt;/span&gt; item_version_detail&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We need to make one change to that data - we need to convert the &lt;code&gt;date&lt;/code&gt; column from a string to a JavaScript date object:&lt;/p&gt;
&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-s1"&gt;points&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;query&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;map&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;d&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-c1"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;
  &lt;span class="pl-c1"&gt;date&lt;/span&gt;: &lt;span class="pl-k"&gt;new&lt;/span&gt; &lt;span class="pl-v"&gt;Date&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;d&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;date&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
  &lt;span class="pl-c1"&gt;users&lt;/span&gt;: &lt;span class="pl-s1"&gt;d&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;users&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
  &lt;span class="pl-c1"&gt;statuses&lt;/span&gt;: &lt;span class="pl-s1"&gt;d&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;statuses&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Finally, we can plot the data using the &lt;a href="https://observablehq.com/@observablehq/plot"&gt;Observable Plot&lt;/a&gt; charting library like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-v"&gt;Plot&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;plot&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;
  &lt;span class="pl-c1"&gt;y&lt;/span&gt;: &lt;span class="pl-kos"&gt;{&lt;/span&gt;
    &lt;span class="pl-c1"&gt;grid&lt;/span&gt;: &lt;span class="pl-c1"&gt;true&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
    &lt;span class="pl-c1"&gt;label&lt;/span&gt;: &lt;span class="pl-s"&gt;"Total users over time across all tracked instances"&lt;/span&gt;
  &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
  &lt;span class="pl-c1"&gt;marks&lt;/span&gt;: &lt;span class="pl-kos"&gt;[&lt;/span&gt;&lt;span class="pl-v"&gt;Plot&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;line&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;points&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt; &lt;span class="pl-c1"&gt;x&lt;/span&gt;: &lt;span class="pl-s"&gt;"date"&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;y&lt;/span&gt;: &lt;span class="pl-s"&gt;"users"&lt;/span&gt; &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;]&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
  &lt;span class="pl-c1"&gt;marginLeft&lt;/span&gt;: &lt;span class="pl-c1"&gt;100&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I added 100px of margin to the left of the chart to ensure there was space for the large (4,696,000 and up) labels on the y-axis.&lt;/p&gt;
&lt;h4&gt;A bunch of tricks combined&lt;/h4&gt;
&lt;p&gt;This project combines a whole bunch of tricks I've been pulling together over the past few years:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt; is the technique I use to gather the initial data, turning a static listing of instances into a record of changes over time&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://datasette.io/tools/git-history"&gt;git-history&lt;/a&gt; is my tool for turning a scraped Git history into a SQLite database that's easier to work with&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://s3-credentials.readthedocs.io/"&gt;s3-credentials&lt;/a&gt; makes working with S3 buckets - in particular creating credentials that are restricted to just one bucket - much less frustrating&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2022/May/4/datasette-lite/"&gt;Datasette Lite&lt;/a&gt; means that once you have a SQLite database online somewhere you can explore it in your browser - without having to run my full server-side &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; Python application on a machine somewhere&lt;/li&gt;
&lt;li&gt;And finally, combining the above means I can take advantage of &lt;a href="https://observablehq.com/"&gt;Observable notebooks&lt;/a&gt; for ad-hoc visualization of data that's hosted online, in this case as a static SQLite database file served from S3&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/observable"&gt;observable&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-history"&gt;git-history&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3-credentials"&gt;s3-credentials&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette-lite"&gt;datasette-lite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mastodon"&gt;mastodon&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cors"&gt;cors&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="github"/><category term="projects"/><category term="datasette"/><category term="observable"/><category term="github-actions"/><category term="git-scraping"/><category term="git-history"/><category term="s3-credentials"/><category term="datasette-lite"/><category term="mastodon"/><category term="cors"/></entry><entry><title>Measuring traffic during the Half Moon Bay Pumpkin Festival</title><link href="https://simonwillison.net/2022/Oct/19/measuring-traffic/#atom-tag" rel="alternate"/><published>2022-10-19T15:41:09+00:00</published><updated>2022-10-19T15:41:09+00:00</updated><id>https://simonwillison.net/2022/Oct/19/measuring-traffic/#atom-tag</id><summary type="html">
    &lt;p&gt;This weekend was the &lt;a href="https://pumpkinfest.miramarevents.com/" rel="nofollow"&gt;50th annual Half Moon Bay Pumpkin Festival&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;We live in El Granada, a tiny town 8 minutes drive from Half Moon Bay. There is a single road (coastal highway one) between the two towns, and the festival is locally notorious for its impact on traffic.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://twitter.com/natbat" rel="nofollow"&gt;Natalie&lt;/a&gt; suggested that we measure the traffic and try and see the impact for ourselves!&lt;/p&gt;
&lt;p&gt;Here's the end result for Saturday. Read on for details on how we created it.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A chart showing the two lines over time" src="https://static.simonwillison.net/static/2022/pumpkin-saturday-smooth.png" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4&gt;&lt;a id="user-content-collecting-the-data" class="anchor" aria-hidden="true" href="#collecting-the-data"&gt;&lt;span aria-hidden="true" class="octicon octicon-link"&gt;&lt;/span&gt;&lt;/a&gt;Collecting the data&lt;/h4&gt;
&lt;p&gt;I built a &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/" rel="nofollow"&gt;git scraper&lt;/a&gt; to gather data from the Google Maps &lt;a href="https://developers.google.com/maps/documentation/directions/overview" rel="nofollow"&gt;Directions API&lt;/a&gt;. It turns out if you pass &lt;code&gt;departure_time=now&lt;/code&gt; to that API it returns the current estimated time in traffic as part of the response.&lt;/p&gt;
&lt;p&gt;I picked a location in Half Moon Bay an a location in El Granada and constructed the following URL (pretty-printed):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;https://maps.googleapis.com/maps/api/directions/json?
  origin=GG49%2BCH,%20Half%20Moon%20Bay%20CA
  &amp;amp;destination=FH78%2BQJ,%20Half%20Moon%20Bay,%20CA
  &amp;amp;departure_time=now
  &amp;amp;key=$GOOGLE_MAPS_KEY
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The two locations here are defined using Google Plus codes. Here they are on Google Maps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.google.com/maps/search/FH78%2BQJ+Half+Moon+Bay,+CA,+USA" rel="nofollow"&gt;FH78+QJ Half Moon Bay, CA, USA&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.google.com/maps/search/GG49%2BCH+El+Granada+CA,+USA" rel="nofollow"&gt;GG49+CH El Granada CA, USA&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I constructed the reverse of the URL too, to track traffic in the other direction. Then I rigged up a scheduled GitHub Actions workflow in &lt;a href="https://github.com/simonw/scrape-hmb-traffic"&gt;this repository&lt;/a&gt; to fetch this API data, pretty-print it with &lt;code&gt;jq&lt;/code&gt; and write it to the repsoitory:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Scrape traffic&lt;/span&gt;

&lt;span class="pl-ent"&gt;on&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;push&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;workflow_dispatch&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;schedule&lt;/span&gt;:
  - &lt;span class="pl-ent"&gt;cron&lt;/span&gt;:  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;*/5 * * * *&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-ent"&gt;jobs&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;shot-scraper&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;runs-on&lt;/span&gt;: &lt;span class="pl-s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="pl-ent"&gt;steps&lt;/span&gt;:
    - &lt;span class="pl-ent"&gt;uses&lt;/span&gt;: &lt;span class="pl-s"&gt;actions/checkout@v2&lt;/span&gt;
    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Scrape&lt;/span&gt;
      &lt;span class="pl-ent"&gt;env&lt;/span&gt;:
        &lt;span class="pl-ent"&gt;GOOGLE_MAPS_KEY&lt;/span&gt;: &lt;span class="pl-s"&gt;${{ secrets.GOOGLE_MAPS_KEY }}&lt;/span&gt;
      &lt;span class="pl-ent"&gt;run&lt;/span&gt;: &lt;span class="pl-s"&gt;|        &lt;/span&gt;
&lt;span class="pl-s"&gt;        curl "https://maps.googleapis.com/maps/api/directions/json?origin=GG49%2BCH,%20Half%20Moon%20Bay%20CA&amp;amp;destination=FH78%2BQJ,%20Half%20Moon%20Bay,%20California&amp;amp;departure_time=now&amp;amp;key=$GOOGLE_MAPS_KEY" | jq &amp;gt; one.json&lt;/span&gt;
&lt;span class="pl-s"&gt;        sleep 3&lt;/span&gt;
&lt;span class="pl-s"&gt;        curl "https://maps.googleapis.com/maps/api/directions/json?origin=FH78%2BQJ,%20Half%20Moon%20Bay%20CA&amp;amp;destination=GG49%2BCH,%20Half%20Moon%20Bay,%20California&amp;amp;departure_time=now&amp;amp;key=$GOOGLE_MAPS_KEY" | jq &amp;gt; two.json&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Commit and push&lt;/span&gt;
      &lt;span class="pl-ent"&gt;run&lt;/span&gt;: &lt;span class="pl-s"&gt;|-&lt;/span&gt;
&lt;span class="pl-s"&gt;        git config user.name "Automated"&lt;/span&gt;
&lt;span class="pl-s"&gt;        git config user.email "actions@users.noreply.github.com"&lt;/span&gt;
&lt;span class="pl-s"&gt;        git add -A&lt;/span&gt;
&lt;span class="pl-s"&gt;        timestamp=$(date -u)&lt;/span&gt;
&lt;span class="pl-s"&gt;        git commit -m "${timestamp}" || exit 0&lt;/span&gt;
&lt;span class="pl-s"&gt;        git pull --rebase&lt;/span&gt;
&lt;span class="pl-s"&gt;        git push&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I'm using a GitHub Actions secret called &lt;code&gt;GOOGLE_MAPS_KEY&lt;/code&gt; to store the Google Maps API key.&lt;/p&gt;
&lt;p&gt;This workflow runs every 5 minutes (more-or-less - GitHub Actions doesn't necessarily stick to the schedule). It fetches the two JSON results and writes them to files called &lt;code&gt;one.json&lt;/code&gt; and &lt;code&gt;two.json&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;... and that was the initial setup for the project. This took me about fifteen minutes to put in place, because I've built systems like this so many times before. I launched it at about 10am on Saturday and left it to collect data.&lt;/p&gt;
&lt;h4&gt;&lt;a id="user-content-analyzing-the-data-and-drawing-some-charts" class="anchor" aria-hidden="true" href="#analyzing-the-data-and-drawing-some-charts"&gt;&lt;span aria-hidden="true" class="octicon octicon-link"&gt;&lt;/span&gt;&lt;/a&gt;Analyzing the data and drawing some charts&lt;/h4&gt;
&lt;p&gt;The trick with git scraping is that the data you care about ends up captured in &lt;a href="https://github.com/simonw/scrape-hmb-traffic/commits/main"&gt;the git commit log&lt;/a&gt;. The challenge is how to extract that back out again and turn it into something useful.&lt;/p&gt;
&lt;p&gt;My &lt;a href="https://simonwillison.net/2021/Dec/7/git-history/" rel="nofollow"&gt;git-history tool&lt;/a&gt; is designed to solve this. It's a command-line utility which can iterate through every version of a file stored in a git repository, extracting information from that file out into a SQLite database table and creating a new row for every commit.&lt;/p&gt;
&lt;p&gt;Normally I run it against CSV or JSON files containing an array of rows - effectively tabular data already, where I just want to record what has changed in between commits.&lt;/p&gt;
&lt;p&gt;For this project, I was storing the raw JSON output by the Google Maps API. I didn't care about most of the information in there: I really just wanted the &lt;code&gt;duration_in_traffic&lt;/code&gt; value.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;git-history&lt;/code&gt; can accept a snippet of Python code that will be run against each stored copy of a file. The snippet should return a list of JSON objects (as Python dictionaries) which the rest of the tool can then use to figure out what has changed.&lt;/p&gt;
&lt;p&gt;To cut a long story short, here's the incantation that worked:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;git-history file hmb.db one.json \
--convert '
try:
    duration_in_traffic = json.loads(content)["routes"][0]["legs"][0]["duration_in_traffic"]["value"]
    return [{"id": "one", "duration_in_traffic": duration_in_traffic}]
except Exception as ex:
    return []
' \
  --full-versions \
  --id id
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;git-history file&lt;/code&gt; command is used to load the history for a specific file - in this case it's the file &lt;code&gt;one.json&lt;/code&gt;, which will be loaded into a new SQLite database file called &lt;code&gt;hm.db&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;--convert&lt;/code&gt; code uses &lt;code&gt;json.loads(content)&lt;/code&gt; to load the JSON for the current file version, then pulls out the &lt;code&gt;["routes"][0]["legs"][0]["duration_in_traffic"]["value"]&lt;/code&gt; nested value from it.&lt;/p&gt;
&lt;p&gt;If that's missing (e.g. in an earlier commit, when I hadn't yet added the &lt;code&gt;departure_time=now&lt;/code&gt; parameter to the URL) an exception will be caught and the function will return an empty list.&lt;/p&gt;
&lt;p&gt;If the &lt;code&gt;duration_in_traffic&lt;/code&gt; value is present, the function returns the following:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[{"id": "one", "duration_in_traffic": duration_in_traffic}]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;git-history&lt;/code&gt; likes lists of dictionaries. It's usually being run against files that contain many different rows, where the &lt;code&gt;id&lt;/code&gt; column can be used to de-dupe rows across commits and spot what has changed.&lt;/p&gt;
&lt;p&gt;In this case, each file only has a single interesting value.&lt;/p&gt;
&lt;p&gt;Two more options are used here:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--full-versions&lt;/code&gt; - tells &lt;code&gt;git-history&lt;/code&gt; to store all of the columns, not just columns that have changed since the last run. The default behaviour here is to store a &lt;code&gt;null&lt;/code&gt; if a value has not changed in order to save space, but our data is tiny here so we don't need any clever optimizations.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--id id&lt;/code&gt; specifies the ID column that should be used to de-dupe changes. Again, not really important for this tiny project.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;After running the above command, the resulting schema includes these tables:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;CREATE TABLE [commits] (
   [id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [namespace] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [namespaces]([id]),
   [hash] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [commit_at] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;
);
CREATE TABLE [item_version] (
   [_id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [_item] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [item]([_id]),
   [_version] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;,
   [_commit] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [commits]([id]),
   [id] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [duration_in_traffic] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;
);&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The &lt;code&gt;commits&lt;/code&gt; table includes the date of the commit - &lt;code&gt;commit_at&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;item_version&lt;/code&gt; table has that &lt;code&gt;duration_in_traffic&lt;/code&gt; value.&lt;/p&gt;
&lt;p&gt;So... to get back the duration in traffic at different times of day I can run this SQL query to join those two tables together:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;select&lt;/span&gt;
    &lt;span class="pl-c1"&gt;commits&lt;/span&gt;.&lt;span class="pl-c1"&gt;commit_at&lt;/span&gt;,
    duration_in_traffic
&lt;span class="pl-k"&gt;from&lt;/span&gt;
    item_version
&lt;span class="pl-k"&gt;join&lt;/span&gt;
    commits &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;item_version&lt;/span&gt;.&lt;span class="pl-c1"&gt;_commit&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;commits&lt;/span&gt;.&lt;span class="pl-c1"&gt;id&lt;/span&gt;
&lt;span class="pl-k"&gt;order by&lt;/span&gt;
    &lt;span class="pl-c1"&gt;commits&lt;/span&gt;.&lt;span class="pl-c1"&gt;commit_at&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;That query returns data that looks like this:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;commit_at&lt;/th&gt;
&lt;th&gt;duration_in_traffic&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2022-10-15T17:09:06+00:00&lt;/td&gt;
&lt;td&gt;1110&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2022-10-15T17:17:38+00:00&lt;/td&gt;
&lt;td&gt;1016&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2022-10-15T17:30:06+00:00&lt;/td&gt;
&lt;td&gt;1391&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;A couple of problems here. First, the &lt;code&gt;commit_at&lt;/code&gt; column is in UTC, not local time. And &lt;code&gt;duration_in_traffic&lt;/code&gt; is in seconds, which aren't particularly easy to read.&lt;/p&gt;
&lt;p&gt;Here's a SQLite fix for these two issues:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;select
    time(datetime(commits.commit_at, '-7 hours')) as t,
    duration_in_traffic / 60 as mins_in_traffic
from
    item_version
join
    commits on item_version._commit = commits.id
order by
    commits.commit_at
&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;t&lt;/th&gt;
&lt;th&gt;mins_in_traffic&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;10:09:06&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10:17:38&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10:30:06&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;code&gt;datetime(commits.commit_at, '-7 hours')&lt;/code&gt; parses the UTC string as a datetime, and then subsracts 7 hours from it to get the local time in California converted from UTC.&lt;/p&gt;
&lt;p&gt;I wrap that in &lt;code&gt;time()&lt;/code&gt; here because for the chart I want to render I know everything will be on the same day.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;mins_in_traffic&lt;/code&gt; now shows minutes, not seconds.&lt;/p&gt;
&lt;p&gt;We now have enough data to render a chart!&lt;/p&gt;
&lt;p&gt;But... we only have one of the two directions of traffic here. To process the numbers from &lt;code&gt;two.json&lt;/code&gt; as well I ran this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;git-history file hmb.db two.json \
--convert '
try:
    duration_in_traffic = json.loads(content)["routes"][0]["legs"][0]["duration_in_traffic"]["value"]
    return [{"id": "two", "duration_in_traffic": duration_in_traffic}]
except Exception as ex:
    return []
' \
  --full-versions \
  --id id --namespace item2
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is &lt;em&gt;almost&lt;/em&gt; the same as the previous command. It's running against &lt;code&gt;two.json&lt;/code&gt; instead of &lt;code&gt;one.json&lt;/code&gt;, and it's using the &lt;code&gt;--namespace item2&lt;/code&gt; option.&lt;/p&gt;
&lt;p&gt;This causes it to populate a new table called &lt;code&gt;item2_version&lt;/code&gt; instead of &lt;code&gt;item_version&lt;/code&gt;, which is a cheap trick to avoid having to figure out how to load both files into the same table.&lt;/p&gt;
&lt;h2&gt;&lt;a id="user-content-two-lines-on-one-chart" class="anchor" aria-hidden="true" href="#two-lines-on-one-chart"&gt;&lt;span aria-hidden="true" class="octicon octicon-link"&gt;&lt;/span&gt;&lt;/a&gt;Two lines on one chart&lt;/h2&gt;
&lt;p&gt;I rendered an initial single line chart using &lt;a href="https://datasette.io/plugins/datasette-vega" rel="nofollow"&gt;datasette-vega&lt;/a&gt;, but Natalie suggested that putting lines on the same chart for the two directions of traffic would be more interesting.&lt;/p&gt;
&lt;p&gt;Since I now had one table for each direction of traffic (&lt;code&gt;item_version&lt;/code&gt; and &lt;code&gt;item_version2&lt;/code&gt;) I decided to combine those into a single table, suitable for pasting into Google Sheets.&lt;/p&gt;
&lt;p&gt;Here's the SQL I came up with to do that:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;with item1 &lt;span class="pl-k"&gt;as&lt;/span&gt; (
  &lt;span class="pl-k"&gt;select&lt;/span&gt;
    &lt;span class="pl-k"&gt;time&lt;/span&gt;(datetime(&lt;span class="pl-c1"&gt;commits&lt;/span&gt;.&lt;span class="pl-c1"&gt;commit_at&lt;/span&gt;, &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;-7 hours&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;)) &lt;span class="pl-k"&gt;as&lt;/span&gt; t,
    duration_in_traffic &lt;span class="pl-k"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;60&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; mins_in_traffic
  &lt;span class="pl-k"&gt;from&lt;/span&gt;
    item_version
    &lt;span class="pl-k"&gt;join&lt;/span&gt; commits &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;item_version&lt;/span&gt;.&lt;span class="pl-c1"&gt;_commit&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;commits&lt;/span&gt;.&lt;span class="pl-c1"&gt;id&lt;/span&gt;
  &lt;span class="pl-k"&gt;order by&lt;/span&gt;
    &lt;span class="pl-c1"&gt;commits&lt;/span&gt;.&lt;span class="pl-c1"&gt;commit_at&lt;/span&gt;
),
item2 &lt;span class="pl-k"&gt;as&lt;/span&gt; (
  &lt;span class="pl-k"&gt;select&lt;/span&gt;
    &lt;span class="pl-k"&gt;time&lt;/span&gt;(datetime(&lt;span class="pl-c1"&gt;commits&lt;/span&gt;.&lt;span class="pl-c1"&gt;commit_at&lt;/span&gt;, &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;-7 hours&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;)) &lt;span class="pl-k"&gt;as&lt;/span&gt; t,
    duration_in_traffic &lt;span class="pl-k"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;60&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; mins_in_traffic
  &lt;span class="pl-k"&gt;from&lt;/span&gt;
    item2_version
    &lt;span class="pl-k"&gt;join&lt;/span&gt; commits &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;item2_version&lt;/span&gt;.&lt;span class="pl-c1"&gt;_commit&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;commits&lt;/span&gt;.&lt;span class="pl-c1"&gt;id&lt;/span&gt;
  &lt;span class="pl-k"&gt;order by&lt;/span&gt;
    &lt;span class="pl-c1"&gt;commits&lt;/span&gt;.&lt;span class="pl-c1"&gt;commit_at&lt;/span&gt;
)
&lt;span class="pl-k"&gt;select&lt;/span&gt;
  item1.&lt;span class="pl-k"&gt;*&lt;/span&gt;,
  &lt;span class="pl-c1"&gt;item2&lt;/span&gt;.&lt;span class="pl-c1"&gt;mins_in_traffic&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; mins_in_traffic_other_way
&lt;span class="pl-k"&gt;from&lt;/span&gt;
  item1
  &lt;span class="pl-k"&gt;join&lt;/span&gt; item2 &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;item1&lt;/span&gt;.&lt;span class="pl-c1"&gt;t&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;item2&lt;/span&gt;.&lt;span class="pl-c1"&gt;t&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This uses two CTEs (Common Table Expressions - the &lt;code&gt;with X as&lt;/code&gt; pieces) using the pattern I explained earlier - now called &lt;code&gt;item1&lt;/code&gt; and &lt;code&gt;item2&lt;/code&gt;. Having defined these two CTEs, I can join them together on the &lt;code&gt;t&lt;/code&gt; column, which is the time of day.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://lite.datasette.io/?url=https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db?&amp;amp;install=datasette-copyable#/hmb?sql=with+item1+as+(%0A++select%0A++++time(datetime(commits.commit_at%2C+'-7+hours'))+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item_version%0A++++join+commits+on+item_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A)%2C%0Aitem2+as+(%0A++select%0A++++time(datetime(commits.commit_at%2C+'-7+hours'))+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item2_version%0A++++join+commits+on+item2_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A)%0Aselect%0A++item1.*%2C%0A++item2.mins_in_traffic+as+mins_in_traffic_other_way%0Afrom%0A++item1%0A++join+item2+on+item1.t+%3D+item2.t" rel="nofollow"&gt;Try running this query&lt;/a&gt; in Datasette Lite.&lt;/p&gt;
&lt;p&gt;Here's the output of that query for Saturday (10am to 8pm):&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;t&lt;/th&gt;
&lt;th&gt;mins_in_traffic&lt;/th&gt;
&lt;th&gt;mins_in_traffic_other_way&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;10:09:06&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10:17:38&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10:30:06&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10:47:38&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10:57:37&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11:08:20&lt;/td&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11:22:27&lt;/td&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11:38:42&lt;/td&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11:52:35&lt;/td&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12:03:23&lt;/td&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12:15:16&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12:27:51&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12:37:48&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12:46:41&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12:55:03&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13:05:10&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13:17:57&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13:32:55&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13:44:53&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13:55:22&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14:05:21&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14:17:48&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14:31:04&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14:41:59&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14:51:48&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15:00:09&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15:11:17&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15:25:48&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15:39:41&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15:51:11&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15:59:34&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16:10:50&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16:25:43&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16:53:06&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;17:11:34&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;17:40:29&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;18:12:07&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;18:58:17&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20:05:13&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;I copied and pasted this table into Google Sheets and messed around with the charting tools there until I had the following chart:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A chart showing the two lines over time" src="https://static.simonwillison.net/static/2022/pumpkin-saturday-smooth.png" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Here's the same chart for Sunday:&lt;/p&gt;
&lt;p&gt;&lt;img alt="This chart shows the same thing but for Sunday" src="https://static.simonwillison.net/static/2022/pumpkin-sunday-smooth.png" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Our &lt;a href="https://docs.google.com/spreadsheets/d/1JOimtkugZBF_YQxqn0Gn6NiIhNz-OMH2rpOZtmECAY4/edit#gid=0" rel="nofollow"&gt;Google Sheet is here&lt;/a&gt; - the two days have two separate tabs within the sheet.&lt;/p&gt;
&lt;h4&gt;&lt;a id="user-content-building-the-sqlite-database-in-github-actions" class="anchor" aria-hidden="true" href="#building-the-sqlite-database-in-github-actions"&gt;&lt;span aria-hidden="true" class="octicon octicon-link"&gt;&lt;/span&gt;&lt;/a&gt;Building the SQLite database in GitHub Actions&lt;/h4&gt;
&lt;p&gt;I did most of the development work for this project on my laptop, running &lt;code&gt;git-history&lt;/code&gt; and &lt;code&gt;datasette&lt;/code&gt; locally for speed of iteration.&lt;/p&gt;
&lt;p&gt;Once I had everything working, I decided to automate the process of building the SQLite database as well.&lt;/p&gt;
&lt;p&gt;I made the following changes to my GitHub Actions workflow:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;jobs&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;shot-scraper&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;runs-on&lt;/span&gt;: &lt;span class="pl-s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="pl-ent"&gt;steps&lt;/span&gt;:
    - &lt;span class="pl-ent"&gt;uses&lt;/span&gt;: &lt;span class="pl-s"&gt;actions/checkout@v3&lt;/span&gt;
      &lt;span class="pl-ent"&gt;with&lt;/span&gt;:
        &lt;span class="pl-ent"&gt;fetch-depth&lt;/span&gt;: &lt;span class="pl-c1"&gt;0&lt;/span&gt; &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Needed by git-history&lt;/span&gt;
    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Set up Python 3.10&lt;/span&gt;
      &lt;span class="pl-ent"&gt;uses&lt;/span&gt;: &lt;span class="pl-s"&gt;actions/setup-python@v4&lt;/span&gt;
      &lt;span class="pl-ent"&gt;with&lt;/span&gt;:
        &lt;span class="pl-ent"&gt;python-version&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;3.10&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
        &lt;span class="pl-ent"&gt;cache&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;pip&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
    - &lt;span class="pl-ent"&gt;run&lt;/span&gt;: &lt;span class="pl-s"&gt;pip install -r requirements.txt&lt;/span&gt;
    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Scrape&lt;/span&gt;
      &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Same as before...&lt;/span&gt;
      &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; env:&lt;/span&gt;
      &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; run&lt;/span&gt;
    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Build SQLite database&lt;/span&gt;
      &lt;span class="pl-ent"&gt;run&lt;/span&gt;: &lt;span class="pl-s"&gt;|&lt;/span&gt;
&lt;span class="pl-s"&gt;        rm -f hmb.db # Recreate from scratch each time&lt;/span&gt;
&lt;span class="pl-s"&gt;        git-history file hmb.db one.json \&lt;/span&gt;
&lt;span class="pl-s"&gt;        --convert '&lt;/span&gt;
&lt;span class="pl-s"&gt;        try:&lt;/span&gt;
&lt;span class="pl-s"&gt;            duration_in_traffic = json.loads(content)["routes"][0]["legs"][0]["duration_in_traffic"]["value"]&lt;/span&gt;
&lt;span class="pl-s"&gt;            return [{"id": "one", "duration_in_traffic": duration_in_traffic}]&lt;/span&gt;
&lt;span class="pl-s"&gt;        except Exception as ex:&lt;/span&gt;
&lt;span class="pl-s"&gt;            return []&lt;/span&gt;
&lt;span class="pl-s"&gt;        ' \&lt;/span&gt;
&lt;span class="pl-s"&gt;          --full-versions \&lt;/span&gt;
&lt;span class="pl-s"&gt;          --id id&lt;/span&gt;
&lt;span class="pl-s"&gt;        git-history file hmb.db two.json \&lt;/span&gt;
&lt;span class="pl-s"&gt;        --convert '&lt;/span&gt;
&lt;span class="pl-s"&gt;        try:&lt;/span&gt;
&lt;span class="pl-s"&gt;            duration_in_traffic = json.loads(content)["routes"][0]["legs"][0]["duration_in_traffic"]["value"]&lt;/span&gt;
&lt;span class="pl-s"&gt;            return [{"id": "two", "duration_in_traffic": duration_in_traffic}]&lt;/span&gt;
&lt;span class="pl-s"&gt;        except Exception as ex:&lt;/span&gt;
&lt;span class="pl-s"&gt;            return []&lt;/span&gt;
&lt;span class="pl-s"&gt;        ' \&lt;/span&gt;
&lt;span class="pl-s"&gt;          --full-versions \&lt;/span&gt;
&lt;span class="pl-s"&gt;          --id id --namespace item2&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;    - &lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;Commit and push&lt;/span&gt;
      &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Same as before...&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I also added a &lt;code&gt;requirements.txt&lt;/code&gt; file containing just &lt;code&gt;git-history&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Note how the &lt;code&gt;actions/checkout@v3&lt;/code&gt; step now has &lt;code&gt;fetch-depth: 0&lt;/code&gt; - this is necessary because &lt;code&gt;git-history&lt;/code&gt; needs to loop through the entire repository history, but &lt;code&gt;actions/checkout@v3&lt;/code&gt; defaults to only fetching the most recent commit.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;setup-python&lt;/code&gt; step uses &lt;code&gt;cache: "pip"&lt;/code&gt;, which causes it to cache installed dependencies from &lt;code&gt;requirements.txt&lt;/code&gt; between runs.&lt;/p&gt;
&lt;p&gt;Because that big &lt;code&gt;git-history&lt;/code&gt; step creates a &lt;code&gt;hmb.db&lt;/code&gt; SQLite database, the "Commit and push" step now includes that file in the push to the repository. So every time the workflow runs a new binary SQLite database file is committed.&lt;/p&gt;
&lt;p&gt;Normally I wouldn't do this, because Git isn't a great place to keep constantly changing binary files... but in this case the SQLite database is only 100KB and won't continue to be updated beyond the end of the pumpkin festival.&lt;/p&gt;
&lt;p&gt;End result: &lt;a href="https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db"&gt;hmb.db is available&lt;/a&gt; in the GitHub repository.&lt;/p&gt;
&lt;h4&gt;&lt;a id="user-content-querying-it-using-datasette-lite" class="anchor" aria-hidden="true" href="#querying-it-using-datasette-lite"&gt;&lt;span aria-hidden="true" class="octicon octicon-link"&gt;&lt;/span&gt;&lt;/a&gt;Querying it using Datasette Lite&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://simonwillison.net/2022/May/4/datasette-lite/" rel="nofollow"&gt;Datasette Lite&lt;/a&gt; is my repackaged version of my Datasette server-side Python application which runs entirely in the user's browser, using WebAssembly.&lt;/p&gt;
&lt;p&gt;A neat feature of Datasette Lite is that you can pass it the URL to a SQLite database file and it will load that database in your browser and let you run queries against it.&lt;/p&gt;
&lt;p&gt;These database files need to be served with CORS headers. Every file served by GitHub includes these headers!&lt;/p&gt;
&lt;p&gt;Which means the following URL can be used to open up the latest &lt;code&gt;hmb.db&lt;/code&gt; file directly in Datasette in the browser:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://lite.datasette.io/?url=https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db" rel="nofollow"&gt;https://lite.datasette.io/?url=https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;(This takes advantage of a &lt;a href="https://simonwillison.net/2022/Sep/16/weeknotes/" rel="nofollow"&gt;feature I added&lt;/a&gt; to Datasette Lite where it knows how to convert the URL to the HTML page about a file on GitHub to the URL to the raw file itself.)&lt;/p&gt;
&lt;p&gt;URLs to SQL queries work too. This URL will open Datasette Lite, load the SQLite database AND execute the query I constructed above:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://lite.datasette.io/?url=https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db#/hmb?sql=with+item1+as+(%0A++select%0A++++time(datetime(commits.commit_at%2C+'-7+hours'))+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item_version%0A++++join+commits+on+item_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A)%2C%0Aitem2+as+(%0A++select%0A++++time(datetime(commits.commit_at%2C+'-7+hours'))+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item2_version%0A++++join+commits+on+item2_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A)%0Aselect%0A++item1.*%2C%0A++item2.mins_in_traffic+as+mins_in_traffic_other_way%0Afrom%0A++item1%0A++join+item2+on+item1.t+%3D+item2.t" rel="nofollow"&gt;https://lite.datasette.io/?url=https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db#/hmb?sql=with+item1+as+(%0A++select%0A++++time(datetime(commits.commit_at%2C+'-7+hours'))+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item_version%0A++++join+commits+on+item_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A)%2C%0Aitem2+as+(%0A++select%0A++++time(datetime(commits.commit_at%2C+'-7+hours'))+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item2_version%0A++++join+commits+on+item2_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A)%0Aselect%0A++item1.*%2C%0A++item2.mins_in_traffic+as+mins_in_traffic_other_way%0Afrom%0A++item1%0A++join+item2+on+item1.t+%3D+item2.t&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;And finally... Datasette Lite &lt;a href="https://simonwillison.net/2022/Aug/17/datasette-lite-plugins/" rel="nofollow"&gt;has plugin support&lt;/a&gt;. Adding &lt;code&gt;&amp;amp;install=datasette-copyable&lt;/code&gt; to the URL adds the &lt;a href="https://datasette.io/plugins/datasette-copyable" rel="nofollow"&gt;datasette-copyable&lt;/a&gt; plugin, which adds a page for easily copying out the query results as TSV (useful for pasting into a spreadsheet) or even as GitHub-flavored Markdown (which I used to add results to this blog post).&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://lite.datasette.io/?url=https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db&amp;amp;install=datasette-copyable#/hmb.copyable?sql=with+item1+as+%28%0A++select%0A++++time%28datetime%28commits.commit_at%2C+%27-7+hours%27%29%29+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item_version%0A++++join+commits+on+item_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A%29%2C%0Aitem2+as+%28%0A++select%0A++++time%28datetime%28commits.commit_at%2C+%27-7+hours%27%29%29+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item2_version%0A++++join+commits+on+item2_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A%29%0Aselect%0A++item1.%2A%2C%0A++item2.mins_in_traffic+as+mins_in_traffic_other_way%0Afrom%0A++item1%0A++join+item2+on+item1.t+%3D+item2.t&amp;amp;_table_format=github" rel="nofollow"&gt;an example&lt;/a&gt; of that plugin in action.&lt;/p&gt;
&lt;p&gt;This was a fun little project that brought together a whole bunch of things I've been working on over the past few years. Here's some more of my writing on these different techniques and tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/series/git-scraping/" rel="nofollow"&gt;Git scraping&lt;/a&gt; is the key technique I'm using here to collect the data&lt;/li&gt;
&lt;li&gt;I've written a lot about &lt;a href="https://simonwillison.net/tags/githubactions/" rel="nofollow"&gt;GitHub Actions&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;These are my notes about &lt;a href="https://simonwillison.net/tags/githistory/" rel="nofollow"&gt;git-history&lt;/a&gt;, the tool I used to turn a commit history into a SQLite database&lt;/li&gt;
&lt;li&gt;Here's my series of posts about &lt;a href="https://simonwillison.net/series/datasette-lite/" rel="nofollow"&gt;Datasette Lite&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/natalie-downe"&gt;natalie-downe&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-history"&gt;git-history&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette-lite"&gt;datasette-lite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/half-moon-bay"&gt;half-moon-bay&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="data-journalism"/><category term="natalie-downe"/><category term="projects"/><category term="sqlite"/><category term="datasette"/><category term="git-scraping"/><category term="git-history"/><category term="datasette-lite"/><category term="half-moon-bay"/></entry><entry><title>Half Moon Bay Pumpkin Festival traffic on Saturday 15th October 2022</title><link href="https://simonwillison.net/2022/Oct/16/half-moon-bay-pumpkin-festival-traffic/#atom-tag" rel="alternate"/><published>2022-10-16T03:56:51+00:00</published><updated>2022-10-16T03:56:51+00:00</updated><id>https://simonwillison.net/2022/Oct/16/half-moon-bay-pumpkin-festival-traffic/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/scrape-hmb-traffic"&gt;Half Moon Bay Pumpkin Festival traffic on Saturday 15th October 2022&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
It’s the Half Moon Bay Pumpkin Festival this weekend... and its impact on the traffic between our little town of El Granada and Half Moon Bay—8 minutes drive away—is notorious. So I built a git scraper that archives estimated driving times from the Google Maps Navigation API, and used git-history to turn that scraped data into a SQLite database and visualize it on a chart.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/simonw/status/1581493679738363904"&gt;@simonw&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-history"&gt;git-history&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/half-moon-bay"&gt;half-moon-bay&lt;/a&gt;&lt;/p&gt;



</summary><category term="projects"/><category term="git-scraping"/><category term="git-history"/><category term="half-moon-bay"/></entry><entry><title>Weeknotes: git-history, bug magnets and s3-credentials --public</title><link href="https://simonwillison.net/2021/Dec/8/weeknotes/#atom-tag" rel="alternate"/><published>2021-12-08T21:34:12+00:00</published><updated>2021-12-08T21:34:12+00:00</updated><id>https://simonwillison.net/2021/Dec/8/weeknotes/#atom-tag</id><summary type="html">
    &lt;p&gt;I've stopped considering my projects "shipped" until I've written a proper blog entry about them, so yesterday I finally &lt;a href="https://simonwillison.net/2021/Dec/7/git-history/"&gt;shipped git-history&lt;/a&gt;, coinciding with the release of &lt;a href="https://github.com/simonw/git-history/releases/tag/0.6"&gt;version 0.6&lt;/a&gt; - a full 27 days after the first &lt;a href="https://github.com/simonw/git-history/releases/tag/0.1"&gt;0.1&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It took way more work than I was expecting to get to this point!&lt;/p&gt;
&lt;p&gt;I wrote the first version of &lt;code&gt;git-history&lt;/code&gt; in an afternoon, as a tool &lt;a href="https://simonwillison.net/2021/Nov/15/weeknotes-git-history/"&gt;for a workshop I was presenting&lt;/a&gt; on Git scraping and Datasette.&lt;/p&gt;
&lt;p&gt;Before promoting it more widely, I wanted to make some improvements to the schema. In particular, I wanted to record only the updated values in the &lt;code&gt;item_version&lt;/code&gt; table - which otherwise could end up duplicating a full copy of each item in the database hundreds or even thousands of times.&lt;/p&gt;
&lt;p&gt;Getting this right took a lot of work, and I kept on getting stumped by weird bugs and edge-cases. &lt;a href="https://github.com/simonw/git-history/issues/33"&gt;This bug&lt;/a&gt; in particular added a couple of days to the project.&lt;/p&gt;
&lt;p&gt;The whole project turned out to be something of a bug magnet, partly because of a design decision I made concerning column names.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;git-history&lt;/code&gt; creates tables with columns that correspond to the underlying data. Since it also needs its own columns for tracking things like commits and incremental versions, I decided to use underscore prefixes for reserved columns such as &lt;code&gt;_item&lt;/code&gt; and &lt;code&gt;_version&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Datasette uses underscore prefixes for its own purposes - special table arguments such as &lt;code&gt;?_facet=column-name&lt;/code&gt;. It's supposed to work with existing columns that use underscores by converting query string arguments like &lt;code&gt;?_item=3&lt;/code&gt; into &lt;code&gt;?_item__exact=3&lt;/code&gt; - but &lt;code&gt;git-history&lt;/code&gt; was the first of my projects to really exercise this, and I kept on finding bugs. Datasette &lt;a href="https://docs.datasette.io/en/stable/changelog.html#v0-59-2"&gt;0.59.2&lt;/a&gt; and &lt;a href="https://docs.datasette.io/en/stable/changelog.html#v0-59-4"&gt;0.59.4&lt;/a&gt; both have related bug fixes, and there's &lt;a href="https://github.com/simonw/datasette/issues/1527"&gt;a re-opened bug&lt;/a&gt; that I have yet to resolve.&lt;/p&gt;
&lt;p&gt;Building the &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident"&gt;ca-fires demo&lt;/a&gt; also revealed a &lt;a href="https://github.com/simonw/datasette-cluster-map/issues/38"&gt;bug in datasette-cluster-map&lt;/a&gt; which I fixed in &lt;a href="https://github.com/simonw/datasette-cluster-map/releases/tag/0.17.2"&gt;version 0.17.2&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;s3-credentials --public&lt;/h4&gt;
&lt;p&gt;The &lt;code&gt;git-history&lt;/code&gt; live demos are built and deployed by &lt;a href="https://github.com/simonw/git-history/blob/main/.github/workflows/deploy-demos.yml"&gt;this GitHub Actions workflow&lt;/a&gt;. The workflow works by checking out three separate repos and running &lt;code&gt;git-history&lt;/code&gt; against them. It takes advantage of that tool's ability to add just new commits to an existing database to run faster, so it needs to persist database files in between runs.&lt;/p&gt;
&lt;p&gt;Since these files can be several hundred MBs, I decided to persist them in an S3 bucket.&lt;/p&gt;
&lt;p&gt;My &lt;a href="https://simonwillison.net/2021/Nov/3/s3-credentials/"&gt;s3-credentials tool&lt;/a&gt; provides the ability to create a new S3 bucket along with restricted read-write credentials just for that bucket, ideal for use in a GitHub Actions workflow.&lt;/p&gt;
&lt;p&gt;I decided to make the bucket public such that anyone can download files from it, since there was no reason to keep it private. I've been wanting to add this ability to &lt;code&gt;s3-credentials&lt;/code&gt; for a while now, so this was the impetus I needed to finally ship that feature.&lt;/p&gt;
&lt;p&gt;It's surprisingly hard to figure out how to make an S3 bucket public these days! It turned out the magic recipe was adding a JSON bucket policy document to the bucket granting &lt;code&gt;s3:GetObject&lt;/code&gt; permission to principal &lt;code&gt;*&lt;/code&gt; - here's &lt;a href="https://github.com/simonw/s3-credentials/blob/0.8/README.md#public-bucket-policy"&gt;that policy in full&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I released &lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.8"&gt;s3-credentials 0.8&lt;/a&gt; with a new &lt;code&gt;--public&lt;/code&gt; option for creating public buckets - here are the release notes in full:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;s3-credentials create my-bucket --public&lt;/code&gt; option for creating public buckets, which allow anyone with knowledge of a filename to download that file. This works by attaching &lt;a href="https://github.com/simonw/s3-credentials/blob/0.8/README.md#public-bucket-policy"&gt;this public bucket policy&lt;/a&gt; to the bucket after it is created. &lt;a href="https://github.com/simonw/s3-credentials/issues/42"&gt;#42&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;s3-credentials put-object&lt;/code&gt; now sets the &lt;code&gt;Content-Type&lt;/code&gt; header on the uploaded object. The type is detected based on the filename, or can be specified using the new &lt;code&gt;--content-type&lt;/code&gt; option. &lt;a href="https://github.com/simonw/s3-credentials/issues/43"&gt;#43&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;s3-credentials policy my-bucket --public-bucket&lt;/code&gt; outputs the public bucket policy that would be attached to a bucket of that name. &lt;a href="https://github.com/simonw/s3-credentials/issues/44"&gt;#44&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;I wrote up this TIL which doubles as a mini-tutorial on using &lt;code&gt;s3-credentials&lt;/code&gt;: &lt;a href="https://til.simonwillison.net/github-actions/s3-bucket-github-actions"&gt;Storing files in an S3 bucket between GitHub Actions runs&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;datasette-hovercards&lt;/h4&gt;
&lt;p&gt;This was a quick experiment which turned into a prototype Datasette plugin. I really like how GitHub show hover card previews of links to issues in their interface:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/github-hovercard.gif" alt="Animation showing a hover card displayed when the mouse cursor touches a link to a GitHub Issue" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I decided to see if I could build something similar for links within Datasette, specifically the links that show up when a column is a foreign key to another record.&lt;/p&gt;
&lt;p&gt;Here's what I've got so far:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/datasette-hovercards.gif" alt="Animation showing a hover card displayed in GitHub for a link to another record" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;There's an interactive demo running on &lt;a href="https://latest-with-plugins.datasette.io/github/issues"&gt;this table page&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It still needs a bunch of work - in particular I need to think harder about when the card is shown, where it displays relative to the mouse pointer, what causes it to be hidden again and how it should handle different page widths. Ideally I'd like to figure out a useful mobile / touch-screen variant, but I'm not sure how that could work.&lt;/p&gt;
&lt;p&gt;The prototype plugin is called &lt;a href="https://github.com/simonw/datasette-hovercards"&gt;datasette-hovercards&lt;/a&gt; - I'd like to eventually merge this back into Datasette core once I'm happy with how it works.&lt;/p&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/git-history"&gt;git-history&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/git-history/releases/tag/0.6.1"&gt;0.6.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/git-history/releases"&gt;9 releases total&lt;/a&gt;) - 2021-12-08
&lt;br /&gt;Tools for analyzing Git history using SQLite&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-cluster-map"&gt;datasette-cluster-map&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-cluster-map/releases/tag/0.17.2"&gt;0.17.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-cluster-map/releases"&gt;20 releases total&lt;/a&gt;) - 2021-12-07
&lt;br /&gt;Datasette plugin that shows a map for any data with latitude/longitude columns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.8"&gt;0.8&lt;/a&gt; - (&lt;a href="https://github.com/simonw/s3-credentials/releases"&gt;8 releases total&lt;/a&gt;) - 2021-12-07
&lt;br /&gt;A tool for creating credentials for accessing S3 buckets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/asyncinject"&gt;asyncinject&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/asyncinject/releases/tag/0.2a1"&gt;0.2a1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/asyncinject/releases"&gt;3 releases total&lt;/a&gt;) - 2021-12-03
&lt;br /&gt;Run async workflows using pytest-fixtures-style dependency injection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-hovercards"&gt;datasette-hovercards&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-hovercards/releases/tag/0.1a0"&gt;0.1a0&lt;/a&gt; - 2021-12-02
&lt;br /&gt;Add preview hovercards to links in Datasette&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/dogsheep/github-to-sqlite"&gt;github-to-sqlite&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/dogsheep/github-to-sqlite/releases/tag/2.8.3"&gt;2.8.3&lt;/a&gt; - (&lt;a href="https://github.com/dogsheep/github-to-sqlite/releases"&gt;22 releases total&lt;/a&gt;) - 2021-12-01
&lt;br /&gt;Save data from GitHub to a SQLite database&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/python/init-subclass"&gt;__init_subclass__&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/github-actions/s3-bucket-github-actions"&gt;Storing files in an S3 bucket between GitHub Actions runs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-history"&gt;git-history&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="datasette"/><category term="weeknotes"/><category term="git-history"/><category term="s3-credentials"/></entry><entry><title>git-history: a tool for analyzing scraped data collected using Git and SQLite</title><link href="https://simonwillison.net/2021/Dec/7/git-history/#atom-tag" rel="alternate"/><published>2021-12-07T22:32:55+00:00</published><updated>2021-12-07T22:32:55+00:00</updated><id>https://simonwillison.net/2021/Dec/7/git-history/#atom-tag</id><summary type="html">
    &lt;p&gt;I described &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping&lt;/a&gt; last year: a technique for writing scrapers where you periodically snapshot a source of data to a Git repository in order to record changes to that source over time.&lt;/p&gt;
&lt;p&gt;The open challenge was how to analyze that data once it was collected. &lt;a href="https://datasette.io/tools/git-history"&gt;git-history&lt;/a&gt; is my new tool designed to tackle that problem.&lt;/p&gt;
&lt;h4&gt;Git scraping, a refresher&lt;/h4&gt;
&lt;p&gt;A neat thing about scraping to a Git repository is that the scrapers themselves can be really simple. I demonstrated how to run scrapers for free using GitHub Actions in this &lt;a href="https://simonwillison.net/2021/Mar/5/git-scraping/"&gt;five minute lightning talk&lt;/a&gt; back in March.&lt;/p&gt;
&lt;p&gt;Here's a concrete example: California's state fire department, Cal Fire, maintain an incident map at &lt;a href="https://www.fire.ca.gov/incidents/"&gt;fire.ca.gov/incidents&lt;/a&gt; showing the status of current large fires in the state.&lt;/p&gt;
&lt;p&gt;I found the underlying data here:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;curl https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then I built &lt;a href="https://github.com/simonw/ca-fires-history/blob/main/.github/workflows/scrape.yml"&gt;a simple scraper&lt;/a&gt; that grabs a copy of that every 20 minutes and commits it to Git. I've been running that for 14 months now, and it's collected &lt;a href="https://github.com/simonw/ca-fires-history"&gt;1,559 commits&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;The thing that excites me most about Git scraping is that it can create truly unique datasets. It's common for organizations not to keep detailed archives of what changed and where, so by scraping their data into a Git repository you can often end up with a more detailed history than they maintain themselves.&lt;/p&gt;
&lt;p&gt;There's one big challenge though; having collected that data, how can you best analyze it? Reading through thousands of commit differences and eyeballing changes to JSON or CSV files isn't a great way of finding the interesting stories that have been captured.&lt;/p&gt;
&lt;h4&gt;git-history&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://datasette.io/tools/git-history"&gt;git-history&lt;/a&gt; is the new CLI tool I've built to answer that question. It reads through the entire history of a file and generates a SQLite database reflecting changes to that file over time. You can then use &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; to explore the resulting data.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://git-history-demos.datasette.io/ca-fires"&gt;an example database&lt;/a&gt; created by running the tool against my &lt;code&gt;ca-fires-history&lt;/code&gt; repository. I created the SQLite database by running this in the repository directory:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;git-history file ca-fires.db incidents.json \
  --namespace incident \
  --id UniqueId \
  --convert &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;json.loads(content)["Incidents"]&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/ca-fires-progress.gif" alt="Animated gif showing the progress bar" style="max-width:100%; border-top: 5px solid black;" /&gt;&lt;/p&gt;
&lt;p&gt;In this example we are processing the history of a single file called &lt;code&gt;incidents.json&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;We use the &lt;code&gt;UniqueId&lt;/code&gt; column to identify which records are changed over time as opposed to newly created.&lt;/p&gt;
&lt;p&gt;Specifying &lt;code&gt;--namespace incident&lt;/code&gt; causes the created database tables to be called &lt;code&gt;incident&lt;/code&gt; and &lt;code&gt;incident_version&lt;/code&gt; rather than the default of &lt;code&gt;item&lt;/code&gt; and &lt;code&gt;item_version&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;And we have a fragment of Python code that knows how to turn each version stored in that commit history into a list of objects compatible with the tool, see &lt;a href="https://github.com/simonw/git-history/blob/0.6/README.md#custom-conversions-using---convert"&gt;--convert in the documentation&lt;/a&gt; for details.&lt;/p&gt;
&lt;p&gt;Let's use the database to answer some questions about fires in California over the past 14 months.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;incident&lt;/code&gt; table contains a copy of the latest record for every incident. We can use that to see &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident"&gt;a map of every fire&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/ca-fires-map.png" alt="A map showing 250 fires in California" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;This uses the &lt;a href="https://datasette.io/plugins/datasette-cluster-map"&gt;datasette-cluster-map&lt;/a&gt; plugin, which draws a map of every row with a valid latitude and longitude column.&lt;/p&gt;
&lt;p&gt;Where things get interesting is the &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version"&gt;incident_version&lt;/a&gt; table. This is where changes between different scraped versions of each item are recorded.&lt;/p&gt;
&lt;p&gt;Those 250 fires have 2,060 recorded versions. If we &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item"&gt;facet by _item&lt;/a&gt; we can see which fires had the most versions recorded. Here are the top ten:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=174"&gt;Dixie Fire&lt;/a&gt; 268&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=209"&gt;Caldor Fire&lt;/a&gt; 153&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=197"&gt;Monument Fire&lt;/a&gt; 65&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=1"&gt;August Complex (includes Doe Fire)&lt;/a&gt; 64&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=2"&gt;Creek Fire&lt;/a&gt; 56&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=213"&gt;French Fire&lt;/a&gt; 53&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=32"&gt;Silverado Fire&lt;/a&gt; 52&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=240"&gt;Fawn Fire&lt;/a&gt; 45&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=34"&gt;Blue Ridge Fire&lt;/a&gt; 39&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=190"&gt;McFarland Fire&lt;/a&gt; 34&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This looks about right - the larger the number of versions the longer the fire must have been burning. The Dixie Fire &lt;a href="https://en.wikipedia.org/wiki/Dixie_Fire"&gt;has its own Wikipedia page&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;Clicking through to &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version?_facet=_item&amp;amp;_item__exact=174"&gt;the Dixie Fire&lt;/a&gt; lands us on a page showing every "version" that we captured, ordered by version number.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;git-history&lt;/code&gt; only writes values to this table that have changed since the previous version. This means you can glance at the table grid and get a feel for which pieces of information were updated over time:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/ca-fires-incident-versions.png" alt="The table showing changes to that fire over time" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;ConditionStatement&lt;/code&gt; is a text description that changes frequently, but the other two interesting columns look to be &lt;code&gt;AcresBurned&lt;/code&gt; and &lt;code&gt;PercentContained&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;That &lt;code&gt;_commit&lt;/code&gt; table is a foreign key to &lt;a href="https://git-history-demos.datasette.io/ca-fires/commits"&gt;commits&lt;/a&gt;, which records commits that have been processed by the tool -  mainly so that when you run it a second time it can pick up where it finished last time.&lt;/p&gt;
&lt;p&gt;We can join against &lt;code&gt;commits&lt;/code&gt; to see the date that each version was created. Or we can use the &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version_detail"&gt;incident_version_detail&lt;/a&gt; view which performs that join for us.&lt;/p&gt;
&lt;p&gt;Using that view, we can filter for just rows where &lt;code&gt;_item&lt;/code&gt; is 174 and &lt;code&gt;AcresBurned&lt;/code&gt; is not blank, then use the &lt;a href=""&gt;datasette-vega&lt;/a&gt; plugin to visualize the &lt;code&gt;_commit_at&lt;/code&gt; date column against the &lt;code&gt;AcresBurned&lt;/code&gt; numeric column... and we get a graph of &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version_detail?_item__exact=174&amp;amp;AcresBurned__notblank=1#g.mark=line&amp;amp;g.x_column=_commit_at&amp;amp;g.x_type=temporal&amp;amp;g.y_column=AcresBurned&amp;amp;g.y_type=quantitative"&gt;the growth of the Dixie Fire over time&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/ca-fires-chart.png" alt="The chart plugin showing a line chart" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;To review: we started out with a GitHub Actions scheduled workflow grabbing a copy of a JSON API endpoint every 20 minutes. Thanks to &lt;code&gt;git-history&lt;/code&gt;, Datasette and &lt;code&gt;datasette-vega&lt;/code&gt; we now have a chart showing the growth of the longest-lived California wildfire of the last 14 months over time.&lt;/p&gt;
&lt;h4&gt;A note on schema design&lt;/h4&gt;
&lt;p&gt;One of the hardest problems in designing &lt;code&gt;git-history&lt;/code&gt; was deciding on an appropriate schema for storing version changes over time.&lt;/p&gt;
&lt;p&gt;I ended up with the following (edited for clarity):&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;CREATE TABLE [commits] (
   [id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [hash] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [commit_at] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;
);
CREATE TABLE [item] (
   [_id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [_item_id] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [IncidentID] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [Location] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [Type] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [_commit] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;
);
CREATE TABLE [item_version] (
   [_id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [_item] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [item]([_id]),
   [_version] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;,
   [_commit] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [commits]([id]),
   [IncidentID] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [Location] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [Type] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;
);
CREATE TABLE [columns] (
   [id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [namespace] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [namespaces]([id]),
   [name] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;
);
CREATE TABLE [item_changed] (
   [item_version] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [item_version]([_id]),
   [column] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [columns]([id]),
   &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt; ([item_version], [column])
);&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;As shown earlier, records in the &lt;code&gt;item_version&lt;/code&gt; table represent snapshots over time - but to save on database space and provide a neater interface for browsing versions, they only record columns that had changed since their previous version. Any unchanged columns are stored as &lt;code&gt;null&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;There's one catch with this schema: what do we do if a new version of an item sets one of the columns to &lt;code&gt;null&lt;/code&gt;? How can we tell the difference between that and a column that didn't change?&lt;/p&gt;
&lt;p&gt;I ended up solving that with an &lt;code&gt;item_changed&lt;/code&gt; many-to-many table, which uses pairs of integers (hopefully taking up as little space as possible) to record exactly which columns were modified in which &lt;code&gt;item_version&lt;/code&gt; records.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;item_version_detail&lt;/code&gt; view displays columns from that many-to-many table as JSON - here's &lt;a href="https://git-history-demos.datasette.io/ca-fires/incident_version_detail?_version__gt=1&amp;amp;_col=_changed_columns&amp;amp;_col=_item&amp;amp;_col=_version"&gt;a filtered example&lt;/a&gt; showing which columns were changed in which versions of which items:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/ca-fires-changed-columns.png" alt="This table shows a JSON list of column names against items and versions" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://git-history-demos.datasette.io/ca-fires?sql=select+columns.name%2C+count%28*%29%0D%0Afrom+incident_changed%0D%0A++join+incident_version+on+incident_changed.item_version+%3D+incident_version._id%0D%0A++join+columns+on+incident_changed.column+%3D+columns.id%0D%0Awhere+incident_version._version+%3E+1%0D%0Agroup+by+columns.name%0D%0Aorder+by+count%28*%29+desc"&gt;a SQL query&lt;/a&gt; that shows, for &lt;code&gt;ca-fires&lt;/code&gt;, which columns were updated most often:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;select&lt;/span&gt; &lt;span class="pl-c1"&gt;columns&lt;/span&gt;.&lt;span class="pl-c1"&gt;name&lt;/span&gt;, &lt;span class="pl-c1"&gt;count&lt;/span&gt;(&lt;span class="pl-k"&gt;*&lt;/span&gt;)
&lt;span class="pl-k"&gt;from&lt;/span&gt; incident_changed
  &lt;span class="pl-k"&gt;join&lt;/span&gt; incident_version &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;incident_changed&lt;/span&gt;.&lt;span class="pl-c1"&gt;item_version&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;incident_version&lt;/span&gt;.&lt;span class="pl-c1"&gt;_id&lt;/span&gt;
  &lt;span class="pl-k"&gt;join&lt;/span&gt; columns &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;incident_changed&lt;/span&gt;.&lt;span class="pl-c1"&gt;column&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;columns&lt;/span&gt;.&lt;span class="pl-c1"&gt;id&lt;/span&gt;
&lt;span class="pl-k"&gt;where&lt;/span&gt; &lt;span class="pl-c1"&gt;incident_version&lt;/span&gt;.&lt;span class="pl-c1"&gt;_version&lt;/span&gt; &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;
&lt;span class="pl-k"&gt;group by&lt;/span&gt; &lt;span class="pl-c1"&gt;columns&lt;/span&gt;.&lt;span class="pl-c1"&gt;name&lt;/span&gt;
&lt;span class="pl-k"&gt;order by&lt;/span&gt; &lt;span class="pl-c1"&gt;count&lt;/span&gt;(&lt;span class="pl-k"&gt;*&lt;/span&gt;) &lt;span class="pl-k"&gt;desc&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;Updated: 1785&lt;/li&gt;
&lt;li&gt;PercentContained: 740&lt;/li&gt;
&lt;li&gt;ConditionStatement: 734&lt;/li&gt;
&lt;li&gt;AcresBurned: 616&lt;/li&gt;
&lt;li&gt;Started: 327&lt;/li&gt;
&lt;li&gt;PersonnelInvolved: 286&lt;/li&gt;
&lt;li&gt;Engines: 274&lt;/li&gt;
&lt;li&gt;CrewsInvolved: 256&lt;/li&gt;
&lt;li&gt;WaterTenders: 225&lt;/li&gt;
&lt;li&gt;Dozers: 211&lt;/li&gt;
&lt;li&gt;AirTankers: 181&lt;/li&gt;
&lt;li&gt;StructuresDestroyed: 125&lt;/li&gt;
&lt;li&gt;Helicopters: 122&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Helicopters are exciting! Let's find all of the fires which had at least one record where the number of helicopters changed (after the first version). We'll use a nested SQL query:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;select&lt;/span&gt; &lt;span class="pl-k"&gt;*&lt;/span&gt; &lt;span class="pl-k"&gt;from&lt;/span&gt; incident
&lt;span class="pl-k"&gt;where&lt;/span&gt; _id &lt;span class="pl-k"&gt;in&lt;/span&gt; (
  &lt;span class="pl-k"&gt;select&lt;/span&gt; _item &lt;span class="pl-k"&gt;from&lt;/span&gt; incident_version
  &lt;span class="pl-k"&gt;where&lt;/span&gt; _id &lt;span class="pl-k"&gt;in&lt;/span&gt; (
    &lt;span class="pl-k"&gt;select&lt;/span&gt; item_version &lt;span class="pl-k"&gt;from&lt;/span&gt; incident_changed &lt;span class="pl-k"&gt;where&lt;/span&gt; column &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;15&lt;/span&gt;
  )
  &lt;span class="pl-k"&gt;and&lt;/span&gt; _version &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;
)&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;That returned 19 fires that were significant enough to involve helicopters - &lt;a href="https://git-history-demos.datasette.io/ca-fires?sql=select+*+from+incident%0D%0Awhere+_id+in+%28%0D%0A++select+_item+from+incident_version%0D%0A++where+_id+in+%28%0D%0A++++select+item_version+from+incident_changed+where+column+%3D+15%0D%0A++%29%0D%0A++and+_version+%3E+1%0D%0A%29"&gt;here they are on a map&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/ca-fire-helicopter-map.png" alt="A map of 19 fires that involved helicopters" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;h4&gt;Advanced usage of --convert&lt;/h4&gt;
&lt;p&gt;Drew Breunig has been running a Git scraper for the past 8 months in &lt;a href="https://github.com/dbreunig/511-events-history"&gt;dbreunig/511-events-history&lt;/a&gt; against &lt;a href="https://511.org/"&gt;511.org&lt;/a&gt;, a site showing traffic incidents in the San Francisco Bay Area. I loaded his data into this example &lt;a href="https://git-history-demos.datasette.io/sf-bay-511"&gt;sf-bay-511 database&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;sf-bay-511&lt;/code&gt; example is useful for digging more into the &lt;code&gt;--convert&lt;/code&gt; option to &lt;code&gt;git-history&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;git-history&lt;/code&gt; requires recorded data to be in a specific shape: it needs a JSON list of JSON objects, where each object has a column that can be treated as a unique ID for purposes of tracking changes to that specific record over time.&lt;/p&gt;
&lt;p&gt;The ideal tracked JSON file would look something like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;[
  {
    &lt;span class="pl-ent"&gt;"IncidentID"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;abc123&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"Location"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Corner of 4th and Vermont&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"Type"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;fire&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
  },
  {
    &lt;span class="pl-ent"&gt;"IncidentID"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;cde448&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"Location"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;555 West Example Drive&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"Type"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;medical&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
  }
]&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It's common for data that has been scraped to not fit this ideal shape.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;511.org&lt;/code&gt; JSON feed &lt;a href="https://backend-prod.511.org/api-proxy/api/v1/traffic/events/?extended=true"&gt;can be found here&lt;/a&gt; - it's a pretty complicated nested set of objects, and there's a bunch of data in there that's quite noisy without adding much to the overall analysis - things like a &lt;code&gt;updated&lt;/code&gt; timestamp field that changes in every version even if there are no changes, or a deeply nested &lt;code&gt;"extension"&lt;/code&gt; object full of duplicate data.&lt;/p&gt;
&lt;p&gt;I wrote a snippet of Python to transform each of those recorded snapshots into a simpler structure, and then passed that Python code to the &lt;code&gt;--convert&lt;/code&gt; option to the script:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash
git-history file sf-bay-511.db 511-events-history/events.json \
  --repo 511-events-history \
  --id id \
  --convert '
data = json.loads(content)
if data.get("error"):
    # {"code": 500, "error": "Error accessing remote data..."}
    return
for event in data["Events"]:
    event["id"] = event["extension"]["event-reference"]["event-identifier"]
    # Remove noisy updated timestamp
    del event["updated"]
    # Drop extension block entirely
    del event["extension"]
    # "schedule" block is noisy but not interesting
    del event["schedule"]
    # Flatten nested subtypes
    event["event_subtypes"] = event["event_subtypes"]["event_subtype"]
    if not isinstance(event["event_subtypes"], list):
        event["event_subtypes"] = [event["event_subtypes"]]
    yield event
'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The single-quoted string passed to &lt;code&gt;--convert&lt;/code&gt; is compiled into a Python function and run against each Git version in turn. My code loops through the nested &lt;code&gt;Events&lt;/code&gt; list, modifying each record and then outputting them as an iterable sequence using &lt;code&gt;yield&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;A few of the records in the history were server 500 errors, so the code block knows how to identify and skip those as well.&lt;/p&gt;
&lt;p&gt;When working with &lt;code&gt;git-history&lt;/code&gt; I find myself spending most of my time iterating on these conversion scripts. Passing strings of Python code to tools like this is a pretty fun pattern - I also used it &lt;a href="https://simonwillison.net/2021/Aug/6/sqlite-utils-convert/"&gt;for sqlite-utils convert&lt;/a&gt; earlier this year.&lt;/p&gt;
&lt;h4&gt;Trying this out yourself&lt;/h4&gt;
&lt;p&gt;If you want to try this out for yourself the &lt;code&gt;git-history&lt;/code&gt; tool has &lt;a href="https://github.com/simonw/git-history/blob/main/README.md"&gt;an extensive README&lt;/a&gt; describing the other options, and the scripts used to create these demos can be found in the &lt;a href="https://github.com/simonw/git-history/tree/main/demos"&gt;demos folder&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/topics/git-scraping"&gt;git-scraping topic&lt;/a&gt; on GitHub now has over 200 repos now built by dozens of different people - that's a lot of interesting scraped data sat there waiting to be explored!&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git"&gt;git&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-history"&gt;git-history&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="cli"/><category term="data-journalism"/><category term="git"/><category term="projects"/><category term="scraping"/><category term="sqlite"/><category term="datasette"/><category term="git-history"/></entry><entry><title>Weeknotes: git-history, created for a Git scraping workshop</title><link href="https://simonwillison.net/2021/Nov/15/weeknotes-git-history/#atom-tag" rel="alternate"/><published>2021-11-15T04:10:50+00:00</published><updated>2021-11-15T04:10:50+00:00</updated><id>https://simonwillison.net/2021/Nov/15/weeknotes-git-history/#atom-tag</id><summary type="html">
    &lt;p&gt;My main project this week was a 90 minute workshop I delivered about Git scraping at &lt;a href="https://escoladedados.org/coda2021/"&gt;Coda.Br 2021&lt;/a&gt;, a Brazilian data journalism conference, on Friday. This inspired the creation of a brand new tool, &lt;strong&gt;git-history&lt;/strong&gt;, plus smaller improvements to a range of other projects.&lt;/p&gt;
&lt;h4&gt;git-history&lt;/h4&gt;
&lt;p&gt;I still need to do a detailed write-up of this one (update: &lt;a href="https://simonwillison.net/2021/Dec/7/git-history/"&gt;git-history: a tool for analyzing scraped data collected using Git and SQLite&lt;/a&gt;), but on Thursday I released a brand new tool called &lt;a href="https://datasette.io/tools/git-history"&gt;git-history&lt;/a&gt;, which I describe as "tools for analyzing Git history using SQLite".&lt;/p&gt;
&lt;p&gt;This tool is the missing link in the &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping pattern&lt;/a&gt; I described here last October.&lt;/p&gt;
&lt;p&gt;Git scraping is the technique of regularly scraping an online source of information and writing the results to a file in a Git repository... which automatically gives you a full revision history of changes made to that data source over time.&lt;/p&gt;
&lt;p&gt;The missing piece has always been what to do next: how do you turn a commit history of changes to a JSON or CSV file into a data source that can be used to answer questions about how that file changed over time?&lt;/p&gt;
&lt;p&gt;I've written one-off Python scripts for this a few times (here's &lt;a href="https://github.com/simonw/cdc-vaccination-history/blob/6f6bcb9437c0d44c4bcf94c111c631cc50bc2744/build_database.py"&gt;my CDC vaccinations one&lt;/a&gt;, for example), but giving an interactive workshop about the technique finally inspired me to build a tool to help.&lt;/p&gt;
&lt;p&gt;The tool has &lt;a href="https://datasette.io/tools/git-history"&gt;a comprehensive README&lt;/a&gt;, but the short version is that you can take a JSON (or CSV) file in a repository that has been tracking changes to some items over time and run the following to load all of the different versions into a SQLite database file for analysis with &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;git-convert file incidents.db incidents.json --id IncidentID
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This assumes that &lt;code&gt;incidents.json&lt;/code&gt; contains a JSON array of incidents (reported fires for example) and that each incident has a &lt;code&gt;IncidentID&lt;/code&gt; identifier key. It will then loop through the Git history of that file right from the start, creating an &lt;code&gt;item_versions&lt;/code&gt; table that tracks every change made to each of those items - using &lt;code&gt;IncidentID&lt;/code&gt; to decide if a row represents a new incident or an update to a previous one.&lt;/p&gt;
&lt;p&gt;I have a few more improvements I want to make before I start more widely promoting this, but it's already really useful. I've had a lot of fun running it against example repos from the &lt;a href="https://github.com/topics/git-scraping"&gt;git-scraping GitHub topic&lt;/a&gt; (now at 202 repos and counting).&lt;/p&gt;
&lt;h4&gt;Workshop: Raspando dados com o GitHub Actions e analisando com Datasette&lt;/h4&gt;
&lt;p&gt;The workshop I gave at the conference was live-translated into Portuguese, which is really exciting! I'm looking forward to watching the video when it comes out and seeing how well that worked.&lt;/p&gt;
&lt;p&gt;The title translates to "Scraping data with GitHub Actions and analyzing with Datasette", and it was the first time I've given a workshop that combines Git scraping and Datasette - hence the development of the new git-history tool to help tie the two together.&lt;/p&gt;
&lt;p&gt;I think it went really well. I put together four detailed exercises for the attendees, and then worked through each one live with the goal of attendees working through them at the same time - a method I learned from the Carpentries training course I took &lt;a href="https://simonwillison.net/2020/Sep/26/weeknotes-software-carpentry-sqlite/"&gt;last year&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Four exercises turns out to be exactly right for 90 minutes, with reasonable time for an introduction and some extra material and questions at the end.&lt;/p&gt;
&lt;p&gt;The worst part of running a workshop is inevitably the part where you try and get everyone setup with a functional development environment on their own machines (see &lt;a href="https://xkcd.com/1987/"&gt;XKCD 1987&lt;/a&gt;). This time round I skipped that entirely by encouraging my students to use &lt;strong&gt;&lt;a href="https://gitpod.io/"&gt;GitPod&lt;/a&gt;&lt;/strong&gt;, which provides free browser-based cloud development environments running Linux, with a browser-embedded VS Code editor and terminal running on top.&lt;/p&gt;

&lt;p&gt;&lt;img style="max-width: 100%" src="https://static.simonwillison.net/static/2021/start-datasette-gitpod.gif" alt="Animated demo of GitPod showing how to run Datasette and have it proxy a port" /&gt;&lt;/p&gt;

&lt;p&gt;(It's similar to &lt;a href="https://github.com/features/codespaces"&gt;GitHub Codespaces&lt;/a&gt;, but Codespaces is not yet available to free customers outside of the beta.)&lt;/p&gt;
&lt;p&gt;I demonstrated all of the exercises using GitPod myself during the workshop, and ensured that they could be entirely completed through that environment, with no laptop software needed at all.&lt;/p&gt;
&lt;p&gt;This worked &lt;strong&gt;so well&lt;/strong&gt;. Not having to worry about development environments makes workshops massively more productive. I will absolutely be doing this again in the future.&lt;/p&gt;
&lt;p&gt;The workshop exercises are available &lt;a href="https://docs.google.com/document/d/1TCatZP5gQNfFjZJ5M77wMlf9u_05Z3BZnjp6t1SA6UU/edit"&gt;in this Google Doc&lt;/a&gt;, and I hope to extract some of them out into official tutorials for various tools later on.&lt;/p&gt;
&lt;h4&gt;Datasette 0.58.2&lt;/h4&gt;
&lt;p&gt;Yesterday was Datasette's fourth birthday - the four year anniversary of &lt;a href="https://simonwillison.net/2017/Nov/13/datasette/"&gt;the initial release announcement&lt;/a&gt;! I celebrated by releasing a minor bug-fix, &lt;a href="https://github.com/simonw/datasette/releases/tag/0.59.2"&gt;Datasette 0.58.2&lt;/a&gt;, the release notes for which are quoted below:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Column names with a leading underscore now work correctly when used as a facet. (&lt;a href="https://github.com/simonw/datasette/issues/1506"&gt;#1506&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Applying &lt;code&gt;?_nocol=&lt;/code&gt; to a column no longer removes that column from the filtering interface. (&lt;a href="https://github.com/simonw/datasette/issues/1503"&gt;#1503&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Official Datasette Docker container now uses Debian Bullseye as the base image. (&lt;a href="https://github.com/simonw/datasette/issues/1497"&gt;#1497&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That first change was inspired by ongoing work on &lt;code&gt;git-history&lt;/code&gt;, where I decided to use a &lt;code&gt;_id&lt;/code&gt; underscoper prefix pattern for columns that were reserved for use by that tool in order &lt;a href="https://github.com/simonw/git-history/issues/14"&gt;to avoid clashing with column names&lt;/a&gt; in the provided source data.&lt;/p&gt;
&lt;h4&gt;sqlite-utils 3.18&lt;/h4&gt;
&lt;p&gt;Today I released &lt;a href="https://sqlite-utils.datasette.io/en/stable/changelog.html#v3-18"&gt;sqlite-utils 3.18&lt;/a&gt; - initially also to provide a feature I wanted for &lt;code&gt;git-history&lt;/code&gt; (a way to &lt;a href="https://github.com/simonw/sqlite-utils/issues/339"&gt;populate additional columns&lt;/a&gt; when creating a row using &lt;code&gt;table.lookup()&lt;/code&gt;) but I also closed some bug reports and landed some small pull requests that had come in since 3.17.&lt;/p&gt;
&lt;h4&gt;s3-credentials 0.5&lt;/h4&gt;
&lt;p&gt;Earlier in the week I released &lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.5"&gt;version 0.5&lt;/a&gt; of &lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt; - my CLI tool for creating read-only, read-write or write-only AWS credentials for a specific S3 bucket.&lt;/p&gt;
&lt;p&gt;The biggest new feature is the ability to create temporary credentials, that expire after a given time limit.&lt;/p&gt;
&lt;p&gt;This is achived using &lt;code&gt;STS.assume_role()&lt;/code&gt;, where STS is &lt;a href="https://docs.aws.amazon.com/STS/latest/APIReference/welcome.html"&gt;Security Token Service&lt;/a&gt;. I've been wanting to learn this API for quite a while now.&lt;/p&gt;
&lt;p&gt;Assume role comes with some limitations: tokens must live between 15 minutes and 12 hours, and you need to first create a role that you can assume. In creating those credentials you can define an additional policy document, which is how I scope down the token I'm creating to only allow a specific level of access to a specific S3 bucket.&lt;/p&gt;
&lt;p&gt;I've learned a huge amount about AWS, IAM and S3 through developming this project. I think I'm finally overcoming my multi-year phobia of anything involving IAM!&lt;/p&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/sqlite-utils/releases/tag/3.18"&gt;3.18&lt;/a&gt; - (&lt;a href="https://github.com/simonw/sqlite-utils/releases"&gt;88 releases total&lt;/a&gt;) - 2021-11-15
&lt;br /&gt;Python CLI utility and library for manipulating SQLite databases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette"&gt;datasette&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette/releases/tag/0.59.2"&gt;0.59.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette/releases"&gt;100 releases total&lt;/a&gt;) - 2021-11-14
&lt;br /&gt;An open source multi-tool for exploring and publishing data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-hello-world"&gt;datasette-hello-world&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-hello-world/releases/tag/0.1.1"&gt;0.1.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-hello-world/releases"&gt;2 releases total&lt;/a&gt;) - 2021-11-14
&lt;br /&gt;The hello world of Datasette plugins&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/git-history"&gt;git-history&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/git-history/releases/tag/0.3.1"&gt;0.3.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/git-history/releases"&gt;5 releases total&lt;/a&gt;) - 2021-11-12
&lt;br /&gt;Tools for analyzing Git history using SQLite&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.5"&gt;0.5&lt;/a&gt; - (&lt;a href="https://github.com/simonw/s3-credentials/releases"&gt;5 releases total&lt;/a&gt;) - 2021-11-11
&lt;br /&gt;A tool for creating credentials for accessing S3 buckets&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/kubernetes/basic-datasette-in-kubernetes"&gt;Basic Datasette in Kubernetes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/deno/annotated-deno-deploy-demo"&gt;Annotated code for a demo of WebSocket chat in Deno Deploy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/javascript/tesseract-ocr-javascript"&gt;Using Tesseract.js to OCR every image on a page&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/my-talks"&gt;my-talks&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/teaching"&gt;teaching&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-history"&gt;git-history&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="aws"/><category term="projects"/><category term="s3"/><category term="my-talks"/><category term="teaching"/><category term="datasette"/><category term="weeknotes"/><category term="git-scraping"/><category term="sqlite-utils"/><category term="git-history"/><category term="s3-credentials"/></entry></feed>