Simon Willison's Weblog: git-history

Tracking Mastodon user numbers over time with a bucket of tricks

2022-11-20T07:00:54+00:00

Mastodon is definitely having a moment. User growth is skyrocketing as more and more people migrate over from Twitter.

I've set up a new git scraper to track the number of registered user accounts on known Mastodon instances over time.

It's only been running for a few hours, but it's already collected enough data to render this chart:

I'm looking forward to seeing how this trend continues to develop over the next days and weeks.

Scraping the data

My scraper works by tracking https://instances.social/ - a website that lists a large number (but not all) of the Mastodon instances that are out there.

That site publishes an instances.json array which currently contains 1,830 objects representing Mastodon instances. Each of those objects looks something like this:

{
    "name": "pleroma.otter.sh",
    "title": "Otterland",
    "short_description": null,
    "description": "Otters does squeak squeak",
    "uptime": 0.944757,
    "up": true,
    "https_score": null,
    "https_rank": null,
    "ipv6": true,
    "openRegistrations": false,
    "users": 5,
    "statuses": "54870",
    "connections": 9821,
}

I have a GitHub Actions workflow running approximately every 20 minutes that fetches a copy of that file and commits it back to this repository:

https://github.com/simonw/scrape-instances-social

Since each instance includes a users count, the commit history of my instances.json file tells the story of Mastodon's growth over time.

Building a database

A commit log of a JSON file is interesting, but the next step is to turn that into actionable information.

My git-history tool is designed to do exactly that.

For the chart up above, the only number I care about is the total number of users listed in each snapshot of the file - the sum of that users field for each instance.

Here's how to run git-history against that file's commit history to generate tables showing how that count has changed over time:

git-history file counts.db instances.json \
  --convert "return [
    {
        'id': 'all',
        'users': sum(d['users'] or 0 for d in json.loads(content)),
        'statuses': sum(int(d['statuses'] or 0) for d in json.loads(content)),
    }
  ]" --id id

I'm creating a file called counts.db that shows the history of the instances.json file.

The real trick here though is that --convert argument. I'm using that to compress each snapshot down to a single row that looks like this:

{
    "id": "all",
    "users": 4717781,
    "statuses": 374217860
}

Normally git-history expects to work against an array of objects, tracking the history of changes to each one based on their id property.

Here I'm tricking it a bit - I only return a single object with the ID of all. This means that git-history will only track the history of changes to that single object.

It works though! The result is a counts.db file which is currently 52KB and has the following schema (truncated to the most interesting bits):

CREATE TABLE [commits] (
   [id] INTEGER PRIMARY KEY,
   [namespace] INTEGER REFERENCES [namespaces]([id]),
   [hash] TEXT,
   [commit_at] TEXT
);
CREATE TABLE [item_version] (
   [_id] INTEGER PRIMARY KEY,
   [_item] INTEGER REFERENCES [item]([_id]),
   [_version] INTEGER,
   [_commit] INTEGER REFERENCES [commits]([id]),
   [id] TEXT,
   [users] INTEGER,
   [statuses] INTEGER,
   [_item_full_hash] TEXT
);

Each item_version row will tell us the number of users and statuses at a particular point in time, based on a join against that commits table to find the commit_at date.

Publishing the database

For this project, I decided to publish the SQLite database to an S3 bucket. I considered pushing the binary SQLite file directly to the GitHub repository but this felt rude, since a binary file that changes every 20 minutes would bloat the repository.

I wanted to serve the file with open CORS headers so I could load it into Datasette Lite and Observable notebooks.

I used my s3-credentials tool to create a bucket for this:

~ % s3-credentials create scrape-instances-social --public --website --create-bucket
Created bucket: scrape-instances-social
Attached bucket policy allowing public access
Configured website: IndexDocument=index.html, ErrorDocument=error.html
Created  user: 's3.read-write.scrape-instances-social' with permissions boundary: 'arn:aws:iam::aws:policy/AmazonS3FullAccess'
Attached policy s3.read-write.scrape-instances-social to user s3.read-write.scrape-instances-social
Created access key for user: s3.read-write.scrape-instances-social
{
    "UserName": "s3.read-write.scrape-instances-social",
    "AccessKeyId": "AKIAWXFXAIOZI5NUS6VU",
    "Status": "Active",
    "SecretAccessKey": "...",
    "CreateDate": "2022-11-20 05:52:22+00:00"
}

This created a new bucket called scrape-instances-social configured to work as a website and allow public access.

It also generated an access key and a secret access key with access to just that bucket. I saved these in GitHub Actions secrets called AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.

I enabled a CORS policy on the bucket like this:

s3-credentials set-cors-policy scrape-instances-social

Then I added the following to my GitHub Actions workflow to build and upload the database after each run of the scraper:

    - name: Build and publish database using git-history
      env:
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
      run: |-
        # First download previous database to save some time
        wget https://scrape-instances-social.s3.amazonaws.com/counts.db
        # Update with latest commits
        ./build-count-history.sh
        # Upload to S3
        s3-credentials put-object scrape-instances-social counts.db counts.db \
          --access-key $AWS_ACCESS_KEY_ID \
          --secret-key $AWS_SECRET_ACCESS_KEY

git-history knows how to only process commits since the last time the database was built, so downloading the previous copy saves a lot of time.

Exploring the data

Now that I have a SQLite database that's being served over CORS-enabled HTTPS I can open it in Datasette Lite - my implementation of Datasette compiled to WebAssembly that runs entirely in a browser.

https://lite.datasette.io/?url=https://scrape-instances-social.s3.amazonaws.com/counts.db

Any time anyone follows this link their browser will fetch the latest copy of the counts.db file directly from S3.

The most interesting page in there is the item_version_detail SQL view, which joins against the commits table to show the date of each change:

https://lite.datasette.io/?url=https://scrape-instances-social.s3.amazonaws.com/counts.db#/counts/item_version_detail

(Datasette Lite lets you link directly to pages within Datasette itself via a #hash.)

Plotting a chart

Datasette Lite doesn't have charting yet, so I decided to turn to my favourite visualization tool, an Observable notebook.

Observable has the ability to query SQLite databases (that are served via CORS) directly these days!

Here's my notebook:

https://observablehq.com/@simonw/mastodon-users-and-statuses-over-time

There are only four cells needed to create the chart shown above.

First, we need to open the SQLite database from the remote URL:

database = SQLiteDatabaseClient.open(
  "https://scrape-instances-social.s3.amazonaws.com/counts.db"
)

Next we need to use an Obervable Database query cell to execute SQL against that database and pull out the data we want to plot - and store it in a query variable:

SELECT _commit_at as date, users, statuses
FROM item_version_detail

We need to make one change to that data - we need to convert the date column from a string to a JavaScript date object:

points = query.map((d) => ({
  date: new Date(d.date),
  users: d.users,
  statuses: d.statuses
}))

Finally, we can plot the data using the Observable Plot charting library like this:

Plot.plot({
  y: {
    grid: true,
    label: "Total users over time across all tracked instances"
  },
  marks: [Plot.line(points, { x: "date", y: "users" })],
  marginLeft: 100
})

I added 100px of margin to the left of the chart to ensure there was space for the large (4,696,000 and up) labels on the y-axis.

A bunch of tricks combined

This project combines a whole bunch of tricks I've been pulling together over the past few years:

Git scraping is the technique I use to gather the initial data, turning a static listing of instances into a record of changes over time
git-history is my tool for turning a scraped Git history into a SQLite database that's easier to work with
s3-credentials makes working with S3 buckets - in particular creating credentials that are restricted to just one bucket - much less frustrating
Datasette Lite means that once you have a SQLite database online somewhere you can explore it in your browser - without having to run my full server-side Datasette Python application on a machine somewhere
And finally, combining the above means I can take advantage of Observable notebooks for ad-hoc visualization of data that's hosted online, in this case as a static SQLite database file served from S3

Tags: github, projects, datasette, observable, github-actions, git-scraping, git-history, s3-credentials, datasette-lite, mastodon, cors

Measuring traffic during the Half Moon Bay Pumpkin Festival

2022-10-19T15:41:09+00:00

This weekend was the 50th annual Half Moon Bay Pumpkin Festival.

We live in El Granada, a tiny town 8 minutes drive from Half Moon Bay. There is a single road (coastal highway one) between the two towns, and the festival is locally notorious for its impact on traffic.

Natalie suggested that we measure the traffic and try and see the impact for ourselves!

Here's the end result for Saturday. Read on for details on how we created it.

Collecting the data

I built a git scraper to gather data from the Google Maps Directions API. It turns out if you pass departure_time=now to that API it returns the current estimated time in traffic as part of the response.

I picked a location in Half Moon Bay an a location in El Granada and constructed the following URL (pretty-printed):

https://maps.googleapis.com/maps/api/directions/json?
  origin=GG49%2BCH,%20Half%20Moon%20Bay%20CA
  &destination=FH78%2BQJ,%20Half%20Moon%20Bay,%20CA
  &departure_time=now
  &key=$GOOGLE_MAPS_KEY

The two locations here are defined using Google Plus codes. Here they are on Google Maps:

I constructed the reverse of the URL too, to track traffic in the other direction. Then I rigged up a scheduled GitHub Actions workflow in this repository to fetch this API data, pretty-print it with jq and write it to the repsoitory:

name: Scrape traffic

on:
  push:
  workflow_dispatch:
  schedule:
  - cron:  '*/5 * * * *'

jobs:
  shot-scraper:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - name: Scrape
      env:
        GOOGLE_MAPS_KEY: ${{ secrets.GOOGLE_MAPS_KEY }}
      run: |        
        curl "https://maps.googleapis.com/maps/api/directions/json?origin=GG49%2BCH,%20Half%20Moon%20Bay%20CA&destination=FH78%2BQJ,%20Half%20Moon%20Bay,%20California&departure_time=now&key=$GOOGLE_MAPS_KEY" | jq > one.json
        sleep 3
        curl "https://maps.googleapis.com/maps/api/directions/json?origin=FH78%2BQJ,%20Half%20Moon%20Bay%20CA&destination=GG49%2BCH,%20Half%20Moon%20Bay,%20California&departure_time=now&key=$GOOGLE_MAPS_KEY" | jq > two.json
    - name: Commit and push
      run: |-
        git config user.name "Automated"
        git config user.email "actions@users.noreply.github.com"
        git add -A
        timestamp=$(date -u)
        git commit -m "${timestamp}" || exit 0
        git pull --rebase
        git push

I'm using a GitHub Actions secret called GOOGLE_MAPS_KEY to store the Google Maps API key.

This workflow runs every 5 minutes (more-or-less - GitHub Actions doesn't necessarily stick to the schedule). It fetches the two JSON results and writes them to files called one.json and two.json

... and that was the initial setup for the project. This took me about fifteen minutes to put in place, because I've built systems like this so many times before. I launched it at about 10am on Saturday and left it to collect data.

Analyzing the data and drawing some charts

The trick with git scraping is that the data you care about ends up captured in the git commit log. The challenge is how to extract that back out again and turn it into something useful.

My git-history tool is designed to solve this. It's a command-line utility which can iterate through every version of a file stored in a git repository, extracting information from that file out into a SQLite database table and creating a new row for every commit.

Normally I run it against CSV or JSON files containing an array of rows - effectively tabular data already, where I just want to record what has changed in between commits.

For this project, I was storing the raw JSON output by the Google Maps API. I didn't care about most of the information in there: I really just wanted the duration_in_traffic value.

git-history can accept a snippet of Python code that will be run against each stored copy of a file. The snippet should return a list of JSON objects (as Python dictionaries) which the rest of the tool can then use to figure out what has changed.

To cut a long story short, here's the incantation that worked:

git-history file hmb.db one.json \
--convert '
try:
    duration_in_traffic = json.loads(content)["routes"][0]["legs"][0]["duration_in_traffic"]["value"]
    return [{"id": "one", "duration_in_traffic": duration_in_traffic}]
except Exception as ex:
    return []
' \
  --full-versions \
  --id id

The git-history file command is used to load the history for a specific file - in this case it's the file one.json, which will be loaded into a new SQLite database file called hm.db.

The --convert code uses json.loads(content) to load the JSON for the current file version, then pulls out the ["routes"][0]["legs"][0]["duration_in_traffic"]["value"] nested value from it.

If that's missing (e.g. in an earlier commit, when I hadn't yet added the departure_time=now parameter to the URL) an exception will be caught and the function will return an empty list.

If the duration_in_traffic value is present, the function returns the following:

[{"id": "one", "duration_in_traffic": duration_in_traffic}]

git-history likes lists of dictionaries. It's usually being run against files that contain many different rows, where the id column can be used to de-dupe rows across commits and spot what has changed.

In this case, each file only has a single interesting value.

Two more options are used here:

--full-versions - tells git-history to store all of the columns, not just columns that have changed since the last run. The default behaviour here is to store a null if a value has not changed in order to save space, but our data is tiny here so we don't need any clever optimizations.
--id id specifies the ID column that should be used to de-dupe changes. Again, not really important for this tiny project.

After running the above command, the resulting schema includes these tables:

CREATE TABLE [commits] (
   [id] INTEGER PRIMARY KEY,
   [namespace] INTEGER REFERENCES [namespaces]([id]),
   [hash] TEXT,
   [commit_at] TEXT
);
CREATE TABLE [item_version] (
   [_id] INTEGER PRIMARY KEY,
   [_item] INTEGER REFERENCES [item]([_id]),
   [_version] INTEGER,
   [_commit] INTEGER REFERENCES [commits]([id]),
   [id] TEXT,
   [duration_in_traffic] INTEGER
);

The commits table includes the date of the commit - commit_at.

The item_version table has that duration_in_traffic value.

So... to get back the duration in traffic at different times of day I can run this SQL query to join those two tables together:

select
    commits.commit_at,
    duration_in_traffic
from
    item_version
join
    commits on item_version._commit = commits.id
order by
    commits.commit_at

That query returns data that looks like this:

commit_at	duration_in_traffic
2022-10-15T17:09:06+00:00	1110
2022-10-15T17:17:38+00:00	1016
2022-10-15T17:30:06+00:00	1391

A couple of problems here. First, the commit_at column is in UTC, not local time. And duration_in_traffic is in seconds, which aren't particularly easy to read.

Here's a SQLite fix for these two issues:

select
    time(datetime(commits.commit_at, '-7 hours')) as t,
    duration_in_traffic / 60 as mins_in_traffic
from
    item_version
join
    commits on item_version._commit = commits.id
order by
    commits.commit_at

t	mins_in_traffic
10:09:06	18
10:17:38	16
10:30:06	23

datetime(commits.commit_at, '-7 hours') parses the UTC string as a datetime, and then subsracts 7 hours from it to get the local time in California converted from UTC.

I wrap that in time() here because for the chart I want to render I know everything will be on the same day.

mins_in_traffic now shows minutes, not seconds.

We now have enough data to render a chart!

But... we only have one of the two directions of traffic here. To process the numbers from two.json as well I ran this:

git-history file hmb.db two.json \
--convert '
try:
    duration_in_traffic = json.loads(content)["routes"][0]["legs"][0]["duration_in_traffic"]["value"]
    return [{"id": "two", "duration_in_traffic": duration_in_traffic}]
except Exception as ex:
    return []
' \
  --full-versions \
  --id id --namespace item2

This is almost the same as the previous command. It's running against two.json instead of one.json, and it's using the --namespace item2 option.

This causes it to populate a new table called item2_version instead of item_version, which is a cheap trick to avoid having to figure out how to load both files into the same table.

Two lines on one chart

I rendered an initial single line chart using datasette-vega, but Natalie suggested that putting lines on the same chart for the two directions of traffic would be more interesting.

Since I now had one table for each direction of traffic (item_version and item_version2) I decided to combine those into a single table, suitable for pasting into Google Sheets.

Here's the SQL I came up with to do that:

with item1 as (
  select
    time(datetime(commits.commit_at, '-7 hours')) as t,
    duration_in_traffic / 60 as mins_in_traffic
  from
    item_version
    join commits on item_version._commit = commits.id
  order by
    commits.commit_at
),
item2 as (
  select
    time(datetime(commits.commit_at, '-7 hours')) as t,
    duration_in_traffic / 60 as mins_in_traffic
  from
    item2_version
    join commits on item2_version._commit = commits.id
  order by
    commits.commit_at
)
select
  item1.*,
  item2.mins_in_traffic as mins_in_traffic_other_way
from
  item1
  join item2 on item1.t = item2.t

This uses two CTEs (Common Table Expressions - the with X as pieces) using the pattern I explained earlier - now called item1 and item2. Having defined these two CTEs, I can join them together on the t column, which is the time of day.

Try running this query in Datasette Lite.

Here's the output of that query for Saturday (10am to 8pm):

t	mins_in_traffic	mins_in_traffic_other_way
10:09:06	18	8
10:17:38	16	8
10:30:06	23	9
10:47:38	23	9
10:57:37	23	9
11:08:20	26	9
11:22:27	26	9
11:38:42	26	9
11:52:35	25	9
12:03:23	24	9
12:15:16	21	9
12:27:51	22	9
12:37:48	22	10
12:46:41	21	10
12:55:03	21	10
13:05:10	21	11
13:17:57	21	11
13:32:55	21	11
13:44:53	19	12
13:55:22	21	14
14:05:21	22	14
14:17:48	23	15
14:31:04	22	15
14:41:59	21	14
14:51:48	18	14
15:00:09	18	15
15:11:17	15	14
15:25:48	14	15
15:39:41	11	14
15:51:11	14	15
15:59:34	15	15
16:10:50	19	16
16:25:43	19	18
16:53:06	19	18
17:11:34	18	16
17:40:29	11	11
18:12:07	10	11
18:58:17	8	9
20:05:13	7	7

I copied and pasted this table into Google Sheets and messed around with the charting tools there until I had the following chart:

Here's the same chart for Sunday:

Our Google Sheet is here - the two days have two separate tabs within the sheet.

Building the SQLite database in GitHub Actions

I did most of the development work for this project on my laptop, running git-history and datasette locally for speed of iteration.

Once I had everything working, I decided to automate the process of building the SQLite database as well.

I made the following changes to my GitHub Actions workflow:

jobs:
  shot-scraper:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
      with:
        fetch-depth: 0 # Needed by git-history
    - name: Set up Python 3.10
      uses: actions/setup-python@v4
      with:
        python-version: "3.10"
        cache: "pip"
    - run: pip install -r requirements.txt
    - name: Scrape
      # Same as before...
      # env:
      # run
    - name: Build SQLite database
      run: |
        rm -f hmb.db # Recreate from scratch each time
        git-history file hmb.db one.json \
        --convert '
        try:
            duration_in_traffic = json.loads(content)["routes"][0]["legs"][0]["duration_in_traffic"]["value"]
            return [{"id": "one", "duration_in_traffic": duration_in_traffic}]
        except Exception as ex:
            return []
        ' \
          --full-versions \
          --id id
        git-history file hmb.db two.json \
        --convert '
        try:
            duration_in_traffic = json.loads(content)["routes"][0]["legs"][0]["duration_in_traffic"]["value"]
            return [{"id": "two", "duration_in_traffic": duration_in_traffic}]
        except Exception as ex:
            return []
        ' \
          --full-versions \
          --id id --namespace item2
    - name: Commit and push
      # Same as before...

I also added a requirements.txt file containing just git-history.

Note how the actions/checkout@v3 step now has fetch-depth: 0 - this is necessary because git-history needs to loop through the entire repository history, but actions/checkout@v3 defaults to only fetching the most recent commit.

The setup-python step uses cache: "pip", which causes it to cache installed dependencies from requirements.txt between runs.

Because that big git-history step creates a hmb.db SQLite database, the "Commit and push" step now includes that file in the push to the repository. So every time the workflow runs a new binary SQLite database file is committed.

Normally I wouldn't do this, because Git isn't a great place to keep constantly changing binary files... but in this case the SQLite database is only 100KB and won't continue to be updated beyond the end of the pumpkin festival.

End result: hmb.db is available in the GitHub repository.

Querying it using Datasette Lite

Datasette Lite is my repackaged version of my Datasette server-side Python application which runs entirely in the user's browser, using WebAssembly.

A neat feature of Datasette Lite is that you can pass it the URL to a SQLite database file and it will load that database in your browser and let you run queries against it.

These database files need to be served with CORS headers. Every file served by GitHub includes these headers!

Which means the following URL can be used to open up the latest hmb.db file directly in Datasette in the browser:

https://lite.datasette.io/?url=https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db

(This takes advantage of a feature I added to Datasette Lite where it knows how to convert the URL to the HTML page about a file on GitHub to the URL to the raw file itself.)

URLs to SQL queries work too. This URL will open Datasette Lite, load the SQLite database AND execute the query I constructed above:

https://lite.datasette.io/?url=https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db#/hmb?sql=with+item1+as+(%0A++select%0A++++time(datetime(commits.commit_at%2C+'-7+hours'))+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item_version%0A++++join+commits+on+item_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A)%2C%0Aitem2+as+(%0A++select%0A++++time(datetime(commits.commit_at%2C+'-7+hours'))+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item2_version%0A++++join+commits+on+item2_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A)%0Aselect%0A++item1.*%2C%0A++item2.mins_in_traffic+as+mins_in_traffic_other_way%0Afrom%0A++item1%0A++join+item2+on+item1.t+%3D+item2.t

And finally... Datasette Lite has plugin support. Adding &install=datasette-copyable to the URL adds the datasette-copyable plugin, which adds a page for easily copying out the query results as TSV (useful for pasting into a spreadsheet) or even as GitHub-flavored Markdown (which I used to add results to this blog post).

Here's an example of that plugin in action.

This was a fun little project that brought together a whole bunch of things I've been working on over the past few years. Here's some more of my writing on these different techniques and tools:

Git scraping is the key technique I'm using here to collect the data
I've written a lot about GitHub Actions
These are my notes about git-history, the tool I used to turn a commit history into a SQLite database
Here's my series of posts about Datasette Lite

Tags: data-journalism, natalie-downe, projects, sqlite, datasette, git-scraping, git-history, datasette-lite, half-moon-bay

Half Moon Bay Pumpkin Festival traffic on Saturday 15th October 2022

2022-10-16T03:56:51+00:00

Half Moon Bay Pumpkin Festival traffic on Saturday 15th October 2022

It’s the Half Moon Bay Pumpkin Festival this weekend... and its impact on the traffic between our little town of El Granada and Half Moon Bay—8 minutes drive away—is notorious. So I built a git scraper that archives estimated driving times from the Google Maps Navigation API, and used git-history to turn that scraped data into a SQLite database and visualize it on a chart.

Via @simonw

Tags: projects, git-scraping, git-history, half-moon-bay

Weeknotes: git-history, bug magnets and s3-credentials --public

2021-12-08T21:34:12+00:00

I've stopped considering my projects "shipped" until I've written a proper blog entry about them, so yesterday I finally shipped git-history, coinciding with the release of version 0.6 - a full 27 days after the first 0.1.

It took way more work than I was expecting to get to this point!

I wrote the first version of git-history in an afternoon, as a tool for a workshop I was presenting on Git scraping and Datasette.

Before promoting it more widely, I wanted to make some improvements to the schema. In particular, I wanted to record only the updated values in the item_version table - which otherwise could end up duplicating a full copy of each item in the database hundreds or even thousands of times.

Getting this right took a lot of work, and I kept on getting stumped by weird bugs and edge-cases. This bug in particular added a couple of days to the project.

The whole project turned out to be something of a bug magnet, partly because of a design decision I made concerning column names.

git-history creates tables with columns that correspond to the underlying data. Since it also needs its own columns for tracking things like commits and incremental versions, I decided to use underscore prefixes for reserved columns such as _item and _version

Datasette uses underscore prefixes for its own purposes - special table arguments such as ?_facet=column-name. It's supposed to work with existing columns that use underscores by converting query string arguments like ?_item=3 into ?_item__exact=3 - but git-history was the first of my projects to really exercise this, and I kept on finding bugs. Datasette 0.59.2 and 0.59.4 both have related bug fixes, and there's a re-opened bug that I have yet to resolve.

Building the ca-fires demo also revealed a bug in datasette-cluster-map which I fixed in version 0.17.2.

s3-credentials --public

The git-history live demos are built and deployed by this GitHub Actions workflow. The workflow works by checking out three separate repos and running git-history against them. It takes advantage of that tool's ability to add just new commits to an existing database to run faster, so it needs to persist database files in between runs.

Since these files can be several hundred MBs, I decided to persist them in an S3 bucket.

My s3-credentials tool provides the ability to create a new S3 bucket along with restricted read-write credentials just for that bucket, ideal for use in a GitHub Actions workflow.

I decided to make the bucket public such that anyone can download files from it, since there was no reason to keep it private. I've been wanting to add this ability to s3-credentials for a while now, so this was the impetus I needed to finally ship that feature.

It's surprisingly hard to figure out how to make an S3 bucket public these days! It turned out the magic recipe was adding a JSON bucket policy document to the bucket granting s3:GetObject permission to principal * - here's that policy in full.

I released s3-credentials 0.8 with a new --public option for creating public buckets - here are the release notes in full:

s3-credentials create my-bucket --public option for creating public buckets, which allow anyone with knowledge of a filename to download that file. This works by attaching this public bucket policy to the bucket after it is created. #42

s3-credentials put-object now sets the Content-Type header on the uploaded object. The type is detected based on the filename, or can be specified using the new --content-type option. #43

s3-credentials policy my-bucket --public-bucket outputs the public bucket policy that would be attached to a bucket of that name. #44

I wrote up this TIL which doubles as a mini-tutorial on using s3-credentials: Storing files in an S3 bucket between GitHub Actions runs.

datasette-hovercards

This was a quick experiment which turned into a prototype Datasette plugin. I really like how GitHub show hover card previews of links to issues in their interface:

I decided to see if I could build something similar for links within Datasette, specifically the links that show up when a column is a foreign key to another record.

Here's what I've got so far:

There's an interactive demo running on this table page.

It still needs a bunch of work - in particular I need to think harder about when the card is shown, where it displays relative to the mouse pointer, what causes it to be hidden again and how it should handle different page widths. Ideally I'd like to figure out a useful mobile / touch-screen variant, but I'm not sure how that could work.

The prototype plugin is called datasette-hovercards - I'd like to eventually merge this back into Datasette core once I'm happy with how it works.

Releases this week

git-history: 0.6.1 - (9 releases total) - 2021-12-08
Tools for analyzing Git history using SQLite
datasette-cluster-map: 0.17.2 - (20 releases total) - 2021-12-07
Datasette plugin that shows a map for any data with latitude/longitude columns
s3-credentials: 0.8 - (8 releases total) - 2021-12-07
A tool for creating credentials for accessing S3 buckets
asyncinject: 0.2a1 - (3 releases total) - 2021-12-03
Run async workflows using pytest-fixtures-style dependency injection
datasette-hovercards: 0.1a0 - 2021-12-02
Add preview hovercards to links in Datasette
github-to-sqlite: 2.8.3 - (22 releases total) - 2021-12-01
Save data from GitHub to a SQLite database

TIL this week

Tags: datasette, weeknotes, git-history, s3-credentials

git-history: a tool for analyzing scraped data collected using Git and SQLite

2021-12-07T22:32:55+00:00

I described Git scraping last year: a technique for writing scrapers where you periodically snapshot a source of data to a Git repository in order to record changes to that source over time.

The open challenge was how to analyze that data once it was collected. git-history is my new tool designed to tackle that problem.

Git scraping, a refresher

A neat thing about scraping to a Git repository is that the scrapers themselves can be really simple. I demonstrated how to run scrapers for free using GitHub Actions in this five minute lightning talk back in March.

Here's a concrete example: California's state fire department, Cal Fire, maintain an incident map at fire.ca.gov/incidents showing the status of current large fires in the state.

I found the underlying data here:

curl https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents

Then I built a simple scraper that grabs a copy of that every 20 minutes and commits it to Git. I've been running that for 14 months now, and it's collected 1,559 commits!

The thing that excites me most about Git scraping is that it can create truly unique datasets. It's common for organizations not to keep detailed archives of what changed and where, so by scraping their data into a Git repository you can often end up with a more detailed history than they maintain themselves.

There's one big challenge though; having collected that data, how can you best analyze it? Reading through thousands of commit differences and eyeballing changes to JSON or CSV files isn't a great way of finding the interesting stories that have been captured.

git-history

git-history is the new CLI tool I've built to answer that question. It reads through the entire history of a file and generates a SQLite database reflecting changes to that file over time. You can then use Datasette to explore the resulting data.

Here's an example database created by running the tool against my ca-fires-history repository. I created the SQLite database by running this in the repository directory:

git-history file ca-fires.db incidents.json \
  --namespace incident \
  --id UniqueId \
  --convert 'json.loads(content)["Incidents"]'

In this example we are processing the history of a single file called incidents.json.

We use the UniqueId column to identify which records are changed over time as opposed to newly created.

Specifying --namespace incident causes the created database tables to be called incident and incident_version rather than the default of item and item_version.

And we have a fragment of Python code that knows how to turn each version stored in that commit history into a list of objects compatible with the tool, see --convert in the documentation for details.

Let's use the database to answer some questions about fires in California over the past 14 months.

The incident table contains a copy of the latest record for every incident. We can use that to see a map of every fire:

This uses the datasette-cluster-map plugin, which draws a map of every row with a valid latitude and longitude column.

Where things get interesting is the incident_version table. This is where changes between different scraped versions of each item are recorded.

Those 250 fires have 2,060 recorded versions. If we facet by _item we can see which fires had the most versions recorded. Here are the top ten:

This looks about right - the larger the number of versions the longer the fire must have been burning. The Dixie Fire has its own Wikipedia page!

Clicking through to the Dixie Fire lands us on a page showing every "version" that we captured, ordered by version number.

git-history only writes values to this table that have changed since the previous version. This means you can glance at the table grid and get a feel for which pieces of information were updated over time:

The ConditionStatement is a text description that changes frequently, but the other two interesting columns look to be AcresBurned and PercentContained.

That _commit table is a foreign key to commits, which records commits that have been processed by the tool - mainly so that when you run it a second time it can pick up where it finished last time.

We can join against commits to see the date that each version was created. Or we can use the incident_version_detail view which performs that join for us.

Using that view, we can filter for just rows where _item is 174 and AcresBurned is not blank, then use the datasette-vega plugin to visualize the _commit_at date column against the AcresBurned numeric column... and we get a graph of the growth of the Dixie Fire over time!

To review: we started out with a GitHub Actions scheduled workflow grabbing a copy of a JSON API endpoint every 20 minutes. Thanks to git-history, Datasette and datasette-vega we now have a chart showing the growth of the longest-lived California wildfire of the last 14 months over time.

A note on schema design

One of the hardest problems in designing git-history was deciding on an appropriate schema for storing version changes over time.

I ended up with the following (edited for clarity):

CREATE TABLE [commits] (
   [id] INTEGER PRIMARY KEY,
   [hash] TEXT,
   [commit_at] TEXT
);
CREATE TABLE [item] (
   [_id] INTEGER PRIMARY KEY,
   [_item_id] TEXT,
   [IncidentID] TEXT,
   [Location] TEXT,
   [Type] TEXT,
   [_commit] INTEGER
);
CREATE TABLE [item_version] (
   [_id] INTEGER PRIMARY KEY,
   [_item] INTEGER REFERENCES [item]([_id]),
   [_version] INTEGER,
   [_commit] INTEGER REFERENCES [commits]([id]),
   [IncidentID] TEXT,
   [Location] TEXT,
   [Type] TEXT
);
CREATE TABLE [columns] (
   [id] INTEGER PRIMARY KEY,
   [namespace] INTEGER REFERENCES [namespaces]([id]),
   [name] TEXT
);
CREATE TABLE [item_changed] (
   [item_version] INTEGER REFERENCES [item_version]([_id]),
   [column] INTEGER REFERENCES [columns]([id]),
   PRIMARY KEY ([item_version], [column])
);

As shown earlier, records in the item_version table represent snapshots over time - but to save on database space and provide a neater interface for browsing versions, they only record columns that had changed since their previous version. Any unchanged columns are stored as null.

There's one catch with this schema: what do we do if a new version of an item sets one of the columns to null? How can we tell the difference between that and a column that didn't change?

I ended up solving that with an item_changed many-to-many table, which uses pairs of integers (hopefully taking up as little space as possible) to record exactly which columns were modified in which item_version records.

The item_version_detail view displays columns from that many-to-many table as JSON - here's a filtered example showing which columns were changed in which versions of which items:

Here's a SQL query that shows, for ca-fires, which columns were updated most often:

select columns.name, count(*)
from incident_changed
  join incident_version on incident_changed.item_version = incident_version._id
  join columns on incident_changed.column = columns.id
where incident_version._version > 1
group by columns.name
order by count(*) desc

Updated: 1785
PercentContained: 740
ConditionStatement: 734
AcresBurned: 616
Started: 327
PersonnelInvolved: 286
Engines: 274
CrewsInvolved: 256
WaterTenders: 225
Dozers: 211
AirTankers: 181
StructuresDestroyed: 125
Helicopters: 122

Helicopters are exciting! Let's find all of the fires which had at least one record where the number of helicopters changed (after the first version). We'll use a nested SQL query:

select * from incident
where _id in (
  select _item from incident_version
  where _id in (
    select item_version from incident_changed where column = 15
  )
  and _version > 1
)

That returned 19 fires that were significant enough to involve helicopters - here they are on a map:

Advanced usage of --convert

Drew Breunig has been running a Git scraper for the past 8 months in dbreunig/511-events-history against 511.org, a site showing traffic incidents in the San Francisco Bay Area. I loaded his data into this example sf-bay-511 database.

The sf-bay-511 example is useful for digging more into the --convert option to git-history.

git-history requires recorded data to be in a specific shape: it needs a JSON list of JSON objects, where each object has a column that can be treated as a unique ID for purposes of tracking changes to that specific record over time.

The ideal tracked JSON file would look something like this:

[
  {
    "IncidentID": "abc123",
    "Location": "Corner of 4th and Vermont",
    "Type": "fire"
  },
  {
    "IncidentID": "cde448",
    "Location": "555 West Example Drive",
    "Type": "medical"
  }
]

It's common for data that has been scraped to not fit this ideal shape.

The 511.org JSON feed can be found here - it's a pretty complicated nested set of objects, and there's a bunch of data in there that's quite noisy without adding much to the overall analysis - things like a updated timestamp field that changes in every version even if there are no changes, or a deeply nested "extension" object full of duplicate data.

I wrote a snippet of Python to transform each of those recorded snapshots into a simpler structure, and then passed that Python code to the --convert option to the script:

#!/bin/bash
git-history file sf-bay-511.db 511-events-history/events.json \
  --repo 511-events-history \
  --id id \
  --convert '
data = json.loads(content)
if data.get("error"):
    # {"code": 500, "error": "Error accessing remote data..."}
    return
for event in data["Events"]:
    event["id"] = event["extension"]["event-reference"]["event-identifier"]
    # Remove noisy updated timestamp
    del event["updated"]
    # Drop extension block entirely
    del event["extension"]
    # "schedule" block is noisy but not interesting
    del event["schedule"]
    # Flatten nested subtypes
    event["event_subtypes"] = event["event_subtypes"]["event_subtype"]
    if not isinstance(event["event_subtypes"], list):
        event["event_subtypes"] = [event["event_subtypes"]]
    yield event
'

The single-quoted string passed to --convert is compiled into a Python function and run against each Git version in turn. My code loops through the nested Events list, modifying each record and then outputting them as an iterable sequence using yield.

A few of the records in the history were server 500 errors, so the code block knows how to identify and skip those as well.

When working with git-history I find myself spending most of my time iterating on these conversion scripts. Passing strings of Python code to tools like this is a pretty fun pattern - I also used it for sqlite-utils convert earlier this year.

Trying this out yourself

If you want to try this out for yourself the git-history tool has an extensive README describing the other options, and the scripts used to create these demos can be found in the demos folder.

The git-scraping topic on GitHub now has over 200 repos now built by dozens of different people - that's a lot of interesting scraped data sat there waiting to be explored!

Tags: cli, data-journalism, git, projects, scraping, sqlite, datasette, git-history

Weeknotes: git-history, created for a Git scraping workshop

2021-11-15T04:10:50+00:00

My main project this week was a 90 minute workshop I delivered about Git scraping at Coda.Br 2021, a Brazilian data journalism conference, on Friday. This inspired the creation of a brand new tool, git-history, plus smaller improvements to a range of other projects.

git-history

I still need to do a detailed write-up of this one (update: git-history: a tool for analyzing scraped data collected using Git and SQLite), but on Thursday I released a brand new tool called git-history, which I describe as "tools for analyzing Git history using SQLite".

This tool is the missing link in the Git scraping pattern I described here last October.

Git scraping is the technique of regularly scraping an online source of information and writing the results to a file in a Git repository... which automatically gives you a full revision history of changes made to that data source over time.

The missing piece has always been what to do next: how do you turn a commit history of changes to a JSON or CSV file into a data source that can be used to answer questions about how that file changed over time?

I've written one-off Python scripts for this a few times (here's my CDC vaccinations one, for example), but giving an interactive workshop about the technique finally inspired me to build a tool to help.

The tool has a comprehensive README, but the short version is that you can take a JSON (or CSV) file in a repository that has been tracking changes to some items over time and run the following to load all of the different versions into a SQLite database file for analysis with Datasette:

git-convert file incidents.db incidents.json --id IncidentID

This assumes that incidents.json contains a JSON array of incidents (reported fires for example) and that each incident has a IncidentID identifier key. It will then loop through the Git history of that file right from the start, creating an item_versions table that tracks every change made to each of those items - using IncidentID to decide if a row represents a new incident or an update to a previous one.

I have a few more improvements I want to make before I start more widely promoting this, but it's already really useful. I've had a lot of fun running it against example repos from the git-scraping GitHub topic (now at 202 repos and counting).

Workshop: Raspando dados com o GitHub Actions e analisando com Datasette

The workshop I gave at the conference was live-translated into Portuguese, which is really exciting! I'm looking forward to watching the video when it comes out and seeing how well that worked.

The title translates to "Scraping data with GitHub Actions and analyzing with Datasette", and it was the first time I've given a workshop that combines Git scraping and Datasette - hence the development of the new git-history tool to help tie the two together.

I think it went really well. I put together four detailed exercises for the attendees, and then worked through each one live with the goal of attendees working through them at the same time - a method I learned from the Carpentries training course I took last year.

Four exercises turns out to be exactly right for 90 minutes, with reasonable time for an introduction and some extra material and questions at the end.

The worst part of running a workshop is inevitably the part where you try and get everyone setup with a functional development environment on their own machines (see XKCD 1987). This time round I skipped that entirely by encouraging my students to use GitPod, which provides free browser-based cloud development environments running Linux, with a browser-embedded VS Code editor and terminal running on top.

(It's similar to GitHub Codespaces, but Codespaces is not yet available to free customers outside of the beta.)

I demonstrated all of the exercises using GitPod myself during the workshop, and ensured that they could be entirely completed through that environment, with no laptop software needed at all.

This worked so well. Not having to worry about development environments makes workshops massively more productive. I will absolutely be doing this again in the future.

The workshop exercises are available in this Google Doc, and I hope to extract some of them out into official tutorials for various tools later on.

Datasette 0.58.2

Yesterday was Datasette's fourth birthday - the four year anniversary of the initial release announcement! I celebrated by releasing a minor bug-fix, Datasette 0.58.2, the release notes for which are quoted below:

Column names with a leading underscore now work correctly when used as a facet. (#1506)
Applying ?_nocol= to a column no longer removes that column from the filtering interface. (#1503)
Official Datasette Docker container now uses Debian Bullseye as the base image. (#1497)

That first change was inspired by ongoing work on git-history, where I decided to use a _id underscoper prefix pattern for columns that were reserved for use by that tool in order to avoid clashing with column names in the provided source data.

sqlite-utils 3.18

Today I released sqlite-utils 3.18 - initially also to provide a feature I wanted for git-history (a way to populate additional columns when creating a row using table.lookup()) but I also closed some bug reports and landed some small pull requests that had come in since 3.17.

s3-credentials 0.5

Earlier in the week I released version 0.5 of s3-credentials - my CLI tool for creating read-only, read-write or write-only AWS credentials for a specific S3 bucket.

The biggest new feature is the ability to create temporary credentials, that expire after a given time limit.

This is achived using STS.assume_role(), where STS is Security Token Service. I've been wanting to learn this API for quite a while now.

Assume role comes with some limitations: tokens must live between 15 minutes and 12 hours, and you need to first create a role that you can assume. In creating those credentials you can define an additional policy document, which is how I scope down the token I'm creating to only allow a specific level of access to a specific S3 bucket.

I've learned a huge amount about AWS, IAM and S3 through developming this project. I think I'm finally overcoming my multi-year phobia of anything involving IAM!

Releases this week

sqlite-utils: 3.18 - (88 releases total) - 2021-11-15
Python CLI utility and library for manipulating SQLite databases
datasette: 0.59.2 - (100 releases total) - 2021-11-14
An open source multi-tool for exploring and publishing data
datasette-hello-world: 0.1.1 - (2 releases total) - 2021-11-14
The hello world of Datasette plugins
git-history: 0.3.1 - (5 releases total) - 2021-11-12
Tools for analyzing Git history using SQLite
s3-credentials: 0.5 - (5 releases total) - 2021-11-11
A tool for creating credentials for accessing S3 buckets

TIL this week

Tags: aws, projects, s3, my-talks, teaching, datasette, weeknotes, git-scraping, sqlite-utils, git-history, s3-credentials