Simon Willison's Weblog: dogsheep

apple-notes-to-sqlite

2023-03-09T06:04:28+00:00

With the help of ChatGPT I finally figured out just enough AppleScript to automate the export of my notes to a SQLite database. AppleScript is a notoriously read-only language, which is turns out makes it a killer app for LLM-assisted coding.

Via Using ChatGPT to write AppleScript

Tags: apple, projects, sqlite, ai, dogsheep, generative-ai, chatgpt, applescript

Google Drive to SQLite

2022-02-20T22:47:01+00:00

I released a new tool this week: google-drive-to-sqlite. It's a CLI utility for fetching metadata about files in your Google Drive and writing them to a local SQLite database.

It's pretty fun!

Here's how to create a SQLite database of every file you've started in your Google Drive, including both files created in Google Docs/Sheets and files you've uploaded to your drive:

% pip install google-drive-to-sqlite
% google-drive-to-sqlite auth
Visit the following URL to authenticate with Google Drive

https://accounts.google.com/o/oauth2/v2/auth?access_type=offline&...

Then return here and paste in the resulting code:
Paste code here: 
# Authentication is now complete, so run:
% google-drive-to-sqlite files starred.db --starred
% ls -lah starred.db
-rw-r--r--@ 1 simon  staff    40K Feb 20 14:14 starred.db

The OAuth client ID it is using hasn't been verified by Google yet, which I think means that only the first 100 people to use it will be able to authenticate. If you need to you can work around that by creating your own client ID, as described in the README.

Having created that starred.db file you can explore the resulting database using Datasette or Datasette Desktop:

datasette starred.db

# or if you have the Datasette Desktop macOS app installed:
open starred.db

Here's Datasette running against one of my larger metadata collections:

Why build this?

I recently got involved with a participatory journalism project, where a team of reporters have used FOIA requests to gather a huge corpus of thousands of files. The files are in a complex folder hierarchy a Google Drive. I wanted to start getting a feel for what's in there.

Pulling the metadata - file names, sizes, file types, file owners, creation dates - into a SQLite database felt like a great way to start understanding the size and scope of what had been collected so far.

Outside of that project, there's something very exciting to me about being able to use Google Drive to collate all kinds of different data and then tie it into the larger Datasette and Dogsheep ecosystems. I think there's a lot of potential here for all kinds of interesting projects.

How it works

The tool is written in Python using Click (based on my click-app template) and sqlite-utils. It works by calling the Google Drive API.

The auth command needs to get hold of an OAuth access token scoped to make read-only calls to the user's Google Drive contents.

This took a bit of figuring out. I wrote up what I learned in this TIL: Google OAuth for a CLI application

Notably, the end result of that flow is a JSON response containing both an access_token and a refresh_token.

The access token can be used to make authenticated API calls, but it expires after an hour and that expiration cannot be extended.

The refresh token lasts forever, and can be used at any time to obtain a fresh access token.

So the auth command writes the refresh token to a file called auth.json, then future calls to other commands use that token to retrieve a fresh access token on every run.

The most useful command is google-drive-to-sqlite files, which retrieves file metadata based on various criteria, then either writes that to a SQLite database or dumps it out as JSON or newline-delimited JSON. It does this by paginating through results from the Google Drive files list API.

The files --folder ID option is a special case. It retrieves every nested file and subfolder starting at the specified folder. The Google Drive API doesn't support this operation directly, so the tool instead has to recursively call directory listings on every folder until it has pulled back all of the data. See my TIL Recursively fetching metadata for all files in a Google Drive folder for more details.

This operation took over an hour for the largest folder I tested it against! So long that the access token it was using expired and I had to implement code to refresh the token in the middle of the operation.

Some other neat tricks

The download command downloads the specified file to disk:

google-drive-to-sqlite download \
  0B32uDVNZfiEKLUtIT1gzYWN2NDI4SzVQYTFWWWxCWUtvVGNB

It detects the file type and uses that as the extension - in the above example, it saves the file as 0B32uDVNZfiEKLUtIT1gzYWN2NDI4SzVQYTFWWWxCWUtvVGNB.pdf.

The export command only works against the file IDs for docs, sheets and presentations create using Google Apps. It can export to a variety of different formats:

google-drive-to-sqlite export html \
  10BOHGDUYa7lBjUSo26YFCHTpgEmtXabdVFaopCTh1vU

This writes to 10BOHGDUYa7lBjUSo26YFCHTpgEmtXabdVFaopCTh1vU-export.html.

The get command takes a URL to a Google Drive API endpoint and fetches it using a valid access token. This is a great tool for debugging and API exploration - my github-to-sqlite tool has this too.

google-drive-to-sqlite get 'https://www.googleapis.com/drive/v3/about?fields=*'

It also knows how to paginate! Adding --paginate files will cause it to fetch all of the subsequent pages of the API and return just the items from the "files" key combined into a single JSON array, for example:

google-drive-to-sqlite get \
  https://www.googleapis.com/drive/v3/files \
  --paginate files

Exploring other APIs with the same tools

While I was building this, I realized that with just a little extra work the auth and get commands could be used to explore other Google APIs too.

If you are a developer, you can create your own OAuth credentials and enable access to other APIs using the Google Cloud console. You can then take the resulting client ID and secret, pick a scope and run the following:

google-drive-to-sqlite auth -a calendar-auth.json \
  --scope 'https://www.googleapis.com/auth/calendar.readonly' \
  --google-client-id '184325416553-nu5ci563v36rmj9opdl7mah786anbkrq.apps.googleusercontent.com' \
  --google-client-secret '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'

calendar-auth.json will now be a JSON file that looks something like this:

{
  "google-drive-to-sqlite": {
    "refresh_token": "1//...",
    "google_client_id": "184325416553-nu5ci563v36rmj9opdl7mah786anbkrq.apps.googleusercontent.com",
    "google_client_secret": "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",
    "scope": "https://www.googleapis.com/auth/calendar.readonly"
  }
}

You can now fetch your Google Calendar items by adding your email address to the following:

google-drive-to-sqlite get \
  https://www.googleapis.com/calendar/v3/calendars/...@gmail.com/events \
  --auth calendar-auth.json

This will output JSON to the console. For newline-delimited JSON, add --nl.

Since we can paginate with --paginate items, this means we can pipe the results to sqlite-utils insert and create a SQLite database of our calendar items!

google-drive-to-sqlite get \
  https://www.googleapis.com/calendar/v3/calendars/...@gmail.com/events \
  --auth calendar-auth.json \
  --paginate items --nl \
  | sqlite-utils insert calendar.db events \
    - --pk id --nl --alter --replace

Maybe google-drive-to-sqlite wasn't the right name for this after all!

What's next?

Google severely tightened their policies on apps that can access Google Drive a few years ago. I'm currently waiting to see if my app will make it through their verification process, see issue #15.

If it doesn't the tool will still be usable, but users will have to jump through some extra hoops to set up their own client ID. I don't see this as a huge concern.

I've started thinking about ways to import additional data from the Google Drive APIs. I'm particularly interested in the idea of creating a full-text search index in SQLite based on plain text exports of documents created in Google Docs, see issue #28.

For other short-term future plans, take a look at the project's open issues.

Tags: google-docs, projects, datasette, dogsheep, weeknotes, sqlite-utils

Notes on Notes.app

2021-12-09T22:39:26+00:00

Notes on Notes.app

Apple's Notes app keeps its data in a SQLite database at ~/Library/Group\ Containers/group.com.apple.notes/NoteStore.sqlite - but it's pretty difficult to extract data from. It turns out the note text is stored as a gzipped protocol buffers object in the ZICNOTEDATA.ZDATA column. Steve Dunham did the hard work of figuring out how it all works - the complexity stems from Apple's use of CRDT's to support seamless multiple edits from different devices.

Tags: apple, sqlite, crdt, dogsheep

Weeknotes: Getting my personal Dogsheep up and running again

2021-08-22T04:46:13+00:00

I gave a talk about Dogsheep at Noisebridge's Five Minutes of Fame on Thursday. Just one problem: my regular Dogsheep demo was broken, so I ended up building it from scratch again. In doing so I fixed a few bugs in some Dogsheep tools.

Dogsheep on a Digital Ocean droplet

The latest iteration of my personal Dogsheep runs on a $20/month 4GB/2CPU Digital Ocean Droplet running Ubuntu 20.04 LTS.

It runs a private Datasette instance and a bunch of cron jobs to fetch data from Twitter, GitHub, Foursquare Swarm, Pocket and Hacker News.

It also has copies of my Apple Photos and Apple HealthKit data which I upload manually - plus a copy of my genome for good measure.

Some abbreviated notes on how I set it up, copied from a private GitHub Issues thread:

Create a new Ubuntu droplet, and configure its IP address as the A record for dogsheep.simonwillison.net
Install Python 3 and NGINX and SQLite: apt-get install python3 python3-venv nginx sqlite -y
Use letsencrypt to get an HTTPS certificate for it: apt-get update and then apt install certbot python3-certbot-nginx -y, then certbot --nginx -d dogsheep.simonwillison.net
I had to remove the ipv6only=on; bit from the NGINX configuration due to this bug
Created a dogsheep user, useradd -s /bin/bash -d /home/dogsheep/ -m -G
As that user, created a virtual environment: python3 -mvenv datasette-venv and then datasette-venv/bin/pip install wheel and datasette-venv/bin/pip install datasette datasette-auth-passwords
Created a /etc/systemd/system/datasette.service file with this contents

Created a set of blank SQLite database files in WAL mode in /home/dogsheep using the following:

for f in beta.db twitter.db healthkit.db github.db \
  swarm.db photos.db genome.db simonwillisonblog.db \
  pocket.db hacker-news.db memories.db
do
  sqlite3 $f vacuum
  # And enable WAL mode:
  sqlite3 $f 'PRAGMA journal_mode=WAL;'
done

Started the Datasette service: service datasette start
Configured NGINX to proxy to localhost port 8001, using this configuration

It's a few more steps than I'd like, but the end result was a password-protected Datasette instance running against a bunch of SQLite database files on my new server.

With Datasette up and running, the next step was to start loading in data.

Importing my tweets

I started with Twitter. I dropped my Twitter API access credentials into an auth.json file (as described here) and ran the following:

source /home/dogsheep/datasette-venv/bin/activate
pip install twitter-to-sqlite
twitter-to-sqlite user-timeline /home/dogsheep/twitter.db \
  -a /home/dogsheep/auth.json
@simonw [###############################-----] 26299/29684 00:02:06

That pulled in all 29,684 of my personal tweets.

(Actually, first it broke with an error, exposing a bug that had already been reported. I shipped a fix for that and tried again and it worked.)

Favourited tweets were a little harder - I have 39,904 favourited tweets, but the Twitter API only returns the most recent 3,200. I grabbed those more recent ones with:

twitter-to-sqlite favorites /home/dogsheep/twitter.db \
  -a /home/dogsheep/auth.json

Then I requested my Twitter archive, waited 24 hours and uploaded the resulting like.js file to the server, then ran:

twitter-to-sqlite import twitter.db /tmp/like.js

This gave me an archive_like table with the data from that file - but it wasn't the full tweet representation, just the subset that Twitter expose in the archive export.

The README shows how to inflate those into full tweets:

twitter-to-sqlite statuses-lookup twitter.db \
  --sql='select tweetId from archive_like' \
  --skip-existing
Importing 33,382 tweets [------------------------------------] 0% 00:18:28

Once that was done I wrote additional records into the favorited_by table like so:

sqlite3 twitter.db '
  INSERT OR IGNORE INTO favorited_by (tweet, user)
  SELECT tweetId, 12497 FROM archive_like
'

(12497 is my Twitter user ID.)

I also came up with a SQL view that lets me see just media attached to tweets:

sqlite-utils create-view twitter.db media_details "
select
  json_object('img_src', media_url_https, 'width', 400) as img,
  tweets.full_text,
  tweets.created_at,
  tweets.id as tweet_id,
  users.screen_name,
  'https://twitter.com/' || users.screen_name || '/status/' || tweets.id as tweet_url
from
  media
  join media_tweets on media.id = media_tweets.media_id
  join tweets on media_tweets.tweets_id = tweets.id
  join users on tweets.user = users.id
order by
  tweets.id desc
"

Now I can visit /twitter/media_details?_where=tweet_id+in+(select+tweet+from+favorited_by+where+user+=+12497) to see the most recent media tweets that I've favourited!

Swarm checkins

Swarm checkins were a lot easier. I needed my previously-created Foursquare API token, and swarm-to-sqlite:

pip install swarm-to-sqlite
swarm-to-sqlite /home/dogsheep/swarm.db --token=...

This gave me a full table of my Swarm checkins, which I can visualize using datasette-cluster-map:

Apple HealthKit

I don't yet have full automation for my Apple HealthKit data (collected by my Apple Watch) or my Apple Photos - both require me to run scripts on my laptop to create the SQLite database file and then copy the result to the server via scp.

healthkit-to-sqlite runs against the export.zip that is produced by the Apple Health app on the iPhone's export data button - for me that was a 158MB zip file which I AirDropped to my laptop and converted (after fixing a new bug) like so:

healthkit-to-sqlite ~/Downloads/export.zip healthkit.db
Importing from HealthKit  [-----------------------------]    2%  00:02:25

I uploaded the resulting 1.5GB healthkit.db file and now I can do things like visualize my 2017 San Francisco Half Marathon run on a map:

Apple Photos

For my photos I use dogsheep-photos, which I described last year in Using SQL to find my best photo of a pelican according to Apple Photos. The short version: I run this script on my laptop:

# Upload original photos to my S3 bucket
dogsheep-photos upload photos.db \
  ~/Pictures/Photos\ Library.photoslibrary/originals
dogsheep-photos apple-photos photos.db \
  --image-url-prefix "https://photos.simonwillison.net/i/" \
  --image-url-suffix "?w=600"
scp photos.db dogsheep:/home/dogsheep/photos.db

photos.db is only 171MB - it contains the metadata, including the machine learning labels, but not the photos themselves.

And now I can run queries for things like photos of food I've taken in 2021:

Automation via cron

I'm still working through the last step, which involves setting up cron tasks to refresh my data periodically from various sources. My crontab currently looks like this:

# Twitter
1,11,21,31,41,51 * * * * /home/dogsheep/datasette-venv/bin/twitter-to-sqlite user-timeline /home/dogsheep/twitter.db -a /home/dogsheep/auth.json --since
4,14,24,34,44,54 * * * * run-one /home/dogsheep/datasette-venv/bin/twitter-to-sqlite mentions-timeline /home/dogsheep/twitter.db -a /home/dogsheep/auth.json --since
11 * * * * run-one /home/dogsheep/datasette-venv/bin/twitter-to-sqlite user-timeline /home/dogsheep/twitter.db cleopaws -a /home/dogsheep/auth.json --since
6,16,26,36,46,56 * * * * run-one /home/dogsheep/datasette-venv/bin/twitter-to-sqlite favorites /home/dogsheep/twitter.db -a /home/dogsheep/auth.json --stop_after=50

# Swarm
25 */2 * * * /home/dogsheep/datasette-venv/bin/swarm-to-sqlite /home/dogsheep/swarm.db --token=... --since=2w

# Hacker News data every six hours
35 0,6,12,18 * * * /home/dogsheep/datasette-venv/bin/hacker-news-to-sqlite user /home/dogsheep/hacker-news.db simonw

# Re-build dogsheep-beta search index once an hour
32 * * * * /home/dogsheep/datasette-venv/bin/dogsheep-beta index /home/dogsheep/beta.db /home/dogsheep/dogsheep-beta.yml

I'll be expanding this out as I configure more of the Dogsheep tools for my personal instance.

TIL this week

Releases this week

healthkit-to-sqlite: 1.0.1 - (9 releases total) - 2021-08-20
Convert an Apple Healthkit export zip to a SQLite database
twitter-to-sqlite: 0.21.4 - (27 releases total) - 2021-08-20
Save data from Twitter to a SQLite database
datasette-block-robots: 1.0 - (5 releases total) - 2021-08-19
Datasette plugin that blocks robots and crawlers using robots.txt
sqlite-utils: 3.16 - (85 releases total) - 2021-08-18
Python CLI utility and library for manipulating SQLite databases
datasette-debug-asgi: 1.1 - (3 releases total) - 2021-08-17
Datasette plugin for dumping out the ASGI scope

Tags: linux, nginx, datasette, dogsheep, weeknotes

Datasette - an ecosystem of tools for working with small data

2021-07-22T18:13:40+00:00

This is the transcript and video from a talk I gave at PyGotham 2020 about using SQLite, Datasette and Dogsheep to work with small data.

I really like the term "small data", as a counterpoint to big data. I define small data as anything that fits on a mobile phone, and since my iPhone has half a terabyte of storage these days that ends up covering a lot of ground!

I think SQLite is the ideal tool for working with data that fits this definition.

My PyGotham talk demonstrates how to find SQLite databases that already exist on your laptop, how to use Datasette to explore them, and then expands to cover Datasette as a tool for publishing data online and my Dogsheep family of tools for personal analytics.

Here's the talk video on YouTube. I've included a full transcript below with additional links and embedded code samples.

Transcript and notes

I'm going to spend the next 20 minutes trying to convince you that almost every data problem you have should be solved using SQLite.

This is also an excuse for me to promote open source project I've been working on for the last three years, called Datasette.

Datasette is a tool for exploring and publishing data that's stored in SQLite databases. So the more people I can use convince to use SQLite to solve things the better!

So let's talk about SQLite, which claims to be the most widely deployed and used database in the world, and I think that claim holds up.

You may not have realised it, but you're using SQLite every single day. It's in phones and laptops. It's in embedded devices. I have an Apple Watch, so I'm wearing SQLite databases on my wrist right now.

Let's take a look on my laptop and see what we can find.

So this right here is a magic incantation that on a Mac will search for the largest SQLite databases across the entire system.

mdfind "kMDItemDisplayName == *.sqlite" -0 | \
  xargs -0 stat "-f%z %N" | sort -nr | head -n 20

I'm gonna run this right now to get my top 20. And here we are. Evernote is using SQLite, Firefox uses it, but the one at the top, this one is 857 megabytes for SQLite file. It's something to do with Apple Photos, so let's see what's in there.

Datasette can be installed as a command-line application, you can run it against the path to a SQLite database, and it starts up a little local web server, which you can then use to explore that data.

So there are 67 tables in this Apple Photos database, with detected face prints and face crops and albums and all sorts of things.

But the one that I happen to know is super interesting is this one here. It's called the Z Cloud Master Media Meta Data. It has 44,000 rows in and I happen to know I've taken about 44,000 photographs - so it looks like there's some interesting stuff here.

This is kind of garbled, but there's a clue. The binary column here starts with BPlist, and I happen to know that Apple have a format called a binary plist, which is probably what this is.

So I'm gonna install a plugin for Datasette called datasette-bplist that knows what to do with that data. This is being installed from PyPI.

datasette install datasette-bplist

And now I can start Datasette up again. Hit refresh on this page. And now that Z Data has been decoded into something that's a little bit more readable.

And straightaway we can see that OK, this is classic photo metadata. We've got the make and model of the device. we've got EXIF data. The really exciting stuff is we've got GPS data. So this is a database that includes the latitude and longitude of every photograph I've ever taken, and we're absolutely going to dig into that and see if we can do some interesting stuff with it later on.

So SQLite is clearly a pretty great database - if Apple are using it this extensively it must have some good things going on for it. But then why don't we use it for web applications?

Well, the classic answer to that is that the one thing SQLite is not so good at is accepting concurrent writes. If you've got a high traffic web site with lots of people writing to your database, you're better off with something like MySQL or Postgres.

But I realised that I have a whole bunch of use cases that don't require any writes at all. Any time I want to publish data online that isn't going to change, SQLite is actually a really good fit for that.

So that was the original inspiration for Datasette. I was thinking about how if I want to publish data online, is this a way I can do it that's really cheap and flexible and interesting?

So that's what I've been doing. I'll show you a few examples.

This is Covid-19 case data, which is published online right now by Johns Hopkins University. The New York Times, the L.A. Times and The Economist.

https://covid-19.datasettes.com/

They're all publishing their data as CSV files in GitHub repositories, which is pretty great because you get a commit history of changes to that CSV data. But it's not necessarily something you can dive straight into and start using.

That's where Datasette comes in. This is a Datasette instance which is automatically updated every day with the latest data from these different publications and I can drill through and see things like this New York Times data for example.

The New York Times publishes county level data of cases and deaths from Covid-19 and they've got over half a million rows in this spreadsheet. And so when we look at this data, it's pretty unexciting in a sort of table form. We've got the date, the county, the state, the FIPS code, which is a unique identify for each county, the number of cases and the number of deaths.

But Datasette kicks in and does some really useful things for us.

Firstly, Datasette has a concept of faceting where it looks for columns in the database which have a small number of unique values - like the State column for example, which only has 50 values in it, and it makes those available for navigation at the top along with these summary counts. So I can look down at this and say, Oh, look, Tennessee. Let's drill into just the rows in this table that mention Tennessee. And then Tennessee has 16,000 rows of which each county has a whole bunch. So let's look at Rutherford County in Tennessee. That's 181 rows. And now that we've drilled down these two levels, we've got back a table that is just showing us the number of cases and number of deaths ordered my date.

So each of these rows says on the sixth of September, there have been 80 deaths reported in 8100 cases.

You saw a Datasette plugin earlier when I installed the bplist plugin. Another plugin I wrote is called Datasette Vega. It gives you simple charting options, so we can say I want a bar chart that plots the date on the X axis against the number of cases on the Y axis. And now we've got a chart of cases over time in this particular county in the state of Tennessee.

A couple of other things you can do: You can export the data back out again so I could get this data as JSON if I want to load this programmatically into something, and get it out as CSV if I want to load it into something like Microsoft Excel.

I can actually export it as YAML because there's an experimental plugin I've been building that does a YAML export. And then, more importantly, you can view and edit the SQL query that was used for this page.

Now this is one of the interesting things about publishing read-only data: because the data as read only and it's opened in a special read only mode, people executing SQL queries can't cause any damage to that database. So SQL injection which in most web applications is a serious security vulnerability for us, becomes a feature.

We could say I just want back the data on the cases and the deaths columns - I'm gonna hit "run SQL" and now I'm getting back just that data that I requested, and I could export that data back out again as JSON or CSV. So this becomes a really powerful way of remixing and slicing and dicing data and then getting it back out again in a format that I can use somewhere else.

So that's all well and good. But the obvious question is, how do you get this data into a SQLite file in the first place? So I'm gonna show you a demo using one of my favourite pieces of data of all time. This is the results of the 2018 New York Central Park squirrel census, where a bunch of volunteers went out and tried to answer the crucial question: "How many squirrels are there in Central Park?".

And this is published as a CSV file, it's got 3,023 rows. Each row is a squirrel. And we can export that out and get back CSV. So I've got the file here. I'm gonna turn it into a SQLite database.

I have a tool I wrote called csvs-to-sqlite, which takes a CSV file - or multiple CSV files - and turns them into, or loads them into SQLite. So here we go, I've run that command and and now I have squirrels.db, which is a SQLite database.

csvs-to-sqlite 2018_Central_Park_Squirrel_Census_-_Squirrel_Data.csv squirrels.db

If I run sqlite3 squirrels.db .schema, it'll show me the schema - and it is indeed a table with columns for squirrels with unique squirrel IDS and which hectare they're in and all of that kind of stuff.

datasette squirrels.db

And so now that we've done that, I can run datasette squirrels.db and start up Datasette and I can see that data in my browser. And here we go here, here are 3,000 rows of squirrel data.

Faceting is kind of fun: you can see things like primary fur colour, showing up as options there. And, actually, if we facet by that, we can say you know what? For the 392 cinnamon squirrels, let's see what the highlight fur colours are for those. Show me the 10 squirrels that are black and have cinnamon hair.

Another thing that stands out about this data is that there are these two columns here, X and Y, which look suspiciously like they might be latitudes and longitudes. We can do something really interesting with that.

I'm gonna load up the SQLite command-line tool for squirrels.db, and I'm gonna rename the X column to longitude.

sqlite3 squirrels.db
alter table "2018_Central_Park_Squirrel_Census_-_Squirrel_Data"
rename column X to longitude;
alter table "2018_Central_Park_Squirrel_Census_-_Squirrel_Data"
rename column Y to latitude;

So "alter table" - um, quite a long table name this one - gonna alter this table here. Rename column X to longitude. And I'm gonna rename column Y to latitude. And now I'm going to check my datasette plugins and make sure I've got a plugin installed called datasette-cluster-map.

There it is. So now if I run Datasette against squirrels again and refresh this page, the plugin here notices the latitude and longitude columns and it draws them on the map. So this is a map of everywhere that they saw a cinnamon squirrel with black highlight fur.

And if I close off these filters you can see that here's a map of all 3,000 squirrels that were spotted in Central Park. And I can zoom in and explore the data that way.

That's pretty cool. Wouldn't it be cool if we could share this data on the Internet somehow?

So we've got a SQLite database. And this is where another key feature of Datasette comes in, which is the"datasette publish command.

Datasette Publish lets you publish databasees online, using Google Cloud Run or Heroku or Vercel. I'm going to use Vercel right now, so if I type datasette publish vercel squirrels.db - I need to give it a project name. I'm going to call it pygotham-squirrels. And I know that I want to install that datasette-cluster-map plugin as well.

datasette publish vercel squirrels.db \
  --project=pygotham-squirrels \
  --install=datasette-cluster-map

So I run this command, and it packages up that database with the Datasette application itself, uploads it and starts running a build on Vercel. It gives me a URL here. I can watch it doing its thing - so it's installing runtimes, it's installing those required dependencies, that additional plugin. And normally, this only takes about 20 seconds to run. And at the end of that 20 seconds, I will have a Web application that is running online with that data.

https://pygotham-squirrels.vercel.app/

So anyone can visit this URL right now, click on that map and they get that Datasette of squirrels with the map plugin, with the ability to facet by primary fur colour. It's all up there. It's running online and it's ready for other people to visit.

I designed this initially for newspapers based on my experience working at the Guardian newspaper, where we wanted to publish some of the data behind our stories, but as a way productively getting structured data onto the web in a way you can start using it I think this is really, really exciting as a capability.

And remember there are JSON APIs for this. So we have just published an API online for squirrel data that anyone who can write JavaScript or Python can start using to build their own squirrel maps.

Another thing that I've been doing with this is figuring out how to turn these into much more custom experiences.

Datasette allows you to provide custom templates, it lets you provide custom CSS, which means you can actually use it to build entire websites powered under the hood by Datasette.

One of the first I built was this one here. This is called niche-museums.com - it's a website for my hobby of trying to track down weird and niche museums around the world, I can click "Use my location" and see just the weird museums near me.

There's one just around the corner from our house called The Comic Book Rockstars Toilet Seat Museum, where a local comic book store has been collecting toilet seats illustrated by famous comic artists. And if you're ever in San Francisco, you should absolutely pop by Hayes Valley and check this place out.

But this right here is just a heavily styled Datasette instance. If you go to /browse, it will give you access to the data. You can see that each museum is a row in a SQLite database table. I've got that plugin, so I get a free map showing you where everything is. Then the actual museum pages are just templated versions of the underlying Datasette pages.

I really like this as a way of publishing websites because it's gives you all of the benefits of static publishing where you it scales magically by starting up new instances, you don't have to worry about security holes and stuff because it's all read only data, but because there's a relational database under it, you can build features like search - so I can search for Bigfoot and get back the Bigfoot Discovery Museum.

Or you can build things like this little "use my location", button at the top of the screen. But again, this is all built using Datasette and Datasette plugins. I have a plugin that knows how to turn a SQL query into an Atom feed so that people can subscribe to new updates to the website as well.

But to finish, I want to talk about a totally different use-case for all of this stuff, and that's personal analytics.

This is an idea that I had had a year and a half ago, inspired by an essay written by Stephen Wolfram. Stephen wrote this absolutely fascinating essay about his approach to personal productivity, and the main feature of this essay is the scroll bar down the side. This thing just goes on and on and on, and he talks about how he had a standing desk, but then his heart rate monitor showed him that he gets better health benefits from walking around outside. So he built himself a little like popcorn tray for his laptop. He scanned every document he's ever every document he's ever written from the age of 11. He's got all of this amazing infrastructure setup for being productive, most of which I feel is way over the top for how I want to live my life.

He has a green screen setup in his basement so he can give talks from home.

But there was one thing in here that kind of caught my eye, and that's this idea that he has of a metasearcher. It's a personal search engine that he built that can search every e-mail, every file, every paper he's written, all of the people he knows for any any search term.

And I saw this and I thought, you know, that's something I'd really like. I would love to have a personal search engine for all of my stuff.

But if I'm gonna build that and it's inspired by Stephen Wolfram, I need to come up with a decent name for it.

And so since it is inspired by Wolf Ram, but it's not really aiming to be as good or comprehensive as the thing that he built, I decided to call it Dog Sheep. So I have a project called Dogsheep. I love that pun so much I committed myself to actually building the software.

And so Dogsheep is a collection of tools for personal analytics: tools for pulling in data about yourself from lots of different sources, turning those into SQLite databases so that you can start working with them and then you can load them up in Datasette and start doing interesting things against them.

I'll give you a a demonstration of my personal Dogsheep instance, where all of the data from these tools ends up.

This is my personal Dogsheep. It has data from a a whole bunch of different places. I've got data from Twitter and HealthKit - my Apple Watch tracks my health over time. GitHub data, Foursquare Swarm, I pulled in data from LinkedIn and Goodreads and Hacker News and all of these different places, and I can do all sorts of fun things with this.

So I'll start by showing you a couple of things from from Twitter. So I index all of my tweets, anytime somebody mentions me, and all of the tweets that I've favorited and because I've indexed my favourited tweets I can run searches against them.

So I can search for PyGotham and see all of the tweets that I've favorited about PyGotham.

Twitter does not offer this feature themselves and having this feature suddenly makes Twitter favourites so, so much more useful.

I can search my followers so if I decide I want to have a conversation with a VC, I can type in VC and see anyone who follows me on Twitter who mentions VC in their bio, and hence is somebody that I might be able to contact via direct messages.

But a really useful example is that my dog, Cleo, has a Twitter account. And every time she goes to the vet, she tweets a selfie and with her selfie she tweets how much she weighs. So she'll tweet "I weigh 42.5 pounds. I grew a little bit more dog". Because she tweets this and I've got this coming into Dogsheep I now have these in a SQL database - so I can construct a SQL query that looks for tweets that mentioned her weight. And I can use a regular expression to extract that weight value out as the thing before the LB. And then I can use my charting plugin to plot that as a chart. So I now have a chart of Cleo's weight over time, as reported by her posting selfies at the vet. It's like having the kind of data warehouse you get at a giant corporation, but just against your own data, and costing about $5 a month in hosting fees.

So that's tweets. But there was so much more stuff in here. I'm running dozens and dozens of projects on GitHub, and I pull all of the issues, issue comments, commits and releases into my one database.

So here I have a table of all 7,000 commits that I've made on GitHub across all of these different project so I can search these in one place, I can evaluate them, I can compare commits across different projects and I can also do charting. So here's a graph that's constructed by a custom SQL query, which pulls out just the date of each commit, groups by them and sticks on a count. So these are my commits over time as a chart.

A key philosophy of Datasette is everything should have a URL. So once I've constructed this query and put it on a chart, I can bookmark that page right there and that becomes, essentially, the application encoded in a URL that I can bookmark and link to and visit later on.

I've got a copy of my genome. I did 23AndMe a few years ago, and I found out recently that they have an export button and you can export your genome out as a CSV file. So I did that. I now have a copy of the 600,000 rows of my genome that 23AndMe keep track of. Which means I can use SQL queries to analyse my own genome. I have a query here that tells me what colour my eyes are based on running this query against my genome - apparently my eyes are blue 99% of the time because I have a GG genotype in this particular location on my genome.

This took an entire weekend figure out. I went to Science Hack Day and borrowed a geneticist so I could see if I could figure out some homebrew genetic explorations.

I mentioned the Apple Watch earlier. Apple record an incredible amount of data about me and they don't uploaded to the cloud - an interesting thing about Apple is they keep that stuff on your device. So if you dig around in the HealthKit app on the iPhone, there's an export button that will give you that data back out again. And I wrote code to turn that into SQLite and there was an enormous quantity of data in here. I've got standing time and basal energy burned, my body mass over time reported by my Withing scales, headphone audio exposure is something that showed up last year, the number of flights of steps I climbed, all sorts of stuff

But the most interesting one is workouts because every time you track a workout using your Apple Watch, it goes in the database. And if it's an outdoor workout - where you're, say, going for a run, it records your GPS location every few seconds. I ran the Bay To Breakers race in San Francisco a couple of years ago and beecause of that, I've got a database table with my location tracked route. It's 3,800 points, and I can plot that on a map and see the exact, finely grained route I took through San Francisco when I ran Bay To Brakers just over three years ago.

It's kind of amazing to me that this stuff is just sat there on our devices waiting for us to liberate it and start using it to build these visualisations and learn interesting things about ourselves.

I also mentioned earlier on Apple Photos. Apple Photos databases are absolutely fascinating because it turns out Apple run machine learning models on your phone to figure out what you took photographs of. And this means I can do things like run a SQL query that shows me all of the photographs that I've taken of a pelican.

My personal Dogsheep right here shows me these pictures - these are Apple Photos I took where Apple itself has labelled them as pelican.

And even more fun than that is Apple calculates scores for your photos. These are hidden away in the SQLite database. And they have names like Z Overall Aesthetic Score or Z Harmonious Colour Score or Z Pleasant Camera Tilt Score. So this query here isn't just showing me photographs I've taken of pelicans - it's sorting them by the most aesthetically pleasing according to Apple's classification algorithm that they're running on my device.

I can show you the Pelican photo with the most pleasant camera tilt - there's so much fun stuff you can do digging into all of this.

So I've been building this set of Dogsheep tools for about a year now, but the thing I always want to do was to tie them all together. I wanted to build that personal search engine.

And a couple of weeks ago, I finally got it working.

I've called it Dogsheep Beta because Stephen Wolfram's search engine is called Wolfram Alpha. So I figured Dogsheep Beta was definitely the appropriate name for it.

And so this right here is Dogsheep Beta, and it lets me run searches across my tweets, my photos, my GitHub commits, issue comments, releases, comments I made on Hacker News - all in one place.

And so if I search for example, for "cleopaws costume", I will see all of the tweets by my dog, whose Twitter handle is @Cleopaws where she's tweeting about her costumes. I think she makes a particularly fetching pirate.

Or again, I can search for PyGotham and I get back not just not just tweets, but I get back issue comments from my personal issues where I've been tracking the fact that I need to submit a talk.

Again, this is a custom search interface, but really all this is under the hood is SQLite - it's a SQLite database with a table that acts as a search index and I'm running queries using SQLite's built-in full text search mechanism.

If any of this has caught your interest, all of this is aggressively open source. So if you go to dogsheep.github.io this is an index page for all of the Dogsheep project. Click through to the GitHub Datasette repository and I've got very extensive documentation covering Datasette, what it can do, the plugins available and how to write plugins of your own.

And really, my goal with this project is I want to build a growing ecosystem of not just Datasette plugins, but tools for getting data from all of these different sources and loading them into SQLite databases so that we can join them together and use them to find out interesting things about the world and about our own lives.

If this has piqued your interest, please drop by the project. We have a GitHub discussions board which you're welcome to join.

[I hadn't launched the full website when I gave this talk, but today I'd encourage you to explore datasette.io]

Tags: sqlite, my-talks, datasette, dogsheep, small-data, annotated-talks

How to secure an Ubuntu server using Tailscale and UFW

2021-02-26T20:31:52+00:00

How to secure an Ubuntu server using Tailscale and UFW

This is the Tailscale tutorial I’ve always wanted: it explains in detail how you can run an Ubuntu server (from any cloud provider) such that only devices on your personal Tailscale network can access it.

Tags: security, ubuntu, dogsheep, tailscale

Building a search engine for datasette.io

2020-12-19T18:12:31+00:00

This week I added a search engine to datasette.io, using the search indexing tool I've been building for Dogsheep.

Project search for Datasette

The Datasette project has a lot of constituent parts. There's the project itself and its documentation - 171 pages when exported to PDF and counting. Then there are the 48 plugins, sqlite-utils and 21 more tools for creating SQLite databases, the Dogsheep collection and over three years of content I've written about the project on my blog.

The new datasette.io search engine provides a faceted search interface to all of this material in one place. It currently searches across:

Every section of the latest documentation (415 total)
48 plugin READMEs
22 tool READMEs
63 news items posted on the Datasette website
212 items from my blog
Release notes from 557 package releases

I plan to extend it with more data sources in the future.

How it works: Dogsheep Beta

I'm reusing the search engine I originally built for my Dogsheep personal analytics project (see Personal Data Warehouses: Reclaiming Your Data). I call that search engine Dogsheep Beta. The name is a pun.

SQLite has great full-text search built in, and I make extensive use of that in Datasette projects already. But out of the box it's not quite right for this kind of search engine that spans multiple different content types.

The problem is relevance calculation. I wrote about this in Exploring search relevance algorithms with SQLite - short version: query relevance is calculated using statistics against the whole corpus, so search terms that occur rarely in the overall corpus contribute a higher score than more common terms.

This means that calculated full-text ranking scores calculated against one table of data cannot be meaningfully compared to scores calculated independently against a separate table, as the corpus statistics used to calculate the rank will differ.

To get usable scores, you need everything in a single table. That's what Dogsheep Beta does: it creates a new table, called search_index, and copies searchable content from the other tables into that new table.

This is analagous to how an external search index like Elasticsearch works: you store your data in the main database, then periodically update an index in Elasticsearch. It's the denormalized query engine design pattern in action.

Configuring Dogsheep Beta

There are two components to Dogsheep Beta: a command-line tool for building a search index, and a Datasette plugin for providing an interface for running searches.

Both of these run off a YAML configuration file, which defines the tables that should be indexed and also defines how those search results should be displayed.

(Having one configuration file handle both indexing and display feels a little inelegant, but it's extremely productive for iterating on so I'm letting that slide.)

Here's the full Dogsheep configuration for datasette.io. An annotated extract:

# Index material in the content.db SQLite file
content.db:
  # Define a search type called 'releases'
  releases:
    # Populate that search type by executing this SQL
    sql: |-
      select
        releases.id as key,
        repos.name || ' ' || releases.tag_name as title,
        releases.published_at as timestamp,
        releases.body as search_1,
        1 as is_public
      from
        releases
        join repos on releases.repo = repos.id
    # When displaying a search result, use this SQL to
    # return extra details about the item
    display_sql: |-
      select
        -- highlight() is a custom SQL function
        highlight(render_markdown(releases.body), :q) as snippet,
        html_url
      from releases where id = :key
    # Jinja template fragment to display the result
    display: |-
      <h3>Release: <a href="{{ display.html_url }}">{{ title }}</a></h3>
      <p>{{ display.snippet|safe }}</p>
      <p><small>Released {{ timestamp }}</small></p>

The core pattern here is the sql: key, which defines a SQL query that must return the following columns:

key - a unique identifier for this search item
title - a title for this indexed document
timestamp - a timestamp for when it was created. May be null.
search_1 - text to be searched. I may add support for search_2 and search_3 later on to store text that will be treated with a lower relevance score.
is_public - should this be considered "public" data. This is a holdover from Dogsheep Beta's application for personal analytics, I don't actually need it for datasette.io.

To create an index, run the following:

dogsheep-beta index dogsheep-index.db dogsheep-config.yml

The index command will loop through every configured search type in the YAML file, execute the SQL query and use it to populate a search_index table in the dogsheep-index.db SQLite database file.

Here's the search_index table for datasette.io.

When you run a search, the plugin queries that table and gets back results sorted by relevance (or other sort criteria, if specified).

To display the results, it loops through each one and uses the Jinja template fragment from the configuration file to turn it into HTML.

If a display_sql: query is defined, that query will be executed for each result to populate the {{ display }} object made available to the template. Many Small Queries Are Efficient In SQLite.

Search term highlighting

I spent a bit of time thinking about search highlighting. SQLite has an implementation of highlighting built in - the snippet() function - but it's not designed to be HTML-aware so there's a risk it might mangle HTML by adding highlighting marks in the middle of a tag or attribute.

I ended up rolling borrowing a BSD licensed highlighting class from the django-haystack project. It deals with HTML by stripping tags, which seems to be more-or-less what Google do for their own search results so I figured that's good enough for me.

I used this one-off site plugin to wrap the highlighting code in a custom SQLite function. This meant I could call it from the display_sql: query in the Dogsheep Beta YAML configuration.

A custom template tag would be more elegant, but I don't yet have a mechanism to expose custom template tags in the Dogsheep Beta rendering mechanism.

Build, index, deploy

The Datasette website implements the Baked Data pattern, where the content is compiled into SQLite database files and bundled with the application code itself as part of the deploy.

Building the index is just another step of that process.

Here's the deploy.yml GitHub workflow used by the site. It roughly does the following:

Download the current version of the content.db database file. This is so it doesn't have to re-fetch release and README content that was previously stored there.
Download the current version of blog.db, with entries from my blog. This means I don't have to fetch all entries, just the new ones.
Run build_directory.py, the script which fetches data for the plugins and tools pages.
- This hits the GitHub GraphQL API to find new repositories tagged datasette-io and datasette-plugin and datasette-tool.
- That GraphQL query also returns the most recent release. The script then checks to see if those releases have previously been fetched and, if not, uses github-to-sqlite to fetch them.
Imports the data from news.yaml into a news table using yaml-to-sqlite
Imports the latest PyPI download statistics for my packages from my simonw/package-stats repository, which implements git scraping against the most excellent pypistats.org.
Runs the dogsheep-beta index command to build a dogsheep-index.db search index.
Runs some soundness checks, e.g. datasette . --get "/plugins", to verify that Datasette is likely to at least return 200 results for some critical pages once published.
Uses datasette publish cloudrun to deploy the results to Google Cloud Run, which hosts the website.

I love building websites this way. You can have as much complexity as you like in the build script (my TIL website build script generates screenshots using Puppeteer) but the end result is some simple database files running on inexpensive, immutable, scalable hosting.

Tags: projects, search, sqlite, datasette, dogsheep, weeknotes, cloudrun, baked-data

The case against client certificates

2020-12-09T14:41:46+00:00

The case against client certificates

Colm MacCárthaigh provides a passionately argued Twitter thread about client certificates and why they should be avoided. I tried using them as an extra layer of protection fir my personal Dogsheep server and ended up abandoning them—certificate management across my devices was too fiddly.

Via Thomas Ptacek

Tags: certificates, dogsheep

Weeknotes: github-to-sqlite workflows, datasette-ripgrep enhancements, Datasette 0.52

2020-12-06T05:46:11+00:00

This week: Improvements to datasette-ripgrep, github-to-sqlite and datasette-graphql, plus Datasette 0.52 and a flurry of dot-releases.

datasette-ripgrep 0.5 and 0.6

datasette-ripgrep (introduced last week) landed on Hacker News, and the comments there inspired me to build a few new features. The interface looks like this now:

I added options for literal searches (as opposed to a regex match) and ignoring case, and a field that lets you filter to just a specific file pattern, for example test*.py.

These are already features of ripgrep so adding them was a case of hooking up the interface and using it to modify the command-line arguments passed to the underlying tool.

github-to-sqlite workflows

github-to-sqlite is my command-line tool for importing data from the GitHub API into a SQLite database, for analysis with Datasette.

I released github-to-sqlite 2.8 this week with two new commands: github-to-sqlite pull-requests, contributed by Adam Jonas, and github-to-sqlite workflows.

The new workflows command can be run against one or more repositories and will fetch their GitHub Actions workflow YAML files, parse them and use them to populate new database tables called workflows, jobs and steps.

If you run workflows across a bunch of different repositories this means you can analyze your workflow usage using SQL!

My github-to-sqlite demo now includes workflows from my core Datasette and Dogsheep projects. Some example queries:

My most commonly used action steps - the top two are actions/checkout@v2 and actions/cache@v2.
All steps using actions/cache@v1, which need to be upgraded to v2 (this link will likely soon stop returning any results as I apply those updates).
My workflows that use the workflow_dispatch trigger.

The implementation is a good example of my sqlite-utils library in action - I pass the extracted YAML data straight to the .insert(data, alter=True) method which creates the correct table schema automatically, altering it if there are any missing columns.

datasette-graphql 1.3

datasette-graphql 1.3 has one tiny feature which I find enormously satisfying.

The plugin provides a GraphQL interface to any table in Datasette. The latest versions use the new "table actions" menu (accessible through a cog icon in the page heading) to provide a link to an example query for that table.

I added the example queries in 1.2, but in 1.3 the example has been expanded to include examples of foreign key references. For a table like this one of GitHub commits the example query now looks like this:

{
  commits {
    totalCount
    pageInfo {
      hasNextPage
      endCursor
    }
    nodes {
      sha
      message
      author_date
      committer_date
      raw_author {
        id
        name
      }
      raw_committer {
        id
        name
      }
      repo {
        id
        name
      }
      author {
        id
        login
      }
      committer {
        id
        login
      }
    }
  }
}

The new foreign key references include the repo, author and committer fields. The example query now demonstrates the most interesting feature of datasette-graphql - its ability to automatically convert foreign key relationships in your database into nested GraphQL fields.

It's a small change, but it makes me really happy.

Datasette 0.52

I shipped Datasette 0.52, a relatively minor release which mainly kicked off an effort to rebrand "configuration" as "settings".

I'm doing this as part of my effort to fix Datasette's "metadata" concept. The metadata.json file started out as a way to add metadata - title, description, license and source information. Over time the file expanded to cover things like default facet displays and sort orders... and then when plugins came along it grew to cover plugin configuration as well.

This is really confusing. Editing metadata.json to configure a plugin doesn't make a great deal of sense.

For Datasette 1.0 I want to clean this up. I'm planning on splitting metadata and configuration into separate mechanisms.

There's just one problem: Datasette already has a "configuration" concept in the form of the --config command-line option which can be used to set some very fundamental options for the Datasette server - things like the SQL time limit and the maximum allowed CSV download size.

I want to call plugin configuration settings "configuration", so I've renamed --config to --settings - see the new settings documentation for details.

This also gave me the chance to clean up a weird design decision. Datasette's configuration options looked like this:

datasette data.db --config sql_time_limit_ms:1000

The new --setting replacement instead looks like this:

datasette data.db --setting sql_time_limit_ms 1000

Note the lack of a colon here - having an option take two arguments is a perfectly cromulent way of using Click, but it's one I wasn't aware of when I first released Datasette.

The old --config mechanism continues to work, but it now displays a deprecation warning - it will be removed in Datasette 1.0.

Datasette dot-releases

0.52 has already had more dot-releases than any other version of Datasette. These are all pure bug fixes, mostly for obscure bugs that are unlikely to have affected anyone. To summarize the release notes:

0.52.1 updated the testing plugins documentation to promote datasette.client, fixed a bug with the display of compound foreign keys and improved the locations searched by the datasette --load-module=spatialite shortcut.
0.52.2 fixed support for the generated columns feature added in SQLite 3.31.0, fixed a 500 error on OPTIONS requests, added support for >32MB database file downloads on Cloud Run and shipped a CSS fix to the cog menus contributed by Abdussamet Koçak.
0.52.3 fixed a fun bug with Datasette installed on Amazon Linux running on ARM where static assets would 404. I eventually tracked that down to an unexpected symlink in the site-packages directory.
0.52.4 now writes errors logged by Datasette to stderr, not stdout. It also fixes a startup error on Windows, another contribution from Abdussamet Koçak.

Broken Dogsheep

My personal Dogsheep broke this week. I've been running it on an Amazon Lightsail instance, and this week I learned that Lightsail has a baseline CPU mechanism which grants your instance burst capacity but shuts it down if it exceeds that capacity too often!

So I'm moving it to a DigitalOcean droplet which won't do that, and trying to figure out enough Ansible to completely automate the process.

My ideal server is one that is configured entirely from files in source control, and updates itself by pulling new configuration from that repository. I plan to use ansible-pull for this, once I've put together the necessary playbooks.

Tags: projects, graphql, datasette, dogsheep, weeknotes, ripgrep

Personal Data Warehouses: Reclaiming Your Data

2020-11-14T03:53:55+00:00

I gave a talk yesterday about personal data warehouses for GitHub's OCTO Speaker Series, focusing on my Datasette and Dogsheep projects. The video of the talk is now available, and I'm presenting that here along with an annotated summary of the talk, including links to demos and further information.

There's a short technical glitch with the screen sharing in the first couple of minutes of the talk - I've added screenshots to the notes which show what you would have seen if my screen had been correctly shared.

Stick while scrolling

I'm going to be talking about personal data warehouses, what they are, why you want one, how to build them and some of the interesting things you can do once you've set one up.

I'm going to start with a demo.

This is my dog, Cleo - when she won first place in a dog costume competition here, dressed as the Golden Gate Bridge!

So the question I want to answer is: How much of a San Francisco hipster is Cleo?

I can answer it using my personal data warehouse.

I have a database of ten year's worth of my checkins on Foursquare Swarm - generated using my swarm-to-sqlite tool. Every time I check in somewhere with Cleo I use the Wolf emoji in the checkin message.

I can filter for just checkins where the checkin message includes the wolf emoji.

Which means I can see just her checkins - all 280 of them.

If I facet by venue category, I can see she's checked in at 57 parks, 32 dog runs, 19 coffee shops and 12 organic groceries.

Then I can facet by venue category and filter down to just her 19 checkins at coffee shops.

Turns out she's a Blue Bottle girl at heart.

Being able to build a map of the coffee shops that your dog likes is obviously a very valuable reason to build your own personal data warehouse.

Let's take a step back and talk about how this demo works.

The key to this demo is this web application I'm running called Datasette. I've been working on this project for three years now, and the goal is to make it as easy and cheap as possible to explore data in all sorts of shapes and sizes.

Ten years ago I was working for the Guardian newspaper in London. One of the things I realized when I joined the organization is that newspapers collect enormous amounts of data. Any time they publish a chart or map in the newspaper someone has to collect the underlying information.

There was a journalist there called Simon Rogers who was a wizard at collecting any data you could think to ask for. He knew exactly where to get it from, and had collected a huge number of brilliant spreadsheets on his desktop computer.

We decided we wanted to publish the data behind the stories. We started something called the Data Blog, and aimed to accompany our stories with the raw data behind them.

We ended up using Google Sheets to publish the data. It worked, but I always felt like there should be a better way to publish this kind of structured data in a way that was as useful and flexible as possible for our audience.

Fast forward to 2017, when I was looking into this new thing called "serverless" hosting - in particular one called Zeit Now, which has since rebranded as Vercel.

My favourite aspect of Serverless is "Scale to zero" - the idea that you only pay for hosting when your project is receiving traffic.

If you're like me, and you love building side-projects but you don't like paying $5/month for them for the rest of your life, this is perfect.

The catch is that serverless providers tend to charge you extra for databases, or require you to buy a hosted database from another provider.

But what if your database doesn't change? Can you bundle your database in the same container as your code?

This was the initial inspiration behind creating Datasette.

Here's another demo. The World Resources Institute maintain a CSV file of every power plant in the world.

Like many groups, they publish that data on GitHub.

I have a script that grabs their most recent data and publishes it using Datasette.

Here's the contents of their CSV file published using Datasette

Datasette supports plugins. You've already seen this plugin in my demo of Cleo's coffee shops - it's called datasette-cluster-map and it works by looking for tables with a latitude and longitude column and plotting the data on a map.

Straight away looking at this data you notice that there's a couple of power plants down here in Antarctica. This is McMurdo station, and it has a 6.6MW oil generator.

And oh look, there's a wind farm down there too on Ross Island knocking out 1MW of electricity.

But this is also a demonstration of faceting. I can slice down to just the nuclear power plants in France and see those on a map.

And anything i can see in the interface, I can get out as JSON. Here's a JSON file showing all of those nuclear power plants in France.

And here's a CSV export which I can use to pull the data into Excel or other CSV-compatible software.

If I click "view and edit SQL" to get back the SQL query that was used to generate the page - and I can edit and re-execute that query.

I can get those custom results back as CSV or JSON as well!

In most web applications this would be seen as a terrifying security hole - it's a SQL injection attack, as a documented feature!

A couple of reasons this isn't a problem here:

Firstly, this is setup as a read-only database: INSERT and UPDATE statements that would modify it are not allowed. There's a one second time limit on queries as well.

Secondly, everything in this database is designed to be published. There are no password hashes or private user data that could be exposed here.

This also means we have a JSON API that lets JavaScript execute SQL queries against a backend! This turns out to be really useful for rapid prototyping.

It's worth talking about the secret sauce that makes this all possible.

This is all built on top of SQLite. Everyone watching this talk uses SQLite every day, even if you don't know it.

Most iPhone apps use SQLite, many desktop apps do, it's even running inside my Apple Watch.

One of my favourite features is that a SQLite database is a single file on disk. This makes it easy to copy, send around and also means I can bundle data up in that single file, include it in a Docker file and deploy it to serverless hosts to serve it on the internet.

Here's another demo that helps show how GitHub fits into all of this.

Last year PG&E - the power company that covers much of California - turned off the power to large swathes of the state.

I got lucky: six months earlier I had started scraping their outage map and recording the history to a GitHub repository.

simonw/pge-outages is a git repository with 34,000 commits tracking the history of outages that PG&E had published on their outage map.

You can see that two minutes ago they added 35 new outages.

I'm using this data to publish a Datasette instance with details of their historic outages. Here's a page showing their current outages ordered by the most customers affected by the outage.

Read Tracking PG&E outages by scraping to a git repo for more details on this project.

I recently decided to give this technique a name. I'm calling it Git scraping - the idea is to take any data source on the web that represents a point-in-time and commit it to a git repository that tells the story of the history of that particular thing.

Here's my article describing the pattern in more detail: Git scraping: track changes over time by scraping to a Git repository.

This technique really stood out just last week during the US election.

This is the New York Times election scraper website, built by Alex Gaynor and a growing team of contributors. It scrapes the New York Times election results and uses the data over time to show how the results are trending.

It uses a GitHub Actions script that runs on a schedule, plus a really clever Python script that turns it into a useful web page.

You can find more examples of Git scraping under the git-scraping topic on GitHub.

I'm going to do a bit of live coding to show you how this stuff works.

This is the incidents page from the state of California CAL FIRE website.

Any time I see a map like this, my first instinct is to open up the browser developer tools and try to figure out how it works.

If I open the network tab, refresh the page and then filter to just XHR requests.

A neat trick is to order by size - because inevitably the thing at the top of the list is the most interesting data on the page.

This appears to be a JSON file telling me about all of the current fires in the state of California!

(I set up a Git scraper for this a while ago.)

Now I'm going to take this a step further and turn it into a Datasette instance.

It looks like the AllYearIncidents key is the most interesting bit here.

I'm going to use curl to fetch that data, then pipe it through jq to filter for just that AllYearIncidents array.

curl 'https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents' \
        | jq .AllYearIncidents

Now I have a list of incidents for this year.

Next I'm going to pipe it into a tool I've been building called sqlite-utils - it's a suite of tools for manipulating SQLite databases.

I'm going to use the "insert" command and insert the data into a ca-fires.db in an incidents table.

curl 'https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents' \
        | jq .AllYearIncidents \
        | sqlite-utils insert ca-fires.db incidents -

Now I've got a ca-fires.db file. I can open that in Datasette:

datasette ca-fires.db -o

And here it is - a brand new database.

You can straight away see that one of the rows has a bad location, hence it appears in Antarctica.

But 258 of them look like they are in the right place.

I can also facet by county, to see which county had the most fires in 2020 - Riverside had 21.

I'm going to take this a step further and put it on the internet, using a command called datasette publish.

Datasette publish supports a number of different hosting providers. I'm going to use Vercel.

I'm going to tell it to publish that database to a project called "ca-fires" - and tell it to install the datasette-cluster-map plugin.

datasette publish vercel ca-fires.db \
        --project ca-fires \
        --install datasette-cluster-map

This then takes that database file, bundles it up with the Datasette application and deploys it to Vercel.

Vercel gives me a URL where I can watch the progress of the deploy.

The goal here is to have as few steps as possible between finding some interesting data, turning it into a SQLite database you can use with Datasette and then publishing it online.

And this here is that database I just created - available for anyone on the internet to visit and build against.

https://ca-fires.vercel.app/ca-fires/incidents

I've given you a whistle-stop tour of Datasette for the purposes of publishing data, and hopefully doing some serious data journalism.

So what does this all have to do with personal data warehouses?

Last year, I read this essay by Stephen Wolfram: Seeking the Productive Life: Some Details of My Personal Infrastructure. It's an incredible exploration of fourty years of productivity hacks that Stephen Wolfram has applied to become the CEO of a 1,000 person company that works remotely. He's optimized every aspect of his professional and personal life.

It's a lot.

But there was one part of this that really caught my eye. He talks about a thing he calls a "metasearcher" - a search engine on his personal homepage that searches every email, journals, files, everything he's ever done - all in one place.

And I thought to myself, I really want THAT. I love this idea of a personal portal to my own stuff.

And because it was inspired by Stephen Wolfram, but I was planning on building a much less impressive version, I decided to call it Dogsheep.

Wolf, ram. Dog, sheep.

I've been building this over the past year.

So essentially this is my personal data warehouse. It pulls in my personal data from as many sources as I can find and gives me an interface to browse that data and run queries against it.

I've got data from Twitter, Apple HealthKit, GitHub, Swarm, Hacker News, Photos, a copy of my genome... all sorts of things.

I'll show a few more demos.

Here's another one about Cleo. Cleo has a Twitter account, and every time she goes to the vet she posts a selfie and says how much she weighs.

Here's a SQL query that finds every tweet that mentions her weight, pulls out her weight in pounds using a regular expression, then uses the datasette-vega charting plugin to show a self-reported chart of her weight over time.

select
    created_at,
    regexp_match('.*?(\d+(\.\d+))lb.*', full_text, 1) as lbs,
    full_text,
    case
        when (media_url_https is not null)
        then json_object('img_src', media_url_https, 'width', 300)
    end as photo
    from
    tweets
    left join media_tweets on tweets.id = media_tweets.tweets_id
    left join media on media.id = media_tweets.media_id
    where
    full_text like '%lb%'
    and user = 3166449535
    and lbs is not null
    group by
    tweets.id
    order by
    created_at desc
    limit
    101

I did 23AndMe a few years ago, so I have a copy of my genome in Dogsheep. This SQL query tells me what colour my eyes are.

Apparently they are blue, 99% of the time.

select rsid, genotype, case genotype
    when 'AA' then 'brown eye color, 80% of the time'
    when 'AG' then 'brown eye color'
    when 'GG' then 'blue eye color, 99% of the time'
    end as interpretation from genome where rsid = 'rs12913832'

I have HealthKit data from my Apple Watch.

Something I really like about Apple's approach to this stuff is that they don't just upload all of your data to the cloud.

This data lives on your watch and on your phone, and there's an option in the Health app on your phone to export it - as a zip file full of XML.

I wrote a script called healthkit-to-sqlite that converts that zip file into a SQLite database, and now I have tables for things like my basal energy burned, my body fat percentage, flights of stairs I've climbed.

But the really fun part is that it turns out any time you track an outdoor workout on your Apple Watch it records your exact location every few seconds, and you can get that data back out again!

This is a map of my exact route for the San Francisco Half Marathon three years ago.

I've started tracking an "outdoor walk" every time I go on a walk now, just so I can get the GPS data out again later.

I have a lot of data from GitHub about my projects - all of my commits, issues, issue comments and releases - everything I can get out of the GitHub API using my github-to-sqlite tool.

So I can do things like see all of my commits across all of my projects, search and facet them.

I have a public demo of a subset of this data at github-to-sqlite.dogsheep.net.

I can search my commits for any commit that mentions "pytest".

I have all of my releases, which is useful for when I write my weeknotes and want to figure out what I've been working on.

Apple Photos is a particularly interesting source of data.

It turns out the Apple Photos app uses a SQLite database, and if you know what you're doing you can extract photo metadata from it.

They actually run machine learning models on your own device to figure out what your photos are of!

You can use the machine learning labels to see all of the photos you have taken of pelicans. Here are all of the photos I have taken that Apple Photos have identified as pelicans.

It also turns out they have columns called things like ZOVERALLAESTHETICSCORE, ZHARMONIOUSCOLORSCORE, ZPLEASANTCAMERATILTSCORE and more.

So I can sort my pelican photos with the most aesthetically pleasing first!

I wrote more about this on my blog; Using SQL to find my best photo of a pelican according to Apple Photos.

And a few weeks ago I finally got around to building the thing I'd always wanted: the search engine.

I called it Dogsheep Beta, because Stephen Wolfram has a search engine called Wolfram Alpha.

This is pun-driven development: I came up with this pun a while ago and liked it so much I committed to building the software.

I wanted to know when the last time I had eaten a waffle-fish ice cream was. I knew it was in Cupertino, so I searched Dogsheep Beta for Cupertino and found this photo.

I hope this illustrates how much you can do if you pull all of your personal data into one place!

The GDPR law that passed in Europe a few years ago really helps with this stuff.

Companies have to provide you with access to the data that they store about you.

Many big internet companies have responded to this by providing a self-service export feature, usually buried somewhere in the settings.

You can also request data directly from companies, but the self-service option helps them keep their customer support costs down.

This stuff becomes easier over time as more companies build out these features.

The other challenge is how we democratize access to this.

Everything I've shown you today is open source: you can install this software and use it yourself, for free.

But there's a lot of assembly required. You need to figure out authentication tokens, find somewhere to host it, set up cron jobs and authentication.

But this should be accessible to regular non-uber-nerd humans!

Expecting regular humans to run a secure web server somewhere is pretty terrifying. I've been looking at WireGuard and Tailscale to help make secure access between devices easier, but that's still very much for super-users only.

Running this as a hosted service doesn't appeal: taking responsibility for people's personal data is scary, and it's probably not a great business.

I think the best options are to run on people's own personal devices - their mobile phones and their laptops. I think it's feasible to get Datasette running in those environments, and I really like the idea of users being able to import their personal data onto a device that they control and analyzing it there.

I invite you to try this all out for yourself!

datasette.io for Datasette

github.com/dogsheep and dogsheep.github.io for Dogsheep

simonwillison.net is my personal blog

twitter.com/simonw is my Twitter account

The Dogsheep GitHub organization has most of the tools that I've used to build out my personal Dogsheep warehouse - many of them using the naming convention of something-to-sqlite.

Q&A, from this Google Doc

Q: Is there/will there be a Datasette hosted service that I can pay $ for? I would like to pay $5/month to get access to the latest version of Dogsheep with all the latest plugins!

I don’t want to build a hosting site for personal private data because I think people should stay in control of that themselves, plus I don’t think there’s a particularly good business model for that.

Instead, I’m building a hosted service for Datasette (called Datasette Cloud) which is aimed at companies and organizations. I want to be able to provide newsrooms and other groups with a private, secure, hosted environment where they can share data with each other and run analysis.

Q: How do you sync your data from your phone/watch to the data warehouse? Is it a manual process?

The health data is manual: the iOS Health app has an export button which generates a zip file of XML which you can then AirDrop to a laptop. I then run my healthkit-to-sqlite script against it to generate the DB file and SCP that to my Dogsheep server.

Many of my other Dogsheep tools use APIs and can run on cron, to fetch the most recent data from Swarm and Twitter and GitHub and so on.

Q: When accessing Github/Twitter etc do you run queries against their API or you periodically sync (retrieve mostly I guess) the data to the warehouse first and then query locally?

I always try to get ALL the data so I can query it locally. The problem with APIs that let you run queries is that inevitably there’s something I want to do that can’t be done of the API - so I’d much rather suck everything down into my own database so I can write my own SQL queries.

Here's an example of my swarm-to-sqlite script, pulling in just checkins from the past two weeks (using authentication credentials from an environment variable).

swarm-to-sqlite swarm.db --since=2w

Here's a redacted copy of my Dogsheep crontab.

Q: Have you explored doing this as a single page app so that it is possible to deploy this as a static site? What are the constraints there?

It’s actually possible to query SQLite databases entirely within client-side JavaScript using SQL.js (SQLite compiled to WebAssembly)

This Observable notebook is an example that uses this to run SQL queries against a SQLite database file loaded from a URL.

Datasette’s JSON and GraphQL APIs mean it can easily act as an API backend to SPAs

I built this site to offer a search engine for trees in San Francisco. View source to see how it hits a Datasette API in the background: https://sf-trees.com/?q=palm

You can use the network pane to see that it's running queries against a Datasette backend.

Here's the JavaScript code which calls the API.

This demo shows Datasette’s GraphQL plugin in action.

Q: What possibilities for data entry tools do the writable canned queries open up?

Writable canned queries are a relatively recent Datasette feature that allow administrators to configure a UPDATE/INSERT/DELETE query that can be called by users filling in forms or accessed via a JSON API.

The idea is to make it easy to build backends that handle simple data entry in addition to serving read-only queries. It’s a feature with a lot of potential but so far I’ve not used it for anything significant.

Currently it can generate a VERY basic form (with single-line input values, similar to this search example) but I hope to expand it in the future to support custom form widgets via plugins for things like dates, map locations or autocomplete against other tables.

Q: For the local version where you had a 1-line push to deploy a new datasette: how do you handle updates? Is there a similar 1-line update to update an existing deployed datasette?

I deploy a brand new installation every time the data changes! This works great for data that only changes a few times a day. If I have a project that changes multiple times an hour I’ll run it as a regular VPS instead rather than use a serverless hosting provider.

Tags: github, speaking, my-talks, datasette, dogsheep, weeknotes, sqlite-utils, annotated-talks

OCTO Speaker Series: Simon Willison - Personal Data Warehouses: Reclaiming Your Data

2020-10-23T03:00:42+00:00

OCTO Speaker Series: Simon Willison - Personal Data Warehouses: Reclaiming Your Data

I’m giving a talk in the GitHub OCTO (Office of the CTO) speaker series about Datasette and my Dogsheep personal analytics project. You can register for free here—the stream will be on Thursday November 12, 2020 at 8:30am PST (4:30pm GMT).

Tags: github, speaking, datasette, dogsheep

Dogsheep: Personal analytics with Datasette

2020-10-19T16:38:23+00:00

Dogsheep: Personal analytics with Datasette

The second edition of my new Datasette Weekly newsletter, talks about Dogsheep, Dogsheep Beta, Datasette 1.0 and features datasette-cluster-map as the plugin of the week.

Tags: datasette, dogsheep

Building an Evernote to SQLite exporter

2020-10-16T20:12:07+00:00

I've been using Evernote for over a decade, and I've long wanted to export my data from it so I can do interesting things with it.

Ideally I'd use their API for that, so I could continually export new notes via a cron. Unfortunately the API is implemented using Thrift (the only public API I've seen to use that) and my attempts to speak to it from Python have so far failed to get anywhere.

Last weekend I decided to try using their "Export notes" feature instead, and its ENEX XML format.

ENEX - the Evernote XML export format

Select a note - or multiple notes - in the Evernote macOS desktop app, use the File -> Export Notes menu item and Evernote will create a Notes.enex exported XML file.

These files can get BIG, because any images attached to your notes will be embedded as base64 encoded binary data within the XML. My export is 2.8GB!

When faced with unknown XML like this it's useful to get a high level overview of what tags are present. in 2009 I wrote a Python script for this - as part of this project I updated it for Python 3 and pushed a release to PyPI.

Here's a subset of the output when run against my Notes.enex file (full output is here).

{
    "note": {
        "attr_counts": {},
        "child_counts": {
            "content": 2126,
            "created": 2126,
            "note-attributes": 2126,
            "resource": 2605,
            "tag": 45,
            "title": 2126,
            "updated": 2126
        },
        "count": 2126,
        "parent_counts": {
            "en-export": 2126
        }
    },
    "note-attributes": {
        "attr_counts": {},
        "child_counts": {
            "altitude": 1466,
            "application-data": 449,
            "author": 998,
            "classifications": 51,
            "content-class": 387,
            "latitude": 1480,
            "longitude": 1480,
            "reminder-done-time": 2,
            "reminder-order": 2126,
            "reminder-time": 1,
            "source": 1664,
            "source-application": 423,
            "source-url": 85,
            "subject-date": 10
        },
        "count": 2126,
        "parent_counts": {
            "note": 2126
        }
    }
}

This shows me that every note is represented as a <note> element, and crucial metadata lives in children of a <note-attributes> child element.

(I thought I'd need to tweak the script for performance since it works by loading the entire file into memory, but my laptop has 32GB of RAM so it didn't even blink.)

ENEX limitations

I ran into some significant limitations while working with my ENEX export.

Notebooks are not represented in the file at all - you just get the notes. So I can't tell which recipes were in my "cooking" notebook. It's possible to work around this by manually exporting the notes from each notebook one at a time and storing them in separate export files. I didn't bother.
Some of the data in the file - the note content itself for example - consists of further blocks of XML embedded in CDATA. This means you have to run a nested XML parser for every note you process.
The notes XML <en-note> format is mostly XHTML, but includes custom <en-media hash="..."> tags where inline images should be displayed
Those inline images are CDATA encoded base64 strings. They don't include a hash, but I figured out that decoding the base64 string and then running it through MD5 generates the hash that is used in the corresponding <en-media> tag.
Notes in the export don't have any form of unique ID!

That last limitation - the lack of unique IDs - is a huge pain. It means that any export is by necessity a one-time operation - if you edit a note and attempt a re-export you will get a brand new record that can't be automatically used to update the previous one.

This is particularly frustrating because I know that Evernote assigns a GUID to each note - they just don't include them in the export file.

Writing an exporter

My Dogsheep family of tools aims to liberate personal data from all kinds of different sources and convert it into SQLite, which means I can explore, query and visualize it using Datasette.

dogsheep/evernote-to-sqlite is my latest entry in that series.

I based it on healthkit-to-sqlite because that, too, has to deal with a multiple GB XML file. Both tools use the Python standard library's XMLPullParser, which makes XML elements available as a stream without needing to load the entire file into memory at once.

The conversion code is here. It ended up being relatively straight-forward, using ElementTree to extract data from the XML and sqlite-utils to write it to a database.

It implements a progress bar by tracking the number of bytes that have been read from the underlying file XML.

One optimization: originally I created a single resources table with the resource metadata and a BLOB column containing the binary image contents.

This table was huge - over a GB - and had very poor performance for operations such as querying and faceting across unindexed columns - because any table scan had to work through MBs of binary data.

My personal Evernote notebook has around 2,000 resources. 2,000 rows of metadata should usually be very fast to query.

So... I split the binary data out into a two column resources_data table - md5 primary key and a BLOB for data. This massively sped up queries against that resources table.

And it works! Running the following command produces a SQLite database file containing all of my notes, note metadata and embedded images:

evernote-to-sqlite enex evernote.db MyNotes.enex

Serving up formatted notes

Figuring out the best way to serve up the note content in Datasette is still a work in progress - but I've hacked together a delightfully terrible way of doing this using a one-off plugin.

Datasette plugins are usually packaged and installed via PyPI, but there's an alternative option for plugins that don't warrant distribution: create a plugins/ directory, drop in one or more Python files and start Datasette pointing --plugins-dir at that directory.

This is great for hacking on terrible ideas. Here's the evernote.py plugin I wrote in full:

from datasette import hookimpl
import jinja2

START = "<en-note"
END = "</en-note>"
TEMPLATE = """
<div style="max-width: 500px; white-space: normal; overflow-wrap: break-word;">{}</div>
""".strip()

EN_MEDIA_SCRIPT = """
Array.from(document.querySelectorAll('en-media')).forEach(el => {
    let hash = el.getAttribute('hash');
    let type = el.getAttribute('type');
    let path = `/evernote/resources_data/${hash}.json?_shape=array`;
    fetch(path).then(r => r.json()).then(rows => {
        let b64 = rows[0].data.encoded;
        let data = `data:${type};base64,${b64}`;
        el.innerHTML = `<img style="max-width: 300px" src="${data}">`;
    });
});
"""

@hookimpl
def render_cell(value, table):
    if not table:
        # Don't render content from arbitrary SQL queries, could be XSS hole
        return
    if not value or not isinstance(value, str):
        return
    value = value.strip()
    if value.startswith(START) and value.endswith(END):
        trimmed = value[len(START) : -len(END)]
        trimmed = trimmed.split(">", 1)[1]
        # Replace those horrible double newlines
        trimmed = trimmed.replace("<div><br /></div>", "<br>")
        return jinja2.Markup(TEMPLATE.format(trimmed))

@hookimpl
def extra_body_script():
    return EN_MEDIA_SCRIPT

This uses two Datasette plugin hooks.

render_cell() is called every time Datasette's interface renders the value of a column. Here I'm looking for the <en-note> and </en-note> tags and, if they are present, stripping them off and marking their inner content as safe to display using jinja2.Markup() (without this they would be automatically HTML esacped).

extra_body_script() can be used to inject additional JavaScript at the bottom of the page. I'm injecting some particularly devious JavaScript which scans the page for Evernote's <en-media> tags and, if it finds any, runs a fetch() to get the base64-encoded data from Datasette and then injects that into the page as a 300px wide image element using a data: URI.

Like I said, this is a pretty terrifying hack! But it works just fine, and my notes are now visible inside my personal Datasette instance:

Bonus feature: search by OCR

An interesting feature of Evernote is that it runs cloud-based OCR against images in your notes, enabling you to search across the text contained within them.

It turns out the output of this OCR process is included in the export XML. It looks like this:

<recoIndex
  docType="unknown"
  objType="image" objID="76dd28b07797cc9f3f129c4871c5293c" 
  engineVersion="7.0.24.1" 
  recoType="service" 
  lang="en" 
  objWidth="670" 
  objHeight="128">
  <item x="26" y="52" w="81" h="29">
    <t w="76">This</t>
  </item>
  <item x="134" y="52" w="35" h="29">
    <t w="79">is</t>
  </item>
  <item x="196" y="60" w="37" h="21">
    <t w="73">so</t>
  </item>
  <item x="300" y="60" w="57" h="21">
    <t w="71">can</t>
  </item>
  <item x="382" y="54" w="79" h="27">
    <t w="77">test</t>
  </item>
  <item x="486" y="52" w="59" h="29">
    <t w="82">the</t>
  </item>
  <item x="570" y="54" w="59" h="25">
    <t w="74">OCR</t>
    <t w="33">DeR</t>
    <t w="15">OCR!</t>
    <t w="14">OCR]</t>
  </item>
</recoIndex>

As far as I can tell an <item> gets multiple <t> elements only if the OCR wasn't 100% sure.

evernote-to-sqlite creates a searchable ocr text field using values from that XML, so I can search my images in Datasette.

Tags: projects, sqlite, xml, datasette, dogsheep, sqlite-utils

evernote-to-sqlite

2020-10-12T00:38:32+00:00

evernote-to-sqlite

The latest tool in my Dogsheep series of utilities for personal analytics: evernote-to-sqlite takes Evernote note exports en their ENEX XML format and loads them into a SQLite database. Embedded images are loaded into a BLOB column and the output of their cloud-based OCR system is added to a full-text search index. Notes have a latitude and longitude which means you can visualize your notes on a map using Datasette and datasette-cluster-map.

Tags: projects, sqlite, datasette, dogsheep

Weeknotes: airtable-export, generating screenshots in GitHub Actions, Dogsheep!

2020-09-03T23:28:29+00:00

This week I figured out how to populate Datasette from Airtable, wrote code to generate social media preview card page screenshots using Puppeteer, and made a big breakthrough with my Dogsheep project.

airtable-export

I wrote about Rocky Beaches in my weeknotes two weeks ago. It's a new website built by Natalie Downe that showcases great places to go rockpooling (tidepooling in American English), mixing in tide data from NOAA and species sighting data from iNaturalist.

Rocky Beaches is powered by Datasette, using a GitHub Actions workflow that builds the site's underlying SQLite database using API calls and YAML data stored in the GitHub repository.

Natalie wanted to use Airtable to maintain the structured data for the site, rather than hand-editing a YAML file. So I built airtable-export, a command-line script for sucking down all of the data from an Airtable instance and writing it to disk as YAML or JSON.

You run it like this:

airtable-export out/ mybaseid table1 table2 --key=key

This will create a folder called out/ with a .yml file for each of the tables.

Sadly the Airtable API doesn't yet provide a mechanism to list all of the tables in a database (a long-running feature request) so you have to list the tables yourself.

We're now running that command as part of the Rocky Beaches build script, and committing the latest version of the YAML file back to the GitHub repo (thus gaining a full change history for that data).

I really like social media cards - og:image HTML meta attributes for Facebook and twitter:image for Twitter. I wanted them for articles on my TIL website since I often share those via Twitter.

One catch: my TILs aren't very image heavy. So I decided to generate screenshots of the pages and use those as the 2x1 social media card images.

The best way I know of programatically generating screenshots is to use Puppeteer, a Node.js library for automating a headless instance of the Chrome browser that is maintained by the Chrome DevTools team.

My first attempt was to run Puppeteer in an AWS Lambda function on Vercel. I remembered seeing an example of how to do this in the Vercel documentation a few years ago. The example isn't there any more, but I found the original pull request that introduced it.

Since the example was MIT licensed I created my own fork at simonw/puppeteer-screenshot and updated it to work with the latest Chrome.

It's pretty resource intensive, so I also added a secret ?key= mechanism so only my own automation code could call my instance running on Vercel.

I needed to store the generated screenshots somewhere. They're pretty small - on the order of 60KB each - so I decided to store them in my SQLite database itself and use my datasette-media plugin (see Fun with binary data and SQLite) to serve them up.

This worked! Until it didn't... I ran into a showstopper bug when I realized that the screenshot process relies on the page being live on the site... but when a new article is added it's not live when the build process works, so the generated screenshot is of the 404 page.

So I reworked it to generate the screenshots inside the GitHub Action as part of the build script, using puppeteer-cli.

My generate_screenshots.py script handles this, by first shelling out to datasette --get to render the HTML for the page, then running puppeteer to generate the screenshot. Relevant code:

def png_for_path(path):
    # Path is e.g. /til/til/python_debug-click-with-pdb.md
    page_html = str(TMP_PATH / "generate-screenshots-page.html")
    # Use datasette to generate HTML
    proc = subprocess.run(["datasette", ".", "--get", path], capture_output=True)
    open(page_html, "wb").write(proc.stdout)
    # Now use puppeteer screenshot to generate a PNG
    proc2 = subprocess.run(
        [
            "puppeteer",
            "screenshot",
            page_html,
            "--viewport",
            "800x400",
            "--full-page=false",
        ],
        capture_output=True,
    )
    png_bytes = proc2.stdout
    return png_bytes

This worked great! Except for one thing... the site is hosted on Vercel, and Vercel has a 5MB response size limit.

Every time my GitHub build script runs it downloads the previous SQLite database file, so it can avoid regenerating screenshots and HTML for pages that haven't changed.

The addition of the binary screenshots drove the size of the SQLite database over 5MB, so the part of my script that retrieved the previous database no longer worked.

I needed a reliable way to store that 5MB (and probably eventually 10-50MB) database file in between runs of my action.

The best place to put this would be an S3 bucket, but I find the process of setting up IAM permissions for access to a new bucket so infuriating that I couldn't bring myself to do it.

So... I created a new dedicated GitHub repository, simonw/til-db, and updated my action to store the binary file in that repo - using a force push so the repo doesn't need to maintain unnecessary version history of the binary asset.

This is an abomination of a hack, and it made me cackle a lot. I tweeted about it and got the suggestion to try Git LFS instead, which would definitely be a more appropriate way to solve this problem.

Rendering Markdown

I write my blog entries in Markdown and transform them into HTML before I post them on my blog. Some day I'll teach my blog to render Markdown itself, but so far I've got by through copying and pasting into Markdown tools.

My favourite Markdown flavour is GitHub's, which adds a bunch of useful capabilities - most notably the ability to apply syntax highlighting. GitHub expose an API that applies their Markdown formatter and returns the resulting HTML.

I built myself a quick and scrappy tool in JavaScript that sends Markdown through their API and then applies a few DOM manipulations to clean up what comes back. It was a nice opportunity to write some modern vanilla JavaScript using fetch():

async function render(markdown) {
    return (await fetch('https://api.github.com/markdown', {
        method: 'POST',
        headers: {
            'Content-Type': 'application/json'
        },
        body: JSON.stringify({'mode': 'markdown', 'text': markdown})
    })).text();
}

const button = document.getElementsByTagName('button')[0];
const output = document.getElementById('output');
const preview = document.getElementById('preview');

button.addEventListener('click', async function() {
    const rendered = await render(input.value);
    output.value = rendered;
    preview.innerHTML = rendered;
});

Dogsheep Beta

My most exciting project this week was getting out the first working version of Dogsheep Beta - the search engine that ties together results from my Dogsheep family of tools for personal analytics.

I'm giving a talk about this tonight at PyCon Australia: Build your own data warehouse for personal analytics with SQLite and Datasette. I'll be writing up detailed notes in the next few days, so watch this space.

TIL this week

Releases this week

dogsheep-beta 0.4.1 - 2020-09-03
dogsheep-beta 0.4 - 2020-09-03
dogsheep-beta 0.4a1 - 2020-09-03
dogsheep-beta 0.4a0 - 2020-09-03
dogsheep-beta 0.3 - 2020-09-02
dogsheep-beta 0.2 - 2020-09-01
dogsheep-beta 0.1 - 2020-09-01
dogsheep-beta 0.1a2 - 2020-09-01
dogsheep-beta 0.1a - 2020-09-01
airtable-export 0.4 - 2020-08-30
datasette-yaml 0.1a - 2020-08-29
airtable-export 0.3.1 - 2020-08-29
airtable-export 0.3 - 2020-08-29
airtable-export 0.2 - 2020-08-29
airtable-export 0.1.1 - 2020-08-29
airtable-export 0.1 - 2020-08-29
datasette 0.49a0 - 2020-08-28
sqlite-utils 2.16.1 - 2020-08-28

Tags: projects, yaml, markdown, dogsheep, weeknotes, github-actions, airtable, puppeteer

Serving photos locally with datasette-media

2020-05-26T15:53:08+00:00

Serving photos locally with datasette-media

datasette-media is a new Datasette plugin which can serve static files from disk in response to a configured SQL query that maps incoming URL parameters to a path to a file. I built it so I could run dogsheep-photos locally on my laptop and serve up thumbnails of images that match particular queries. I’ve added documentation to the dogsheep-photos README explaining how to use datasette-media, datasette-json-html and datasette-template-sql to create custom interfaces onto Apple Photos data on your machine.

Tags: plugins, projects, datasette, dogsheep, apple-photos

Using SQL to find my best photo of a pelican according to Apple Photos

2020-05-21T19:16:38+00:00

According to the Apple Photos internal SQLite database, this is the most aesthetically pleasing photograph I have ever taken of a pelican:

Here's the SQL query that found me my best ten pelican photos:

select
  sha256,
  ext,
  uuid,
  date,
  ZOVERALLAESTHETICSCORE
from
  photos_with_apple_metadata
where
  uuid in (
    select
      uuid
    from
      labels
    where
      normalized_string = 'pelican'
  )
order by
  ZOVERALLAESTHETICSCORE desc
limit
  10

You can try it out here (with some extra datasette-json-html magic to display the actual photos). Or try lemur or seal.

I actually think this is my best pelican photo, but Apple Photos rated it fifth:

How this works

Apple Photos keeps photo metadata in a SQLite database. It runs machine learning models to identify the contents of every photo, and separate machine learning models to calculate quality scores for those photographs. All of this data lives in SQLite files on my laptop. The trick is knowing where to look.

I'm not running queries directly against the Apple Photos SQLite file - it's a little hard to work with, and the label metadata is stored in a separate database file. Instead, this query runs against a combined database created by my new dogsheep-photos tool.

An aside: Why I love Apple Photos

The Apple Photos app - on both macOS and iOS - is in my opinion Apple's most underappreciated piece of software. In my experience most people who use it are missing some of the most valuable features. A few highlights:

It can show you ALL of your photos on a map. On iOS go to the "Albums" tab, scroll half way down and then click on "Places" (no wonder people miss this feature!) - on macOS Photos it's the "Library -> Places" sidebar item. It still baffles me that Google Photos doesn't do this (I have conspiracy theories about it). This is my most common way for finding a photo I've taken - I remember where it was, then zoom in on that area of the map.
It runs machine learning models on your phone (or laptop) to identify the subject of your photos, and makes them searchable. Try searching for "dog" and you'll see all of the photos you've taken of dogs! I love that this runs on-device: it's much less creepy than uploading your photos to the cloud in order to do this.
It has a really great faceted search implementation - particularly in the phone app. Try searching for "dog", then add "selfie" and the name of a city to see all of the selfies you've taken with dogs in that place!
It has facial recognition, again running on device, which you can use to teach it who your friends are (autocompleting against your contacts). A little bit of effort spent training this and you can see photos you've taken of specific friends in specific places and with specific animals!

As with most Apple software, Photos uses SQLite under the hood. The underlying database is undocumented and clearly not intended as a public API, but it exists. And I've wanted to gain access to what's in it for years.

Querying the Apple Photos SQLite database

If you run Apple Photos on a Mac (which will synchronize with your phone via iCloud) then most of your photo metadata can be found in a database file that lives here:

~/Pictures/Photos\ Library.photoslibrary/database/Photos.sqlite

Mine is 752MB, for aroud 40,000 photos. There's a lot of detailed metadata in there!

Querying the database isn't straight-forward. Firstly it's almost always locked by some other process - the workaround for that is to create a copy of the file. Secondly, it uses some custom undocumented Apple SQLite extensions. I've not figured out a way to load these, and without them a lot of my queries ended up throwing errors.

osxphotos to the rescue! I ran a GitHub code search for one of the tables in that database (searching for RKPerson in Python code) and was delighted to stumble across the osxphotos project by Rhet Turnbull. It's a well designed and extremely actively maintained Python tool for accessing the Apple Photos database, including code to handle several iterations of the underlying database structure.

Thanks to osxphotos the first iteration of my own code for accessing the Apple Photos metadata was less than 100 lines of code. This gave me locations, people, albums and places (human names of geographical areas) almost for free!

Quality scores

Apple Photos has a fascinating database table called ZCOMPUTEDASSETATTRIBUTES, with a bewildering collection of columns. Each one is a floating point number calculated presumably by some kind of machine learning model. Here's a full list, each one linking to my public photos sorted by that score:

I'm not enormously impressed with the results I get from these. They're clearly not intended for end-user visibility, and sorting them might not even be something that makes sense.

The ZGENERICASSET table provides four more scores, which seem to provide much more useful results:

My guess is that these overall scores are derived from the ZCOMPUTEDASSETATTRIBUTES ones. I've seen the best results from ZOVERALLAESTHETICSCORE, so that's the one I used in my "show me my best photo of a pelican" query.

A note about the demo

The demo I'm running at dogsheep-photos.dogsheep.net currently only contains 496 photos. My private instance of this has over 40,000, but I decided to just publish a subset of that in the demo so I wouldn't have to carefully filter out private screenshots and photos with sensitive locations and suchlike. Details of how the demo work (using the dogsheep-photos create-subset command to create a subset database containing just photos in my Public album) can be found in this issue.

Automatic labeling of photo contents

Even more impressive than the quality scores are the machine learning labels.

Automatically labeling the content of a photo is surprisingly easy these days, thanks to convolutional neural networks. I wrote a bit about these in Automatically playing science communication games with transfer learning and fastai.

Apple download a machine learning model to your device and do the label classification there. After quite a bit of hunting (I ended up using Activity Monitor's Inspect -> Open Files and Ports option against the photoanalysisd process) I finally figured out where the results go: the ~/Pictures/Photos\ Library.photoslibrary/database/search/psi.sqlite database file.

(Inspecting photoanalysisd also lead me to the /System/Library/Frameworks/Vision.framework/Versions/A/Resources/ folder, which solved another mystery: where do Apple keep the models? There are some fascinating files in there.)

It took some work to figure out how to match those labels with their corresponding photos, mainly because the psi.sqlite database stores photo UUIDs as a pair of signed integers whereas the Photos.sqlite database stores a UUID string.

I'm now pulling the labels out into a separate labels table. You can browse that in the demo to see how it is structured. Labels belong to numeric categories - here are some of my guesses as to what those mean:

Category 2024 appears to be actual content labels - Seal, Water Body, Pelican etc.
Category 2027 is more contextual: Entertainment, Trip, Travel, Museum, Beach Activity etc.
Category 1014 is simply the month the photo was taken. 1015 is the year, and 2030 is the season.
Category 2056 is the original filename.
Category 12 is the country the photo was taken in.

Here's a query that shows the labels (from every category) next to each photo.

Geography

Photos taken on an iPhone have embedded latitudes and longitudes... which means I can display them on a map!

Apple also perform reverse-geocoding on those photos, resolving them to cities, regions and countries. This is great for faceted browse: here are my photos faceted by country, city and state/province.

Hosting and serving the images

My least favourite thing about Apple Photos is how hard it is to get images from it onto the internet. If you enable iCloud sharing your images are accessible through icloud.com - but they aren't given publicly accessible URLs, so you can't embed them in blog entries or do other webby things with them.

I also really want to "own" my images. I want them in a place that I control.

Amazon S3 is ideal for image storage. It's incredibly inexpensive and essentially infinite.

The dogsheep-photos upload command takes ANY directory as input, scans through that directory for image files and then uploads them to the configured S3 bucket.

I designed this to work independently of Apple Photos, mainly to preserve my ability to switch to alternative image solutions in the future.

I'm using the content addressable storage pattern to store the images. Their filename is the sha256 hash of the file contents. The idea is that since sensible photo management software leaves the original files unmodified I should be able to de-duplicate my photo files no matter where they are from and store everything in the one bucket.

Original image files come with privacy concerns: they embed accurate latitude and longitude data in the EXIF data, so they can be used to reconstruct your exact location history and even figure out your address. This is why systems like Google Photos make it difficult to export images with location data intact.

I've addressed this by making the content in my S3 bucket private. Access to the images takes place through s3-image-proxy - a proxy server I wrote and deployed on Vercel (previously Zeit Now). The proxy strips EXIF data and can optionally resize images based on querystring parameters. It also serves them with far-future cache expire headers, which means they sit in Vercel's CDN cache rather than being resized every time they are accessed.

iPhones default to saving photos in HEIC format, which fails to display using with the <img src=""> tag in the browsers I tested. The proxy uses pyheif to convert those into JPEGs.

Here's an example HEIC image, resized by the proxy and converted to JPEG: https://photos.simonwillison.net/i/59854a70f125154cdf8dad89a4c730e6afde06466d4a6de24689439539c2d863.heic?w=600

Next steps

This project is a little daunting in that there are so many possibilities for where to take it next!

In the short term:

Import EXIF data from the images into a table. The Apple Photos tables give me some of this already (particularly GPS data) but I want things like ISO, aperture, what lens I used.
Load the labels into SQLite full-text search.
I'd like other people to be able to play with this easily. Getting it all up and running right now is a fair amount of work - I think I can improve this with usability improvements and better documentation.
The system only handles static images at the moment. I'd like to get my movies and more importantly my live photos in there as well.

And in the longer term:

Only iPhone photos have location data at the moment - I'd like to derive approximate latitude/longitude points for my DSLR images by matching against images from my phone based on date.
Running my photos through other computer vision systems like Google's Cloud Vision APIs could be really interesting.
For better spotting of duplicate images I'm interested in exploring image content hashing.
The UI for all of this right now is just regular Datasette. Building a custom UI (running against the Datasette JSON API) could be a lot of fun.

Tags: photography, photos, projects, sql, sqlite, datasette, dogsheep, weeknotes, apple-photos

Weeknotes: Datasette 0.41, photos breakthroughs

2020-05-07T05:02:51+00:00

Shorter weeknotes this week, because my main project for the week warrants a detailed write-up on its own (coming soon... update 21st May here it is).

Datasette 0.41

I released Datasette 0.41 this morning. Highlights include:

Custom pages and configuration directory mode, both described here last week
A new NOT LIKE table filter expression, ?colname__notlike=X - #750
A CSS pattern portfolio! I've been meaning to ship one of these for over two years now - you can see it at latest.datasette.io/-/patterns - #151
Various minor fixes and documentation improvements
PRAGMA function support - #761

A late addition to the release - thought of and implemented just this morning, as the result of this TIL, is the ability to use various SQLite PRAGMA functions.

Datasette blocks user-provided SQL queries containing the term PRAGMA because it can be used to dynamically update various database configuration settings, and Datasette likes to keep queries read-only.

But... SQLite provides access to a number of useful introspection capabilities through a collection of PRAGMA functions. Things like this:

select * from pragma_foreign_key_list('complex_foreign_keys')

These are harmless - the SQLite documentation specifically states that "Table-valued functions exist only for PRAGMAs that return results and that have no side-effects".

SO I enabled them using a negative lookahead assertion in the regular expression that prevents PRAGMA from being used:

allowed_pragmas = (
    "database_list",
    "foreign_key_list",
    "function_list",
    "index_info",
    "index_list",
    "index_xinfo",
    "page_count",
    "max_page_count",
    "page_size",
    "schema_version",
    "table_info",
    "table_xinfo",
)
disallawed_sql_res = [
    (
        re.compile("pragma(?!_({}))".format("|".join(allowed_pragmas))),
        "Statement may not contain PRAGMA",
    )
]

The regular expression here is dynamically composed from the list of allowed PRAGMA functions - it looks something like this:

re.compile("pragma(?!_(database_list|foreign_key_list|...))")

In Python regular expressions pragme(?!_foo) means "pragma but only if not followed by _foo".

Plotting foreign key relationships with Observable and D3

With PRAGMA functions support shipped in Datasette 0.41 it's now possible to make use of them in queries. Here's an example query which finds all foreign key relationships in a database:

select
    sqlite_master.name as table_from,
    fk_info.[from] as column_from,
    fk_info.[table] as table_to,
    fk_info.[to] as column_to
from
    sqlite_master
join
    pragma_foreign_key_list(sqlite_master.name) as fk_info
order by
    sqlite_master.name

Here's that query run against my github-to-sqlite demo database. Exported as JSON, the results look like this:

[
    {
        "table_from": "commits",
        "column_from": "committer",
        "table_to": "users",
        "column_to": "id"
    },
    {
        "table_from": "commits",
        "column_from": "author",
        "table_to": "users",
        "column_to": "id"
    }
    ...
]

I used this to put together an Observable Notebook which takes this data (from any public Datasette instance if you provide it with a URL) and renders those relationships as a force-directed graph using D3.

Apple Photos breakthrough

I'll write about this more in a separate post, but I had a huge breakthrough with my photos-to-sqlite project this week.

The goal of the project is to give my SQL access to the photos I've taken. I use Apple Photos for this, and my holy grail has always been the metadata that Apple generate around my photos using machine learning.

Rhet Turnbull's fantastic osxphotos project gave me my first breakthrough, allowing me to easily start importing albums, people (from on-device facial recognition) and places.

Apple run sophisticated label classification against my photos on device - so if I search for "dog" they show me all of the photos I've taken of dogs. I was certain that this information was in a SQLite database somewhere, and this week I finally figured out where: the ~/Pictures/Photos\ Library.photoslibrary/database/search/psi.sqlite file.

After a whole lot of digging around I worked out how to join it against my photos database, and as a result I can now run SQL queries that select all of my photos of dogs! Or seals! Or pelicans!

Combine that with some intriguing automated photo quality scores I found (order by ZPLEASANTCOMPOSITIONSCORE is a thing) and my Dogsheep photos solution is beginning to take shape. I'll write it up in more detail over the next few days.

Other projects

conditional-get 0.2a now streams large downloads to disk and shows a progress bar in -v mode.
datasette-atom 0.5 adds support for the Atom author element, so you can add author information to feed entries.
sqlite-utils went through versions 2.7, 2.7.1, 2.7.2 and 2.8 - see release notes.
github-to-sqlite 2.1 and 2.2 introduced two new commands:
- github-to-sqlite scrape-dependents scrapes the GitHub dependents page (which isn't yet in an API) to gather details of repos that depend on your repos
- github-to-sqlite stargazers fetches the users who have starred a specific list of repositories
- I also added several new SQL views for better exploring the retrieved data

My github-to-sqlite demo is updated daily via a scheduled GitHub Action and retrieves all kinds of data about my various Dogsheep projects plus sqlite-utils and Datasette. Being able to see things like the most recent dependent repos across my projects in one place is extremely useful.

TIL this week

Tags: projects, sqlite, datasette, observable, dogsheep, weeknotes, apple-photos

github-to-sqlite 2.2 highlights thread

2020-05-02T22:16:16+00:00

github-to-sqlite 2.2 highlights thread

I released github-to-sqlite 2.2 today with a new “stargazers” command for importing users who have starred one or more specific repositories. This Twitter thread lists highlights of recent releases and links to a live Datasette demo that shows what the tool can do.

Tags: github, projects, datasette, dogsheep

Weeknotes: Datasette 0.40, various projects, Dogsheep photos

2020-04-22T23:09:10+00:00

A new release of Datasette, two new projects and progress towards a Dogsheep photos solution.

Datasette 0.40

I released Datasette 0.40 last night. Full release notes are here, but the highlights of this key feature in this release is the ability to provide metadata in a metadata.yaml file as an alternative to metadata.json. This is particularly useful for embedded multi-line SQL queries: I've upgraded simonw/museums and simonw/til to take advantage of this, since they both use their metadata to define SQL queries that power their search pages and Atom feeds.

A JSK fellows directory and twitter-to-sqlite 0.21

My JSK Fellowship at Stanford ends in a few months. JSK has extremely talented and influential alumni, and one of the benefits of the fellowship is becoming part of that network afterwards.

The @JSKStanford Twitter account maintains lists of fellows on Twitter - journalists love Twitter! - so I decided to use my twitter-to-sqlite tool to build a Datasette-powered search engine of them.

That search engine is now running at jsk-fellows.datasettes.com. It's updated daily by a GitHub Action to capture any bio changes or new list entrants.

It's a neat example of taking advantage of SQLite views to build faceted search across a subset of data. A script constructs the jsk_fellows view at build time, then metadata.json configures that view to run full-text search and facet by the derived fellowship column.

I shipped twitter-to-sqlite 0.21 with a new twitter-to-sqlite lists username command as part of this project.

TILs and datasette-template-sql 1.0

I described my new TILs project on Monday. I've published 15 so far - the format is working really well for me.

Hacking on simonw/tils reminded me of a feature gap in my datasette-template-sql plugin: it didn't have a solution for safely escaping parameters in SQL queries, leading to nasty string concatenated SQL queries.

datasette-template-sql 1.0 fixes that issue, at the cost of backwards compatibility with previous releases. I'm using it for both til and museums now.

github-to-sqlite 2.0

I released github-to-sqlite 2.0 with a small backwards incompatible change to the database schema (hence the major version increment). It builds on 1.1 from a few days ago which added a new github-to-sqlite contributors command for fetching statistics on contributors to repositories.

More importantly, I improved the live demo running at github-to-sqlite.dogsheep.net.

The demo now updates once a day using GitHub Actions and pulls in releases, commits, issues, issue comments and contributors for all of my Dogsheep projects plus datasette and sqlite-utils.

This means I can browse and execute SQL queries across 929 issues, 1,505 commits and 132 releases across 14 repositories!

Want to see which of my projects have had the most releases? Facet releases by repo.

I've also installed the datasette-search-all plugin there, so you can search across all commits, releases, issues etc for "zeit now" for example.

Bringing all of my different project data together in one place like this is really powerful.

I think it's a great illustration of the Datasette/Dogsheep philosophy of pulling down a complete SQLite-powered copy of data from external services so you can query and join across your data without being limited to the functionality that those services provide through their own interfaces or APIs.

photos-to-sqlite alpha

Dogsheep is about bringing all of my interesting personal and social data into a single, private place.

The biggest thing missing at the moment is photos. I want to be able to query my photos with SQL, and eventually combine them with tweets, checkins etc in a unified timeline.

Last week I took a step towards this goal with heic-to-jpeg, a proxy to let me display my iPhone's HEIC photos online.

This week I started work on photos-to-sqlite - the set of tools which I'll use to turn my photos into something I can run queries again.

So far I've mainly been figuring out how to get them into an S3 bucket that I control. Once configured, running photos-to-sqlite upload photos.db ~/Pictures/Photos\ Library.photoslibrary/originals will start uploading every photo it can find in that directory to the S3 bucket.

The filename it uses is the sha256 hash of the photo file contents, which I'm hoping will let me de-dupe photos from multiple sources in the future. It also writes basic metadata on the photos to that photos.db SQLite database.

This is going to be a big project. I'm investigating osxphotos to liberate the metadata from Apple Photos, and various Python libraries for extracting EXIF data from the files themselves.

Once I've got that working, I can experiment with things like piping photos through Google Cloud Vision to label them based on their contents.

This is all a very, very early alpha at the moment, but I'm cautiously optimistic about progress so far.

Tags: github, projects, twitter, datasette, jsk, dogsheep, weeknotes, sqlite-utils

Weeknotes: Hacking on 23 different projects

2020-04-16T05:03:11+00:00

I wrote a lot of code this week: 184 commits over 23 repositories! I've also started falling for Zeit Now v2, having found workarounds for some of my biggest problems with it.

Better Datasette on Zeit Now v2

Last week I bemoaned the loss of Zeit Now v1 and documented my initial explorations of Zeit Now v2 with respect to Datasette.

My favourite thing about Now v1 was that it ran from Dockerfiles, which gave me complete control over the versions of everything in my deployment environment.

Now v2 runs on AWS Lambda, which means you are mostly stuck with what Zeit's flavour of Lambda gives you. This currently means Python 3.6 (not too terrible - Datasette fully supports it) and a positively ancient SQLite - 3.7.17 from May 2013.

Lambda runs on Amazon Linux. Charles Leifer maintains a package called pysqlite3 which bundles the latest version of SQLite3 as a standalone Python package, and includes a pysqlite3-binary package precompiled for Linux. Could it work on Amazon Linux...?

It turns out it does! A one-line change (not including tests) to my datasette-publish-now and it now deploys Datasette on Now v2 with SQLite 3.31.1 - the latest release from January this year, with window functions and all kinds of other goodness.

This means that Now v2 is back to being a really solid option for hosting Datasette instances. You get scale-to-zero, crazily low prices and really fast cold-boot times. It can only take databases up to around 50MB - if you need more space than that you're better off with Cloud Run - but it's a great option for smaller data.

I released a few versions of datasette-publish-now as a result of this research. I plan to release the first non-alpha version at the same time as Datasette 0.40.

Various projects ported to Now v2 or Cloud Run

I had over 100 projects running on Now v1 that needed updating or deleting in time for that platform's shutdown in August. I've been porting some of them very quickly using datasette-publish-now, but a few have been more work. Some highlights from this week:

ftfy.now.sh, my web app that takes a string of broken unicode and figures out the sequence of transformations you can use to make sense of it (built on the incredible FTFY Python library by Robyn Speer) has been upgraded to Now v2 - repo here.
gzthermal.now.sh offers a web interface to the gzthermal gzip visualization tool, released by caveman on the encode.ru (now encode.su) forum. My repo is here.
My crowdsourced directory of range maps of cryptozoological creatures is now running on Cloud Run (I haven't figured out a way to run SpatiaLite on Now v2 yet).
The datasette-sqlite-fts4.datasette.io demo instance I used for explanations in Exploring search relevance algorithms with SQLite.
The demo instance used for datasette-jellyfish is on Now v2.
The demo for datasette-jq had to move to Cloud Run, because I couldn't install pyjq on Now v2.

big-local-datasette

I've been collaborating with the Big Local team at Stanford on a number of projects related to the Covid-19 situation. It's not quite open to the public yet but I've been building a Datasette instance which shares data from the "open projects" maintained by that team.

The implementation fits a common pattern for me: a scheduled GitHub Action which fetches project data from a GraphQL API, seeks out CSV files which have changed (using HTTP HEAD requests to check their ETags), loads the CSV into SQLite tables and publishes the resulting database using datasette publish cloudrun.

There's one interesing new twist: I'm fetching the existing database files on every run using my new datasette-clone tool (written for this project), applying changes to them and then only publishing if the resulting MD5 sums have changed since last time.

It seems to work well, and I'm excited about this technique as a way of incrementally updating existing databases using stateless code running in a GitHub Action.

Datasette Cloud

I continue to work on the invite-only alpha of my SaaS Datasette platform, Datasette Cloud. This week I ported the CI and deployment scripts from GitLab to GitHub Actions, mainly to try and reduce the variety of CI systems I'm working with (I now have projects live on three: Travis, Circle CI and GitHub Actions).

I've also been figuring out ways of supporting API tokens for making requests to authentication-protected Datasette instances. I shipped small releases of datasette-auth-github and datasette-auth-existing-cookies to support this.

In tinkering with Datasette Cloud I also shipped an upgrade to datasette-mask-columns, which now shows visible REDACTED text on redacted columns in table view.

Miscellaneous

My covid-19.datasettes.com project now also imports data from the LA Times.
I added .rows_where(..., order_by="column") in release 2.6 of sqlite-utils.
I shipped a new release of paginate-json, a tool I built primarily for paginating through the GitHub API and piping the results to sqlite-utils.
I fixed a minor bug with Datasette's --plugin-secret mechanism and added a CSS customization hook for the canned query page.
I built a HEIC to JPEG converting proxy as part of my ongoing mission to eventually liberate my photos from Apple Photos and make them available to Dogsheep. In doing so I contributed usage documentation to the pyheif Python library.

Tags: projects, zeit-now, datasette, dogsheep, weeknotes, datasette-cloud

Weeknotes: Covid-19, First Python Notebook, more Dogsheep, Tailscale

2020-04-01T20:29:59+00:00

My covid-19.datasettes.com project publishes information on COVID-19 cases around the world. The project started out using data from Johns Hopkins CSSE, but last week the New York Times started publishing high quality USA county- and state-level daily numbers to their own repository. Here's the change that added the NY Times data.

It's very easy to use this data to accidentally build misleading things. I've been updating the README with links about this - my current favourite is Why It’s So Freaking Hard To Make A Good COVID-19 Model by Maggie Koerth, Laura Bronner and Jasmine Mithani at FiveThirtyEight.

First Python Notebook

Ben Welsh from the LA Times teaches a course called First Python Notebook at journalism conferences such as NICAR. He ran a free online version the course last weekend, and I offered to help out as a TA.

Most of the help I provided came before the course: Ben asked attendees to confirm that they had working installations of Python 3 and pipenv, and if they didn't volunteers such as myself would step in to help. I had Zoom and email conversations with at least ten people to help them get their environments into shape.

This XKCD neatly summarizes the problem:

One of the most common problems I had to debug was PATH issues: people had installed the software, but due to various environmental differences python3 and pipenv weren't available on the PATH. Talking people through the obscurities of creating a ~/.bashrc file and using it to define a PATH over-ride really helps emphasize how arcane this kind of knowledge is.

I enjoyed this comment:

"Welcome to intro to Tennis. In the first two weeks, we'll discuss how to rig a net and resurface a court." - Claus Wilke

Ben's course itself is hands down the best introduction to Python from a Data Journalism perspective I have ever seen. Within an hour of starting the students are using Pandas in a Jupyter notebook to find interesting discrepancies in California campaign finance data.

If you want to check it out yourself, the entire four hour workshop is now on YouTube and closely follows the material on firstpythonnotebook.org.

Coronavirus Diary

We are clearly living through a notable and very painful period of history right now. On the 19th of March (just under two weeks ago, but time is moving both really fast and incredibly slowly right now) I started a personal diary - something I've never done before. It lives in an Apple Note and I'm adding around a dozen paragraphs to it every day. I think it's helping. I'm sure it will be interesting to look back on in a few years time.

Dogsheep

Much of my development work this past week has gone into my Dogsheep suite of tools for personal analytics.

I upgraded the entire family of tools for compatibility with sqlite-utils 2.x.
pocket-to-sqlite got a major upgrade: it now fetches items using Pocket's API pagination (previously it just tried to pull in 5,000 items in one go) and has the ability to only fetch new items. As a result I'm now running it from cron in my personal Dogsheep instance, so "Save to Pocket" is now my preferred Dogsheep-compatible way of bookmarking content.
twitter-to-sqlite got a couple of important new features in release 0.20. I fixed a nasty bug in the --since flag where retweets from other accounts could cause new tweets from an account to be ignored. I also added a new count_history table which automatically tracks changes to a Twitter user's friends, follower and listed counts over time (#40).

I'm also now using Dogsheep for some journalism! I'm working with the Big Local News team at Stanford to help track and archive tweets by a number of different US politicians and health departments relating to the ongoing pandemic. This collaboration resulted in the above improvements to twitter-to-sqlite.

Tailscale

My personal Dogsheep is currently protected by client certificates, so only my personal laptop and iPhone (with the right certificates installed) can connect to the web server it is running on.

I spent a bit of time this week playing with Tailscale, and I'm really impressed by it.

Tailscale is a commercial company built on top of WireGuard, the new approach to VPN tunnels which just got merged into the Linux 5.6 kernel. Tailscale first caught my attention in January when they hired Brad Fitzpatrick.

WireGuard lets you form a private network by having individual hosts exchange public/private keys with each other. Tailscale provides software which manages those keys for you, making it trivial to set up a private network between different nodes.

How trivial? It took me less than ten minutes to get a three-node private network running between my iPhone, laptop and a Linux server. I installed the iPhone app, the Ubuntu package, the OS X app, signed them all into my Google account and I was done.

Each of those devices now has an additional IP address in the 100.x range which they can use to talk to each other. Tailscale guarantees that the IP address will stay constant for each of them.

Since the network is public/private key encrypted between the nodes, Tailscale can't see any of my traffic - they're purely acting as a key management mechanism. And it's free: Tailscale charge for networks with multiple users, but a personal network like this is free of charge.

I'm not running my own personal Dogsheep on it yet, but I'm tempted to switch over. I'd love other people to start running their own personal Dogsheep instances but I'm paranoid about encouraging this when securing them is so important. Tailscale looks like it might be a great solution for making secure personal infrastructure more easily and widely available.

Tags: brad-fitzpatrick, data-journalism, projects, python, teaching, datasette, dogsheep, weeknotes, tailscale, covid19, ben-welsh

Weeknotes: Datasette 0.39 and many other projects

2020-03-25T05:33:19+00:00

This week's theme: Well, I'm not going anywhere. So a ton of progress to report on various projects.

Datasette 0.39

This evening I shipped Datasette 0.39. The two big features are a mechanism for setting the default sort order for tables and a new base_url configuration setting.

You can see the new default sort order in action on my Covid-19 project - the daily reports now default to sort by day descending so the most recent figures show up first. Here's the metadata that makes it happen, and here's the new documentation.

I had to do some extra work on that project this morning when the underlying data changed its CSV column headings without warning.

The base_url feature has been an open issue since Janunary 2019. It lets you run Datasette behind a proxy on a different URL prefix - /tools/datasette/ for example. The trigger for finally getting this solved was a Twitter conversation about running Datasette on Binder in coordination with a Jupyter notebook.

Tony Hirst did some work on this last year, but was stumped by the lack of a base_url equivalent. Terry Jones shared an implementation in December. I finally found the inspiration to pull it all together, and ended up wih a working fork of Tony's project which does indeed launch Datasette on Binder - try launching your own here.

github-to-sqlite

I've not done much work on my Dogsheep family of tools in a while. That changed this week: in particular, I shipped a 1.0 of github-to-sqlite.

As you might expect, it's a tool for importing GitHub data into a SQLite database. Today it can handle repositories, releases, release assets, commits, issues and issue comments. You can see a live demo built from Dogsheep organization data at github-to-sqlite.dogsheep.net (deployed by this GitHub action).

I built this tool primarily to help me better keep track of all of my projects. Pulling the issues into a single database means I can run queries against all open issues across all of my repositories, and imporing commits and releases is handy for when I want to write my weeknotes and need to figure out what I've worked on lately.

datasette-render-markdown

GitHub issues use Markdown. To correctly display them it's useful to be able to render that Markdown. I built datasette-render-markdown back in November, but this week I made some substantial upgrades: you can now configure which columns should be rendered, and it includes support for Markdown extensions including GitHub-Flavored Markdown.

You can see it in action on the github-to-sqlite demo.

I also upgraded datasette-render-timestamps with the same explicit column configuration pattern.

datasette-publish-fly

Fly is a relatively new hosting provider which lets you host applications bundled as Docker containers in load-balanced data centers geographically close to your users.

It has a couple of characteristics that make it a really good fit for Datasette.

Firstly, the pricing model: Fly will currently host a tiny (128MB of RAM) container for $2.67/month - and they give you $10/month of free service credit, enough for 3 containers.

It turns out Datasette runs just fine in 128MB of RAM, so that's three always-on Datasette containers! (Unlike Heroku and Cloud Run, Fly keeps your containers running rather than scaling them to zero).

Secondly, it works by shipping it a Dockerfile. This means building datasette publish support for it is really easy.

I added the publish_subcommand plugin hook to Datasette all the way back in 0.25 in September 2018, but I've never actually built anything with it. That's now changed: datasette-publish-fly uses the hook to add a datasette publish fly command for publishing databases directly to your Fly account.

hacker-news-to-sqlite

It turns out I created my Hacker News account in 2007, and I've posted 2,167 comments and submitted 131 stories since then. Since my personal Dogsheep project is about pulling my data from multiple sources into a single place it made sense to build a tool for importing from Hacker News.

hacker-news-to-sqlite uses the official Hacker News API to import every comment and story posted by a specific user. It can also use one or more item IDs to suck the entire discussion tree around those items.

The README includes detailed documentation on how to best browse your data using Datasette once you have imported it.

Other projects

sqlite-utils gained some improvements to the way it suggests types for existing columns.
twitter-to-sqlite now offers --sql and --attach for more of its subcommands.
datasette-show-errors is a new plugin which exposes 500 errors as tracebacks, like Django does with DEBUG=True. It's built on top of Starlette's ServerErrorMiddleware.
I upgraded inaturalist-to-sqlite to work with sqlite-utils 2.x.

Tags: github, projects, sqlite, markdown, jupyter, datasette, dogsheep, weeknotes, fly

hacker-news-to-sqlite

2020-03-21T04:27:04+00:00

hacker-news-to-sqlite

The latest in my Dogsheep series of tools: hacker-news-to-sqlite uses the Hacker News API to fetch your comments and submissions from Hacker News and save them to a SQLite database.

Via @simonw

Tags: hacker-news, projects, sqlite, dogsheep

pinboard-to-sqlite

2019-11-07T20:46:38+00:00

pinboard-to-sqlite

Jacob Kaplan-Moss just released the second Dogsheep tool that wasn’t written by me (after goodreads-to-sqlite by Tobias Kunze)—this one imports your Pinterest bookmarks. The repo includes a really clean minimal example of how to use GitHub actions to run tests and release packages to PyPI.

Via @jacobian

Tags: github, jacob-kaplan-moss, pypi, pinboard, dogsheep

Weeknotes: PG&E outages, and Open Source works!

2019-10-14T17:00:28+00:00

My big focus this week was the PG&E outages project. I'm really pleased with how this turned out: the San Francisco Chronicle used data from it for their excellent PG&E outage interactive (mixing in data on wind conditions) and it earned a bunch of interest on Twitter and some discussion on Hacker News.

I gave a talk about the project on Thursday for the Online News Association Silicon Valley meetup and made connections with a bunch of interesting journalists from around the Bay Area.

Open Source works!

A highlight of last week was that a bunch of my projects gained new functionality through open source contributions!

I landed three pull requests to Datasette from Tobias Kunze - a metadata fix, smarter database ordering and a long-wanted button to auto-indent SQL queries.
datasette-leaflet-json is a plugin that I built last year but never promoted because it had a frustrating bug in it. Chris Shaw spotted and fixed the bug, and bumped the Leaflet version dependency too. These changes are now shipped as version 0.3.
I quietly shipped a new plugin, datasette-render-timestamps, last week which identifies columns containing Unix-style timestamp integers and renders them as a readable string. Chris spotted that too, and added a feature to let you configure the formatting in a metadata.json plugin setting. It was the best kind of pull request, incorporating the fix, the tests and the documentation update all in a single commit.
Ishan Anand plans to use datasette-auth-github to hook into nginx authentication. He submitted a PR to refactor the GithubAuth class to support building that as a separate project.

Dogsheep

I started a very basic website for my Dogsheep personal analytics project.

I also started running various Dogsheep tools via cron on my personal Dogsheep server, to keep that set of databases automatically updated with my latest activity on various services.

Most excitingly, Tobias Kunze built the first independent Dogsheep-style tool: goodreads-to-sqlite!

As the name suggests, it imports your data from Goodreads into a SQLite database. It inspired me to create a Goodreads account which I will be using to track my book reading activity from now on.

Tobias wrote a fantastic blog post introducing the tool which includes some neat example queries and graphs.

In other Dogsheep news, I added an issue-comments command to github-to-sqlite for fetching all issue comments in a repo. My goal is to evolve that tool to the point where it can import all relevant data from all of my repositories and give me a single Datasette-powered dashboard for keeping track of everything in one place.

Tags: open-source, projects, datasette, dogsheep, weeknotes, leaflet

goodreads-to-sqlite

2019-10-14T04:07:53+00:00

goodreads-to-sqlite

This is so cool! Tobias Kunze built a Python CLI tool to import your Goodreads data into a SQLite database, inspired by github-to-sqlite and my various other Dogsheep tools. It’s the first Dogsheep style tool I’ve seen that wasn’t built by me—and Tobias’ write-up includes some neat examples of queries you can run against your Goodreads data. I’ve now started using Goodreads and I’m importing my books into my own private Dogsheep Datasette instance.

Via @rixxtr

Tags: books, cli, sqlite, datasette, dogsheep

Weeknotes: Dogsheep

2019-10-07T19:33:14+00:00

Having figured out my Stanford schedule, this week I started getting back into the habit of writing some code.

Dogsheep

Dogsheep is the collective name I've given to a suite of tools I'm building around the concept of personal analytics.

I generate a lot of data, and while much of it ends up in the silos of the internet giants, thanks to the GDPR most of those silos now feature an "export a copy of your data" button.

Wouldn't it be cool if you could convert that data into a SQLite database and then use Datasette to run queries against it?

So that's what I'm doing! The tools I've built so far include:

healthkit-to-sqlite for my Apple HealthKit data (mostly collected by my Apple Watch).
twitter-to-sqlite, by far the most developed tool. For Dogsheep purposes it lets me import my tweets and the tweets I have favourited, but it's growing all kinds of other useful features for retrieving and analyzing data from Twitter. More on this in my previous weeknotes.
swarm-to-sqlite for my Foursquare Swarm checkins.
inaturalist-to-sqlite for my iNaturalist observations.
google-takeout-to-sqlite for the wealth of data available from Google Takeout. I've barely scratched the surface with this one but it does have the ability to export my location history from Google Maps - 215,000 latitude/longitude/timestamp records dating back to select min(timestamp) from location_history July 2015!
github-to-sqlite for importing my GitHub repositories and the repositories I have starred.
pocket-to-sqlite (built over this weekend) for articles I have saved to Pocket.
genome-to-sqlite for my 23andMe genome, because being able to run SQL queries against my genetic code is really funny.

My biggest achievement this week is that my own personal Dogsheep is no longer vaporware! I got a Datasette instance running on a AWS Lightsail VPS (using the systemd recipe from this ticket) and started loading in copies of my data.

Securing this instance well is important. In addition to running it behind datasette-auth-github I've followed up on a long-term ambition of deploying something protected by client certificates. If your browser doesn't have the required certificate installed it gets rejected by nginx before it's even proxied through to Datasette.

I have the certificate successfully installed on my iPhone and my laptop browsers. Setting this all up wasn't particularly straight-forward - I ended up following this excellent tutorial by Nathan Wittstock for the nginx part of it, then randomly Googling for tips on installing the certificate in Firefox and on my iPhone.

I think there's a great opportunity for tooling that makes this much easier. Adding that to my maybe-someday pile.

datasette-auth-github asset caching

Dogsheep is a great opportunity for eating my own dogfood. I've been having a lot of fun using it to explore my data... but I noticed that there was a suspicous delay every time I clicked a link. Datasette is mostly JavaScript-free but usually it responds so fast (~100ms or less) that clicking around feels "app-like" - but here I was really feeling the delay, often of more than a second.

I fired up the Firefox Network inspector and quickly spotted what was happening. I have the datasette-vega plugin installed so I can create charts, and that weighs in at nearly 1MB of JavaScript. According to Firefox my browser was loading that entire 1M file from scratch for every page load!

The culprit there turned out to be my datasette-auth-github plugin.

As detailed in issue #6, the ASGI middleware adds a cache-control: private header to every HTTP response - to make sure authentication-required content isn't accidentally cached by any intermediary proxies (I originally designed Datasette to run effectively behind Varnish and Cloudflare).

This was applied to everything... including the static JavaScript and CSS served up by the plugin!

To fix this, I shipped a new release of datasette-auth-github with a cacheable_prefixes setting, which defaults to /-/static/ and /-/static-plugins/ when run as a Datasette plugin. See issue 47 for details.

Areas of research

Connection pooling

My most significant ongoing project for Datasette right now concerns database connection pooling.

For Datasette Library I want to be able to support potentially hundreds of attached databases, and Datasette Edit calls for the ability to use a writable (as opposed to read-only) connection.

This means I need to get smarter about database connections, and last week I finally made some solid progress on figuring out more advanced connection pooling. Still more work to go but I'm finally feeling unblocked on this after sweating over it for a couple of months without writing any code.

I also added a /-/threads page to Datasette (demo) to make it easier to understand the threads being run by the application.

Recursive CTEs in SQLite

twitter-to-sqlite generates a tweets table with both id and in_reply_to_status_id (nullable) columns.

SQLite supports recursive CTEs, documented here. In theory, this means it should be possible to write SQL queries that analyze Twitter threads - showing me the tweets that are at the bottom of the longest reply chain, for example.

I was feeling lazy and asked if anyone knew how to do that on Twitter. Robin Houston, Matthew Somerville and Piers Cawley all chipped in with useful tips, which I've collected in this gist.

I particularly enjoyed learning how Piers uses recursive SQL queries to help run his bakery.

logging

I finally made some progress on getting my head around the Python standard library logging module, a long-time nemesis.

The breakthrough was to sit down and actually read the source code - which helped me understand that the entire point of the library is to be a globally configured logging system that makes careful use of threading primitives to ensure you can log without worrying about concurrency.

I've been worrying about that aspect of logging for years, so it's reassuring to see that Vinay Sajip has that comprehensively covered already.

Easier installation for Datasette

I want journalists to be able to install Datasette. I don't want to have to talk them through installing xcode to install homebrew to install Python 3 first!

Raffaele Messuti did some great work investigating pyinstaller for this a while ago. I started a conversation on Twitter a few days ago after investigating pex (neat but not quite what I want, since users still have to install Python). From that thread I got some interesting further leads:

Freezing your code in the The Hitchhiker's Guide to Python describes the problem space in general.
Briefcase is the part of the BeeWare suite that deals with for converting a Python project into a standalone native application (for a bunch of different platforms).
PyOxidizer is a relatively new tool for building standalone executables that bundle a Python interpreter, on top of the Rust and Cargo build ecosystem.
Conda Constructor helps you construct an installer for a group of Conda packages (I need to figure out Conda).

Next step: actively try some of these out and see what I can do with them.

My longer term goal is to package Datasette up with Electron and make it available as a regular Windows and OS X application, for people who aren't comfortable using the command-line directly.

Books

I've been mostly working through required reading for my Stanford courses. Strategic Communication in particular has some excellent text books:

Speaking Up without Freaking Out: 50 Techniques for Confident and Compelling Presenting by Matthew Abrahams - despite the clickbait listicle title, this book is really good. I'm not a nervous public speaker, but I'm finding all kinds of actionable advice here on giving higher quality talks. And it's nice and short.
Resonate: Present Visual Stories that Transform Audiences by Nancy Duarte. This digs deep into the Hero's journey as a methodology for presenting, but with an interesting twist: it encourages you to think about your audience's journey and how you can take them from unaware, to skeptical, to tentatively committed, to excited and convinced.

Tags: projects, sqlite, dogsheep, weeknotes

twitter-to-sqlite 0.6, with track and follow

2019-10-06T04:54:33+00:00

twitter-to-sqlite 0.6, with track and follow

I shipped a new release of my twitter-to-sqlite command-line tool this evening. It now includes experimental features for subscribing to the Twitter streaming API: you can track keywords or follow users and matching Tweets will be written to a SQLite database in real-time as they come in through the API. Since Datasette supports mutable databases now you can run Datasette against the database and run queries against the tweets as they are inserted into the tables.

Tags: projects, realtime, twitter, dogsheep

Client-Side Certificate Authentication with nginx

2019-10-05T17:26:35+00:00

Client-Side Certificate Authentication with nginx

I’m intrigued by client-side browser certificates, which allow you to lock down a website such that only browsers with a specific certificate installed can access them. They work on both laptops and mobile phones. I followed the steps in this tutorial and managed to get an nginx instance running which only allows connections from my personal laptop and iPhone.

Tags: certificates, nginx, security, dogsheep

Simon Willison's Weblog: dogsheep

apple-notes-to-sqlite

Google Drive to SQLite

Why build this?

How it works

Some other neat tricks

Exploring other APIs with the same tools

What's next?

Notes on Notes.app

Weeknotes: Getting my personal Dogsheep up and running again

Dogsheep on a Digital Ocean droplet

Importing my tweets

Swarm checkins

Apple HealthKit

Apple Photos

Automation via cron

TIL this week

Releases this week

Datasette - an ecosystem of tools for working with small data

Transcript and notes

How to secure an Ubuntu server using Tailscale and UFW

Building a search engine for datasette.io

Project search for Datasette

How it works: Dogsheep Beta

Configuring Dogsheep Beta

Search term highlighting

Build, index, deploy

The case against client certificates

Weeknotes: github-to-sqlite workflows, datasette-ripgrep enhancements, Datasette 0.52

datasette-ripgrep 0.5 and 0.6

github-to-sqlite workflows

datasette-graphql 1.3

Datasette 0.52

Datasette dot-releases

Broken Dogsheep

Personal Data Warehouses: Reclaiming Your Data

Q&A, from this Google Doc

OCTO Speaker Series: Simon Willison - Personal Data Warehouses: Reclaiming Your Data

Dogsheep: Personal analytics with Datasette

Building an Evernote to SQLite exporter

ENEX - the Evernote XML export format

ENEX limitations

Writing an exporter

Serving up formatted notes

Bonus feature: search by OCR

evernote-to-sqlite

Weeknotes: airtable-export, generating screenshots in GitHub Actions, Dogsheep!

airtable-export

Social media cards for my TILs

Rendering Markdown

Dogsheep Beta

TIL this week

Releases this week

Serving photos locally with datasette-media

Using SQL to find my best photo of a pelican according to Apple Photos

How this works

An aside: Why I love Apple Photos

Querying the Apple Photos SQLite database

Quality scores

A note about the demo

Automatic labeling of photo contents

Geography

Hosting and serving the images

Next steps

Weeknotes: Datasette 0.41, photos breakthroughs

Datasette 0.41

Plotting foreign key relationships with Observable and D3

Apple Photos breakthrough

Other projects

TIL this week

github-to-sqlite 2.2 highlights thread

Weeknotes: Datasette 0.40, various projects, Dogsheep photos

Datasette 0.40

A JSK fellows directory and twitter-to-sqlite 0.21

TILs and datasette-template-sql 1.0

github-to-sqlite 2.0

photos-to-sqlite alpha

Weeknotes: Hacking on 23 different projects

Better Datasette on Zeit Now v2

Various projects ported to Now v2 or Cloud Run