<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: dogsheep</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/dogsheep.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2023-03-09T06:04:28+00:00</updated><author><name>Simon Willison</name></author><entry><title>apple-notes-to-sqlite</title><link href="https://simonwillison.net/2023/Mar/9/apple-notes-to-sqlite/#atom-tag" rel="alternate"/><published>2023-03-09T06:04:28+00:00</published><updated>2023-03-09T06:04:28+00:00</updated><id>https://simonwillison.net/2023/Mar/9/apple-notes-to-sqlite/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/dogsheep/apple-notes-to-sqlite"&gt;apple-notes-to-sqlite&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
With the help of ChatGPT I finally figured out just enough AppleScript to automate the export of my notes to a SQLite database. AppleScript is a notoriously read-only language, which is turns out makes it a killer app for LLM-assisted coding.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://til.simonwillison.net/gpt3/chatgpt-applescript"&gt;Using ChatGPT to write AppleScript&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/apple"&gt;apple&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatgpt"&gt;chatgpt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/applescript"&gt;applescript&lt;/a&gt;&lt;/p&gt;



</summary><category term="apple"/><category term="projects"/><category term="sqlite"/><category term="ai"/><category term="dogsheep"/><category term="generative-ai"/><category term="chatgpt"/><category term="applescript"/></entry><entry><title>Google Drive to SQLite</title><link href="https://simonwillison.net/2022/Feb/20/google-drive-to-sqlite/#atom-tag" rel="alternate"/><published>2022-02-20T22:47:01+00:00</published><updated>2022-02-20T22:47:01+00:00</updated><id>https://simonwillison.net/2022/Feb/20/google-drive-to-sqlite/#atom-tag</id><summary type="html">
    &lt;p&gt;I released a new tool this week: &lt;a href="https://datasette.io/tools/google-drive-to-sqlite"&gt;google-drive-to-sqlite&lt;/a&gt;. It's a CLI utility for fetching metadata about files in your &lt;a href="https://drive.google.com/"&gt;Google Drive&lt;/a&gt; and writing them to a local SQLite database.&lt;/p&gt;
&lt;p&gt;It's pretty fun!&lt;/p&gt;
&lt;p&gt;Here's how to create a SQLite database of every file you've started in your Google Drive, including both files created in Google Docs/Sheets and files you've uploaded to your drive:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% pip install google-drive-to-sqlite
% google-drive-to-sqlite auth
Visit the following URL to authenticate with Google Drive

https://accounts.google.com/o/oauth2/v2/auth?access_type=offline&amp;amp;...

Then return here and paste in the resulting code:
Paste code here: 
# Authentication is now complete, so run:
% google-drive-to-sqlite files starred.db --starred
% ls -lah starred.db
-rw-r--r--@ 1 simon  staff    40K Feb 20 14:14 starred.db
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The OAuth client ID it is using hasn't been verified by Google yet, which I think means that only the first 100 people to use it will be able to authenticate. If you need to you can work around that by creating your own client ID, as &lt;a href="https://datasette.io/tools/google-drive-to-sqlite#user-content-authentication"&gt;described in the README&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Having created that &lt;code&gt;starred.db&lt;/code&gt; file you can explore the resulting database using &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; or &lt;a href="https://datasette.io/desktop"&gt;Datasette Desktop&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;datasette starred.db

# or if you have the Datasette Desktop macOS app installed:
open starred.db
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's Datasette running against one of my larger metadata collections:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/google-drive-to-sqlite.png" alt="Screenshot showing the drive_files, drive_folders and drive_users tables" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;h4&gt;Why build this?&lt;/h4&gt;
&lt;p&gt;I recently got involved with a participatory journalism project, where a team of reporters have used FOIA requests to gather a huge corpus of thousands of files. The files are in a complex folder hierarchy a Google Drive. I wanted to start getting a feel for what's in there.&lt;/p&gt;
&lt;p&gt;Pulling the metadata - file names, sizes, file types, file owners, creation dates - into a SQLite database felt like a great way to start understanding the size and scope of what had been collected so far.&lt;/p&gt;
&lt;p&gt;Outside of that project, there's something very exciting to me about being able to use Google Drive to collate all kinds of different data and then tie it into the larger Datasette and &lt;a href="https://dogsheep.github.io/"&gt;Dogsheep&lt;/a&gt; ecosystems. I think there's a lot of potential here for all kinds of interesting projects.&lt;/p&gt;
&lt;h4&gt;How it works&lt;/h4&gt;
&lt;p&gt;The tool is written in Python using &lt;a href="https://click.palletsprojects.com/"&gt;Click&lt;/a&gt; (based on my &lt;a href="https://github.com/simonw/click-app"&gt;click-app template&lt;/a&gt;) and &lt;a href="https://sqlite-utils.datasette.io/"&gt;sqlite-utils&lt;/a&gt;. It works by calling the &lt;a href="https://developers.google.com/drive/api"&gt;Google Drive API&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;auth&lt;/code&gt; command needs to get hold of an OAuth access token scoped to make read-only calls to the user's Google Drive contents.&lt;/p&gt;
&lt;p&gt;This took a bit of figuring out. I wrote up what I learned in this TIL: &lt;a href="https://til.simonwillison.net/googlecloud/google-oauth-cli-application"&gt;Google OAuth for a CLI application&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Notably, the end result of that flow is a JSON response containing both an &lt;code&gt;access_token&lt;/code&gt; and a &lt;code&gt;refresh_token&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The access token can be used to make authenticated API calls, but it expires after an hour and that expiration cannot be extended.&lt;/p&gt;
&lt;p&gt;The refresh token lasts forever, and can be used at any time to obtain a fresh access token.&lt;/p&gt;
&lt;p&gt;So the &lt;code&gt;auth&lt;/code&gt; command writes the refresh token to a file called &lt;code&gt;auth.json&lt;/code&gt;, then future calls to other commands use that token to retrieve a fresh access token on every run.&lt;/p&gt;
&lt;p&gt;The most useful command is &lt;a href="https://datasette.io/tools/google-drive-to-sqlite#user-content-google-drive-to-sqlite-files"&gt;google-drive-to-sqlite files&lt;/a&gt;, which retrieves file metadata based on various criteria, then either writes that to a SQLite database or dumps it out as JSON or newline-delimited JSON. It does this by paginating through results from the Google Drive &lt;a href="https://developers.google.com/drive/api/v3/reference/files/list"&gt;files list API&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;files --folder ID&lt;/code&gt; option is a special case. It retrieves every nested file and subfolder starting at the specified folder. The Google Drive API doesn't support this operation directly, so the tool instead has to recursively call directory listings on every folder until it has pulled back all of the data. See my TIL &lt;a href="https://til.simonwillison.net/googlecloud/recursive-fetch-google-drive"&gt;Recursively fetching metadata for all files in a Google Drive folder&lt;/a&gt; for more details.&lt;/p&gt;
&lt;p&gt;This operation took over an hour for the largest folder I tested it against! So long that the access token it was using expired and I had to &lt;a href="https://github.com/simonw/google-drive-to-sqlite/issues/11"&gt;implement code&lt;/a&gt; to refresh the token in the middle of the operation.&lt;/p&gt;
&lt;h4&gt;Some other neat tricks&lt;/h4&gt;
&lt;p&gt;The &lt;a href="https://datasette.io/tools/google-drive-to-sqlite#user-content-google-drive-to-sqlite-download-file_id"&gt;download command&lt;/a&gt; downloads the specified file to disk:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;google-drive-to-sqlite download \
  0B32uDVNZfiEKLUtIT1gzYWN2NDI4SzVQYTFWWWxCWUtvVGNB
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It detects the file type and uses that as the extension - in the above example, it saves the file as &lt;code&gt;0B32uDVNZfiEKLUtIT1gzYWN2NDI4SzVQYTFWWWxCWUtvVGNB.pdf&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://datasette.io/tools/google-drive-to-sqlite#user-content-google-drive-to-sqlite-export-format-file_id"&gt;export command&lt;/a&gt; only works against the file IDs for docs, sheets and presentations create using Google Apps. It can export to a variety of different formats:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;google-drive-to-sqlite export html \
  10BOHGDUYa7lBjUSo26YFCHTpgEmtXabdVFaopCTh1vU
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This writes to &lt;code&gt;10BOHGDUYa7lBjUSo26YFCHTpgEmtXabdVFaopCTh1vU-export.html&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://datasette.io/tools/google-drive-to-sqlite#user-content-google-drive-to-sqlite-get-url"&gt;get command&lt;/a&gt; takes a URL to a Google Drive API endpoint and fetches it using a valid access token. This is a great tool for debugging and API exploration - my &lt;code&gt;github-to-sqlite&lt;/code&gt; tool &lt;a href="https://datasette.io/tools/github-to-sqlite#user-content-making-authenticated-api-calls"&gt;has this too&lt;/a&gt;.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;google-drive-to-sqlite get 'https://www.googleapis.com/drive/v3/about?fields=*'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It also knows how to paginate! Adding &lt;code&gt;--paginate files&lt;/code&gt; will cause it to fetch all of the subsequent pages of the API and return just the items from the &lt;code&gt;"files"&lt;/code&gt; key combined into a single JSON array, for example:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;google-drive-to-sqlite get \
  https://www.googleapis.com/drive/v3/files \
  --paginate files
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Exploring other APIs with the same tools&lt;/h4&gt;
&lt;p&gt;While I was building this, I realized that with just a little extra work the auth and get commands could be used to explore other Google APIs too.&lt;/p&gt;
&lt;p&gt;If you are a developer, you can create your own OAuth credentials and enable access to other APIs using &lt;a href="https://console.cloud.google.com/apis/credentials"&gt;the Google Cloud console&lt;/a&gt;. You can then take the resulting client ID and secret, pick a scope and run the following:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;google-drive-to-sqlite auth -a calendar-auth.json \
  --scope 'https://www.googleapis.com/auth/calendar.readonly' \
  --google-client-id '184325416553-nu5ci563v36rmj9opdl7mah786anbkrq.apps.googleusercontent.com' \
  --google-client-secret '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'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;calendar-auth.json&lt;/code&gt; will now be a JSON file that looks something like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
  &lt;span class="pl-ent"&gt;"google-drive-to-sqlite"&lt;/span&gt;: {
    &lt;span class="pl-ent"&gt;"refresh_token"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;1//...&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"google_client_id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;184325416553-nu5ci563v36rmj9opdl7mah786anbkrq.apps.googleusercontent.com&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"google_client_secret"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;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&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"scope"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;https://www.googleapis.com/auth/calendar.readonly&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
  }
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;You can now fetch your Google Calendar items by adding your email address to the following:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;google-drive-to-sqlite get \
  https://www.googleapis.com/calendar/v3/calendars/...@gmail.com/events \
  --auth calendar-auth.json
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will output JSON to the console. For newline-delimited JSON, add &lt;code&gt;--nl&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Since we can paginate with &lt;code&gt;--paginate items&lt;/code&gt;, this means we can pipe the results to &lt;a href="https://sqlite-utils.datasette.io/en/stable/cli-reference.html#insert"&gt;sqlite-utils insert&lt;/a&gt; and create a SQLite database of our calendar items!&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;google-drive-to-sqlite get \
  https://www.googleapis.com/calendar/v3/calendars/...@gmail.com/events \
  --auth calendar-auth.json \
  --paginate items --nl \
  | sqlite-utils insert calendar.db events \
    - --pk id --nl --alter --replace
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Maybe &lt;code&gt;google-drive-to-sqlite&lt;/code&gt; wasn't the right name for this after all!&lt;/p&gt;
&lt;h4&gt;What's next?&lt;/h4&gt;
&lt;p&gt;Google severely &lt;a href="https://cloud.google.com/blog/products/identity-security/enhancing-security-controls-for-google-drive-third-party-apps"&gt;tightened their policies&lt;/a&gt; on apps that can access Google Drive a few years ago. I'm currently waiting to see if my app will make it through their verification process, see &lt;a href="https://github.com/simonw/google-drive-to-sqlite/issues/15"&gt;issue #15&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;If it doesn't the tool will still be usable, but users will have to jump through some extra hoops to set up their own client ID. I don't see this as a huge concern.&lt;/p&gt;
&lt;p&gt;I've started thinking about ways to import additional data from the Google Drive APIs. I'm particularly interested in the idea of creating a full-text search index in SQLite based on plain text exports of documents created in Google Docs, see &lt;a href="https://github.com/simonw/google-drive-to-sqlite/issues/28"&gt;issue #28&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;For other short-term future plans, take a look at the project's &lt;a href="https://github.com/simonw/google-drive-to-sqlite/issues"&gt;open issues&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google-docs"&gt;google-docs&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="google-docs"/><category term="projects"/><category term="datasette"/><category term="dogsheep"/><category term="weeknotes"/><category term="sqlite-utils"/></entry><entry><title>Notes on Notes.app</title><link href="https://simonwillison.net/2021/Dec/9/notes-on-notesapp/#atom-tag" rel="alternate"/><published>2021-12-09T22:39:26+00:00</published><updated>2021-12-09T22:39:26+00:00</updated><id>https://simonwillison.net/2021/Dec/9/notes-on-notesapp/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/dunhamsteve/notesutils/blob/master/notes.md"&gt;Notes on Notes.app&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Apple's Notes app keeps its data in a SQLite database at &lt;code&gt;~/Library/Group\ Containers/group.com.apple.notes/NoteStore.sqlite&lt;/code&gt; - but it's pretty difficult to extract data from. It turns out the note text is stored as a gzipped protocol buffers object in the &lt;code&gt;ZICNOTEDATA.ZDATA&lt;/code&gt; column. Steve Dunham did the hard work of figuring out how it all works - the complexity stems from Apple's use of CRDT's to support seamless multiple edits from different devices.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/apple"&gt;apple&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/crdt"&gt;crdt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;&lt;/p&gt;



</summary><category term="apple"/><category term="sqlite"/><category term="crdt"/><category term="dogsheep"/></entry><entry><title>Weeknotes: Getting my personal Dogsheep up and running again</title><link href="https://simonwillison.net/2021/Aug/22/weeknotes-dogsheep/#atom-tag" rel="alternate"/><published>2021-08-22T04:46:13+00:00</published><updated>2021-08-22T04:46:13+00:00</updated><id>https://simonwillison.net/2021/Aug/22/weeknotes-dogsheep/#atom-tag</id><summary type="html">
    &lt;p&gt;I gave a talk about &lt;a href="https://dogsheep.github.io/"&gt;Dogsheep&lt;/a&gt; at Noisebridge's &lt;a href="https://www.noisebridge.net/wiki/Five_Minutes_of_Fame_2021_08_19"&gt;Five Minutes of Fame&lt;/a&gt; on Thursday. Just one problem: my regular Dogsheep demo was broken, so I ended up building it from scratch again. In doing so I fixed a few bugs in some Dogsheep tools.&lt;/p&gt;
&lt;h4&gt;Dogsheep on a Digital Ocean droplet&lt;/h4&gt;
&lt;p&gt;The latest iteration of my personal Dogsheep runs on a $20/month 4GB/2CPU &lt;a href="https://www.digitalocean.com/"&gt;Digital Ocean&lt;/a&gt; Droplet running Ubuntu 20.04 LTS.&lt;/p&gt;
&lt;p&gt;It runs a private Datasette instance and a bunch of cron jobs to fetch data from Twitter, GitHub, Foursquare Swarm, Pocket and Hacker News.&lt;/p&gt;
&lt;p&gt;It also has copies of my Apple Photos and Apple HealthKit data which I upload manually - plus a &lt;a href="https://simonwillison.net/2019/Sep/20/weeknotes-design-thinking-genome-sqlite/#genome-to-sqlite"&gt;copy of my genome&lt;/a&gt; for good measure.&lt;/p&gt;
&lt;p&gt;Some abbreviated notes on how I set it up, copied from a private GitHub Issues thread:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Create a new Ubuntu droplet, and configure its IP address as the A record for &lt;code&gt;dogsheep.simonwillison.net&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Install Python 3 and NGINX and SQLite: &lt;code&gt;apt-get install python3 python3-venv nginx sqlite -y&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Use &lt;a href="https://letsencrypt.org/"&gt;letsencrypt&lt;/a&gt; to get an HTTPS certificate for it: &lt;code&gt;apt-get update&lt;/code&gt; and then
&lt;code&gt;apt install certbot python3-certbot-nginx -y&lt;/code&gt;, then &lt;code&gt;certbot --nginx -d dogsheep.simonwillison.net&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;I had to remove the &lt;code&gt;ipv6only=on;&lt;/code&gt; bit from the NGINX configuration due to &lt;a href="https://github.com/certbot/certbot/issues/5550#issuecomment-367971137"&gt;this bug&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Created a &lt;code&gt;dogsheep&lt;/code&gt; user, &lt;code&gt;useradd -s /bin/bash -d /home/dogsheep/ -m -G&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;As that user, created a virtual environment: &lt;code&gt;python3 -mvenv datasette-venv&lt;/code&gt; and then &lt;code&gt;datasette-venv/bin/pip install wheel&lt;/code&gt; and &lt;code&gt;datasette-venv/bin/pip install datasette datasette-auth-passwords&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Created a &lt;code&gt;/etc/systemd/system/datasette.service&lt;/code&gt; file with &lt;a href="https://gist.github.com/simonw/0653b6177c6f12caa16530da4c56646f"&gt;this contents&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Created a set of blank SQLite database files &lt;a href="https://til.simonwillison.net/sqlite/enabling-wal-mode"&gt;in WAL mode&lt;/a&gt; in &lt;code&gt;/home/dogsheep&lt;/code&gt; using the following:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;for f in beta.db twitter.db healthkit.db github.db \
  swarm.db photos.db genome.db simonwillisonblog.db \
  pocket.db hacker-news.db memories.db
do
  sqlite3 $f vacuum
  # And enable WAL mode:
  sqlite3 $f 'PRAGMA journal_mode=WAL;'
done
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Started the Datasette service: &lt;code&gt;service datasette start&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Configured NGINX to proxy to localhost port 8001, using &lt;a href="https://docs.datasette.io/en/stable/deploying.html#nginx-proxy-configuration"&gt;this configuration&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;It's a few more steps than I'd like, but the end result was a &lt;a href="https://datasette.io/plugins/datasette-auth-passwords"&gt;password-protected&lt;/a&gt; Datasette instance running against a bunch of SQLite database files on my new server.&lt;/p&gt;
&lt;p&gt;With Datasette up and running, the next step was to start loading in data.&lt;/p&gt;
&lt;h4&gt;Importing my tweets&lt;/h4&gt;
&lt;p&gt;I started with Twitter. I dropped my Twitter API access credentials into an &lt;code&gt;auth.json&lt;/code&gt; file (as &lt;a href="https://datasette.io/tools/twitter-to-sqlite#user-content-authentication"&gt;described here&lt;/a&gt;) and ran the following:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;source /home/dogsheep/datasette-venv/bin/activate
pip install twitter-to-sqlite
twitter-to-sqlite user-timeline /home/dogsheep/twitter.db \
  -a /home/dogsheep/auth.json
@simonw [###############################-----] 26299/29684 00:02:06
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That pulled in all 29,684 of my personal tweets.&lt;/p&gt;
&lt;p&gt;(Actually, first it broke with an error, exposing a &lt;a href="https://github.com/dogsheep/twitter-to-sqlite/issues/57"&gt;bug that had already been reported&lt;/a&gt;. I shipped &lt;a href="https://github.com/dogsheep/twitter-to-sqlite/releases/tag/0.21.4"&gt;a fix for that&lt;/a&gt; and tried again and it worked.)&lt;/p&gt;
&lt;p&gt;Favourited tweets were a little harder - I have 39,904 favourited tweets, but the Twitter API only returns the most recent 3,200. I grabbed those more recent ones with:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;twitter-to-sqlite favorites /home/dogsheep/twitter.db \
  -a /home/dogsheep/auth.json
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then I requested &lt;a href="https://help.twitter.com/en/managing-your-account/how-to-download-your-twitter-archive"&gt;my Twitter archive&lt;/a&gt;, waited 24 hours and uploaded the resulting &lt;code&gt;like.js&lt;/code&gt; file to the server, then ran:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;twitter-to-sqlite import twitter.db /tmp/like.js
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This gave me an &lt;code&gt;archive_like&lt;/code&gt; table with the data from that file - but it wasn't the full tweet representation, just the subset that Twitter expose in the archive export.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://datasette.io/tools/twitter-to-sqlite#user-content-importing-data-from-your-twitter-archive"&gt;README&lt;/a&gt; shows how to inflate those into full tweets:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;twitter-to-sqlite statuses-lookup twitter.db \
  --sql='select tweetId from archive_like' \
  --skip-existing
Importing 33,382 tweets [------------------------------------] 0% 00:18:28
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once that was done I wrote additional records into the &lt;code&gt;favorited_by&lt;/code&gt; table like so:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;sqlite3 twitter.db '
  INSERT OR IGNORE INTO favorited_by (tweet, user)
  SELECT tweetId, 12497 FROM archive_like
'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;(12497 is my Twitter user ID.)&lt;/p&gt;
&lt;p&gt;I also came up with a SQL view that lets me see just media attached to tweets:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;sqlite-utils create-view twitter.db media_details "
select
  json_object('img_src', media_url_https, 'width', 400) as img,
  tweets.full_text,
  tweets.created_at,
  tweets.id as tweet_id,
  users.screen_name,
  'https://twitter.com/' || users.screen_name || '/status/' || tweets.id as tweet_url
from
  media
  join media_tweets on media.id = media_tweets.media_id
  join tweets on media_tweets.tweets_id = tweets.id
  join users on tweets.user = users.id
order by
  tweets.id desc
"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now I can visit &lt;code&gt;/twitter/media_details?_where=tweet_id+in+(select+tweet+from+favorited_by+where+user+=+12497)&lt;/code&gt; to see the most recent media tweets that I've favourited!&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/liked-media-details.jpg" alt="media_details view showing Twitter media I have liked" data-canonical-src="https://static.simonwillison.net/static/2021/liked-media-details.jpg" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;h4&gt;Swarm checkins&lt;/h4&gt;
&lt;p&gt;Swarm checkins were a lot easier. I needed my previously-created Foursquare API token, and &lt;a href="https://datasette.io/tools/swarm-to-sqlite"&gt;swarm-to-sqlite&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;pip install swarm-to-sqlite
swarm-to-sqlite /home/dogsheep/swarm.db --token=...
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This gave me a full table of my Swarm checkins, which I can visualize using &lt;a href="https://datasette.io/plugins/datasette-cluster-map"&gt;datasette-cluster-map&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/dogsheep-swarm-map.jpg" alt="Map of my 2021 Swarm checkins" data-canonical-src="https://static.simonwillison.net/static/2021/dogsheep-swarm-map.jpg" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;h4&gt;Apple HealthKit&lt;/h4&gt;
&lt;p&gt;I don't yet have full automation for my Apple HealthKit data (collected by my Apple Watch) or my Apple Photos - both require me to run scripts on my laptop to create the SQLite database file and then copy the result to the server via &lt;code&gt;scp&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://datasette.io/tools/healthkit-to-sqlite"&gt;healthkit-to-sqlite&lt;/a&gt; runs against the &lt;code&gt;export.zip&lt;/code&gt; that is produced by the Apple Health app on the iPhone's export data button - for me that was a 158MB zip file which I AirDropped to my laptop and converted (after &lt;a href="https://github.com/dogsheep/healthkit-to-sqlite/issues/19"&gt;fixing a new bug&lt;/a&gt;) like so:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;healthkit-to-sqlite ~/Downloads/export.zip healthkit.db
Importing from HealthKit  [-----------------------------]    2%  00:02:25
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I uploaded the resulting 1.5GB &lt;code&gt;healthkit.db&lt;/code&gt; file and now I can do things like visualize my 2017 San Francisco Half Marathon run on a map:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/dogsheep-marathon.jpg" alt="A map of the half marathon I ran" data-canonical-src="https://static.simonwillison.net/static/2021/dogsheep-marathon.jpg" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;h4&gt;Apple Photos&lt;/h4&gt;
&lt;p&gt;For my photos I use &lt;a href="https://datasette.io/tools/dogsheep-photos"&gt;dogsheep-photos&lt;/a&gt;, which I described last year in &lt;a href="https://simonwillison.net/2020/May/21/dogsheep-photos/"&gt;Using SQL to find my best photo of a pelican according to Apple Photos&lt;/a&gt;. The short version: I run this script on my laptop:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Upload original photos to my S3 bucket
dogsheep-photos upload photos.db \
  ~/Pictures/Photos\ Library.photoslibrary/originals
dogsheep-photos apple-photos photos.db \
  --image-url-prefix "https://photos.simonwillison.net/i/" \
  --image-url-suffix "?w=600"
scp photos.db dogsheep:/home/dogsheep/photos.db
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;photos.db&lt;/code&gt; is only 171MB - it contains the metadata, including the machine learning labels, but not the photos themselves.&lt;/p&gt;
&lt;p&gt;And now I can run queries for things like photos of food I've taken in 2021:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2021/dogsheep-photos-of-food.jpg" alt="Some photos of food" data-canonical-src="https://static.simonwillison.net/static/2021/dogsheep-photos-of-food.jpg" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;h4&gt;Automation via cron&lt;/h4&gt;
&lt;p&gt;I'm still working through the last step, which involves setting up cron tasks to refresh my data periodically from various sources. My &lt;code&gt;crontab&lt;/code&gt; currently looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Twitter
1,11,21,31,41,51 * * * * /home/dogsheep/datasette-venv/bin/twitter-to-sqlite user-timeline /home/dogsheep/twitter.db -a /home/dogsheep/auth.json --since
4,14,24,34,44,54 * * * * run-one /home/dogsheep/datasette-venv/bin/twitter-to-sqlite mentions-timeline /home/dogsheep/twitter.db -a /home/dogsheep/auth.json --since
11 * * * * run-one /home/dogsheep/datasette-venv/bin/twitter-to-sqlite user-timeline /home/dogsheep/twitter.db cleopaws -a /home/dogsheep/auth.json --since
6,16,26,36,46,56 * * * * run-one /home/dogsheep/datasette-venv/bin/twitter-to-sqlite favorites /home/dogsheep/twitter.db -a /home/dogsheep/auth.json --stop_after=50

# Swarm
25 */2 * * * /home/dogsheep/datasette-venv/bin/swarm-to-sqlite /home/dogsheep/swarm.db --token=... --since=2w

# Hacker News data every six hours
35 0,6,12,18 * * * /home/dogsheep/datasette-venv/bin/hacker-news-to-sqlite user /home/dogsheep/hacker-news.db simonw

# Re-build dogsheep-beta search index once an hour
32 * * * * /home/dogsheep/datasette-venv/bin/dogsheep-beta index /home/dogsheep/beta.db /home/dogsheep/dogsheep-beta.yml
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I'll be expanding this out as I configure more of the &lt;a href="https://dogsheep.github.io/"&gt;Dogsheep tools&lt;/a&gt; for my personal instance.&lt;/p&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/sqlite/build-specific-sqlite-pysqlite-macos"&gt;Building a specific version of SQLite with pysqlite on macOS/Linux&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/sqlite/track-timestamped-changes-to-a-table"&gt;Track timestamped changes to a SQLite table using triggers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/observable-plot/histogram-with-tooltips"&gt;Histogram with tooltips in Observable Plot&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/dogsheep/healthkit-to-sqlite"&gt;healthkit-to-sqlite&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/dogsheep/healthkit-to-sqlite/releases/tag/1.0.1"&gt;1.0.1&lt;/a&gt; - (&lt;a href="https://github.com/dogsheep/healthkit-to-sqlite/releases"&gt;9 releases total&lt;/a&gt;) - 2021-08-20
&lt;br /&gt;Convert an Apple Healthkit export zip to a SQLite database&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/dogsheep/twitter-to-sqlite"&gt;twitter-to-sqlite&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/dogsheep/twitter-to-sqlite/releases/tag/0.21.4"&gt;0.21.4&lt;/a&gt; - (&lt;a href="https://github.com/dogsheep/twitter-to-sqlite/releases"&gt;27 releases total&lt;/a&gt;) - 2021-08-20
&lt;br /&gt;Save data from Twitter to a SQLite database&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-block-robots"&gt;datasette-block-robots&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-block-robots/releases/tag/1.0"&gt;1.0&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-block-robots/releases"&gt;5 releases total&lt;/a&gt;) - 2021-08-19
&lt;br /&gt;Datasette plugin that blocks robots and crawlers using robots.txt&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/sqlite-utils/releases/tag/3.16"&gt;3.16&lt;/a&gt; - (&lt;a href="https://github.com/simonw/sqlite-utils/releases"&gt;85 releases total&lt;/a&gt;) - 2021-08-18
&lt;br /&gt;Python CLI utility and library for manipulating SQLite databases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-debug-asgi"&gt;datasette-debug-asgi&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-debug-asgi/releases/tag/1.1"&gt;1.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-debug-asgi/releases"&gt;3 releases total&lt;/a&gt;) - 2021-08-17
&lt;br /&gt;Datasette plugin for dumping out the ASGI scope&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/linux"&gt;linux&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nginx"&gt;nginx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="linux"/><category term="nginx"/><category term="datasette"/><category term="dogsheep"/><category term="weeknotes"/></entry><entry><title>Datasette - an ecosystem of tools for working with small data</title><link href="https://simonwillison.net/2021/Jul/22/small-data/#atom-tag" rel="alternate"/><published>2021-07-22T18:13:40+00:00</published><updated>2021-07-22T18:13:40+00:00</updated><id>https://simonwillison.net/2021/Jul/22/small-data/#atom-tag</id><summary type="html">
    &lt;p&gt;This is the transcript and video from a talk I gave at PyGotham 2020 about using SQLite, Datasette and Dogsheep to work with small data.&lt;/p&gt;
&lt;p&gt;I really like the term "small data", as a counterpoint to big data. I define small data as anything that fits on a mobile phone, and since my iPhone has half a terabyte of storage these days that ends up covering a lot of ground!&lt;/p&gt;
&lt;p&gt;I think SQLite is the ideal tool for working with data that fits this definition.&lt;/p&gt;
&lt;p&gt;My PyGotham talk demonstrates how to find SQLite databases that already exist on your laptop, how to use Datasette to explore them, and then expands to cover Datasette as a tool for publishing data online and my Dogsheep family of tools for personal analytics.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://www.youtube.com/watch?v=Lig2gxPEZPo"&gt;the talk video on YouTube&lt;/a&gt;. I've included a full transcript below with additional links and embedded code samples.&lt;/p&gt;

&lt;div&gt;&lt;iframe allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="allowfullscreen" frameborder="0" height="315" src="https://www.youtube-nocookie.com/embed/Lig2gxPEZPo" width="560"&gt; &lt;/iframe&gt;&lt;/div&gt;

&lt;h4&gt;Transcript and notes&lt;/h4&gt;
&lt;p&gt;I'm going to spend the next 20 minutes trying to convince you that almost every data problem you have should be solved using &lt;a href="https://www.sqlite.org/"&gt;SQLite&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This is also an excuse for me to promote open source project I've been working on for the last three years, called &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Datasette is a tool for exploring and publishing data that's stored in SQLite databases. So the more people I can use convince to use SQLite to solve things the better!&lt;/p&gt;
&lt;p&gt;So let's talk about SQLite, which &lt;a href="https://www.sqlite.org/mostdeployed.html"&gt;claims to be&lt;/a&gt; the most widely deployed and used database in the world, and I think that claim holds up.&lt;/p&gt;
&lt;p&gt;You may not have realised it, but you're using SQLite every single day. It's in phones and laptops. It's in embedded devices. I have an Apple Watch, so I'm wearing SQLite databases on my wrist right now.&lt;/p&gt;
&lt;p&gt;Let's take a look on my laptop and see what we can find.&lt;/p&gt;
&lt;p&gt;So &lt;a href="https://til.simonwillison.net/macos/find-largest-sqlite"&gt;this right here&lt;/a&gt; is a magic incantation that on a Mac will search for the largest SQLite databases across the entire system.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;mdfind "kMDItemDisplayName == *.sqlite" -0 | \
  xargs -0 stat "-f%z %N" | sort -nr | head -n 20
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I'm gonna run this right now to get my top 20. And here we are. Evernote is using SQLite, Firefox uses it, but the one at the top, this one is 857 megabytes for SQLite file. It's something to do with Apple Photos, so let's see what's in there.&lt;/p&gt;
&lt;p&gt;Datasette can be &lt;a href="https://docs.datasette.io/en/stable/installation.html"&gt;installed&lt;/a&gt; as a command-line application, you can run it against the path to a SQLite database, and it starts up a little local web server, which you can then use to explore that data.&lt;/p&gt;
&lt;p&gt;So there are 67 tables in this Apple Photos database, with detected face prints and face crops and albums and all sorts of things.&lt;/p&gt;
&lt;p&gt;But the one that I happen to know is super interesting is this one here. It's called the Z Cloud Master Media Meta Data. It has 44,000 rows in and I happen to know I've taken about 44,000 photographs - so it looks like there's some interesting stuff here.&lt;/p&gt;
&lt;p&gt;This is kind of garbled, but there's a clue. The binary column here starts with BPlist, and I happen to know that Apple have a format called a binary plist, which is probably what this is.&lt;/p&gt;
&lt;p&gt;So I'm gonna install a plugin for Datasette called &lt;a href="https://datasette.io/plugins/datasette-bplist"&gt;datasette-bplist&lt;/a&gt; that knows what to do with that data. This is being installed from PyPI.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;datasette install datasette-bplist
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And now I can start Datasette up again. Hit refresh on this page. And now that Z Data has been decoded into something that's a little bit more readable.&lt;/p&gt;
&lt;p&gt;And straightaway we can see that OK, this is classic photo metadata. We've got the make and model of the device. we've got EXIF data. The really exciting stuff is we've got GPS data. So this is a database that includes the latitude and longitude of every photograph I've ever taken, and we're absolutely going to dig into that and see if we can do some interesting stuff with it later on.&lt;/p&gt;
&lt;p&gt;So SQLite is clearly a pretty great database - if Apple are using it this extensively it must have some good things going on for it. But then why don't we use it for web applications?&lt;/p&gt;
&lt;p&gt;Well, the classic answer to that is that the one thing SQLite is not so good at is accepting concurrent writes. If you've got a high traffic web site with lots of people writing to your database, you're better off with something like MySQL or Postgres.&lt;/p&gt;
&lt;p&gt;But I realised that I have a whole bunch of use cases that don't require any writes at all. Any time I want to publish data online that isn't going to change, SQLite is actually a really good fit for that.&lt;/p&gt;
&lt;p&gt;So that was the original inspiration for Datasette. I was thinking about how if I want to publish data online, is this a way I can do it that's really cheap and flexible and interesting?&lt;/p&gt;
&lt;p&gt;So that's what I've been doing. I'll show you a few examples.&lt;/p&gt;
&lt;p&gt;This is Covid-19 case data, which is published online right now by Johns Hopkins University. The New York Times, the L.A. Times and The Economist.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://covid-19.datasettes.com/"&gt;https://covid-19.datasettes.com/&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;They're all publishing their data as CSV files in GitHub repositories, which is pretty great because you get a commit history of changes to that CSV data. But it's not necessarily something you can dive straight into and start using.&lt;/p&gt;
&lt;p&gt;That's where Datasette comes in. This is a Datasette instance which is automatically updated every day with the latest data from these different publications and I can drill through and see things like &lt;a href="https://covid-19.datasettes.com/covid/ny_times_us_counties"&gt;this New York Times data&lt;/a&gt; for example.&lt;/p&gt;
&lt;p&gt;The New York Times publishes county level data of cases and deaths from Covid-19 and they've got over half a million rows in this spreadsheet. And so when we look at this data, it's pretty unexciting in a sort of table form. We've got the date, the county, the state, the FIPS code, which is a unique identify for each county, the number of cases and the number of deaths.&lt;/p&gt;
&lt;p&gt;But Datasette kicks in and does some really useful things for us.&lt;/p&gt;
&lt;p&gt;Firstly, Datasette has a concept of faceting where it looks for columns in the database which have a small number of unique values - like the State column for example, which only has 50 values in it, and it makes those available for navigation at the top along with these summary counts. So I can look down at this and say, Oh, look, Tennessee. Let's drill into &lt;a href="https://covid-19.datasettes.com/covid/ny_times_us_counties?state=Tennessee"&gt;just the rows in this table that mention Tennessee&lt;/a&gt;. And then Tennessee has 16,000 rows of which each county has a whole bunch. So let's look &lt;a href="https://covid-19.datasettes.com/covid/ny_times_us_counties?state=Tennessee&amp;amp;county=Rutherford"&gt;at Rutherford County&lt;/a&gt; in Tennessee. That's 181 rows. And now that we've drilled down these two levels, we've got back a table that is just showing us the number of cases and number of deaths ordered my date.&lt;/p&gt;
&lt;p&gt;So each of these rows says on the sixth of September, there have been 80 deaths reported in 8100 cases.&lt;/p&gt;
&lt;p&gt;You saw a Datasette plugin earlier when I installed the bplist plugin. Another plugin I wrote is called &lt;a href="https://datasette.io/plugins/datasette-vega"&gt;Datasette Vega&lt;/a&gt;. It gives you simple charting options, so we can say I want a bar chart that plots the date on the X axis against the number of cases on the Y axis. And now we've got &lt;a href="https://covid-19.datasettes.com/covid/ny_times_us_counties?state=Tennessee&amp;amp;county=Rutherford#g.mark=bar&amp;amp;g.x_column=date&amp;amp;g.x_type=ordinal&amp;amp;g.y_column=cases&amp;amp;g.y_type=quantitative"&gt;a chart of cases over time&lt;/a&gt; in this particular county in the state of Tennessee.&lt;/p&gt;
&lt;p&gt;A couple of other things you can do: You can export the data back out again so I could get this data &lt;a href="https://covid-19.datasettes.com/covid/ny_times_us_counties.json?state=Tennessee&amp;amp;county=Rutherford"&gt;as JSON&lt;/a&gt; if I want to load this programmatically into something, and get it out &lt;a href="https://covid-19.datasettes.com/covid/ny_times_us_counties.csv?state=Tennessee&amp;amp;county=Rutherford"&gt;as CSV&lt;/a&gt; if I want to load it into something like Microsoft Excel.&lt;/p&gt;
&lt;p&gt;I can actually export it &lt;a href="https://covid-19.datasettes.com/covid/ny_times_us_counties.yaml?state=Tennessee&amp;amp;county=Rutherford"&gt;as YAML&lt;/a&gt; because there's an &lt;a href="https://datasette.io/plugins/datasette-yaml"&gt;experimental plugin&lt;/a&gt; I've been building that does a YAML export. And then, more importantly, you can &lt;a href="https://covid-19.datasettes.com/covid?sql=select+rowid%2C+date%2C+county%2C+state%2C+fips%2C+cases%2C+deaths+from+ny_times_us_counties+where+%22county%22+%3D+%3Ap0+and+%22state%22+%3D+%3Ap1+order+by+date+desc+limit+101&amp;amp;p0=Rutherford&amp;amp;p1=Tennessee"&gt;view and edit the SQL query&lt;/a&gt; that was used for this page.&lt;/p&gt;
&lt;p&gt;Now this is one of the interesting things about publishing read-only data: because the data as read only and it's opened in a special read only mode, people executing SQL queries can't cause any damage to that database. So SQL injection which in most web applications is a serious security vulnerability for us, becomes a feature.&lt;/p&gt;
&lt;p&gt;We could say I just want back the data on the cases and the deaths columns - I'm gonna hit "run SQL" and now I'm getting back &lt;a href="https://covid-19.datasettes.com/covid?sql=select+date%2C+cases%2C+deaths+from+ny_times_us_counties+where+%22county%22+%3D+%3Ap0+and+%22state%22+%3D+%3Ap1+order+by+date+desc+limit+101&amp;amp;p0=Rutherford&amp;amp;p1=Tennessee"&gt;just that data that I requested&lt;/a&gt;, and I could export that data back out again as JSON or CSV. So this becomes a really powerful way of remixing and slicing and dicing data and then getting it back out again in a format that I can use somewhere else.&lt;/p&gt;
&lt;p&gt;So that's all well and good. But the obvious question is, how do you get this data into a SQLite file in the first place? So I'm gonna show you a demo using one of my favourite pieces of data of all time. This is the results of the 2018 &lt;a href="https://www.thesquirrelcensus.com/"&gt;New York Central Park squirrel census&lt;/a&gt;, where a bunch of volunteers went out and tried to answer the crucial question: "How many squirrels are there in Central Park?".&lt;/p&gt;
&lt;p&gt;And this is published &lt;a href="https://data.cityofnewyork.us/Environment/2018-Central-Park-Squirrel-Census-Squirrel-Data/vfnx-vebw"&gt;as a CSV file&lt;/a&gt;, it's got 3,023 rows. Each row is a squirrel. And we can export that out and get back CSV. So I've got the file here. I'm gonna turn it into a SQLite database.&lt;/p&gt;
&lt;p&gt;I have a tool I wrote called &lt;a href="https://datasette.io/tools/csvs-to-sqlite"&gt;csvs-to-sqlite&lt;/a&gt;, which takes a CSV file - or multiple CSV files - and turns them into, or loads them into SQLite. So here we go, I've run that command and and now I have &lt;code&gt;squirrels.db&lt;/code&gt;, which is a SQLite database.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;csvs-to-sqlite 2018_Central_Park_Squirrel_Census_-_Squirrel_Data.csv squirrels.db
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If I run &lt;code&gt;sqlite3 squirrels.db .schema&lt;/code&gt;, it'll show me the schema - and it is indeed a table with columns for squirrels with unique squirrel IDS and which hectare they're in and all of that kind of stuff.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;datasette squirrels.db
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And so now that we've done that, I can run &lt;code&gt;datasette squirrels.db&lt;/code&gt; and start up Datasette and I can see that data in my browser. And here we go here, here are 3,000 rows of squirrel data.&lt;/p&gt;
&lt;p&gt;Faceting is kind of fun: you can see things like primary fur colour, showing up as options there. And, actually, if we facet by that, we can say you know what? For the 392 cinnamon squirrels, let's see what the highlight fur colours are for those. Show me the 10 squirrels that are black and have cinnamon hair.&lt;/p&gt;
&lt;p&gt;Another thing that stands out about this data is that there are these two columns here, X and Y, which look suspiciously like they might be latitudes and longitudes. We can do something really interesting with that.&lt;/p&gt;
&lt;p&gt;I'm gonna load up the SQLite command-line tool for squirrels.db, and I'm gonna rename the X column to longitude.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;sqlite3 squirrels.db
alter table "2018_Central_Park_Squirrel_Census_-_Squirrel_Data"
rename column X to longitude;
alter table "2018_Central_Park_Squirrel_Census_-_Squirrel_Data"
rename column Y to latitude;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So "alter table" - um, quite a long table name this one - gonna alter this table here. Rename column X to longitude. And I'm gonna rename column Y to latitude. And now I'm going to check my datasette plugins and make sure I've got a plugin installed called &lt;a href="https://datasette.io/plugins/datasette-cluster-map"&gt;datasette-cluster-map&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;There it is. So now if I run Datasette against squirrels again and refresh this page, the plugin here notices the latitude and longitude columns and it draws them on the map. So this is a map of everywhere that they saw a cinnamon squirrel with black highlight fur.&lt;/p&gt;
&lt;p&gt;And if I close off these filters you can see that here's a map of all 3,000 squirrels that were spotted in Central Park. And I can zoom in and explore the data that way.&lt;/p&gt;
&lt;p&gt;That's pretty cool. Wouldn't it be cool if we could share this data on the Internet somehow?&lt;/p&gt;
&lt;p&gt;So we've got a SQLite database. And this is where another key feature of Datasette comes in, which is the&lt;a href="https://docs.datasette.io/en/stable/publish.html"&gt;"datasette publish&lt;/a&gt; command.&lt;/p&gt;
&lt;p id="datasette-publish"&gt;Datasette Publish lets you publish databasees online, using &lt;a href="https://cloud.google.com/run"&gt;Google Cloud Run&lt;/a&gt; or &lt;a href="https://heroku.com/"&gt;Heroku&lt;/a&gt; or &lt;a href="https://vercel.com/"&gt;Vercel&lt;/a&gt;. I'm going to use Vercel right now, so if I type &lt;code&gt;datasette publish vercel squirrels.db&lt;/code&gt; - I need to give it a project name. I'm going to call it &lt;code&gt;pygotham-squirrels&lt;/code&gt;. And I know that I want to install that &lt;code&gt;datasette-cluster-map&lt;/code&gt; plugin as well.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;datasette publish vercel squirrels.db \
  --project=pygotham-squirrels \
  --install=datasette-cluster-map
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So I run this command, and it packages up that database with the Datasette application itself, uploads it and starts running a build on Vercel. It gives me a URL here. I can watch it doing its thing - so it's installing runtimes, it's installing those required dependencies, that additional plugin. And normally, this only takes about 20 seconds to run. And at the end of that 20 seconds, I will have a Web application that is running online with that data.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://pygotham-squirrels.vercel.app/"&gt;https://pygotham-squirrels.vercel.app/&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;So anyone can visit this URL right now, click on that map and they get &lt;a href="https://pygotham-squirrels.vercel.app/squirrels/2018_Central_Park_Squirrel_Census_-_Squirrel_Data"&gt;that Datasette of squirrels with the map plugin&lt;/a&gt;, with the ability to facet by primary fur colour. It's all up there. It's running online and it's ready for other people to visit.&lt;/p&gt;
&lt;p&gt;I designed this initially for newspapers based on my experience working at the Guardian newspaper, where we wanted to publish some of the data behind our stories, but as a way productively getting structured data onto the web in a way you can start using it I think this is really, really exciting as a capability.&lt;/p&gt;
&lt;p&gt;And remember there are JSON APIs for this. So we have just published an API online for squirrel data that anyone who can write JavaScript or Python can start using to build their own squirrel maps.&lt;/p&gt;
&lt;p&gt;Another thing that I've been doing with this is figuring out how to turn these into much more custom experiences.&lt;/p&gt;
&lt;p&gt;Datasette allows you to provide &lt;a href="https://docs.datasette.io/en/stable/custom_templates.html"&gt;custom templates&lt;/a&gt;, it lets you provide custom CSS, which means you can actually use it to build entire websites powered under the hood by Datasette.&lt;/p&gt;
&lt;p&gt;One of the first I built was this one here. This is called &lt;a href="https://www.niche-museums.com/"&gt;niche-museums.com&lt;/a&gt; - it's a website for my hobby of trying to track down weird and niche museums around the world, I can click "Use my location" and see just the weird museums near me.&lt;/p&gt;
&lt;p&gt;There's one just around the corner from our house called &lt;a href="https://www.niche-museums.com/47"&gt;The Comic Book Rockstars Toilet Seat Museum&lt;/a&gt;, where a local comic book store has been collecting toilet seats illustrated by famous comic artists. And if you're ever in San Francisco, you should absolutely pop by Hayes Valley and check this place out.&lt;/p&gt;
&lt;p&gt;But this right here is just a heavily styled Datasette instance. If you go to &lt;a href="https://www.niche-museums.com/browse"&gt;/browse&lt;/a&gt;, it will give you access to the data. You can see that each museum is a row in a SQLite database table. I've got that plugin, so I get a free map showing you where everything is. Then the actual museum pages are just templated versions of the underlying Datasette pages.&lt;/p&gt;
&lt;p&gt;I really like this as a way of publishing websites because it's gives you all of the benefits of static publishing where you it scales magically by starting up new instances, you don't have to worry about security holes and stuff because it's all read only data, but because there's a relational database under it, you can build features like search - so I can search for &lt;a href="https://www.niche-museums.com/browse/search?q=bigfoot"&gt;Bigfoot&lt;/a&gt; and get back the &lt;a href="https://www.niche-museums.com/2"&gt;Bigfoot Discovery Museum&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Or you can build things like this little "use my location", button at the top of the screen. But again, this is all built using Datasette and Datasette plugins. I have a plugin that knows how to turn a SQL query &lt;a href="https://datasette.io/plugins/datasette-atom"&gt;into an Atom feed&lt;/a&gt; so that people can subscribe to new updates to the website as well.&lt;/p&gt;
&lt;p&gt;But to finish, I want to talk about a totally different use-case for all of this stuff, and that's personal analytics.&lt;/p&gt;
&lt;p&gt;This is an idea that I had had a year and a half ago, inspired by &lt;a href="https://writings.stephenwolfram.com/2019/02/seeking-the-productive-life-some-details-of-my-personal-infrastructure/"&gt;an essay written by Stephen Wolfram&lt;/a&gt;. Stephen wrote this absolutely fascinating essay about his approach to personal productivity, and the main feature of this essay is the scroll bar down the side. This thing just goes on and on and on, and he talks about how he had a standing desk, but then his heart rate monitor showed him that he gets better health benefits from walking around outside. So he built himself a little like popcorn tray for his laptop. He scanned every document he's ever every document he's ever written from the age of 11. He's got all of this amazing infrastructure setup for being productive, most of which I feel is way over the top for how I want to live my life.&lt;/p&gt;
&lt;p&gt;He has a green screen setup in his basement so he can give talks from home.&lt;/p&gt;
&lt;p&gt;But there was one thing in here that kind of caught my eye, and that's this idea that he has of a metasearcher. It's a personal search engine that he built that can search every e-mail, every file, every paper he's written, all of the people he knows for any any search term.&lt;/p&gt;
&lt;p&gt;And I saw this and I thought, you know, that's something I'd really like. I would love to have a personal search engine for all of my stuff.&lt;/p&gt;
&lt;p&gt;But if I'm gonna build that and it's inspired by Stephen Wolfram, I need to come up with a decent name for it.&lt;/p&gt;
&lt;p&gt;And so since it is inspired by Wolf Ram, but it's not really aiming to be as good or comprehensive as the thing that he built, I decided to call it Dog Sheep. So I have a project called &lt;a href="https://dogsheep.github.io/"&gt;Dogsheep&lt;/a&gt;. I love that pun so much I committed myself to actually building the software.&lt;/p&gt;
&lt;p&gt;And so Dogsheep is a collection of tools for personal analytics: tools for pulling in data about yourself from lots of different sources, turning those into SQLite databases so that you can start working with them and then you can load them up in Datasette and start doing interesting things against them.&lt;/p&gt;
&lt;p&gt;I'll give you a a demonstration of my personal Dogsheep instance, where all of the data from these tools ends up.&lt;/p&gt;
&lt;p&gt;This is my personal Dogsheep. It has data from a a whole bunch of different places. I've got data from Twitter and HealthKit - my Apple Watch tracks my health over time. GitHub data, Foursquare Swarm, I pulled in data from LinkedIn and Goodreads and Hacker News and all of these different places, and I can do all sorts of fun things with this.&lt;/p&gt;
&lt;p&gt;So I'll start by showing you a couple of things from from Twitter. So I index all of my tweets, anytime somebody mentions me, and all of the tweets that I've favorited and because I've indexed my favourited tweets I can run searches against them.&lt;/p&gt;
&lt;p&gt;So I can search for PyGotham and see all of the tweets that I've favorited about PyGotham.&lt;/p&gt;
&lt;p&gt;Twitter does not offer this feature themselves and having this feature suddenly makes Twitter favourites so, so much more useful.&lt;/p&gt;
&lt;p&gt;I can search my followers so if I decide I want to have a conversation with a VC, I can type in VC and see anyone who follows me on Twitter who mentions VC in their bio, and hence is somebody that I might be able to contact via direct messages.&lt;/p&gt;
&lt;p&gt;But a really useful example is that my dog, Cleo, has a Twitter account. And every time she goes to the vet, she tweets a selfie and with her selfie she tweets how much she weighs. So she'll tweet "I weigh 42.5 pounds. I grew a little bit more dog". Because she tweets this and I've got this coming into Dogsheep I now have these in a SQL database - so I can construct a SQL query that looks for tweets that mentioned her weight. And I can use &lt;a href="https://datasette.io/plugins/datasette-rure"&gt;a regular expression&lt;/a&gt; to extract that weight value out as the thing before the LB. And then I can use my charting plugin to plot that as a chart. So I now have a chart of Cleo's weight over time, as reported by her posting selfies at the vet. It's like having the kind of data warehouse you get at a giant corporation, but just against your own data, and costing about $5 a month in hosting fees.&lt;/p&gt;
&lt;p&gt;So that's tweets. But there was so much more stuff in here. I'm running dozens and dozens of projects on GitHub, and I pull all of the issues, issue comments, commits and releases into my one database.&lt;/p&gt;
&lt;p&gt;So here I have a table of all 7,000 commits that I've made on GitHub across all of these different project so I can search these in one place, I can evaluate them, I can compare commits across different projects and I can also do charting. So here's a graph that's constructed by a custom SQL query, which pulls out just the date of each commit, groups by them and sticks on a count. So these are my commits over time as a chart.&lt;/p&gt;
&lt;p&gt;A key philosophy of Datasette is everything should have a URL. So once I've constructed this query and put it on a chart, I can bookmark that page right there and that becomes, essentially, the application encoded in a URL that I can bookmark and link to and visit later on.&lt;/p&gt;
&lt;p&gt;I've got a copy of my genome. I did &lt;a href="https://www.23andme.com/"&gt;23AndMe&lt;/a&gt; a few years ago, and I found out recently that they have an export button and you can export your genome out as a CSV file. So I did that. I now have a copy of the 600,000 rows of my genome that 23AndMe keep track of. Which means I can use SQL queries to analyse my own genome. I have a query here that tells me what colour my eyes are based on running this query against my genome - apparently my eyes are blue 99% of the time because I have a GG genotype in this particular location on my genome.&lt;/p&gt;
&lt;p&gt;This took an entire weekend figure out. I &lt;a href="https://simonwillison.net/2019/Oct/21/weeknotes-squirrels-genome/"&gt;went to Science Hack Day&lt;/a&gt; and borrowed a geneticist so I could see if I could figure out some homebrew genetic explorations.&lt;/p&gt;
&lt;p&gt;I mentioned the Apple Watch earlier. Apple record an incredible amount of data about me and they don't uploaded to the cloud - an interesting thing about Apple is they keep that stuff on your device. So if you dig around in the HealthKit app on the iPhone, there's an export button that will give you that data back out again. And I &lt;a href="https://datasette.io/tools/healthkit-to-sqlite"&gt;wrote code to turn that into SQLite&lt;/a&gt; and there was an enormous quantity of data in here. I've got standing time and basal energy burned, my body mass over time reported by my Withing scales, headphone audio exposure is something that showed up last year, the number of flights of steps I climbed, all sorts of stuff&lt;/p&gt;
&lt;p&gt;But the most interesting one is workouts because every time you track a workout using your Apple Watch, it goes in the database. And if it's an outdoor workout - where you're, say, going for a run, it records your GPS location every few seconds. I ran the Bay To Breakers race in San Francisco a couple of years ago and beecause of that, I've got a database table with my location tracked route. It's 3,800 points, and I can plot that on a map and see the exact, finely grained route I took through San Francisco when I ran Bay To Brakers just over three years ago.&lt;/p&gt;
&lt;p&gt;It's kind of amazing to me that this stuff is just sat there on our devices waiting for us to liberate it and start using it to build these visualisations and learn interesting things about ourselves.&lt;/p&gt;
&lt;p&gt;I also mentioned earlier on Apple Photos. Apple Photos databases are absolutely fascinating because it turns out Apple run machine learning models on your phone to figure out what you took photographs of. And this means I can do things like run a SQL query that shows me &lt;a href="https://simonwillison.net/2020/May/21/dogsheep-photos/"&gt;all of the photographs that I've taken of a pelican&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;My personal Dogsheep right here shows me these pictures - these are Apple Photos I took where Apple itself has labelled them as pelican.&lt;/p&gt;
&lt;p&gt;And even more fun than that is Apple calculates scores for your photos. These are hidden away in the SQLite database. And they have names like Z Overall Aesthetic Score or Z Harmonious Colour Score or Z Pleasant Camera Tilt Score. So this query here isn't just showing me photographs I've taken of pelicans - it's sorting them by the most aesthetically pleasing according to Apple's classification algorithm that they're running on my device.&lt;/p&gt;
&lt;p&gt;I can show you the Pelican photo with the most pleasant camera tilt - there's so much fun stuff you can do digging into all of this.&lt;/p&gt;
&lt;p&gt;So I've been building this set of Dogsheep tools for about a year now, but the thing I always want to do was to tie them all together. I wanted to build that personal search engine.&lt;/p&gt;
&lt;p&gt;And a couple of weeks ago, I finally got it working.&lt;/p&gt;
&lt;p&gt;I've called it &lt;a href="https://datasette.io/plugins/dogsheep-beta"&gt;Dogsheep Beta&lt;/a&gt; because Stephen Wolfram's search engine is called &lt;a href="https://www.wolframalpha.com/"&gt;Wolfram Alpha&lt;/a&gt;. So I figured Dogsheep Beta was definitely the appropriate name for it.&lt;/p&gt;
&lt;p&gt;And so this right here is Dogsheep Beta, and it lets me run searches across my tweets, my photos, my GitHub commits, issue comments, releases, comments I made on Hacker News - all in one place.&lt;/p&gt;
&lt;p&gt;And so if I search for example, for "cleopaws costume", I will see all of the tweets by my dog, whose Twitter handle is @Cleopaws where she's tweeting about her costumes. I think she makes &lt;a href="https://twitter.com/Cleopaws/status/894322673525837824"&gt;a particularly fetching pirate&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Or again, I can search for PyGotham and I get back not just not just tweets, but I get back issue comments from my personal issues where I've been tracking the fact that I need to submit a talk.&lt;/p&gt;
&lt;p&gt;Again, this is a custom search interface, but really all this is under the hood is SQLite - it's a SQLite database with a table that acts as a search index and I'm running queries using SQLite's built-in full text search mechanism.&lt;/p&gt;
&lt;p&gt;If any of this has caught your interest, all of this is aggressively open source. So if you go to &lt;a href="https://dogsheep.github.io/"&gt;dogsheep.github.io&lt;/a&gt; this is an index page for all of the Dogsheep project. Click through to &lt;a href="https://github.com/simonw/datasette"&gt;the GitHub Datasette repository&lt;/a&gt; and I've got very &lt;a href="https://docs.datasette.io/"&gt;extensive documentation&lt;/a&gt; covering Datasette, what it can do, the plugins available and how to write plugins of your own.&lt;/p&gt;
&lt;p&gt;And really, my goal with this project is I want to build a growing ecosystem of not just Datasette plugins, but tools for getting data from all of these different sources and loading them into SQLite databases so that we can join them together and use them to find out interesting things about the world and about our own lives.&lt;/p&gt;
&lt;p&gt;If this has piqued your interest, please drop by the project. We have &lt;a href="https://github.com/simonw/datasette/discussions"&gt;a GitHub discussions board&lt;/a&gt; which you're welcome to join.&lt;/p&gt;
&lt;p&gt;[I hadn't launched the full website when I gave this talk, but today I'd encourage you to explore &lt;a href="https://datasette.io/"&gt;datasette.io&lt;/a&gt;]&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/my-talks"&gt;my-talks&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/small-data"&gt;small-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/annotated-talks"&gt;annotated-talks&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="sqlite"/><category term="my-talks"/><category term="datasette"/><category term="dogsheep"/><category term="small-data"/><category term="annotated-talks"/></entry><entry><title>How to secure an Ubuntu server using Tailscale and UFW</title><link href="https://simonwillison.net/2021/Feb/26/tailscale/#atom-tag" rel="alternate"/><published>2021-02-26T20:31:52+00:00</published><updated>2021-02-26T20:31:52+00:00</updated><id>https://simonwillison.net/2021/Feb/26/tailscale/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://tailscale.com/kb/1077/secure-server-ubuntu-18-04"&gt;How to secure an Ubuntu server using Tailscale and UFW&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This is the Tailscale tutorial I’ve always wanted: it explains in detail how you can run an Ubuntu server (from any cloud provider) such that only devices on your personal Tailscale network can access it.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ubuntu"&gt;ubuntu&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tailscale"&gt;tailscale&lt;/a&gt;&lt;/p&gt;



</summary><category term="security"/><category term="ubuntu"/><category term="dogsheep"/><category term="tailscale"/></entry><entry><title>Building a search engine for datasette.io</title><link href="https://simonwillison.net/2020/Dec/19/dogsheep-beta/#atom-tag" rel="alternate"/><published>2020-12-19T18:12:31+00:00</published><updated>2020-12-19T18:12:31+00:00</updated><id>https://simonwillison.net/2020/Dec/19/dogsheep-beta/#atom-tag</id><summary type="html">
    &lt;p&gt;This week I added &lt;a href="https://datasette.io/-/beta"&gt;a search engine&lt;/a&gt; to &lt;a href="https://datasette.io/"&gt;datasette.io&lt;/a&gt;, using the search indexing tool I've been building for &lt;a href="https://datasette.substack.com/p/dogsheep-personal-analytics-with"&gt;Dogsheep&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A screenshot of dogsheep.io search results for ripgrep" src="https://static.simonwillison.net/static/2020/dogsheep-beta-ripgrep.jpg" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;h4&gt;Project search for Datasette&lt;/h4&gt;
&lt;p&gt;The Datasette project has a &lt;em&gt;lot&lt;/em&gt; of constituent parts. There's the project itself and its &lt;a href="https://docs.datasette.io/"&gt;documentation&lt;/a&gt; - 171 pages when exported to PDF and counting. Then there are the &lt;a href="https://datasette.io/plugins"&gt;48 plugins&lt;/a&gt;, &lt;a href="https://datasette.io/tools/sqlite-utils"&gt;sqlite-utils&lt;/a&gt; and &lt;a href="https://datasette.io/tools"&gt;21 more tools&lt;/a&gt; for creating SQLite databases, the &lt;a href="https://dogsheep.github.io/"&gt;Dogsheep&lt;/a&gt; collection and over three years of content I've written about the project &lt;a href="https://simonwillison.net/tags/datasette/"&gt;on my blog&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The new &lt;a href="https://datasette.io/-/beta"&gt;datasette.io search engine&lt;/a&gt; provides a faceted search interface to all of this material in one place. It currently searches across:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Every section of the latest documentation (415 total)&lt;/li&gt;
&lt;li&gt;48 plugin READMEs&lt;/li&gt;
&lt;li&gt;22 tool READMEs&lt;/li&gt;
&lt;li&gt;63 news items posted on the Datasette website&lt;/li&gt;
&lt;li&gt;212 items from my blog&lt;/li&gt;
&lt;li&gt;Release notes from 557 package releases&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I plan to extend it with more data sources in the future.&lt;/p&gt;
&lt;h4&gt;How it works: Dogsheep Beta&lt;/h4&gt;
&lt;p&gt;I'm reusing the search engine I originally built for my Dogsheep personal analytics project (see &lt;a href="https://simonwillison.net/2020/Nov/14/personal-data-warehouses/"&gt;Personal Data Warehouses: Reclaiming Your Data&lt;/a&gt;). I call that search engine &lt;a href="https://github.com/dogsheep/beta"&gt;Dogsheep Beta&lt;/a&gt;. The name is &lt;a href="https://datasette.substack.com/p/dogsheep-personal-analytics-with"&gt;a pun&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;SQLite has great &lt;a href="https://sqlite.org/fts5.html"&gt;full-text search&lt;/a&gt; built in, and I make extensive use of that in Datasette projects already. But out of the box it's not quite right for this kind of search engine that spans multiple different content types.&lt;/p&gt;
&lt;p&gt;The problem is relevance calculation. I wrote about this in &lt;a href="https://simonwillison.net/2019/Jan/7/exploring-search-relevance-algorithms-sqlite/"&gt;Exploring search relevance algorithms with SQLite&lt;/a&gt; - short version: query relevance is calculated using statistics against the whole corpus, so search terms that occur rarely in the overall corpus contribute a higher score than more common terms.&lt;/p&gt;
&lt;p&gt;This means that calculated full-text ranking scores calculated against one table of data cannot be meaningfully compared to scores calculated independently against a separate table, as the corpus statistics used to calculate the rank will differ.&lt;/p&gt;
&lt;p&gt;To get usable scores, you need everything in a single table. That's what Dogsheep Beta does: it creates a new table, called &lt;code&gt;search_index&lt;/code&gt;, and copies searchable content from the other tables into that new table.&lt;/p&gt;
&lt;p&gt;This is analagous to how an external search index like Elasticsearch works: you store your data in the main database, then periodically update an index in Elasticsearch. It's the &lt;a href="https://2017.djangocon.us/talks/the-denormalized-query-engine-design-pattern/"&gt;denormalized query engine&lt;/a&gt; design pattern in action.&lt;/p&gt;
&lt;h4&gt;Configuring Dogsheep Beta&lt;/h4&gt;
&lt;p&gt;There are two components to Dogsheep Beta: a command-line tool for building a search index, and a Datasette plugin for providing an interface for running searches.&lt;/p&gt;
&lt;p&gt;Both of these run off a YAML configuration file, which defines the tables that should be indexed and also defines how those search results should be displayed.&lt;/p&gt;
&lt;p&gt;(Having one configuration file handle both indexing and display feels a little inelegant, but it's extremely productive for iterating on so I'm letting that slide.)&lt;/p&gt;
&lt;p&gt;Here's the full &lt;a href="https://github.com/simonw/datasette.io/blob/0e3e839ca1efb19cd84100291789521c0a1a3561/templates/dogsheep-beta.yml"&gt;Dogsheep configuration for datasette.io&lt;/a&gt;. An annotated extract:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Index material in the content.db SQLite file&lt;/span&gt;
&lt;span class="pl-ent"&gt;content.db&lt;/span&gt;:
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Define a search type called 'releases'&lt;/span&gt;
  &lt;span class="pl-ent"&gt;releases&lt;/span&gt;:
    &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Populate that search type by executing this SQL&lt;/span&gt;
    &lt;span class="pl-ent"&gt;sql&lt;/span&gt;: &lt;span class="pl-s"&gt;|-&lt;/span&gt;
&lt;span class="pl-s"&gt;      select&lt;/span&gt;
&lt;span class="pl-s"&gt;        releases.id as key,&lt;/span&gt;
&lt;span class="pl-s"&gt;        repos.name || ' ' || releases.tag_name as title,&lt;/span&gt;
&lt;span class="pl-s"&gt;        releases.published_at as timestamp,&lt;/span&gt;
&lt;span class="pl-s"&gt;        releases.body as search_1,&lt;/span&gt;
&lt;span class="pl-s"&gt;        1 as is_public&lt;/span&gt;
&lt;span class="pl-s"&gt;      from&lt;/span&gt;
&lt;span class="pl-s"&gt;        releases&lt;/span&gt;
&lt;span class="pl-s"&gt;        join repos on releases.repo = repos.id&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;    &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; When displaying a search result, use this SQL to&lt;/span&gt;
    &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; return extra details about the item&lt;/span&gt;
    &lt;span class="pl-ent"&gt;display_sql&lt;/span&gt;: &lt;span class="pl-s"&gt;|-&lt;/span&gt;
&lt;span class="pl-s"&gt;      select&lt;/span&gt;
&lt;span class="pl-s"&gt;        -- highlight() is a custom SQL function&lt;/span&gt;
&lt;span class="pl-s"&gt;        highlight(render_markdown(releases.body), :q) as snippet,&lt;/span&gt;
&lt;span class="pl-s"&gt;        html_url&lt;/span&gt;
&lt;span class="pl-s"&gt;      from releases where id = :key&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;    &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Jinja template fragment to display the result&lt;/span&gt;
    &lt;span class="pl-ent"&gt;display&lt;/span&gt;: &lt;span class="pl-s"&gt;|-&lt;/span&gt;
&lt;span class="pl-s"&gt;      &amp;lt;h3&amp;gt;Release: &amp;lt;a href="{{ display.html_url }}"&amp;gt;{{ title }}&amp;lt;/a&amp;gt;&amp;lt;/h3&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;      &amp;lt;p&amp;gt;{{ display.snippet|safe }}&amp;lt;/p&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;      &amp;lt;p&amp;gt;&amp;lt;small&amp;gt;Released {{ timestamp }}&amp;lt;/small&amp;gt;&amp;lt;/p&amp;gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The core pattern here is the &lt;code&gt;sql:&lt;/code&gt; key, which defines a SQL query that must return the following columns:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;key&lt;/code&gt; - a unique identifier for this search item&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;title&lt;/code&gt; - a title for this indexed document&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;timestamp&lt;/code&gt; - a timestamp for when it was created. May be null.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;search_1&lt;/code&gt; - text to be searched. I may add support for &lt;code&gt;search_2&lt;/code&gt; and &lt;code&gt;search_3&lt;/code&gt; later on to store text that will be treated with a lower relevance score.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;is_public&lt;/code&gt; - should this be considered "public" data. This is a holdover from Dogsheep Beta's application for personal analytics, I don't actually need it for datasette.io.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To create an index, run the following:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;dogsheep-beta index dogsheep-index.db dogsheep-config.yml
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;index&lt;/code&gt; command will loop through every configured search type in the YAML file, execute the SQL query and use it to populate a &lt;code&gt;search_index&lt;/code&gt; table in the &lt;code&gt;dogsheep-index.db&lt;/code&gt; SQLite database file.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://datasette.io/dogsheep-index/search_index"&gt;the search_index table&lt;/a&gt; for &lt;a href="https://datasette.io/"&gt;datasette.io&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;When you run a search, the plugin queries that table and gets back results sorted by relevance (or other sort criteria, if specified).&lt;/p&gt;
&lt;p&gt;To display the results, it loops through each one and uses the Jinja template fragment from the configuration file to turn it into HTML.&lt;/p&gt;
&lt;p&gt;If a &lt;code&gt;display_sql:&lt;/code&gt; query is defined, that query will be executed for each result to populate the &lt;code&gt;{{ display }}&lt;/code&gt; object made available to the template. &lt;a href="https://www.sqlite.org/np1queryprob.html"&gt;Many Small Queries Are Efficient In SQLite&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;Search term highlighting&lt;/h4&gt;
&lt;p&gt;I spent &lt;a href="https://github.com/simonw/datasette.io/issues/49"&gt;a bit of time&lt;/a&gt; thinking about search highlighting. SQLite has an implementation of highlighting built in - &lt;a href="https://sqlite.org/fts5.html#the_snippet_function"&gt;the snippet() function&lt;/a&gt; - but it's not designed to be HTML-aware so there's a risk it might mangle HTML by adding highlighting marks in the middle of a tag or attribute.&lt;/p&gt;
&lt;p&gt;I ended up rolling borrowing a BSD licensed &lt;a href="https://github.com/django-haystack/django-haystack/blob/v3.0/haystack/utils/highlighting.py"&gt;highlighting class&lt;/a&gt; from the &lt;a href="https://github.com/django-haystack/django-haystack"&gt;django-haystack&lt;/a&gt; project. It deals with HTML by stripping tags, which seems to be more-or-less what Google do for their own search results so I figured that's good enough for me.&lt;/p&gt;
&lt;p&gt;I used this &lt;a href="https://github.com/simonw/datasette.io/blob/0e3e839ca1efb19cd84100291789521c0a1a3561/plugins/sql_functions.py"&gt;one-off site plugin&lt;/a&gt; to wrap the highlighting code in a custom SQLite function. This meant I could call it from the &lt;code&gt;display_sql:&lt;/code&gt; query in the Dogsheep Beta YAML configuration.&lt;/p&gt;
&lt;p&gt;A custom template tag would be more elegant, but I don't yet have a mechanism to expose custom template tags in the Dogsheep Beta rendering mechanism.&lt;/p&gt;
&lt;h4&gt;Build, index, deploy&lt;/h4&gt;
&lt;p&gt;The Datasette website implements the Baked Data pattern, where the content is compiled into SQLite database files and bundled with the application code itself as part of the deploy.&lt;/p&gt;
&lt;p&gt;Building the index is just another step of that process.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://github.com/simonw/datasette.io/blob/0e3e839ca1efb19cd84100291789521c0a1a3561/.github/workflows/deploy.yml"&gt;the deploy.yml&lt;/a&gt; GitHub workflow used by the site. It roughly does the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Download the current version of the &lt;a href="https://datasette.io/content"&gt;content.db&lt;/a&gt; database file. This is so it doesn't have to re-fetch release and README content that was previously stored there.&lt;/li&gt;
&lt;li&gt;Download the current version of &lt;a href="https://datasette.io/blog"&gt;blog.db&lt;/a&gt;, with entries from my blog. This means I don't have to fetch all entries, just the new ones.&lt;/li&gt;
&lt;li&gt;Run &lt;a href="https://github.com/simonw/datasette.io/blob/0e3e839ca1efb19cd84100291789521c0a1a3561/build_directory.py"&gt;build_directory.py&lt;/a&gt;, the script which fetches data for the plugins and tools pages.
&lt;ul&gt;
&lt;li&gt;This hits the GitHub GraphQL API to find new repositories tagged &lt;code&gt;datasette-io&lt;/code&gt; and &lt;code&gt;datasette-plugin&lt;/code&gt; and &lt;code&gt;datasette-tool&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;That GraphQL query also returns the most recent release. The script then checks to see if those releases have previously been fetched and, if not, uses &lt;a href="https://datasette.io/tools/github-to-sqlite"&gt;github-to-sqlite&lt;/a&gt; to fetch them.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Imports the data from &lt;a href="https://github.com/simonw/datasette.io/blob/0e3e839ca1efb19cd84100291789521c0a1a3561/news.yaml"&gt;news.yaml&lt;/a&gt; into a &lt;code&gt;news&lt;/code&gt; table using &lt;a href="https://datasette.io/tools/yaml-to-sqlite"&gt;yaml-to-sqlite&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Imports the latest PyPI download statistics for my packages from my &lt;a href="https://github.com/simonw/package-stats"&gt;simonw/package-stats&lt;/a&gt; repository, which implements &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;git scraping&lt;/a&gt; against the most excellent &lt;a href="https://pypistats.org/"&gt;pypistats.org&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Runs the &lt;code&gt;dogsheep-beta index&lt;/code&gt; command to build a &lt;code&gt;dogsheep-index.db&lt;/code&gt; search index.&lt;/li&gt;
&lt;li&gt;Runs some soundness checks, e.g. &lt;code&gt;datasette . --get "/plugins"&lt;/code&gt;, to verify that Datasette is likely to at least return 200 results for some critical pages once published.&lt;/li&gt;
&lt;li&gt;Uses &lt;code&gt;datasette publish cloudrun&lt;/code&gt; to deploy the results to Google Cloud Run, which hosts the website.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I love building websites this way. You can have as much complexity as you like in the build script (my TIL website build script &lt;a href="https://simonwillison.net/2020/Sep/3/weeknotes-airtable-screenshots-dogsheep/#weeknotes-2020-09-03-social-media-cards-tils"&gt;generates screenshots using Puppeteer&lt;/a&gt;) but the end result is some simple database files running on inexpensive, immutable, scalable hosting.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cloudrun"&gt;cloudrun&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/baked-data"&gt;baked-data&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="projects"/><category term="search"/><category term="sqlite"/><category term="datasette"/><category term="dogsheep"/><category term="weeknotes"/><category term="cloudrun"/><category term="baked-data"/></entry><entry><title>The case against client certificates</title><link href="https://simonwillison.net/2020/Dec/9/case-against-client-certificates/#atom-tag" rel="alternate"/><published>2020-12-09T14:41:46+00:00</published><updated>2020-12-09T14:41:46+00:00</updated><id>https://simonwillison.net/2020/Dec/9/case-against-client-certificates/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://twitter.com/colmmacc/status/1057017343438540801"&gt;The case against client certificates&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Colm MacCárthaigh provides a passionately argued Twitter thread about client certificates and why they should be avoided. I tried using them as an extra layer of protection fir my personal Dogsheep server and ended up abandoning them—certificate management across my devices was too fiddly.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://fly.io/blog/incoming-6pn-private-networks/"&gt;Thomas Ptacek&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/certificates"&gt;certificates&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;&lt;/p&gt;



</summary><category term="certificates"/><category term="dogsheep"/></entry><entry><title>Weeknotes: github-to-sqlite workflows, datasette-ripgrep enhancements, Datasette 0.52</title><link href="https://simonwillison.net/2020/Dec/6/weeknotes/#atom-tag" rel="alternate"/><published>2020-12-06T05:46:11+00:00</published><updated>2020-12-06T05:46:11+00:00</updated><id>https://simonwillison.net/2020/Dec/6/weeknotes/#atom-tag</id><summary type="html">
    &lt;p&gt;This week: Improvements to &lt;code&gt;datasette-ripgrep&lt;/code&gt;, &lt;code&gt;github-to-sqlite&lt;/code&gt; and &lt;code&gt;datasette-graphql&lt;/code&gt;, plus Datasette 0.52 and a flurry of dot-releases.&lt;/p&gt;
&lt;h4&gt;datasette-ripgrep 0.5 and 0.6&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://github.com/simonw/datasette-ripgrep"&gt;datasette-ripgrep&lt;/a&gt; (introduced &lt;a href="https://simonwillison.net/2020/Nov/28/datasette-ripgrep/"&gt;last week&lt;/a&gt;) landed &lt;a href="https://news.ycombinator.com/item?id=25236636"&gt;on Hacker News&lt;/a&gt;, and the comments there inspired me to build a few new features. The interface looks like this now:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://ripgrep.datasette.io/-/ripgrep?pattern=client.*cookies%3D&amp;amp;glob=test*.py"&gt;&lt;img src="https://static.simonwillison.net/static/2020/ripgrep__client__cookies_.png" alt="Screenshot showing search results for a client.cookies= against files matching test.py" style="max-width:100%;" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;I added options for literal searches (as opposed to a regex match) and ignoring case, and a field that lets you filter to just a specific file pattern, for example &lt;code&gt;test*.py&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;These are already features of &lt;a href="https://github.com/BurntSushi/ripgrep/blob/master/GUIDE.md"&gt;ripgrep&lt;/a&gt; so adding them was a case of hooking up the interface and using it to modify the command-line arguments passed to the underlying tool.&lt;/p&gt;
&lt;h4&gt;github-to-sqlite workflows&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://github.com/dogsheep/github-to-sqlite"&gt;github-to-sqlite&lt;/a&gt; is my command-line tool for importing data from the GitHub API into a SQLite database, for analysis with Datasette.&lt;/p&gt;
&lt;p&gt;I released &lt;a href="https://github.com/dogsheep/github-to-sqlite/releases/tag/2.8"&gt;github-to-sqlite 2.8&lt;/a&gt; this week  with two new commands: &lt;code&gt;github-to-sqlite pull-requests&lt;/code&gt;, contributed by &lt;a href="https://github.com/adamjonas"&gt;Adam Jonas&lt;/a&gt;, and &lt;code&gt;github-to-sqlite workflows&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The new &lt;code&gt;workflows&lt;/code&gt; command can be run against one or more repositories and will fetch their GitHub Actions workflow YAML files, parse them and use them to populate new database tables called &lt;code&gt;workflows&lt;/code&gt;, &lt;code&gt;jobs&lt;/code&gt; and &lt;code&gt;steps&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;If you run workflows across a bunch of different repositories this means you can analyze your workflow usage using SQL!&lt;/p&gt;
&lt;p&gt;My &lt;a href="https://github-to-sqlite.dogsheep.net/github"&gt;github-to-sqlite demo&lt;/a&gt; now includes workflows from my core Datasette and Dogsheep projects. Some example queries:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;My &lt;a href="https://github-to-sqlite.dogsheep.net/github/steps?_facet=uses"&gt;most commonly used action steps&lt;/a&gt; - the top two are &lt;code&gt;actions/checkout@v2&lt;/code&gt; and &lt;code&gt;actions/cache@v2&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;All &lt;a href="https://github-to-sqlite.dogsheep.net/github/steps?_facet=uses&amp;amp;uses=actions%2Fcache%40v1"&gt;steps using actions/cache@v1&lt;/a&gt;, which need to be upgraded to &lt;code&gt;v2&lt;/code&gt; (this link will likely soon stop returning any results as I apply those updates).&lt;/li&gt;
&lt;li&gt;My workflows that &lt;a href="https://github-to-sqlite.dogsheep.net/github/workflows?_sort=id&amp;amp;on__contains=workflow_dispatch"&gt;use the workflow_dispatch&lt;/a&gt; trigger.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The &lt;a href="https://github.com/dogsheep/github-to-sqlite/blob/main/github_to_sqlite/utils.py#L787-L858"&gt;implementation&lt;/a&gt; is a good example of my &lt;a href="https://sqlite-utils.readthedocs.io/en/stable/python-api.html"&gt;sqlite-utils&lt;/a&gt; library in action - I pass the extracted YAML data straight to the &lt;code&gt;.insert(data, alter=True)&lt;/code&gt; method which creates the correct table schema automatically, altering it if there are any missing columns.&lt;/p&gt;
&lt;h4&gt;datasette-graphql 1.3&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://github.com/simonw/datasette-graphql/releases/tag/1.3"&gt;datasette-graphql 1.3&lt;/a&gt; has one tiny feature which I find enormously satisfying.&lt;/p&gt;
&lt;p&gt;The plugin provides a GraphQL interface to any table in Datasette. The latest versions use the new "table actions" menu (accessible through a cog icon in the page heading) to provide a link to an example query for that table.&lt;/p&gt;
&lt;p&gt;I added the example queries in &lt;a href="https://github.com/simonw/datasette-graphql/releases/tag/1.3"&gt;1.2&lt;/a&gt;, but in 1.3 the example has been expanded to include examples of foreign key references. For a table like this one of &lt;a href="https://github-to-sqlite.dogsheep.net/github/commits"&gt;GitHub commits&lt;/a&gt; the example query now looks &lt;a href="https://github-to-sqlite.dogsheep.net/graphql?query=%7B%0A%20%20commits%20%7B%0A%20%20%20%20totalCount%0A%20%20%20%20pageInfo%20%7B%0A%20%20%20%20%20%20hasNextPage%0A%20%20%20%20%20%20endCursor%0A%20%20%20%20%7D%0A%20%20%20%20nodes%20%7B%0A%20%20%20%20%20%20sha%0A%20%20%20%20%20%20message%0A%20%20%20%20%20%20author_date%0A%20%20%20%20%20%20committer_date%0A%20%20%20%20%20%20raw_author%20%7B%0A%20%20%20%20%20%20%20%20id%0A%20%20%20%20%20%20%20%20name%0A%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20raw_committer%20%7B%0A%20%20%20%20%20%20%20%20id%0A%20%20%20%20%20%20%20%20name%0A%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20repo%20%7B%0A%20%20%20%20%20%20%20%20id%0A%20%20%20%20%20%20%20%20name%0A%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20author%20%7B%0A%20%20%20%20%20%20%20%20id%0A%20%20%20%20%20%20%20%20login%0A%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20committer%20%7B%0A%20%20%20%20%20%20%20%20id%0A%20%20%20%20%20%20%20%20login%0A%20%20%20%20%20%20%7D%0A%20%20%20%20%7D%0A%20%20%7D%0A%7D"&gt;like this&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-graphql"&gt;&lt;pre&gt;{
  &lt;span class="pl-v"&gt;commits&lt;/span&gt; {
    &lt;span class="pl-v"&gt;totalCount&lt;/span&gt;
    &lt;span class="pl-v"&gt;pageInfo&lt;/span&gt; {
      &lt;span class="pl-v"&gt;hasNextPage&lt;/span&gt;
      &lt;span class="pl-v"&gt;endCursor&lt;/span&gt;
    }
    &lt;span class="pl-v"&gt;nodes&lt;/span&gt; {
      &lt;span class="pl-v"&gt;sha&lt;/span&gt;
      &lt;span class="pl-v"&gt;message&lt;/span&gt;
      &lt;span class="pl-v"&gt;author_date&lt;/span&gt;
      &lt;span class="pl-v"&gt;committer_date&lt;/span&gt;
      &lt;span class="pl-v"&gt;raw_author&lt;/span&gt; {
        &lt;span class="pl-v"&gt;id&lt;/span&gt;
        &lt;span class="pl-v"&gt;name&lt;/span&gt;
      }
      &lt;span class="pl-v"&gt;raw_committer&lt;/span&gt; {
        &lt;span class="pl-v"&gt;id&lt;/span&gt;
        &lt;span class="pl-v"&gt;name&lt;/span&gt;
      }
      &lt;span class="pl-v"&gt;repo&lt;/span&gt; {
        &lt;span class="pl-v"&gt;id&lt;/span&gt;
        &lt;span class="pl-v"&gt;name&lt;/span&gt;
      }
      &lt;span class="pl-v"&gt;author&lt;/span&gt; {
        &lt;span class="pl-v"&gt;id&lt;/span&gt;
        &lt;span class="pl-v"&gt;login&lt;/span&gt;
      }
      &lt;span class="pl-v"&gt;committer&lt;/span&gt; {
        &lt;span class="pl-v"&gt;id&lt;/span&gt;
        &lt;span class="pl-v"&gt;login&lt;/span&gt;
      }
    }
  }
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The new foreign key references include the repo, author and committer fields. The example query now demonstrates the most interesting feature of &lt;code&gt;datasette-graphql&lt;/code&gt; - its ability to automatically &lt;a href="https://github.com/simonw/datasette-graphql#accessing-nested-objects"&gt;convert foreign key relationships&lt;/a&gt; in your database into nested GraphQL fields.&lt;/p&gt;
&lt;p&gt;It's a small change, but it makes me really happy.&lt;/p&gt;
&lt;h4&gt;Datasette 0.52&lt;/h4&gt;
&lt;p&gt;I shipped &lt;a href=""&gt;Datasette 0.52&lt;/a&gt;, a relatively minor release which mainly kicked off an effort to rebrand "configuration" as "settings".&lt;/p&gt;
&lt;p&gt;I'm doing this as part of my effort to fix Datasette's "metadata" concept. The &lt;code&gt;metadata.json&lt;/code&gt; file started out as a way to add metadata - title, description, license and source information. Over time the file expanded to cover things like default facet displays and sort orders... and then when plugins came along it grew to cover plugin configuration as well.&lt;/p&gt;
&lt;p&gt;This is really confusing. Editing &lt;code&gt;metadata.json&lt;/code&gt; to configure a plugin doesn't make a great deal of sense.&lt;/p&gt;
&lt;p&gt;For Datasette 1.0 I want to clean this up. I'm planning on splitting metadata and configuration into separate mechanisms.&lt;/p&gt;
&lt;p&gt;There's just one problem: Datasette already has a "configuration" concept in the form of the &lt;code&gt;--config&lt;/code&gt; command-line option which can be used to set some very fundamental options for the Datasette server - things like the SQL time limit and the maximum allowed CSV download size.&lt;/p&gt;
&lt;p&gt;I want to call plugin configuration settings "configuration", so I've renamed &lt;code&gt;--config&lt;/code&gt; to &lt;code&gt;--settings&lt;/code&gt; - see the new &lt;a href="https://docs.datasette.io/en/stable/settings.html"&gt;settings documentation&lt;/a&gt; for details.&lt;/p&gt;
&lt;p&gt;This also gave me the chance to clean up a weird design decision. Datasette's configuration options looked like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;datasette data.db --config sql_time_limit_ms:1000
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The new &lt;code&gt;--setting&lt;/code&gt; replacement instead looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;datasette data.db --setting sql_time_limit_ms 1000
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note the lack of a colon here - having an option take two arguments is a perfectly cromulent way of using &lt;a href="https://click.palletsprojects.com/en/7.x/"&gt;Click&lt;/a&gt;, but it's one I wasn't aware of when I first released Datasette.&lt;/p&gt;
&lt;p&gt;The old &lt;code&gt;--config&lt;/code&gt; mechanism continues to work, but it now displays a deprecation warning - it will be removed in Datasette 1.0.&lt;/p&gt;
&lt;h4&gt;Datasette dot-releases&lt;/h4&gt;
&lt;p&gt;0.52 has already had more dot-releases than any other version of Datasette. These are all pure bug fixes, mostly for obscure bugs that are unlikely to have affected anyone. To summarize the &lt;a href="https://docs.datasette.io/en/stable/changelog.html#v0-52-4"&gt;release notes&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://docs.datasette.io/en/stable/changelog.html#v0-52-1"&gt;0.52.1&lt;/a&gt; updated the &lt;a href="https://docs.datasette.io/en/stable/testing_plugins.html#testing-plugins"&gt;testing plugins documentation&lt;/a&gt; to promote &lt;code&gt;datasette.client&lt;/code&gt;, fixed a bug with the display of compound foreign keys and improved the locations searched by the &lt;code&gt;datasette --load-module=spatialite&lt;/code&gt; shortcut.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.datasette.io/en/stable/changelog.html#v0-52-2"&gt;0.52.2&lt;/a&gt; fixed support for the generated columns feature added in &lt;a href="https://www.sqlite.org/releaselog/3_31_1.html"&gt;SQLite 3.31.0&lt;/a&gt;, fixed a 500 error on &lt;code&gt;OPTIONS&lt;/code&gt; requests, added support for &amp;gt;32MB database file downloads on Cloud Run and shipped a CSS fix to the cog menus contributed by Abdussamet Koçak.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.datasette.io/en/stable/changelog.html#v0-52-3"&gt;0.52.3&lt;/a&gt; fixed a fun bug with Datasette installed on Amazon Linux running on ARM where static assets would 404. I eventually &lt;a href="https://github.com/simonw/datasette/issues/1124"&gt;tracked that down&lt;/a&gt; to an unexpected symlink in the site-packages directory.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.datasette.io/en/stable/changelog.html#v0-52-4"&gt;0.52.4&lt;/a&gt; now writes errors logged by Datasette to &lt;code&gt;stderr&lt;/code&gt;, not &lt;code&gt;stdout&lt;/code&gt;. It also fixes a startup error on Windows, another contribution from Abdussamet Koçak.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Broken Dogsheep&lt;/h4&gt;
&lt;p&gt;My personal &lt;a href="https://simonwillison.net/2020/Nov/14/personal-data-warehouses/"&gt;Dogsheep&lt;/a&gt; broke this week. I've been running it on an Amazon Lightsail instance, and this week I learned that Lightsail has a &lt;a href="https://lightsail.aws.amazon.com/ls/docs/en_us/articles/amazon-lightsail-viewing-instance-burst-capacity"&gt;baseline CPU&lt;/a&gt; mechanism which grants your instance burst capacity but shuts it down if it exceeds that capacity too often!&lt;/p&gt;
&lt;p&gt;So I'm moving it to a DigitalOcean droplet which won't do that, and trying to figure out enough &lt;a href="https://docs.ansible.com/ansible/latest/index.html"&gt;Ansible&lt;/a&gt; to completely automate the process.&lt;/p&gt;
&lt;p&gt;My ideal server is one that is configured entirely from files in source control, and updates itself by pulling new configuration from that repository. I plan to use &lt;a href="https://docs.ansible.com/ansible/latest/cli/ansible-pull.html"&gt;ansible-pull&lt;/a&gt; for this, once I've put together the necessary playbooks.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/graphql"&gt;graphql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ripgrep"&gt;ripgrep&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="projects"/><category term="graphql"/><category term="datasette"/><category term="dogsheep"/><category term="weeknotes"/><category term="ripgrep"/></entry><entry><title>Personal Data Warehouses: Reclaiming Your Data</title><link href="https://simonwillison.net/2020/Nov/14/personal-data-warehouses/#atom-tag" rel="alternate"/><published>2020-11-14T03:53:55+00:00</published><updated>2020-11-14T03:53:55+00:00</updated><id>https://simonwillison.net/2020/Nov/14/personal-data-warehouses/#atom-tag</id><summary type="html">
    &lt;p&gt;I gave a talk yesterday about personal data warehouses for &lt;a href="https://octo.github.com/speakerseries/SimonWillison/"&gt;GitHub's OCTO Speaker Series&lt;/a&gt;, focusing on my &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; and &lt;a href="https://dogsheep.github.io/"&gt;Dogsheep&lt;/a&gt; projects. The video of the talk is now available, and I'm presenting that here along with &lt;a href="/2020/Nov/14/personal-data-warehouses/"&gt;an annotated summary of the talk&lt;/a&gt;, including links to demos and further information.&lt;/p&gt;

&lt;p&gt;There's a short technical glitch with the screen sharing in the first couple of minutes of the talk - I've added screenshots to the notes which show what you would have seen if my screen had been correctly shared.&lt;/p&gt;

    &lt;div style="padding: 1em; text-align: center; position: static; top: 1em; background: black; padding-bottom: 1em; box-shadow: 0px 0px 4px 2px #000000;"&gt;
        &lt;iframe src="https://www.youtube-nocookie.com/embed/l1EFThsAFgs" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="" width="560" height="315" frameborder="0"&gt; &lt;/iframe&gt;
        &lt;div class="hide-on-homepage" style="text-align: left"&gt;&lt;a style="color: white" href="#" onclick="document.getElementsByTagName('iframe')[0].parentNode.style.position=this.innerHTML == 'Unstick' ? 'static' : 'sticky'; this.innerHTML = this.innerHTML == 'Unstick' ? 'Stick while scrolling' : 'Unstick'; return false;"&gt;Stick while scrolling&lt;/a&gt;&lt;/div&gt;
    &lt;/div&gt;
&lt;!-- cutoff --&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-1.jpg" alt="Simon Willison - FOSS Developer and Consultant, Python, Django, Datasette" /&gt;
        &lt;p&gt;I'm going to be talking about personal data warehouses, what they are, why you want one, how to build them and some of the interesting things you can do once you've set one up.&lt;/p&gt;
        &lt;p&gt;I'm going to start with a demo.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/cleo-golden-gate-bridge.jpg" alt="Cleo wearing a very fine Golden Gate Bridge costume with a prize rosette attached to it" /&gt;
        &lt;p&gt;This is my dog, Cleo - when she won first place in a dog costume competition here, dressed as the Golden Gate Bridge!&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/all-checkins.png" alt="All of my checkins on a map" /&gt;
        &lt;p&gt;So the question I want to answer is: How much of a San Francisco hipster is Cleo?&lt;/p&gt;
        &lt;p&gt;I can answer it using my personal data warehouse. &lt;/p&gt;
        &lt;p&gt;I have a database of ten year's worth of my checkins on Foursquare Swarm - generated using my &lt;a href="https://github.com/dogsheep/swarm-to-sqlite"&gt;swarm-to-sqlite&lt;/a&gt; tool. Every time I check in somewhere with Cleo I use the Wolf emoji in the checkin message.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/dog-checkins.png" alt="All of Cleo's checkins on a map" /&gt;
        &lt;p&gt;I can filter for just checkins where the checkin message includes the wolf emoji.&lt;/p&gt;
        &lt;p&gt;Which means I can see just her checkins - all 280 of them.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/dog-categories.png" alt="Cleo's top categories" /&gt;
        &lt;p&gt;If I facet by venue category, I can see she's checked in at 57 parks, 32 dog runs, 19 coffee shops and 12 organic groceries.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-2.jpg" alt="A map of coffe shops that Cleo has been to" /&gt;
        &lt;p&gt;Then I can facet by venue category and filter down to just her 19 checkins at coffee shops.&lt;/p&gt;
        &lt;p&gt;Turns out she's a Blue Bottle girl at heart.&lt;/p&gt;
        &lt;p&gt;Being able to build a map of the coffee shops that your dog likes is obviously a very valuable reason to build your own personal data warehouse.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-4.jpg" alt="The Datasette website" /&gt;
        &lt;p&gt;Let's take a step back and talk about how this demo works.&lt;/p&gt;
        &lt;p&gt;The key to this demo is this web application I'm running called &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt;. I've been working on this project for three years now, and the goal is to make it as easy and cheap as possible to explore data in all sorts of shapes and sizes.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-5.jpg" alt="A screenshot of the Guardian Data Blog" /&gt;
        &lt;p&gt;Ten years ago I was working for the Guardian newspaper in London. One of the things I realized when I joined the organization is that newspapers collect enormous amounts of data. Any time they publish a chart or map in the newspaper someone has to collect the underlying information.&lt;/p&gt;
        &lt;p&gt;There was a journalist there called Simon Rogers who was a wizard at collecting any data you could think to ask for. He knew exactly where to get it from, and had collected a huge number of brilliant spreadsheets on his desktop computer.&lt;/p&gt;
        &lt;p&gt;We decided we wanted to publish the data behind the stories. We started something called &lt;a href="https://www.theguardian.com/news/datablog/2011/jan/27/data-store-office-for-national-statistics "&gt;the Data Blog&lt;/a&gt;, and aimed to accompany our stories with the raw data behind them.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-6.jpg" alt="A Google Sheet containing US public debt figures since 2001" /&gt;
        &lt;p&gt;We ended up using Google Sheets to publish the data. It worked, but I always felt like there should be a better way to publish this kind of structured data in a way that was as useful and flexible as possible for our audience.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-7.jpg" alt="Serverless hosting? Scale to Zero. ... but databases cost extra!" /&gt;
        &lt;p&gt;Fast forward to 2017, when I was looking into this new thing called "serverless" hosting - in particular one called Zeit Now, which has since rebranded as &lt;a href="https://vercel.com/"&gt;Vercel&lt;/a&gt;.&lt;/p&gt;
        &lt;p&gt;My favourite aspect of Serverless is "Scale to zero" - the idea that you only pay for hosting when your project is receiving traffic.&lt;/p&gt;
        &lt;p&gt;If you're like me, and you love building side-projects but you don't like paying $5/month for them for the rest of your life, this is perfect.&lt;/p&gt;
        &lt;p&gt;The catch is that serverless providers tend to charge you extra for databases, or require you to buy a hosted database from another provider.&lt;/p&gt;
        &lt;p&gt;But what if your database doesn't change? Can you bundle your database in the same container as your code?&lt;/p&gt;
        &lt;p&gt;This was the initial inspiration behind creating Datasette.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-8.jpg" alt="A Global Database of Power Plants on the World Resources Institute website" /&gt;
        &lt;p&gt;Here's another demo. The &lt;a href="https://www.wri.org/publication/global-power-plant-database"&gt;World Resources Institute&lt;/a&gt; maintain a CSV file of every power plant in the world.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-9.jpg" alt="A GitHub repository containing the Global Power Plant Database" /&gt;
        &lt;p&gt;Like many groups, they publish &lt;a href="https://github.com/wri/global-power-plant-database"&gt;that data&lt;/a&gt; on GitHub.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-10.jpg" alt="A Datasette instance showing power plants faceted by country and primary fuel" /&gt;
        &lt;p&gt;I have &lt;a href="https://github.com/simonw/global-power-plants-datasette/blob/main/.github/workflows/scheduled.yml"&gt;a script&lt;/a&gt; that grabs their most recent data and publishes it using Datasette.&lt;/p&gt;
        &lt;p&gt;&lt;a href="https://global-power-plants.datasettes.com/global-power-plants/global-power-plants"&gt;Here's the contents of their CSV file&lt;/a&gt; published using Datasette&lt;/p&gt;
        &lt;p&gt;Datasette supports plugins. You've already seen this plugin in my demo of Cleo's coffee shops - it's called &lt;a href="https://github.com/simonw/datasette-cluster-map"&gt;datasette-cluster-map&lt;/a&gt; and it works by looking for tables with a latitude and longitude column and plotting the data on a map.&lt;/p&gt;
        &lt;p&gt;&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-11.jpg" alt="A zoomed in map showing two power plants in Antarctica" /&gt;
        &lt;p&gt;Straight away looking at this data you notice that there's a couple of power plants down here in Antarctica. This is McMurdo station, and it has a 6.6MW oil generator.&lt;/p&gt;
        &lt;p&gt;And oh look, there's a wind farm down there too on Ross Island knocking out 1MW of electricity.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-12.jpg" alt="A map of all of the nuclear power plants in France" /&gt;
        &lt;p&gt;But this is also a demonstration of faceting. I can slice down to just the &lt;a href="https://global-power-plants.datasettes.com/global-power-plants/global-power-plants?primary_fuel=Nuclear&amp;amp;country_long=France"&gt;nuclear power plants in France&lt;/a&gt; and see those on a map.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-13.jpg" alt="a screen full of JSON" /&gt;
        &lt;p&gt;And anything i can see in the interface, I can get out as JSON. Here's &lt;a href="https://global-power-plants.datasettes.com/global-power-plants/global-power-plants.json?primary_fuel=Nuclear&amp;amp;country_long=France"&gt;a JSON file&lt;/a&gt; showing all of those nuclear power plants in France.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-14.jpg" alt="A screen full of CSV" /&gt;
        &lt;p&gt;And here's &lt;a href="https://global-power-plants.datasettes.com/global-power-plants/global-power-plants.csv?primary_fuel=Nuclear&amp;amp;country_long=France"&gt;a CSV export&lt;/a&gt; which I can use to pull the data into Excel or other CSV-compatible software.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-15.jpg" alt="An interface for editing a SQL query" /&gt;
        &lt;p&gt;If I click &lt;a href="https://global-power-plants.datasettes.com/global-power-plants?sql=select+rowid%2C+country%2C+country_long%2C+name%2C+gppd_idnr%2C+capacity_mw%2C+latitude%2C+longitude%2C+primary_fuel%2C+other_fuel1%2C+other_fuel2%2C+other_fuel3%2C+commissioning_year%2C+owner%2C+source%2C+url%2C+geolocation_source%2C+wepp_id%2C+year_of_capacity_data%2C+generation_gwh_2013%2C+generation_gwh_2014%2C+generation_gwh_2015%2C+generation_gwh_2016%2C+generation_gwh_2017%2C+generation_data_source%2C+estimated_generation_gwh+from+%5Bglobal-power-plants%5D+where+%22country_long%22+%3D+%3Ap0+and+%22primary_fuel%22+%3D+%3Ap1+order+by+rowid+limit+101&amp;amp;p0=France&amp;amp;p1=Nuclear"&gt;"view and edit SQL"&lt;/a&gt; to get back the SQL query that was used to generate the page - and I can edit and re-execute that query.&lt;/p&gt;
        &lt;p&gt;I can get those custom results back as CSV or JSON as well!&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-16.jpg" alt="Results of a custom SQL query" /&gt;
        &lt;p&gt;In most web applications this would be seen as a terrifying security hole - it's a SQL injection attack, as a documented feature!&lt;/p&gt;
        &lt;p&gt;A couple of reasons this isn't a problem here:&lt;/p&gt;
        &lt;p&gt;Firstly, this is setup as a read-only database: INSERT and UPDATE statements that would modify it are not allowed. There's a one second time limit on queries as well.&lt;/p&gt;
        &lt;p&gt;Secondly, everything in this database is designed to be published. There are no password hashes or private user data that could be exposed here.&lt;/p&gt;
        &lt;p&gt;This also means we have a JSON API that lets JavaScript execute SQL queries against a backend! This turns out to be really useful for rapid prototyping.&lt;/p&gt;
        &lt;p&gt;&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-17.jpg" alt="The SQLite home page" /&gt;
        &lt;p&gt;It's worth talking about the secret sauce that makes this all possible.&lt;/p&gt;
        &lt;p&gt;This is all built on top of &lt;a href="https://www.sqlite.org/"&gt;SQLite&lt;/a&gt;. Everyone watching this talk uses SQLite every day, even if you don't know it.&lt;/p&gt;
        &lt;p&gt;Most iPhone apps use SQLite, many desktop apps do, it's even running inside my Apple Watch.&lt;/p&gt;
        &lt;p&gt;One of my favourite features is that a SQLite database is a single file on disk. This makes it easy to copy, send around and also means I can bundle data up in that single file, include it in a Docker file and deploy it to serverless hosts to serve it on the internet.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-19.jpg" alt="A Datasette map of power outages" /&gt;
        &lt;p&gt;Here's another demo that helps show how GitHub fits into all of this.&lt;/p&gt;
        &lt;p&gt;Last year PG&amp;amp;E - the power company that covers much of California - turned off the power to large swathes of the state.&lt;/p&gt;
        &lt;p&gt;I got lucky: six months earlier I had started scraping &lt;a href="https://m.pge.com/#outages"&gt;their outage map&lt;/a&gt; and recording the history to a GitHub repository.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-20.jpg" alt="A list of recent commits to the pge-outages GitHub repository, each one with a commit messages showing the number of incidents added, removed or updated" /&gt;
        &lt;p&gt;&lt;a href="https://github.com/simonw/pge-outages"&gt;simonw/pge-outages&lt;/a&gt; is a git repository with 34,000 commits tracking the history of outages that PG&amp;amp;E had published on their outage map.&lt;/p&gt;
        &lt;p&gt;You can see that two minutes ago they added 35 new outages.&lt;/p&gt;
        &lt;p&gt;I'm using this data to publish a Datasette instance with details of their historic outages. Here's a page &lt;a href="https://pge-outages.simonwillison.net/pge-outages/most_recent_snapshot?_sort_desc=estCustAffected"&gt;showing their current outages&lt;/a&gt; ordered by the most customers affected by the outage.&lt;/p&gt;
        &lt;p&gt;Read &lt;a href="https://simonwillison.net/2019/Oct/10/pge-outages/"&gt;Tracking PG&amp;amp;E outages by scraping to a git repo&lt;/a&gt; for more details on this project.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-21.jpg" alt="A screenshot of my blog entry about Git scraping" /&gt;
        &lt;p&gt;I recently decided to give this technique a name. I'm calling it &lt;strong&gt;Git scraping&lt;/strong&gt; - the idea is to take any data source on the web that represents a point-in-time and commit it to a git repository that tells the story of the history of that particular thing.&lt;/p&gt;
        &lt;p&gt;Here's my article describing the pattern in more detail: &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping: track changes over time by scraping to a Git repository&lt;/a&gt;.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-22.jpg" alt="A screenshot of the NYT scraped election results page" /&gt;
        &lt;p&gt;This technique really stood out just last week during the US election.&lt;/p&gt;
        &lt;p&gt;This is the &lt;a href="https://alex.github.io/nyt-2020-election-scraper/battleground-state-changes.html#"&gt;New York Times election scraper website&lt;/a&gt;, built by Alex Gaynor and a growing team of contributors. It scrapes the New York Times election results and uses the data over time to show how the results are trending.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-23.jpg" alt="The nyt-2020-election-scraper GitHub repository page" /&gt;
        &lt;p&gt;It uses a &lt;a href="https://github.com/alex/nyt-2020-election-scraper"&gt;GitHub Actions script&lt;/a&gt; that runs on a schedule, plus a really clever Python script that turns it into a useful web page.&lt;/p&gt;
        &lt;p&gt;You can find more examples of Git scraping under the &lt;a href="https://github.com/topics/git-scraping"&gt;git-scraping topic&lt;/a&gt; on GitHub.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-24.jpg" alt="A screenshot of the incident map on fire.ca.gov" /&gt;
        &lt;p&gt;I'm going to do a bit of live coding to show you how this stuff works.&lt;/p&gt;
        &lt;p&gt;This is the &lt;a href="https://www.fire.ca.gov/incidents/"&gt;incidents page&lt;/a&gt; from the state of California &lt;a href="https://www.fire.ca.gov/"&gt;CAL FIRE&lt;/a&gt; website.&lt;/p&gt;
        &lt;p&gt;Any time I see a map like this, my first instinct is to open up the browser developer tools and try to figure out how it works.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-25.jpg" alt="The incident map with an open developer tools network console showing XHR requests ordered by size, largest first" /&gt;
        &lt;p&gt;If I open the network tab, refresh the page and then filter to just XHR requests.&lt;/p&gt;
        &lt;p&gt;A neat trick is to order by size - because inevitably the thing at the top of the list is the most interesting data on the page.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-26.jpg" alt="a JSON list of incidents" /&gt;
        &lt;p&gt;This appears to be &lt;a href="https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents"&gt;a JSON file&lt;/a&gt; telling me about all of the current fires in the state of California!&lt;/p&gt;
        &lt;p&gt;(I set up &lt;a href="https://github.com/simonw/ca-fires-history"&gt;a Git scraper for this&lt;/a&gt; a while ago.)&lt;/p&gt;
        &lt;p&gt;Now I'm going to take this a step further and turn it into a Datasette instance.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-27.jpg" alt="The AllYearIncidents section of the JSON" /&gt;
        &lt;p&gt;It looks like the &lt;code&gt;AllYearIncidents&lt;/code&gt; key is the most interesting bit here.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-28.jpg" alt="A screenshot showing the output of curl" /&gt;
        &lt;p&gt;I'm going to use &lt;a href="https://curl.se/"&gt;curl&lt;/a&gt; to fetch that data, then pipe it through &lt;a href="https://stedolan.github.io/jq/"&gt;jq&lt;/a&gt; to filter for just that &lt;code&gt;AllYearIncidents&lt;/code&gt; array.&lt;/p&gt;
        &lt;pre&gt;&lt;code&gt;curl 'https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents' \
        | jq .AllYearIncidents&lt;/code&gt;&lt;/pre&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-29.jpg" alt="Pretty-printed JSON produced by piping to jq" /&gt;
        &lt;p&gt;Now I have a list of incidents for this year.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-30.jpg" alt="A terminal running a command that inserts the data into a SQLite database" /&gt;
        &lt;p&gt;Next I'm going to pipe it into a tool I've been building called &lt;a href="https://sqlite-utils.readthedocs.io/"&gt;sqlite-utils&lt;/a&gt; - it's a suite of tools for manipulating SQLite databases.&lt;/p&gt;
        &lt;p&gt;I'm going to use the "insert" command and insert the data into a &lt;code&gt;ca-fires.db&lt;/code&gt; in an &lt;code&gt;incidents&lt;/code&gt; table.&lt;/p&gt;
        &lt;pre&gt;&lt;code&gt;curl 'https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents' \
        | jq .AllYearIncidents \
        | sqlite-utils insert ca-fires.db incidents -&lt;/code&gt;&lt;/pre&gt;
        &lt;p&gt;Now I've got a &lt;code&gt;ca-fires.db&lt;/code&gt; file. I can open that in Datasette:&lt;/p&gt;
        &lt;pre&gt;&lt;code&gt;datasette ca-fires.db -o&lt;/code&gt;&lt;/pre&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-31.jpg" alt="A map of incidents, where one of them is located at the very bottom of the map in Antarctica" /&gt;
        &lt;p&gt;And here it is - a brand new database.&lt;/p&gt;
        &lt;p&gt;You can straight away see that one of the rows has a bad location, hence it appears in Antarctica.&lt;/p&gt;
        &lt;p&gt;But 258 of them look like they are in the right place.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-32.jpg" alt="I list of faceted counties, showing the count of fires for each one" /&gt;
        &lt;p&gt;I can also facet by county, to see which county had the most fires in 2020 - Riverside had 21.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-33.jpg" alt="datasette publish --help shows a list of hosting providers - cloudrun, heroku and vercel" /&gt;
        &lt;p&gt;I'm going to take this a step further and put it on the internet, using a command called &lt;a href="https://docs.datasette.io/en/stable/publish.html#datasette-publish"&gt;datasette publish&lt;/a&gt;.&lt;/p&gt;
        &lt;p&gt;Datasette publish supports a number of different hosting providers. I'm going to use &lt;a href="https://vercel.com/"&gt;Vercel&lt;/a&gt;.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-34.jpg" alt="A terminal running datasette publish" /&gt;
        &lt;p&gt;I'm going to tell it to publish that database to a project called "ca-fires" - and tell it to install the &lt;code&gt;datasette-cluster-map&lt;/code&gt; plugin.&lt;/p&gt;
        &lt;pre&gt;&lt;code&gt;datasette publish vercel ca-fires.db \
        --project ca-fires \
        --install datasette-cluster-map&lt;/code&gt;&lt;/pre&gt;
        &lt;p&gt;This then takes that database file, bundles it up with the Datasette application and deploys it to Vercel.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-35.jpg" alt="A page on Vercel.com showing a deployment in process" /&gt;
        &lt;p&gt;Vercel gives me a URL where I can watch the progress of the deploy.&lt;/p&gt;
        &lt;p&gt;The goal here is to have as few steps as possible between finding some interesting data, turning it into a SQLite database you can use with Datasette and then publishing it online.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-36.jpg" alt="The incident map, hosted online at ca-fires.vercel.com" /&gt;
        &lt;p&gt;And this here is that database I just created - available for anyone on the internet to visit and build against.&lt;/p&gt;
        &lt;p&gt;&lt;a href="https://ca-fires.vercel.app/ca-fires/incidents"&gt;https://ca-fires.vercel.app/ca-fires/incidents&lt;/a&gt;&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-37.jpg" alt="Screenshot of Stephen Wolfram's essay Seeking the Productive Life: Some Details of My Personal Infrastructure" /&gt;
        &lt;p&gt;I've given you a whistle-stop tour of Datasette for the purposes of publishing data, and hopefully doing some serious data journalism.&lt;/p&gt;
        &lt;p&gt;So what does this all have to do with personal data warehouses?&lt;/p&gt;
        &lt;p&gt;Last year, I read this essay by Stephen Wolfram: &lt;a href="https://writings.stephenwolfram.com/2019/02/seeking-the-productive-life-some-details-of-my-personal-infrastructure/"&gt;Seeking the Productive Life: Some Details of My Personal Infrastructure&lt;/a&gt;. It's an incredible exploration of fourty years of productivity hacks that Stephen Wolfram has applied to become the CEO of a 1,000 person company that works remotely. He's optimized every aspect of his professional and personal life.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-38.jpg" alt="A screenshot showing the section where he talks about his metasearcher" /&gt;
        &lt;p&gt;It's a lot. &lt;/p&gt;
        &lt;p&gt;But there was one part of this that really caught my eye. He talks about a thing he calls a "metasearcher" - a search engine on his personal homepage that searches every email, journals, files, everything he's ever done - all in one place.&lt;/p&gt;
        &lt;p&gt;And I thought to myself, I really want THAT. I love this idea of a personal portal to my own stuff.&lt;/p&gt;
        &lt;p&gt;And because it was inspired by Stephen Wolfram, but I was planning on building a much less impressive version, I decided to call it Dogsheep.&lt;/p&gt;
        &lt;p&gt;Wolf, ram. Dog, sheep.&lt;/p&gt;
        &lt;p&gt;I've been building this over the past year.&lt;/p&gt;
        &lt;p&gt;&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-39.jpg" alt="A screenshot of my personal Dogsheep homepage, showing a list of data sources and saved queries" /&gt;
        &lt;p&gt;So essentially this is my personal data warehouse. It pulls in my personal data from as many sources as I can find and gives me an interface to browse that data and run queries against it.&lt;/p&gt;
        &lt;p&gt;I've got data from Twitter, Apple HealthKit, GitHub, Swarm, Hacker News, Photos, a copy of my genome... all sorts of things.&lt;/p&gt;
        &lt;p&gt;I'll show a few more demos.&lt;/p&gt;
        &lt;p&gt;&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-40.jpg" alt="Tweets with selfies by Cleo" /&gt;
        &lt;p&gt;Here's another one about Cleo. Cleo has &lt;a href="https://twitter.com/cleopaws"&gt;a Twitter account&lt;/a&gt;, and every time she goes to the vet she posts a selfie and says how much she weighs.&lt;/p&gt;
        &lt;p&gt;&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-42.jpg" alt="A graph showing Cleo's weight over time" /&gt;
        &lt;p&gt;Here's a SQL query that finds every tweet that mentions her weight, pulls out her weight in pounds using a regular expression, then uses the &lt;a href="https://github.com/simonw/datasette-vega"&gt;datasette-vega&lt;/a&gt; charting plugin to show a self-reported chart of her weight over time.&lt;/p&gt;
        &lt;pre&gt;&lt;code&gt;select
    created_at,
    regexp_match('.*?(\d+(\.\d+))lb.*', full_text, 1) as lbs,
    full_text,
    case
        when (media_url_https is not null)
        then json_object('img_src', media_url_https, 'width', 300)
    end as photo
    from
    tweets
    left join media_tweets on tweets.id = media_tweets.tweets_id
    left join media on media.id = media_tweets.media_id
    where
    full_text like '%lb%'
    and user = 3166449535
    and lbs is not null
    group by
    tweets.id
    order by
    created_at desc
    limit
    101&lt;/code&gt;&lt;/pre&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-43.jpg" alt="A screenshot showing the result of running a SQL query against my genome" /&gt;
        &lt;p&gt;I did &lt;a href="https://www.23andme.com/"&gt;23AndMe&lt;/a&gt; a few years ago, so I have a copy of my genome in Dogsheep. This SQL query tells me what colour my eyes are.&lt;/p&gt;
        &lt;p&gt;Apparently they are blue, 99% of the time.&lt;/p&gt;
        &lt;pre&gt;&lt;code&gt;select rsid, genotype, case genotype
    when 'AA' then 'brown eye color, 80% of the time'
    when 'AG' then 'brown eye color'
    when 'GG' then 'blue eye color, 99% of the time'
    end as interpretation from genome where rsid = 'rs12913832'&lt;/code&gt;&lt;/pre&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-44.jpg" alt="A list of tables in my HealthKit database" /&gt;
        &lt;p&gt;I have HealthKit data from my Apple Watch.&lt;/p&gt;
        &lt;p&gt;Something I really like about Apple's approach to this stuff is that they don't just upload all of your data to the cloud.&lt;/p&gt;
        &lt;p&gt;This data lives on your watch and on your phone, and there's an option in the Health app on your phone to export it - as a zip file full of XML.&lt;/p&gt;
        &lt;p&gt;I wrote a script called &lt;a href="https://github.com/dogsheep/healthkit-to-sqlite"&gt;healthkit-to-sqlite&lt;/a&gt; that converts that zip file into a SQLite database, and now I have tables for things like my basal energy burned, my body fat percentage, flights of stairs I've climbed.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-46.jpg" alt="Screenshot showing a Datasette map of my San Francisco Half Marathon route" /&gt;
        &lt;p&gt;But the really fun part is that it turns out any time you track an outdoor workout on your Apple Watch it records your exact location every few seconds, and you can get that data back out again!&lt;/p&gt;
        &lt;p&gt;This is a map of my exact route for the San Francisco Half Marathon three years ago.&lt;/p&gt;
        &lt;p&gt;I've started tracking an "outdoor walk" every time I go on a walk now, just so I can get the GPS data out again later.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-47.jpg" alt="Screeshot showing a list of commits to my projects, faceted by repository" /&gt;
        &lt;p&gt;I have a lot of data from GitHub about my projects - all of my commits, issues, issue comments and releases - everything I can get out of the GitHub API using my &lt;a href="https://github.com/dogsheep/github-to-sqlite"&gt;github-to-sqlite&lt;/a&gt; tool.&lt;/p&gt;
        &lt;p&gt;So I can do things like see all of my commits across all of my projects, search and facet them.&lt;/p&gt;
        &lt;p&gt;I have a public demo of a subset of this data at &lt;a href="https://github-to-sqlite.dogsheep.net/"&gt;github-to-sqlite.dogsheep.net&lt;/a&gt;.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-48.jpg" alt="Commits filtered by a search for pytest" /&gt;
        &lt;p&gt;I can search my commits for &lt;a href="https://github-to-sqlite.dogsheep.net/github/commits?_search=pytest&amp;amp;_sort_desc=author_date"&gt;any commit that mentions "pytest"&lt;/a&gt;.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-49.jpg" alt="A list of all of my recent project releases" /&gt;
        &lt;p&gt;I have &lt;a href="https://github-to-sqlite.dogsheep.net/github/recent_releases"&gt;all of my releases&lt;/a&gt;, which is useful for when I write &lt;a href="https://simonwillison.net/tags/weeknotes/"&gt;my weeknotes&lt;/a&gt; and want to figure out what I've been working on.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-50.jpg" alt="A faceted interface showing my photos, faceted by city, country and whether they are a favourite" /&gt;
        &lt;p&gt;Apple Photos is a particularly interesting source of data.&lt;/p&gt;
        &lt;p&gt;It turns out the Apple Photos app uses a SQLite database, and if you know what you're doing you can extract photo metadata from it.&lt;/p&gt;
        &lt;p&gt;They actually run machine learning models on your own device to figure out what your photos are of!&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-51.jpg" alt="Some photos I have taken of pelicans, inside Datasette" /&gt;
        &lt;p&gt;You can use the machine learning labels to see all of the photos you have taken of pelicans. Here are all of the photos I have taken that Apple Photos have identified as pelicans.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-52.jpg" alt="Screenshot showing some of the columns in my photos table" /&gt;
        &lt;p&gt;It also turns out they have columns called things like ZOVERALLAESTHETICSCORE, ZHARMONIOUSCOLORSCORE, ZPLEASANTCAMERATILTSCORE and more.&lt;/p&gt;
        &lt;p&gt;So I can sort my pelican photos with the most aesthetically pleasing first!&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-53.jpg" alt="Screenshot of my blog entry about Using SQL to find my best photo of a pelican according to Apple Photos" /&gt;
        &lt;p&gt;I wrote more about this on my blog; &lt;a href="https://simonwillison.net/2020/May/21/dogsheep-photos/"&gt;Using SQL to find my best photo of a pelican according to Apple Photos&lt;/a&gt;.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-54.jpg" alt="Screenshot of my Dogsheep Beta faceted search interface" /&gt;
        &lt;p&gt;And a few weeks ago I finally got around to building the thing I'd always wanted: the search engine.&lt;/p&gt;
        &lt;p&gt;I called it &lt;a href="https://github.com/dogsheep/dogsheep-beta"&gt;Dogsheep Beta&lt;/a&gt;, because Stephen Wolfram has a search engine called &lt;a href="https://www.wolframalpha.com/"&gt;Wolfram Alpha&lt;/a&gt;.&lt;/p&gt;
        &lt;p&gt;This is pun-driven development: I came up with this pun a while ago and liked it so much I committed to building the software.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-55.jpg" alt="Search results for Cupertino, showing photos with maps" /&gt;
        &lt;p&gt;I wanted to know when the last time I had eaten a waffle-fish ice cream was. I knew it was in Cupertino, so I searched Dogsheep Beta for Cupertino and found this photo.&lt;/p&gt;
        &lt;p&gt;I hope this illustrates how much you can do if you pull all of your personal data into one place!&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-56.jpg" alt="GDPR really helps" /&gt;
        &lt;p&gt;The GDPR law that passed in Europe a few years ago really helps with this stuff.&lt;/p&gt;
        &lt;p&gt;Companies have to provide you with access to the data that they store about you.&lt;/p&gt;
        &lt;p&gt;Many big internet companies have responded to this by providing a self-service export feature, usually buried somewhere in the settings.&lt;/p&gt;
        &lt;p&gt;You can also request data directly from companies, but the self-service option helps them keep their customer support costs down.&lt;/p&gt;
        &lt;p&gt;This stuff becomes easier over time as more companies build out these features.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-57.jpg" alt="Democratizing access. The future is already here, it's just not evenly distributed - William Gibson" /&gt;
        &lt;p&gt;The other challenge is how we democratize access to this.&lt;/p&gt;
        &lt;p&gt;Everything I've shown you today is open source: you can install this software and use it yourself, for free.&lt;/p&gt;
        &lt;p&gt;But there's a lot of assembly required. You need to figure out authentication tokens, find somewhere to host it, set up cron jobs and authentication.&lt;/p&gt;
        &lt;p&gt;But this should be accessible to regular non-uber-nerd humans!&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-57-2.jpg" alt="Democratizing access. Should users run their own online Dogsheep? So hard and risky! Tailscale and WireGuard are interesting here. Vendors to provide hosted Dogsheep? Not a great business, risky!. Better options: Desktop app, mobile app." /&gt;
        &lt;p&gt;Expecting regular humans to run a secure web server somewhere is pretty terrifying. I've been looking at &lt;a href="https://www.wireguard.com/"&gt;WireGuard&lt;/a&gt; and &lt;a href="https://tailscale.com/"&gt;Tailscale&lt;/a&gt; to help make secure access between devices easier, but that's still very much for super-users only.&lt;/p&gt;
        &lt;p&gt;Running this as a hosted service doesn't appeal: taking responsibility for people's personal data is scary, and it's probably not a great business.&lt;/p&gt;
        &lt;p&gt;I think the best options are to run on people's own personal devices - their mobile phones and their laptops. I think it's feasible to get Datasette running in those environments, and I really like the idea of users being able to import their personal data onto a device that they control and analyzing it there.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-58.jpg" alt="Try it yourself! Everything I’ve shown you today is open source" /&gt;
        &lt;p&gt;I invite you to try this all out for yourself!&lt;/p&gt;
        &lt;p&gt;&lt;a href="https://datasette.io"&gt;datasette.io&lt;/a&gt; for Datasette&lt;/p&gt;
        &lt;p&gt;&lt;a href="https://github.com/dogsheep"&gt;github.com/dogsheep&lt;/a&gt; and &lt;a href="https://dogsheep.github.io"&gt;dogsheep.github.io&lt;/a&gt; for Dogsheep&lt;/p&gt;
        &lt;p&gt;&lt;a href="https://simonwillison.net"&gt;simonwillison.net&lt;/a&gt; is my personal blog&lt;/p&gt;
        &lt;p&gt;&lt;a href="https://twitter.com/simonw"&gt;twitter.com/simonw&lt;/a&gt; is my Twitter account&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-59.jpg" alt="Screenshot of Dogsheep on GitHub" /&gt;
        &lt;p&gt;The &lt;a href="https://github.com/dogsheep"&gt;Dogsheep GitHub organization&lt;/a&gt; has most of the tools that I've used to build out my personal Dogsheep warehouse - many of them using the naming convention of something-to-sqlite.&lt;/p&gt;
    &lt;/div&gt;
    &lt;h4 style="clear: both"&gt;Q&amp;amp;A, from &lt;a href="https://docs.google.com/document/d/1rFp2tXLvaCK5khbPbSfV8nfVjZLGTX3KkMZ2FMpgi-k/edit"&gt;this Google Doc&lt;/a&gt;&lt;/h4&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-60.jpg" alt="Screenshot of the Google Doc" /&gt;
        &lt;p&gt;Q: Is there/will there be a Datasette hosted service that I can pay $ for? I would like to pay $5/month to get access to the latest version of Dogsheep with all the latest plugins!&lt;/p&gt;
        &lt;p&gt;I don’t want to build a hosting site for personal private data because I think people should stay in control of that themselves, plus I don’t think there’s a particularly good business model for that.&lt;/p&gt;
        &lt;p&gt;Instead, I’m building a hosted service for Datasette (called Datasette Cloud) which is aimed at companies and organizations. I want to be able to provide newsrooms and other groups with a private, secure, hosted environment where they can share data with each other and run analysis.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-61.jpg" alt="Screenshot showing an export running on an iPhone in the Health app" /&gt;
        &lt;p&gt;Q: How do you sync your data from your phone/watch to the data warehouse? Is it a manual process?&lt;/p&gt;
        &lt;p&gt;The health data is manual: the iOS Health app has an export button which generates a zip file of XML which you can then AirDrop to a laptop. I then run my &lt;a href="https://github.com/dogsheep/healthkit-to-sqlite"&gt;healthkit-to-sqlite&lt;/a&gt; script against it to generate the DB file and SCP that to my Dogsheep server.&lt;/p&gt;
        &lt;p&gt;Many of my other Dogsheep tools use APIs and can run on cron, to fetch the most recent data from Swarm and Twitter and GitHub and so on.&lt;/p&gt;
        &lt;p&gt;Q: When accessing Github/Twitter etc do you run queries against their API or you periodically sync (retrieve mostly I guess) the data to the warehouse first and then query locally? &lt;/p&gt;
        &lt;p&gt;I always try to get ALL the data so I can query it locally. The problem with APIs that let you run queries is that inevitably there’s something I want to do that can’t be done of the API - so I’d much rather suck everything down into my own database so I can write my own SQL queries.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-62.jpg" alt="Screenshot showing how to run swarm-to-sqlite in a terminal" /&gt;
        &lt;p&gt;Here's an example of my &lt;a href="https://github.com/dogsheep/swarm-to-sqlite"&gt;swarm-to-sqlite&lt;/a&gt; script, pulling in just checkins from the past two weeks (using authentication credentials from an environment variable).&lt;/p&gt;
        &lt;pre&gt;&lt;code&gt;swarm-to-sqlite swarm.db --since=2w&lt;/code&gt;&lt;/pre&gt;
        &lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/1299d61d17637d1145955ebc019ea3c4"&gt;a redacted copy&lt;/a&gt; of my Dogsheep crontab.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-63.jpg" alt="Screenshot of the SQL.js GitHub page" /&gt;
        &lt;p&gt;Q: Have you explored doing this as a single page app so that it is possible to deploy this as a static site? What are the constraints there?&lt;/p&gt;
        &lt;p&gt;It’s actually possible to query SQLite databases entirely within client-side JavaScript using &lt;a href="https://github.com/sql-js/sql.js"&gt;SQL.js&lt;/a&gt; (SQLite compiled to WebAssembly)&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-64.jpg" alt="Screenshot of an Observable notebook running SQL.js" /&gt;
        &lt;p&gt;&lt;a href="https://observablehq.com/@mbostock/sqlite"&gt;This Observable notebook&lt;/a&gt; is an example that uses this to run SQL queries against a SQLite database file loaded from a URL.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-65.jpg" alt="Screenshot of a search for cherry trees on sf-trees.com" /&gt;
        &lt;p&gt;Datasette’s JSON and GraphQL APIs mean it can easily act as an API backend to SPAs&lt;/p&gt;
        &lt;p&gt;I built this site to offer a search engine for trees in San Francisco. View source to see how it hits a Datasette API in the background: &lt;a href="https://sf-trees.com/?q=palm"&gt;https://sf-trees.com/?q=palm&lt;/a&gt;&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-66.jpg" alt="The network pane running against sf-trees.com" /&gt;
        &lt;p&gt;You can use the network pane to see that it's running queries against a Datasette backend.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-67.jpg" alt="Screenshot of view-source on sf-trees.com" /&gt;
        &lt;p&gt;Here's the JavaScript code which calls the API.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-68.jpg" alt="Screenshot showing the GraphiQL explorer tool running a GraphQL query against Datasette" /&gt;
        &lt;p&gt;&lt;a href="https://github-to-sqlite.dogsheep.net/graphql?query=%7B%0A%20%20repos%20%7B%0A%20%20%20%20totalCount%0A%20%20%20%20nodes%20%7B%0A%20%20%20%20%20%20full_name%0A%20%20%20%20%20%20owner%20%7B%0A%20%20%20%20%20%20%20%20id%0A%20%20%20%20%20%20%20%20name%0A%20%20%20%20%20%20%7D%20%0A%20%20%20%20%7D%0A%20%20%7D%0A%7D"&gt;This demo&lt;/a&gt; shows Datasette’s &lt;a href="https://github.com/simonw/datasette-graphql"&gt;GraphQL plugin&lt;/a&gt; in action.&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class="slide"&gt;
        &lt;img src="https://static.simonwillison.net/static/2020/octo-github/shot-69.jpg" alt="Screenshot of Datasette Canned Query documentation" /&gt;
        &lt;p&gt;Q: What possibilities for data entry tools do the writable canned queries open up?&lt;/p&gt;
        &lt;p&gt;&lt;a href="https://docs.datasette.io/en/stable/sql_queries.html#writable-canned-queries"&gt;Writable canned queries&lt;/a&gt; are a relatively recent Datasette feature that allow administrators to configure a UPDATE/INSERT/DELETE query that can be called by users filling in forms or accessed via a JSON API.&lt;/p&gt;
        &lt;p&gt;The idea is to make it easy to build backends that handle simple data entry in addition to serving read-only queries. It’s a feature with a lot of potential but so far I’ve not used it for anything significant.&lt;/p&gt;
        &lt;p&gt;Currently it can generate a VERY basic form (with single-line input values, similar to &lt;a href="https://latest.datasette.io/fixtures/neighborhood_search"&gt;this search example&lt;/a&gt;) but I hope to expand it in the future to support &lt;a href="https://github.com/simonw/datasette/issues/1090"&gt;custom form widgets&lt;/a&gt; via plugins for things like dates, map locations or autocomplete against other tables.&lt;/p&gt;
        &lt;p&gt;Q: For the local version where you had a 1-line push to deploy a new datasette: how do you handle updates? Is there a similar 1-line update to update an existing deployed datasette?&lt;/p&gt;
        &lt;p&gt;I deploy a brand new installation every time the data changes! This works great for data that only changes a few times a day. If I have a project that changes multiple times an hour I’ll run it as a regular VPS instead rather than use a serverless hosting provider.&lt;/p&gt;
        &lt;p&gt;&lt;/p&gt;
    &lt;/div&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/speaking"&gt;speaking&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/my-talks"&gt;my-talks&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/annotated-talks"&gt;annotated-talks&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="github"/><category term="speaking"/><category term="my-talks"/><category term="datasette"/><category term="dogsheep"/><category term="weeknotes"/><category term="sqlite-utils"/><category term="annotated-talks"/></entry><entry><title>OCTO Speaker Series: Simon Willison - Personal Data Warehouses: Reclaiming Your Data</title><link href="https://simonwillison.net/2020/Oct/23/octo-speaker-series/#atom-tag" rel="alternate"/><published>2020-10-23T03:00:42+00:00</published><updated>2020-10-23T03:00:42+00:00</updated><id>https://simonwillison.net/2020/Oct/23/octo-speaker-series/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://octo.github.com/speakerseries/SimonWillison/"&gt;OCTO Speaker Series: Simon Willison - Personal Data Warehouses: Reclaiming Your Data&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I’m giving a talk in the GitHub OCTO (Office of the CTO) speaker series about Datasette and my Dogsheep personal analytics project. You can register for free here—the stream will be on Thursday November 12, 2020 at 8:30am PST (4:30pm GMT).


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/speaking"&gt;speaking&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;&lt;/p&gt;



</summary><category term="github"/><category term="speaking"/><category term="datasette"/><category term="dogsheep"/></entry><entry><title>Dogsheep: Personal analytics with Datasette</title><link href="https://simonwillison.net/2020/Oct/19/dogsheep-personal-analytics-datasette/#atom-tag" rel="alternate"/><published>2020-10-19T16:38:23+00:00</published><updated>2020-10-19T16:38:23+00:00</updated><id>https://simonwillison.net/2020/Oct/19/dogsheep-personal-analytics-datasette/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://datasette.substack.com/p/dogsheep-personal-analytics-with"&gt;Dogsheep: Personal analytics with Datasette&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The second edition of my new Datasette Weekly newsletter, talks about Dogsheep, Dogsheep Beta, Datasette 1.0 and features datasette-cluster-map as the plugin of the week.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;&lt;/p&gt;



</summary><category term="datasette"/><category term="dogsheep"/></entry><entry><title>Building an Evernote to SQLite exporter</title><link href="https://simonwillison.net/2020/Oct/16/building-evernote-sqlite-exporter/#atom-tag" rel="alternate"/><published>2020-10-16T20:12:07+00:00</published><updated>2020-10-16T20:12:07+00:00</updated><id>https://simonwillison.net/2020/Oct/16/building-evernote-sqlite-exporter/#atom-tag</id><summary type="html">
    &lt;p&gt;I've been using Evernote for over a decade, and I've long wanted to export my data from it so I can do interesting things with it.&lt;/p&gt;
&lt;p&gt;Ideally I'd use their API for that, so I could continually export new notes via a cron. Unfortunately the API is implemented &lt;a href="https://evernote.com/blog/so-api-together-evernote-and-thrift/"&gt;using Thrift&lt;/a&gt; (the only public API I've seen to use that) and my attempts to speak to it from Python have so far &lt;a href="https://twitter.com/simonw/status/1257743727264595968"&gt;failed to get anywhere&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Last weekend I decided to try using their "Export notes" feature instead, and its &lt;a href="https://evernote.com/blog/how-evernotes-xml-export-format-works/"&gt;ENEX XML format&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;ENEX - the Evernote XML export format&lt;/h4&gt;
&lt;p&gt;Select a note - or multiple notes - in the Evernote macOS desktop app, use the &lt;code&gt;File -&amp;gt; Export Notes&lt;/code&gt; menu item and Evernote will create a &lt;code&gt;Notes.enex&lt;/code&gt; exported XML file.&lt;/p&gt;
&lt;p&gt;These files can get BIG, because any images attached to your notes will be embedded as base64 encoded binary data within the XML. My export is 2.8GB!&lt;/p&gt;
&lt;p&gt;When faced with unknown XML like this it's useful to get a high level overview of what tags are present. in 2009 I wrote &lt;a href="https://github.com/simonw/xml-analyser"&gt;a Python script&lt;/a&gt; for this - as part of this project I updated it for Python 3 and pushed &lt;a href="https://pypi.org/project/xml-analyser/"&gt;a release to PyPI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here's a subset of the output when run against my &lt;code&gt;Notes.enex&lt;/code&gt; file (&lt;a href="https://gist.github.com/simonw/eb86bbdd953c2c1dba2007f43e689f2f"&gt;full output is here&lt;/a&gt;).&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;note&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: {
        &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;attr_counts&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: {},
        &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;child_counts&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: {
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;content&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;2126&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;created&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;2126&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;note-attributes&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;2126&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;resource&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;2605&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;tag&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;45&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;title&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;2126&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;updated&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;2126&lt;/span&gt;
        },
        &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;count&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;2126&lt;/span&gt;,
        &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;parent_counts&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: {
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;en-export&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;2126&lt;/span&gt;
        }
    },
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;note-attributes&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: {
        &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;attr_counts&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: {},
        &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;child_counts&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: {
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;altitude&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;1466&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;application-data&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;449&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;author&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;998&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;classifications&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;51&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;content-class&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;387&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;latitude&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;1480&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;longitude&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;1480&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;reminder-done-time&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;2&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;reminder-order&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;2126&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;reminder-time&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;1&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;source&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;1664&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;source-application&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;423&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;source-url&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;85&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;subject-date&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;10&lt;/span&gt;
        },
        &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;count&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;2126&lt;/span&gt;,
        &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;parent_counts&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: {
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;note&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;2126&lt;/span&gt;
        }
    }
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This shows me that every note is represented as a &lt;code&gt;&amp;lt;note&amp;gt;&lt;/code&gt; element, and crucial metadata lives in children of a &lt;code&gt;&amp;lt;note-attributes&amp;gt;&lt;/code&gt; child element.&lt;/p&gt;
&lt;p&gt;(I thought I'd need to tweak the script for performance since it works by loading the entire file into memory, but my laptop has 32GB of RAM so it didn't even blink.)&lt;/p&gt;
&lt;h4&gt;ENEX limitations&lt;/h4&gt;
&lt;p&gt;I ran into some significant limitations while working with my ENEX export.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Notebooks are not represented in the file at all - you just get the notes. So I can't tell which recipes were in my "cooking" notebook. It's possible to work around this by manually exporting the notes from each notebook one at a time and storing them in separate export files. I didn't bother.&lt;/li&gt;
&lt;li&gt;Some of the data in the file - the note content itself for example - consists of further blocks of XML embedded in CDATA. This means you have to run a nested XML parser for every note you process.&lt;/li&gt;
&lt;li&gt;The notes XML &lt;code&gt;&amp;lt;en-note&amp;gt;&lt;/code&gt; format is mostly XHTML, but includes custom &lt;code&gt;&amp;lt;en-media hash="..."&amp;gt;&lt;/code&gt; tags where inline images should be displayed&lt;/li&gt;
&lt;li&gt;Those inline images are CDATA encoded base64 strings. They don't include a hash, but I figured out that decoding the base64 string and then running it through MD5 generates the hash that is used in the corresponding &lt;code&gt;&amp;lt;en-media&amp;gt;&lt;/code&gt; tag.&lt;/li&gt;
&lt;li&gt;Notes in the export don't have any form of unique ID!&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That last limitation - the lack of unique IDs - is a huge pain. It means that any export is by necessity a one-time operation - if you edit a note and attempt a re-export you will get a brand new record that can't be automatically used to update the previous one.&lt;/p&gt;
&lt;p&gt;This is particularly frustrating because I know that Evernote assigns a GUID to each note - they just don't include them in the export file.&lt;/p&gt;
&lt;h3&gt;Writing an exporter&lt;/h3&gt;
&lt;p&gt;My &lt;a href="https://dogsheep.github.io/"&gt;Dogsheep&lt;/a&gt; family of tools aims to liberate personal data from all kinds of different sources and convert it into SQLite, which means I can explore, query and visualize it using &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/dogsheep/evernote-to-sqlite"&gt;dogsheep/evernote-to-sqlite&lt;/a&gt; is my latest entry in that series.&lt;/p&gt;
&lt;p&gt;I based it on &lt;a href="https://github.com/dogsheep/healthkit-to-sqlite"&gt;healthkit-to-sqlite&lt;/a&gt; because that, too, has to deal with a multiple GB XML file. Both tools use the Python standard library's &lt;a href="https://docs.python.org/3/library/xml.etree.elementtree.html#pull-api-for-non-blocking-parsing"&gt;XMLPullParser&lt;/a&gt;, which makes XML elements available as a stream without needing to load the entire file into memory at once.&lt;/p&gt;
&lt;p&gt;The conversion code &lt;a href="https://github.com/dogsheep/evernote-to-sqlite/blob/0.2/evernote_to_sqlite/utils.py"&gt;is here&lt;/a&gt;. It ended up being relatively straight-forward, using &lt;code&gt;ElementTree&lt;/code&gt; to extract data from the XML and &lt;a href="https://sqlite-utils.readthedocs.io/"&gt;sqlite-utils&lt;/a&gt; to write it to a database.&lt;/p&gt;
&lt;p&gt;It implements a progress bar by tracking the number of bytes that have been read from the underlying file XML.&lt;/p&gt;
&lt;p&gt;One optimization: originally I created a single &lt;code&gt;resources&lt;/code&gt; table with the resource metadata and a BLOB column containing the binary image contents.&lt;/p&gt;
&lt;p&gt;This table was huge - over a GB - and had very poor performance for operations such as querying and faceting across unindexed columns - because any table scan had to work through MBs of binary data.&lt;/p&gt;
&lt;p&gt;My personal Evernote notebook has around 2,000 resources. 2,000 rows of metadata should usually be very fast to query.&lt;/p&gt;
&lt;p&gt;So... I split the binary data out into a two column &lt;code&gt;resources_data&lt;/code&gt; table - &lt;code&gt;md5&lt;/code&gt; primary key and a BLOB for &lt;code&gt;data&lt;/code&gt;. This massively sped up queries against that &lt;code&gt;resources&lt;/code&gt; table.&lt;/p&gt;
&lt;p&gt;And it works! Running the following command produces a SQLite database file containing all of my notes, note metadata and embedded images:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;evernote-to-sqlite enex evernote.db MyNotes.enex
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Serving up formatted notes&lt;/h4&gt;
&lt;p&gt;Figuring out the best way to serve up the note content in Datasette is still a work in progress - but I've hacked together a delightfully terrible way of doing this using a one-off plugin.&lt;/p&gt;
&lt;p&gt;Datasette plugins are usually packaged and installed via PyPI, but there's &lt;a href="https://docs.datasette.io/en/stable/writing_plugins.html#writing-one-off-plugins"&gt;an alternative option&lt;/a&gt; for plugins that don't warrant distribution: create a &lt;code&gt;plugins/&lt;/code&gt; directory, drop in one or more Python files and start Datasette pointing &lt;code&gt;--plugins-dir&lt;/code&gt; at that directory.&lt;/p&gt;
&lt;p&gt;This is great for hacking on terrible ideas. Here's the &lt;code&gt;evernote.py&lt;/code&gt; plugin I wrote in full:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;datasette&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;hookimpl&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;jinja2&lt;/span&gt;

&lt;span class="pl-v"&gt;START&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"&amp;lt;en-note"&lt;/span&gt;
&lt;span class="pl-v"&gt;END&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"&amp;lt;/en-note&amp;gt;"&lt;/span&gt;
&lt;span class="pl-v"&gt;TEMPLATE&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"""&lt;/span&gt;
&lt;span class="pl-s"&gt;&amp;lt;div style="max-width: 500px; white-space: normal; overflow-wrap: break-word;"&amp;gt;{}&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;"""&lt;/span&gt;.&lt;span class="pl-en"&gt;strip&lt;/span&gt;()

&lt;span class="pl-v"&gt;EN_MEDIA_SCRIPT&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"""&lt;/span&gt;
&lt;span class="pl-s"&gt;Array.from(document.querySelectorAll('en-media')).forEach(el =&amp;gt; {&lt;/span&gt;
&lt;span class="pl-s"&gt;    let hash = el.getAttribute('hash');&lt;/span&gt;
&lt;span class="pl-s"&gt;    let type = el.getAttribute('type');&lt;/span&gt;
&lt;span class="pl-s"&gt;    let path = `/evernote/resources_data/${hash}.json?_shape=array`;&lt;/span&gt;
&lt;span class="pl-s"&gt;    fetch(path).then(r =&amp;gt; r.json()).then(rows =&amp;gt; {&lt;/span&gt;
&lt;span class="pl-s"&gt;        let b64 = rows[0].data.encoded;&lt;/span&gt;
&lt;span class="pl-s"&gt;        let data = `data:${type};base64,${b64}`;&lt;/span&gt;
&lt;span class="pl-s"&gt;        el.innerHTML = `&amp;lt;img style="max-width: 300px" src="${data}"&amp;gt;`;&lt;/span&gt;
&lt;span class="pl-s"&gt;    });&lt;/span&gt;
&lt;span class="pl-s"&gt;});&lt;/span&gt;
&lt;span class="pl-s"&gt;"""&lt;/span&gt;

&lt;span class="pl-en"&gt;@&lt;span class="pl-s1"&gt;hookimpl&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;render_cell&lt;/span&gt;(&lt;span class="pl-s1"&gt;value&lt;/span&gt;, &lt;span class="pl-s1"&gt;table&lt;/span&gt;):
    &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-c1"&gt;not&lt;/span&gt; &lt;span class="pl-s1"&gt;table&lt;/span&gt;:
        &lt;span class="pl-c"&gt;# Don't render content from arbitrary SQL queries, could be XSS hole&lt;/span&gt;
        &lt;span class="pl-k"&gt;return&lt;/span&gt;
    &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-c1"&gt;not&lt;/span&gt; &lt;span class="pl-s1"&gt;value&lt;/span&gt; &lt;span class="pl-c1"&gt;or&lt;/span&gt; &lt;span class="pl-c1"&gt;not&lt;/span&gt; &lt;span class="pl-en"&gt;isinstance&lt;/span&gt;(&lt;span class="pl-s1"&gt;value&lt;/span&gt;, &lt;span class="pl-s1"&gt;str&lt;/span&gt;):
        &lt;span class="pl-k"&gt;return&lt;/span&gt;
    &lt;span class="pl-s1"&gt;value&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;value&lt;/span&gt;.&lt;span class="pl-en"&gt;strip&lt;/span&gt;()
    &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-s1"&gt;value&lt;/span&gt;.&lt;span class="pl-en"&gt;startswith&lt;/span&gt;(&lt;span class="pl-v"&gt;START&lt;/span&gt;) &lt;span class="pl-c1"&gt;and&lt;/span&gt; &lt;span class="pl-s1"&gt;value&lt;/span&gt;.&lt;span class="pl-en"&gt;endswith&lt;/span&gt;(&lt;span class="pl-v"&gt;END&lt;/span&gt;):
        &lt;span class="pl-s1"&gt;trimmed&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;value&lt;/span&gt;[&lt;span class="pl-en"&gt;len&lt;/span&gt;(&lt;span class="pl-v"&gt;START&lt;/span&gt;) : &lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-en"&gt;len&lt;/span&gt;(&lt;span class="pl-v"&gt;END&lt;/span&gt;)]
        &lt;span class="pl-s1"&gt;trimmed&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;trimmed&lt;/span&gt;.&lt;span class="pl-en"&gt;split&lt;/span&gt;(&lt;span class="pl-s"&gt;"&amp;gt;"&lt;/span&gt;, &lt;span class="pl-c1"&gt;1&lt;/span&gt;)[&lt;span class="pl-c1"&gt;1&lt;/span&gt;]
        &lt;span class="pl-c"&gt;# Replace those horrible double newlines&lt;/span&gt;
        &lt;span class="pl-s1"&gt;trimmed&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;trimmed&lt;/span&gt;.&lt;span class="pl-en"&gt;replace&lt;/span&gt;(&lt;span class="pl-s"&gt;"&amp;lt;div&amp;gt;&amp;lt;br /&amp;gt;&amp;lt;/div&amp;gt;"&lt;/span&gt;, &lt;span class="pl-s"&gt;"&amp;lt;br&amp;gt;"&lt;/span&gt;)
        &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s1"&gt;jinja2&lt;/span&gt;.&lt;span class="pl-v"&gt;Markup&lt;/span&gt;(&lt;span class="pl-v"&gt;TEMPLATE&lt;/span&gt;.&lt;span class="pl-en"&gt;format&lt;/span&gt;(&lt;span class="pl-s1"&gt;trimmed&lt;/span&gt;))

&lt;span class="pl-en"&gt;@&lt;span class="pl-s1"&gt;hookimpl&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;extra_body_script&lt;/span&gt;():
    &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-v"&gt;EN_MEDIA_SCRIPT&lt;/span&gt;&lt;/pre&gt;
&lt;p&gt;This uses two &lt;a href="https://docs.datasette.io/en/stable/plugin_hooks.html"&gt;Datasette plugin hooks&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;render_cell()&lt;/code&gt; is called every time Datasette's interface renders the value of a column. Here I'm looking for the &lt;code&gt;&amp;lt;en-note&amp;gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;/en-note&amp;gt;&lt;/code&gt; tags and, if they are present, stripping them off and marking their inner content as safe to display using &lt;code&gt;jinja2.Markup()&lt;/code&gt; (without this they would be automatically HTML esacped).&lt;/p&gt;
&lt;p&gt;&lt;code&gt;extra_body_script()&lt;/code&gt; can be used to inject additional JavaScript at the bottom of the page. I'm injecting some particularly devious JavaScript which scans the page for Evernote's &lt;code&gt;&amp;lt;en-media&amp;gt;&lt;/code&gt; tags and, if it finds any, runs a &lt;code&gt;fetch()&lt;/code&gt; to get the base64-encoded data from Datasette and then injects that into the page as a 300px wide image element using a &lt;code&gt;data:&lt;/code&gt; URI.&lt;/p&gt;
&lt;p&gt;Like I said, this is a pretty terrifying hack! But it works just fine, and my notes are now visible inside my personal Datasette instance:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2020/evernote-datasette.png" alt="One of my notes displayed in Datasette, with an embedded image" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;h4&gt;Bonus feature: search by OCR&lt;/h4&gt;
&lt;p&gt;An interesting feature of Evernote is that it runs cloud-based OCR against images in your notes, enabling you to search across the text contained within them.&lt;/p&gt;
&lt;p&gt;It turns out the output of this OCR process is included in the export XML. It looks like this:&lt;/p&gt;
&lt;div class="highlight highlight-text-xml"&gt;&lt;pre&gt;&amp;lt;&lt;span class="pl-ent"&gt;recoIndex&lt;/span&gt;
  &lt;span class="pl-e"&gt;docType&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;unknown&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
  &lt;span class="pl-e"&gt;objType&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;image&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;objID&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;76dd28b07797cc9f3f129c4871c5293c&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; 
  &lt;span class="pl-e"&gt;engineVersion&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;7.0.24.1&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; 
  &lt;span class="pl-e"&gt;recoType&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;service&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; 
  &lt;span class="pl-e"&gt;lang&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;en&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; 
  &lt;span class="pl-e"&gt;objWidth&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;670&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; 
  &lt;span class="pl-e"&gt;objHeight&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;128&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;item&lt;/span&gt; &lt;span class="pl-e"&gt;x&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;26&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;52&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;81&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;h&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;29&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;t&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;76&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;This&amp;lt;/&lt;span class="pl-ent"&gt;t&lt;/span&gt;&amp;gt;
  &amp;lt;/&lt;span class="pl-ent"&gt;item&lt;/span&gt;&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;item&lt;/span&gt; &lt;span class="pl-e"&gt;x&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;134&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;52&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;35&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;h&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;29&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;t&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;79&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;is&amp;lt;/&lt;span class="pl-ent"&gt;t&lt;/span&gt;&amp;gt;
  &amp;lt;/&lt;span class="pl-ent"&gt;item&lt;/span&gt;&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;item&lt;/span&gt; &lt;span class="pl-e"&gt;x&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;196&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;37&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;h&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;21&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;t&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;73&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;so&amp;lt;/&lt;span class="pl-ent"&gt;t&lt;/span&gt;&amp;gt;
  &amp;lt;/&lt;span class="pl-ent"&gt;item&lt;/span&gt;&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;item&lt;/span&gt; &lt;span class="pl-e"&gt;x&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;300&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;57&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;h&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;21&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;t&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;71&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;can&amp;lt;/&lt;span class="pl-ent"&gt;t&lt;/span&gt;&amp;gt;
  &amp;lt;/&lt;span class="pl-ent"&gt;item&lt;/span&gt;&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;item&lt;/span&gt; &lt;span class="pl-e"&gt;x&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;382&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;54&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;79&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;h&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;27&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;t&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;77&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;test&amp;lt;/&lt;span class="pl-ent"&gt;t&lt;/span&gt;&amp;gt;
  &amp;lt;/&lt;span class="pl-ent"&gt;item&lt;/span&gt;&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;item&lt;/span&gt; &lt;span class="pl-e"&gt;x&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;486&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;52&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;59&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;h&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;29&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;t&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;82&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;the&amp;lt;/&lt;span class="pl-ent"&gt;t&lt;/span&gt;&amp;gt;
  &amp;lt;/&lt;span class="pl-ent"&gt;item&lt;/span&gt;&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;item&lt;/span&gt; &lt;span class="pl-e"&gt;x&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;570&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;54&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;59&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;h&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;25&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;t&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;74&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;OCR&amp;lt;/&lt;span class="pl-ent"&gt;t&lt;/span&gt;&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;t&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;33&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;DeR&amp;lt;/&lt;span class="pl-ent"&gt;t&lt;/span&gt;&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;t&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;15&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;OCR!&amp;lt;/&lt;span class="pl-ent"&gt;t&lt;/span&gt;&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;t&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;14&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;OCR]&amp;lt;/&lt;span class="pl-ent"&gt;t&lt;/span&gt;&amp;gt;
  &amp;lt;/&lt;span class="pl-ent"&gt;item&lt;/span&gt;&amp;gt;
&amp;lt;/&lt;span class="pl-ent"&gt;recoIndex&lt;/span&gt;&amp;gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;As far as I can tell an &lt;code&gt;&amp;lt;item&amp;gt;&lt;/code&gt; gets multiple &lt;code&gt;&amp;lt;t&amp;gt;&lt;/code&gt; elements only if the OCR wasn't 100% sure.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;evernote-to-sqlite&lt;/code&gt; creates a searchable &lt;code&gt;ocr&lt;/code&gt; text field using values from that XML, so I can search my images in Datasette.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="projects"/><category term="sqlite"/><category term="xml"/><category term="datasette"/><category term="dogsheep"/><category term="sqlite-utils"/></entry><entry><title>evernote-to-sqlite</title><link href="https://simonwillison.net/2020/Oct/12/evernote-to-sqlite/#atom-tag" rel="alternate"/><published>2020-10-12T00:38:32+00:00</published><updated>2020-10-12T00:38:32+00:00</updated><id>https://simonwillison.net/2020/Oct/12/evernote-to-sqlite/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/dogsheep/evernote-to-sqlite"&gt;evernote-to-sqlite&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The latest tool in my Dogsheep series of utilities for personal analytics: evernote-to-sqlite takes Evernote note exports en their ENEX XML format and loads them into a SQLite database. Embedded images are loaded into a BLOB column and the output of their cloud-based OCR system is added to a full-text search index. Notes have a latitude and longitude which means you can visualize your notes on a map using Datasette and datasette-cluster-map.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;&lt;/p&gt;



</summary><category term="projects"/><category term="sqlite"/><category term="datasette"/><category term="dogsheep"/></entry><entry><title>Weeknotes: airtable-export, generating screenshots in GitHub Actions, Dogsheep!</title><link href="https://simonwillison.net/2020/Sep/3/weeknotes-airtable-screenshots-dogsheep/#atom-tag" rel="alternate"/><published>2020-09-03T23:28:29+00:00</published><updated>2020-09-03T23:28:29+00:00</updated><id>https://simonwillison.net/2020/Sep/3/weeknotes-airtable-screenshots-dogsheep/#atom-tag</id><summary type="html">
    &lt;p&gt;This week I figured out how to populate Datasette from Airtable, wrote code to generate social media preview card page screenshots using Puppeteer, and made a big breakthrough with my Dogsheep project.&lt;/p&gt;
&lt;h4 id="weeknotes-2020-09-03-airtable-export"&gt;airtable-export&lt;/h4&gt;
&lt;p&gt;I wrote about &lt;a href="https://www.rockybeaches.com/"&gt;Rocky Beaches&lt;/a&gt; in my weeknotes &lt;a href="https://simonwillison.net/2020/Aug/21/weeknotes-rocky-beaches/"&gt;two weeks ago&lt;/a&gt;. It's a new website built by Natalie Downe that showcases great places to go rockpooling (tidepooling in American English), mixing in tide data from NOAA and species sighting data from iNaturalist.&lt;/p&gt;
&lt;p&gt;Rocky Beaches is powered by Datasette, using a GitHub Actions workflow that builds the site's underlying SQLite database using API calls and YAML data stored in the GitHub repository.&lt;/p&gt;
&lt;p&gt;Natalie wanted to use Airtable to maintain the structured data for the site, rather than hand-editing a YAML file. So I built &lt;a href="https://github.com/simonw/airtable-export"&gt;airtable-export&lt;/a&gt;, a command-line script for sucking down all of the data from an Airtable instance and writing it to disk as YAML or JSON.&lt;/p&gt;
&lt;p&gt;You run it like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;airtable-export out/ mybaseid table1 table2 --key=key
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will create a folder called &lt;code&gt;out/&lt;/code&gt; with a &lt;code&gt;.yml&lt;/code&gt; file for each of the tables.&lt;/p&gt;
&lt;p&gt;Sadly the Airtable API doesn't yet provide a mechanism to list all of the tables in a database (a &lt;a href="https://community.airtable.com/t/list-tables-given-api-key-and-baseid/1173"&gt;long-running feature request&lt;/a&gt;) so you have to list the tables yourself.&lt;/p&gt;
&lt;p&gt;We're now &lt;a href="https://github.com/natbat/rockybeaches/blob/32a010292e7c1ba47db1a86523a61c666d977074/.github/workflows/deploy.yml#L31-L44"&gt;running that command&lt;/a&gt; as part of the Rocky Beaches build script, and committing the latest version of the YAML file back to the GitHub repo (thus gaining a &lt;a href="https://github.com/natbat/rockybeaches/commits/main/airtable"&gt;full change history&lt;/a&gt; for that data).&lt;/p&gt;
&lt;h4 id="weeknotes-2020-09-03-social-media-cards-tils"&gt;Social media cards for my TILs&lt;/h4&gt;
&lt;p&gt;I really like social media cards - &lt;code&gt;og:image&lt;/code&gt; HTML meta attributes for Facebook and &lt;code&gt;twitter:image&lt;/code&gt; for Twitter. I wanted them for articles on my &lt;a href="https://til.simonwillison.net/"&gt;TIL website&lt;/a&gt; since I often share those via Twitter.&lt;/p&gt;
&lt;p&gt;One catch: my TILs aren't very image heavy. So I decided to generate screenshots of the pages and use those as the 2x1 social media card images.&lt;/p&gt;
&lt;p&gt;The best way I know of programatically generating screenshots is to use &lt;a href="https://developers.google.com/web/tools/puppeteer"&gt;Puppeteer&lt;/a&gt;, a Node.js library for automating a headless instance of the Chrome browser that is maintained by the Chrome DevTools team.&lt;/p&gt;
&lt;p&gt;My first attempt was to run Puppeteer in an AWS Lambda function on &lt;a href="https://vercel.com/"&gt;Vercel&lt;/a&gt;. I remembered seeing an example of how to do this in the Vercel documentation a few years ago. The example isn't there any more, but I found the &lt;a href="https://github.com/vercel/now-examples/pull/207"&gt;original pull request&lt;/a&gt; that introduced it.&lt;/p&gt;
&lt;p&gt;Since the example was MIT licensed I created my own fork at &lt;a href="https://github.com/simonw/puppeteer-screenshot"&gt;simonw/puppeteer-screenshot&lt;/a&gt; and updated it to work with the latest Chrome.&lt;/p&gt;
&lt;p&gt;It's pretty resource intensive, so I also added a secret &lt;code&gt;?key=&lt;/code&gt; mechanism so only my own automation code could call my instance running on Vercel.&lt;/p&gt;
&lt;p&gt;I needed to store the generated screenshots somewhere. They're pretty small - on the order of 60KB each - so I decided to store them in my SQLite database itself and use my &lt;a href="https://github.com/simonw/datasette-media"&gt;datasette-media&lt;/a&gt; plugin (see &lt;a href="https://simonwillison.net/2020/Jul/30/fun-binary-data-and-sqlite/"&gt;Fun with binary data and SQLite&lt;/a&gt;) to serve them up.&lt;/p&gt;
&lt;p&gt;This worked! Until it didn't... I ran into a showstopper bug when I realized that the screenshot process relies on the page being live on the site... but when a new article is added it's not live when the build process works, so the generated screenshot &lt;a href="https://github.com/simonw/til/issues/23"&gt;is of the 404 page&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;So I reworked it to generate the screenshots inside the GitHub Action as part of the build script, using &lt;a href="https://github.com/JarvusInnovations/puppeteer-cli"&gt;puppeteer-cli&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;My &lt;a href="https://github.com/simonw/til/blob/3fca996228ad54ee433b25840fcd3682e9f7bbfd/generate_screenshots.py"&gt;generate_screenshots.py&lt;/a&gt; script handles this, by first shelling out to &lt;code&gt;datasette --get&lt;/code&gt; to render the HTML for the page, then running &lt;code&gt;puppeteer&lt;/code&gt; to generate the screenshot. Relevant code:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;png_for_path&lt;/span&gt;(&lt;span class="pl-s1"&gt;path&lt;/span&gt;):
    &lt;span class="pl-c"&gt;# Path is e.g. /til/til/python_debug-click-with-pdb.md&lt;/span&gt;
    &lt;span class="pl-s1"&gt;page_html&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;str&lt;/span&gt;(&lt;span class="pl-v"&gt;TMP_PATH&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-s"&gt;"generate-screenshots-page.html"&lt;/span&gt;)
    &lt;span class="pl-c"&gt;# Use datasette to generate HTML&lt;/span&gt;
    &lt;span class="pl-s1"&gt;proc&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;subprocess&lt;/span&gt;.&lt;span class="pl-en"&gt;run&lt;/span&gt;([&lt;span class="pl-s"&gt;"datasette"&lt;/span&gt;, &lt;span class="pl-s"&gt;"."&lt;/span&gt;, &lt;span class="pl-s"&gt;"--get"&lt;/span&gt;, &lt;span class="pl-s1"&gt;path&lt;/span&gt;], &lt;span class="pl-s1"&gt;capture_output&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)
    &lt;span class="pl-en"&gt;open&lt;/span&gt;(&lt;span class="pl-s1"&gt;page_html&lt;/span&gt;, &lt;span class="pl-s"&gt;"wb"&lt;/span&gt;).&lt;span class="pl-en"&gt;write&lt;/span&gt;(&lt;span class="pl-s1"&gt;proc&lt;/span&gt;.&lt;span class="pl-s1"&gt;stdout&lt;/span&gt;)
    &lt;span class="pl-c"&gt;# Now use puppeteer screenshot to generate a PNG&lt;/span&gt;
    &lt;span class="pl-s1"&gt;proc2&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;subprocess&lt;/span&gt;.&lt;span class="pl-en"&gt;run&lt;/span&gt;(
        [
            &lt;span class="pl-s"&gt;"puppeteer"&lt;/span&gt;,
            &lt;span class="pl-s"&gt;"screenshot"&lt;/span&gt;,
            &lt;span class="pl-s1"&gt;page_html&lt;/span&gt;,
            &lt;span class="pl-s"&gt;"--viewport"&lt;/span&gt;,
            &lt;span class="pl-s"&gt;"800x400"&lt;/span&gt;,
            &lt;span class="pl-s"&gt;"--full-page=false"&lt;/span&gt;,
        ],
        &lt;span class="pl-s1"&gt;capture_output&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;,
    )
    &lt;span class="pl-s1"&gt;png_bytes&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;proc2&lt;/span&gt;.&lt;span class="pl-s1"&gt;stdout&lt;/span&gt;
    &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s1"&gt;png_bytes&lt;/span&gt;&lt;/pre&gt;
&lt;p&gt;This worked great! Except for one thing... the site is hosted on Vercel, and Vercel has a 5MB &lt;a href="https://vercel.com/docs/platform/limits#serverless-function-payload-size-limit"&gt;response size limit&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Every time my GitHub build script runs it downloads the previous SQLite database file, so it can avoid regenerating screenshots and HTML for pages that haven't changed.&lt;/p&gt;
&lt;p&gt;The addition of the binary screenshots drove the size of the SQLite database over 5MB, so the part of my script that retrieved the previous database &lt;a href="https://github.com/simonw/til/issues/25"&gt;no longer worked&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I needed a reliable way to store that 5MB (and probably eventually 10-50MB) database file in between runs of my action.&lt;/p&gt;
&lt;p&gt;The best place to put this would be an S3 bucket, but I find the process of setting up IAM permissions for access to a new bucket so infuriating that I couldn't bring myself to do it.&lt;/p&gt;
&lt;p&gt;So... I created a new dedicated GitHub repository, &lt;a href="https://github.com/simonw/til-db"&gt;simonw/til-db&lt;/a&gt;, and updated my action to store the binary file in that repo - using &lt;a href="https://github.com/simonw/til/blob/1e29c3fe5e90c29b0e71d87dba805484ceb4393c/.github/workflows/build.yml#L80-L86"&gt;a force push&lt;/a&gt; so the repo doesn't need to maintain unnecessary version history of the binary asset.&lt;/p&gt;
&lt;p&gt;This is an abomination of a hack, and it made me cackle a lot. I &lt;a href="https://twitter.com/simonw/status/1301029346614718465"&gt;tweeted about it&lt;/a&gt; and got the suggestion to try &lt;a href="https://git-lfs.github.com/"&gt;Git LFS&lt;/a&gt; instead, which would definitely be a more appropriate way to solve this problem.&lt;/p&gt;
&lt;h4 id="weeknotes-2020-09-03-rendering-markdown"&gt;Rendering Markdown&lt;/h4&gt;
&lt;p&gt;I write my blog entries in Markdown and transform them into HTML before I post them on my blog. Some day I'll teach my blog to render Markdown itself, but so far I've got by through copying and pasting into Markdown tools.&lt;/p&gt;
&lt;p&gt;My favourite Markdown flavour is GitHub's, which adds a bunch of useful capabilities - most notably the ability to apply syntax highlighting. GitHub &lt;a href="https://docs.github.com/en/rest/reference/markdown"&gt;expose an API&lt;/a&gt; that applies their Markdown formatter and returns the resulting HTML.&lt;/p&gt;
&lt;p&gt;I built myself &lt;a href="https://til.simonwillison.net/tools/render-markdown"&gt;a quick and scrappy tool&lt;/a&gt; in JavaScript that sends Markdown through their API and then applies a few DOM manipulations to clean up what comes back. It was a nice opportunity to write some modern vanilla JavaScript using &lt;code&gt;fetch()&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;function&lt;/span&gt; &lt;span class="pl-en"&gt;render&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;markdown&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
    &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-en"&gt;fetch&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'https://api.github.com/markdown'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
        &lt;span class="pl-c1"&gt;method&lt;/span&gt;: &lt;span class="pl-s"&gt;'POST'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
        &lt;span class="pl-c1"&gt;headers&lt;/span&gt;: &lt;span class="pl-kos"&gt;{&lt;/span&gt;
            &lt;span class="pl-s"&gt;'Content-Type'&lt;/span&gt;: &lt;span class="pl-s"&gt;'application/json'&lt;/span&gt;
        &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
        &lt;span class="pl-c1"&gt;body&lt;/span&gt;: &lt;span class="pl-c1"&gt;JSON&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;stringify&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s"&gt;'mode'&lt;/span&gt;: &lt;span class="pl-s"&gt;'markdown'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;'text'&lt;/span&gt;: &lt;span class="pl-s1"&gt;markdown&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;
    &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;text&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt;

&lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;button&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;getElementsByTagName&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'button'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;[&lt;/span&gt;&lt;span class="pl-c1"&gt;0&lt;/span&gt;&lt;span class="pl-kos"&gt;]&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;output&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;getElementById&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'output'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;preview&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;getElementById&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'preview'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

&lt;span class="pl-s1"&gt;button&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;addEventListener&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'click'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;function&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
    &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;rendered&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-en"&gt;render&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;input&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;value&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-s1"&gt;output&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;value&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;rendered&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-s1"&gt;preview&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;innerHTML&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;rendered&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h4 id="weeknotes-2020-09-03-dogsheep-beta"&gt;Dogsheep Beta&lt;/h4&gt;
&lt;p&gt;My most exciting project this week was getting out the first working version of &lt;a href="https://github.com/dogsheep/beta"&gt;Dogsheep Beta&lt;/a&gt; - the search engine that ties together results from my &lt;a href="https://dogsheep.github.io/"&gt;Dogsheep&lt;/a&gt; family of tools for personal analytics.&lt;/p&gt;
&lt;p&gt;I'm giving a talk about this tonight at PyCon Australia: &lt;a href="https://2020.pycon.org.au/program/73uk8x/"&gt;Build your own data warehouse for personal analytics with SQLite and Datasette&lt;/a&gt;. I'll be writing up detailed notes in the next few days, so watch this space.&lt;/p&gt;
&lt;h4 id="weeknotes-2020-09-03-til-this-week"&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/til/til/jq_reformatting-airtable-json.md"&gt;Converting Airtable JSON for use with sqlite-utils using jq&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/til/til/javascript_minifying-uglify-npx.md"&gt;Minifying JavaScript with npx uglify-js&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/til/til/pytest_subprocess-server.md"&gt;Start a server in a subprocess during a pytest session&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/til/til/bash_loop-over-csv.md"&gt;Looping over comma-separated values in Bash&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/til/til/cloudrun_gcloud-run-services-list.md"&gt;Using the gcloud run services list command&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/til/til/python_debug-click-with-pdb.md"&gt;Debugging a Click application using pdb&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="weeknotes-2020-09-03-releases-this-week"&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/dogsheep/dogsheep-beta/releases/tag/0.4.1"&gt;dogsheep-beta 0.4.1&lt;/a&gt; - 2020-09-03&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/dogsheep/dogsheep-beta/releases/tag/0.4"&gt;dogsheep-beta 0.4&lt;/a&gt; - 2020-09-03&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/dogsheep/dogsheep-beta/releases/tag/0.4a1"&gt;dogsheep-beta 0.4a1&lt;/a&gt; - 2020-09-03&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/dogsheep/dogsheep-beta/releases/tag/0.4a0"&gt;dogsheep-beta 0.4a0&lt;/a&gt; - 2020-09-03&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/dogsheep/dogsheep-beta/releases/tag/0.3"&gt;dogsheep-beta 0.3&lt;/a&gt; - 2020-09-02&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/dogsheep/dogsheep-beta/releases/tag/0.2"&gt;dogsheep-beta 0.2&lt;/a&gt; - 2020-09-01&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/dogsheep/dogsheep-beta/releases/tag/0.1"&gt;dogsheep-beta 0.1&lt;/a&gt; - 2020-09-01&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/dogsheep/dogsheep-beta/releases/tag/0.1a2"&gt;dogsheep-beta 0.1a2&lt;/a&gt; - 2020-09-01&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/dogsheep/dogsheep-beta/releases/tag/0.1a"&gt;dogsheep-beta 0.1a&lt;/a&gt; - 2020-09-01&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/airtable-export/releases/tag/0.4"&gt;airtable-export 0.4&lt;/a&gt; - 2020-08-30&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette-yaml/releases/tag/0.1a"&gt;datasette-yaml 0.1a&lt;/a&gt; - 2020-08-29&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/airtable-export/releases/tag/0.3.1"&gt;airtable-export 0.3.1&lt;/a&gt; - 2020-08-29&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/airtable-export/releases/tag/0.3"&gt;airtable-export 0.3&lt;/a&gt; - 2020-08-29&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/airtable-export/releases/tag/0.2"&gt;airtable-export 0.2&lt;/a&gt; - 2020-08-29&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/airtable-export/releases/tag/0.1.1"&gt;airtable-export 0.1.1&lt;/a&gt; - 2020-08-29&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/airtable-export/releases/tag/0.1"&gt;airtable-export 0.1&lt;/a&gt; - 2020-08-29&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette/releases/tag/0.49a0"&gt;datasette 0.49a0&lt;/a&gt; - 2020-08-28&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/sqlite-utils/releases/tag/2.16.1"&gt;sqlite-utils 2.16.1&lt;/a&gt; - 2020-08-28&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/yaml"&gt;yaml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/markdown"&gt;markdown&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/airtable"&gt;airtable&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/puppeteer"&gt;puppeteer&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="projects"/><category term="yaml"/><category term="markdown"/><category term="dogsheep"/><category term="weeknotes"/><category term="github-actions"/><category term="airtable"/><category term="puppeteer"/></entry><entry><title>Serving photos locally with datasette-media</title><link href="https://simonwillison.net/2020/May/26/serving-photos-locally-datasette-media/#atom-tag" rel="alternate"/><published>2020-05-26T15:53:08+00:00</published><updated>2020-05-26T15:53:08+00:00</updated><id>https://simonwillison.net/2020/May/26/serving-photos-locally-datasette-media/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/dogsheep/dogsheep-photos/tree/0.4.1#serving-photos-locally-with-datasette-media"&gt;Serving photos locally with datasette-media&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
datasette-media is a new Datasette plugin which can serve static files from disk in response to a configured SQL query that maps incoming URL parameters to a path to a file. I built it so I could run dogsheep-photos locally on my laptop and serve up thumbnails of images that match particular queries. I’ve added documentation to the dogsheep-photos README explaining how to use datasette-media, datasette-json-html and datasette-template-sql to create custom interfaces onto Apple Photos data on your machine.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/plugins"&gt;plugins&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/apple-photos"&gt;apple-photos&lt;/a&gt;&lt;/p&gt;



</summary><category term="plugins"/><category term="projects"/><category term="datasette"/><category term="dogsheep"/><category term="apple-photos"/></entry><entry><title>Using SQL to find my best photo of a pelican according to Apple Photos</title><link href="https://simonwillison.net/2020/May/21/dogsheep-photos/#atom-tag" rel="alternate"/><published>2020-05-21T19:16:38+00:00</published><updated>2020-05-21T19:16:38+00:00</updated><id>https://simonwillison.net/2020/May/21/dogsheep-photos/#atom-tag</id><summary type="html">
    &lt;p&gt;According to the Apple Photos internal SQLite database, this is the most aesthetically pleasing photograph I have ever taken of a pelican:&lt;/p&gt;

&lt;p&gt;&lt;img src="https://photos.simonwillison.net/i/cbfe463f1a67e37a1d36c5db44f0159ef6f86a0d64a987b129b63b52e555f1af.jpeg?w=800" alt="A pelican" style="max-width: 100%" /&gt;&lt;/p&gt;

&lt;p&gt;Here's the SQL query that found me my best ten pelican photos:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;select
  sha256,
  ext,
  uuid,
  date,
  ZOVERALLAESTHETICSCORE
from
  photos_with_apple_metadata
where
  uuid in (
    select
      uuid
    from
      labels
    where
      normalized_string = 'pelican'
  )
order by
  ZOVERALLAESTHETICSCORE desc
limit
  10&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You can &lt;a href="https://dogsheep-photos.dogsheep.net/public?sql=select%0D%0A++json_object%28%0D%0A++++%27img_src%27%2C%0D%0A++++%27https%3A%2F%2Fphotos.simonwillison.net%2Fi%2F%27+%7C%7C+sha256+%7C%7C+%27.%27+%7C%7C+ext+%7C%7C+%27%3Fw%3D600%27%0D%0A++%29+as+photo%2C%0D%0A++sha256%2C%0D%0A++ext%2C%0D%0A++uuid%2C%0D%0A++date%2C%0D%0A++ZOVERALLAESTHETICSCORE%0D%0Afrom%0D%0A++photos_with_apple_metadata%0D%0Awhere%0D%0A++uuid+in+%28%0D%0A++++select%0D%0A++++++uuid%0D%0A++++from%0D%0A++++++labels%0D%0A++++where%0D%0A++++++normalized_string+%3D+%3Alabel%0D%0A++%29%0D%0Aorder+by%0D%0A++ZOVERALLAESTHETICSCORE+desc%0D%0Alimit%0D%0A++10&amp;amp;label=pelican"&gt;try it out here&lt;/a&gt; (with some extra &lt;a href="https://github.com/simonw/datasette-json-html/blob/master/README.md#images"&gt;datasette-json-html&lt;/a&gt; magic to display the actual photos). Or try &lt;a href="https://dogsheep-photos.dogsheep.net/public?sql=select%0D%0A++json_object%28%0D%0A++++%27img_src%27%2C%0D%0A++++%27https%3A%2F%2Fphotos.simonwillison.net%2Fi%2F%27+%7C%7C+sha256+%7C%7C+%27.%27+%7C%7C+ext+%7C%7C+%27%3Fw%3D600%27%0D%0A++%29+as+photo%2C%0D%0A++sha256%2C%0D%0A++ext%2C%0D%0A++uuid%2C%0D%0A++date%2C%0D%0A++ZOVERALLAESTHETICSCORE%0D%0Afrom%0D%0A++photos_with_apple_metadata%0D%0Awhere%0D%0A++uuid+in+%28%0D%0A++++select%0D%0A++++++uuid%0D%0A++++from%0D%0A++++++labels%0D%0A++++where%0D%0A++++++normalized_string+%3D+%3Alabel%0D%0A++%29%0D%0Aorder+by%0D%0A++ZOVERALLAESTHETICSCORE+desc%0D%0Alimit%0D%0A++10&amp;amp;label=lemur"&gt;lemur&lt;/a&gt; or &lt;a href="https://dogsheep-photos.dogsheep.net/public?sql=select%0D%0A++json_object%28%0D%0A++++%27img_src%27%2C%0D%0A++++%27https%3A%2F%2Fphotos.simonwillison.net%2Fi%2F%27+%7C%7C+sha256+%7C%7C+%27.%27+%7C%7C+ext+%7C%7C+%27%3Fw%3D600%27%0D%0A++%29+as+photo%2C%0D%0A++sha256%2C%0D%0A++ext%2C%0D%0A++uuid%2C%0D%0A++date%2C%0D%0A++ZOVERALLAESTHETICSCORE%0D%0Afrom%0D%0A++photos_with_apple_metadata%0D%0Awhere%0D%0A++uuid+in+%28%0D%0A++++select%0D%0A++++++uuid%0D%0A++++from%0D%0A++++++labels%0D%0A++++where%0D%0A++++++normalized_string+%3D+%3Alabel%0D%0A++%29%0D%0Aorder+by%0D%0A++ZOVERALLAESTHETICSCORE+desc%0D%0Alimit%0D%0A++10&amp;amp;label=seal"&gt;seal&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I actually think this is my best pelican photo, but Apple Photos rated it fifth:&lt;/p&gt;

&lt;p&gt;&lt;img src="https://photos.simonwillison.net/i/a444857c4ac71ceae6af5192c8acc5ac35934ed589259136df0ed11295dbb085.jpeg?w=800" alt="A pelican" style="max-width: 100%" /&gt;&lt;/p&gt;

&lt;h3&gt;How this works&lt;/h3&gt;

&lt;p&gt;Apple Photos keeps photo metadata in a SQLite database. It runs machine learning models to identify the contents of every photo, and separate machine learning models to calculate quality scores for those photographs. All of this data lives in SQLite files on my laptop. The trick is knowing where to look.&lt;/p&gt;

&lt;p&gt;I'm not running queries directly against the Apple Photos SQLite file - it's a little hard to work with, and the label metadata is stored in a separate database file. Instead, this query runs against a combined database created by my new &lt;a href="https://github.com/dogsheep/dogsheep-photos"&gt;dogsheep-photos&lt;/a&gt; tool.&lt;/p&gt;

&lt;h3&gt;An aside: Why I love Apple Photos&lt;/h3&gt;

&lt;p&gt;The Apple Photos app - on both macOS and iOS - is in my opinion Apple's most underappreciated piece of software. In my experience most people who use it are missing some of the most valuable features. A few highlights:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;It can show you ALL of your photos on a map. On iOS go to the "Albums" tab, scroll half way down and then click on "Places" (no wonder people miss this feature!) - on macOS Photos it's the "Library -&amp;gt; Places" sidebar item.  It still baffles me that Google Photos doesn't do this (I have &lt;a href="https://twitter.com/simonw/status/1227060020694503425"&gt;conspiracy theories&lt;/a&gt; about it). This is my most common way for finding a photo I've taken - I remember where it was, then zoom in on that area of the map.&lt;/li&gt;&lt;li&gt;It runs machine learning models &lt;em&gt;on your phone&lt;/em&gt; (or laptop) to identify the subject of your photos, and makes them searchable. Try searching for "dog" and you'll see all of the photos you've taken of dogs! I love that this runs on-device: it's much less creepy than uploading your photos to the cloud in order to do this.&lt;/li&gt;&lt;li&gt;It has a really great faceted search implementation - particularly in the phone app. Try searching for "dog", then add "selfie" and the name of a city to see all of the selfies you've taken with dogs in that place!&lt;/li&gt;&lt;li&gt;It has facial recognition, again running on device, which you can use to teach it who your friends are (autocompleting against your contacts). A little bit of effort spent training this and you can see photos you've taken of specific friends in specific places and with specific animals!&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;As with most Apple software, Photos uses SQLite under the hood. The underlying database is undocumented and clearly not intended as a public API, but it exists. And I've wanted to gain access to what's in it for years.&lt;/p&gt;

&lt;h3&gt;Querying the Apple Photos SQLite database&lt;/h3&gt;

&lt;p&gt;If you run Apple Photos on a Mac (which will synchronize with your phone via iCloud) then most of your photo metadata can be found in a database file that lives here:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;~/Pictures/Photos\ Library.photoslibrary/database/Photos.sqlite&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Mine is 752MB, for aroud 40,000 photos. There's a lot of detailed metadata in there!&lt;/p&gt;

&lt;p&gt;Querying the database isn't straight-forward. Firstly it's almost always locked by some other process - the workaround for that is to create a copy of the file. Secondly, it uses some custom undocumented Apple SQLite extensions. I've not figured out a way to load these, and without them a lot of my queries ended up throwing errors.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/RhetTbull/osxphotos"&gt;osxphotos&lt;/a&gt; to the rescue! I ran a GitHub code search for one of the tables in that database (searching for &lt;a href="https://github.com/search?l=Python&amp;amp;q=RKPerson&amp;amp;type=Code"&gt;RKPerson in Python code&lt;/a&gt;) and was delighted to stumble across the &lt;code&gt;osxphotos&lt;/code&gt; project by Rhet Turnbull. It's a well designed and extremely actively maintained Python tool for accessing the Apple Photos database, including code to handle several iterations of the underlying database structure.&lt;/p&gt;

&lt;p&gt;Thanks to &lt;code&gt;osxphotos&lt;/code&gt; the first iteration of my own code for accessing the Apple Photos metadata was &lt;a href="https://github.com/dogsheep/dogsheep-photos/commit/b3c20e08b1a99c8898f13cc0266e1c5c012cf23c"&gt;less than 100 lines of code&lt;/a&gt;. This gave me locations, people, albums and places (human names of geographical areas) almost for free!&lt;/p&gt;

&lt;h3&gt;Quality scores&lt;/h3&gt;

&lt;p&gt;Apple Photos has a fascinating database table called &lt;code&gt;ZCOMPUTEDASSETATTRIBUTES&lt;/code&gt;, with a bewildering collection of columns. Each one is a floating point number calculated presumably by some kind of machine learning model. Here's a full list, each one linking to my public photos sorted by that score:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_with_apple_metadata?_sort_desc=ZBEHAVIORALSCORE"&gt;ZBEHAVIORALSCORE&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_with_apple_metadata?_sort_desc=ZFAILURESCORE"&gt;ZFAILURESCORE&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_with_apple_metadata?_sort_desc=ZHARMONIOUSCOLORSCORE"&gt;ZHARMONIOUSCOLORSCORE&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_with_apple_metadata?_sort_desc=ZIMMERSIVENESSSCORE"&gt;ZIMMERSIVENESSSCORE&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_with_apple_metadata?_sort_desc=ZINTERACTIONSCORE"&gt;ZINTERACTIONSCORE&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_with_apple_metadata?_sort_desc=ZINTERESTINGSUBJECTSCORE"&gt;ZINTERESTINGSUBJECTSCORE&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_with_apple_metadata?_sort_desc=ZINTRUSIVEOBJECTPRESENCESCORE"&gt;ZINTRUSIVEOBJECTPRESENCESCORE&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_with_apple_metadata?_sort_desc=ZLIVELYCOLORSCORE"&gt;ZLIVELYCOLORSCORE&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_with_apple_metadata?_sort_desc=ZLOWLIGHT"&gt;ZLOWLIGHT&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_with_apple_metadata?_sort_desc=ZNOISESCORE"&gt;ZNOISESCORE&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_with_apple_metadata?_sort_desc=ZPLEASANTCAMERATILTSCORE"&gt;ZPLEASANTCAMERATILTSCORE&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_with_apple_metadata?_sort_desc=ZPLEASANTCOMPOSITIONSCORE"&gt;ZPLEASANTCOMPOSITIONSCORE&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_with_apple_metadata?_sort_desc=ZPLEASANTLIGHTINGSCORE"&gt;ZPLEASANTLIGHTINGSCORE&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_with_apple_metadata?_sort_desc=ZPLEASANTPATTERNSCORE"&gt;ZPLEASANTPATTERNSCORE&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_with_apple_metadata?_sort_desc=ZPLEASANTPERSPECTIVESCORE"&gt;ZPLEASANTPERSPECTIVESCORE&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_with_apple_metadata?_sort_desc=ZPLEASANTPOSTPROCESSINGSCORE"&gt;ZPLEASANTPOSTPROCESSINGSCORE&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_with_apple_metadata?_sort_desc=ZPLEASANTREFLECTIONSSCORE"&gt;ZPLEASANTREFLECTIONSSCORE&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_with_apple_metadata?_sort_desc=ZPLEASANTSYMMETRYSCORE"&gt;ZPLEASANTSYMMETRYSCORE&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_with_apple_metadata?_sort_desc=ZSHARPLYFOCUSEDSUBJECTSCORE"&gt;ZSHARPLYFOCUSEDSUBJECTSCORE&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_with_apple_metadata?_sort_desc=ZTASTEFULLYBLURREDSCORE"&gt;ZTASTEFULLYBLURREDSCORE&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_with_apple_metadata?_sort_desc=ZWELLCHOSENSUBJECTSCORE"&gt;ZWELLCHOSENSUBJECTSCORE&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_with_apple_metadata?_sort_desc=ZWELLFRAMEDSUBJECTSCORE"&gt;ZWELLFRAMEDSUBJECTSCORE&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_with_apple_metadata?_sort_desc=ZWELLTIMEDSHOTSCORE"&gt;ZWELLTIMEDSHOTSCORE&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;I'm not enormously impressed with the results I get from these. They're clearly not intended for end-user visibility, and sorting them might not even be something that makes sense.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;ZGENERICASSET&lt;/code&gt; table provides four more scores, which seem to provide much more useful results:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_with_apple_metadata?_sort_desc=ZOVERALLAESTHETICSCORE"&gt;ZOVERALLAESTHETICSCORE&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_with_apple_metadata?_sort_desc=ZCURATIONSCORE"&gt;ZCURATIONSCORE&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_with_apple_metadata?_sort_desc=ZPROMOTIONSCORE"&gt;ZPROMOTIONSCORE&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_with_apple_metadata?_sort_desc=ZHIGHLIGHTVISIBILITYSCORE"&gt;ZHIGHLIGHTVISIBILITYSCORE&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;My guess is that these overall scores are derived from the &lt;code&gt;ZCOMPUTEDASSETATTRIBUTES&lt;/code&gt; ones. I've seen the best results from &lt;code&gt;ZOVERALLAESTHETICSCORE&lt;/code&gt;, so that's the one I used in my "show me my best photo of a pelican" query.&lt;/p&gt;

&lt;h3&gt;A note about the demo&lt;/h3&gt;

&lt;p&gt;The demo I'm running at &lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_with_apple_metadata"&gt;dogsheep-photos.dogsheep.net&lt;/a&gt; currently only contains 496 photos. My private instance of this has over 40,000, but I decided to just publish a subset of that in the demo so I wouldn't have to carefully filter out private screenshots and photos with sensitive locations and suchlike. Details of how the demo work (using the &lt;code&gt;dogsheep-photos create-subset&lt;/code&gt; command to create a subset database containing just photos in my Public album) can be found &lt;a href="https://github.com/dogsheep/dogsheep-photos/issues/25"&gt;in this issue&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;Automatic labeling of photo contents&lt;/h3&gt;

&lt;p&gt;Even more impressive than the quality scores are the machine learning labels.&lt;/p&gt;

&lt;p&gt;Automatically labeling the content of a photo is surprisingly easy these days, thanks to &lt;a href="https://en.wikipedia.org/wiki/Convolutional_neural_network"&gt;convolutional neural networks&lt;/a&gt;. I wrote a bit about these in &lt;a href="https://simonwillison.net/2018/Oct/29/transfer-learning/"&gt;Automatically playing science communication games with transfer learning and fastai&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Apple download a machine learning model to your device and do the label classification there. After quite a bit of hunting (I ended up using Activity Monitor's Inspect -&amp;gt; Open Files and Ports option against the &lt;code&gt;photoanalysisd&lt;/code&gt; process) I finally figured out where the results go: the &lt;code&gt;~/Pictures/Photos\ Library.photoslibrary/database/search/psi.sqlite&lt;/code&gt; database file.&lt;/p&gt;

&lt;p&gt;(Inspecting &lt;code&gt;photoanalysisd&lt;/code&gt; also lead me to the &lt;code&gt;/System/Library/Frameworks/Vision.framework/Versions/A/Resources/&lt;/code&gt; folder, which solved another mystery: where do Apple keep the models? There are &lt;a href="https://gist.github.com/simonw/6ce25981931e3c99f51f2ff0c8bcb0b1"&gt;some fascinating files&lt;/a&gt; in there.)&lt;/p&gt;

&lt;p&gt;It took &lt;a href="https://github.com/dogsheep/dogsheep-photos/issues/16"&gt;some work&lt;/a&gt; to figure out how to match those labels with their corresponding photos, mainly because the &lt;code&gt;psi.sqlite&lt;/code&gt; database stores photo UUIDs as a pair of signed integers whereas the &lt;code&gt;Photos.sqlite&lt;/code&gt; database stores a UUID string.&lt;/p&gt;

&lt;p&gt;I'm now pulling the labels out into a separate &lt;code&gt;labels&lt;/code&gt; table. You can &lt;a href="https://dogsheep-photos.dogsheep.net/public/labels?_facet=category"&gt;browse that in the demo&lt;/a&gt; to see how it is structured. Labels belong to numeric categories - here are some of my guesses as to what those mean:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/labels?_facet=category&amp;amp;category=2024"&gt;Category 2024&lt;/a&gt; appears to be actual content labels - Seal, Water Body, Pelican etc.&lt;/li&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/labels?_facet=category&amp;amp;category=2027"&gt;Category 2027&lt;/a&gt; is more contextual: Entertainment, Trip, Travel, Museum, Beach Activity etc.&lt;/li&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/labels?_facet=category&amp;amp;category=1014"&gt;Category 1014&lt;/a&gt; is simply the month the photo was taken. &lt;a href="https://dogsheep-photos.dogsheep.net/public/labels?_facet=category&amp;amp;category=1015"&gt;1015&lt;/a&gt; is the year, and &lt;a href="https://dogsheep-photos.dogsheep.net/public/labels?_facet=category&amp;amp;category=2030"&gt;2030&lt;/a&gt; is the season.&lt;/li&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/labels?_facet=category&amp;amp;category=2056"&gt;Category 2056&lt;/a&gt; is the original filename.&lt;/li&gt;&lt;li&gt;&lt;a href="https://dogsheep-photos.dogsheep.net/public/labels?_facet=category&amp;amp;category=12"&gt;Category 12&lt;/a&gt; is the country the photo was taken in.&lt;/li&gt;&lt;/ul&gt;

Here's &lt;a href="https://dogsheep-photos.dogsheep.net/public?sql=select%0D%0A++photo%2C%0D%0A++%28%0D%0A++++select%0D%0A++++++json_group_array%28%0D%0A++++++++normalized_string%0D%0A++++++%29%0D%0A++++from%0D%0A++++++labels%0D%0A++++where%0D%0A++++++labels.uuid+%3D+photos_with_apple_metadata.uuid%0D%0A++%29+as+labels%2C%0D%0A++date%2C%0D%0A++albums%2C%0D%0A++persons%2C%0D%0A++ZOVERALLAESTHETICSCORE%0D%0Afrom%0D%0A++photos_with_apple_metadata"&gt;a query&lt;/a&gt; that shows the labels (from every category) next to each photo.

&lt;h3&gt;Geography&lt;/h3&gt;

&lt;p&gt;Photos taken on an iPhone have embedded latitudes and longitudes... which means I can &lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_on_a_map"&gt;display them on a map&lt;/a&gt;!&lt;/p&gt;

&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2020/photos-on-a-map.png" alt="My photos on a map" style="max-width: 100%" /&gt;&lt;/p&gt;

&lt;p&gt;Apple also perform reverse-geocoding on those photos, resolving them to cities, regions and countries. This is great for faceted browse: here are my photos &lt;a href="https://dogsheep-photos.dogsheep.net/public/photos_with_apple_metadata?_facet=place_state_province&amp;amp;_facet=place_country&amp;amp;_facet=place_city"&gt;faceted by country, city and state/province&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;Hosting and serving the images&lt;/h3&gt;

&lt;p&gt;My least favourite thing about Apple Photos is how hard it is to get images from it onto the internet. If you enable iCloud sharing your images are accessible through &lt;a href="https://www.icloud.com/"&gt;icloud.com&lt;/a&gt; - but they aren't given publicly accessible URLs, so you can't embed them in blog entries or do other webby things with them.&lt;/p&gt;

&lt;p&gt;I also really want to "own" my images. I want them in a place that I control.&lt;/p&gt;

&lt;p&gt;Amazon S3 is ideal for image storage. It's incredibly inexpensive and essentially infinite.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;dogsheep-photos upload&lt;/code&gt; command takes ANY directory as input, scans through that directory for image files and then uploads them to the configured S3 bucket.&lt;/p&gt;

&lt;p&gt;I designed this to work independently of Apple Photos, mainly to preserve my ability to switch to alternative image solutions in the future.&lt;/p&gt;

&lt;p&gt;I'm using the &lt;a href="https://en.wikipedia.org/wiki/Content-addressable_storage"&gt;content addressable storage&lt;/a&gt; pattern to store the images. Their filename is the sha256 hash of the file contents. The idea is that since sensible photo management software leaves the original files unmodified I should be able to de-duplicate my photo files no matter where they are from and store everything in the one bucket.&lt;/p&gt;

&lt;p&gt;Original image files come with privacy concerns: they embed accurate latitude and longitude data in the EXIF data, so they can be used to reconstruct your exact location history and even figure out your address. This is why systems like Google Photos &lt;a href="https://issuetracker.google.com/issues/80379228"&gt;make it difficult&lt;/a&gt; to export images with location data intact.&lt;/p&gt;

&lt;p&gt;I've addressed this by making the content in my S3 bucket private. Access to the images takes place through &lt;a href="https://github.com/simonw/s3-image-proxy"&gt;s3-image-proxy&lt;/a&gt; - a proxy server I wrote and deployed on &lt;a href="https://vercel.com/"&gt;Vercel&lt;/a&gt; (previously Zeit Now). The proxy strips EXIF data and can optionally resize images based on querystring parameters. It also serves them with far-future cache expire headers, which means they sit in Vercel's CDN cache rather than being resized every time they are accessed.&lt;/p&gt;

&lt;p&gt;iPhones default to saving photos in HEIC format, which fails to display using with the &lt;code&gt;&amp;lt;img src=""&amp;gt;&lt;/code&gt; tag in the browsers I tested. The proxy uses &lt;a href="https://pypi.org/project/pyheif/"&gt;pyheif&lt;/a&gt; to convert those into JPEGs.&lt;/p&gt;

&lt;p&gt;Here's an example HEIC image, resized by the proxy and converted to JPEG:
&lt;a href="https://photos.simonwillison.net/i/59854a70f125154cdf8dad89a4c730e6afde06466d4a6de24689439539c2d863.heic?w=600"&gt;https://photos.simonwillison.net/i/59854a70f125154cdf8dad89a4c730e6afde06466d4a6de24689439539c2d863.heic?w=600&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;Next steps&lt;/h3&gt;

&lt;p&gt;This project is a little daunting in that there are so many possibilities for where to take it next!&lt;/p&gt;

&lt;p&gt;In the short term:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;&lt;a href="https://github.com/dogsheep/dogsheep-photos/issues/3"&gt;Import EXIF data&lt;/a&gt; from the images into a table. The Apple Photos tables give me some of this already (particularly GPS data) but I want things like ISO, aperture, what lens I used.&lt;/li&gt;&lt;li&gt;Load the labels into SQLite full-text search.&lt;/li&gt;&lt;li&gt;I'd like other people to be able to play with this easily. Getting it all up and running right now is a fair amount of work - I think I can improve this with usability improvements and better documentation.&lt;/li&gt;&lt;li&gt;The system only handles static images at the moment. I'd like to &lt;a href="https://github.com/dogsheep/dogsheep-photos/issues/13"&gt;get my movies&lt;/a&gt; and more importantly my live photos in there as well.&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;And in the longer term:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;Only iPhone photos have location data at the moment - I'd like to derive approximate latitude/longitude points for my DSLR images by matching against images from my phone based on date.&lt;/li&gt;&lt;li&gt;Running my photos through other computer vision systems like Google's Cloud Vision APIs &lt;a href="https://github.com/dogsheep/dogsheep-photos/issues/14"&gt;could be really interesting&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;For better spotting of duplicate images I'm interested in exploring &lt;a href="https://github.com/dogsheep/dogsheep-photos/issues/7"&gt;image content hashing&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;The UI for all of this right now is just regular Datasette. Building a custom UI (running against the Datasette JSON API) could be a lot of fun.&lt;/li&gt;&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/photography"&gt;photography&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/photos"&gt;photos&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sql"&gt;sql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/apple-photos"&gt;apple-photos&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="photography"/><category term="photos"/><category term="projects"/><category term="sql"/><category term="sqlite"/><category term="datasette"/><category term="dogsheep"/><category term="weeknotes"/><category term="apple-photos"/></entry><entry><title>Weeknotes: Datasette 0.41, photos breakthroughs</title><link href="https://simonwillison.net/2020/May/7/weeknotes/#atom-tag" rel="alternate"/><published>2020-05-07T05:02:51+00:00</published><updated>2020-05-07T05:02:51+00:00</updated><id>https://simonwillison.net/2020/May/7/weeknotes/#atom-tag</id><summary type="html">
    &lt;p&gt;Shorter weeknotes this week, because my main project for the week warrants a detailed write-up on its own (coming soon... update 21st May &lt;a href="https://simonwillison.net/2020/May/21/dogsheep-photos/"&gt;here it is&lt;/a&gt;).&lt;/p&gt;

&lt;h3&gt;Datasette 0.41&lt;/h3&gt;

&lt;p&gt;I released &lt;a href="https://datasette.readthedocs.io/en/stable/changelog.html#v0-41"&gt;Datasette 0.41&lt;/a&gt; this morning. Highlights include:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;Custom pages and configuration directory mode, both &lt;a href="https://simonwillison.net/2020/Apr/29/weeknotes/#new-datasette-features"&gt;described here last week&lt;/a&gt;&lt;/li&gt;&lt;li&gt;A new NOT LIKE table filter expression, &lt;code&gt;?colname__notlike=X&lt;/code&gt; - &lt;a href="https://github.com/simonw/datasette/issues/750"&gt;#750&lt;/a&gt;&lt;/li&gt;&lt;li&gt;A CSS pattern portfolio! I've been meaning to ship one of these for over two years now - you can see it at &lt;a href="https://latest.datasette.io/-/patterns"&gt;latest.datasette.io/-/patterns&lt;/a&gt; - &lt;a href="https://github.com/simonw/datasette/issues/151"&gt;#151&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Various minor fixes and documentation improvements&lt;/li&gt;&lt;li&gt;PRAGMA function support - &lt;a href="https://github.com/simonw/datasette/issues/761"&gt;#761&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;A late addition to the release - thought of and implemented just this morning, as the result of &lt;a href="https://github.com/simonw/til/blob/master/sqlite/list-all-columns-in-a-database.md"&gt;this TIL&lt;/a&gt;, is the ability to use various &lt;a href="https://www.sqlite.org/pragma.html#pragfunc"&gt;SQLite PRAGMA functions&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Datasette blocks user-provided SQL queries containing the term &lt;code&gt;PRAGMA&lt;/code&gt; because it can be used to dynamically update various database configuration settings, and Datasette likes to keep queries read-only.&lt;/p&gt;

&lt;p&gt;But... SQLite provides access to a number of useful introspection capabilities through a collection of PRAGMA functions. Things like this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;select * from pragma_foreign_key_list('complex_foreign_keys')&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;These are harmless - the SQLite documentation specifically states that "Table-valued functions exist only for PRAGMAs that return results and that have no side-effects".&lt;/p&gt;

&lt;p&gt;SO I enabled them using a negative lookahead assertion in the regular expression that &lt;a href="https://github.com/simonw/datasette/blob/0784f2ef9d3ff6dd9df05f54cb51de29a6d11764/datasette/utils/__init__.py#L174-L193"&gt;prevents PRAGMA from being used&lt;/a&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;allowed_pragmas = (
    "database_list",
    "foreign_key_list",
    "function_list",
    "index_info",
    "index_list",
    "index_xinfo",
    "page_count",
    "max_page_count",
    "page_size",
    "schema_version",
    "table_info",
    "table_xinfo",
)
disallawed_sql_res = [
    (
        re.compile("pragma(?!_({}))".format("|".join(allowed_pragmas))),
        "Statement may not contain PRAGMA",
    )
]&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The regular expression here is dynamically composed from the list of allowed PRAGMA functions - it looks something like this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;re.compile("pragma(?!_(database_list|foreign_key_list|...))")&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In Python regular expressions &lt;code&gt;pragme(?!_foo)&lt;/code&gt; means "pragma but only if not followed by &lt;code&gt;_foo&lt;/code&gt;".&lt;/p&gt;

&lt;h4&gt;Plotting foreign key relationships with Observable and D3&lt;/h4&gt;

&lt;p&gt;With PRAGMA functions support shipped in Datasette 0.41 it's now possible to make use of them in queries. Here's an example query which finds all foreign key relationships in a database:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;select
    sqlite_master.name as table_from,
    fk_info.[from] as column_from,
    fk_info.[table] as table_to,
    fk_info.[to] as column_to
from
    sqlite_master
join
    pragma_foreign_key_list(sqlite_master.name) as fk_info
order by
    sqlite_master.name&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Here's that query &lt;a href="https://github-to-sqlite.dogsheep.net/github?sql=select%0D%0A++sqlite_master.name+as+table_from%2C%0D%0A++fk_info.%5Bfrom%5D+as+column_from%2C%0D%0A++fk_info.%5Btable%5D+as+table_to%2C%0D%0A++fk_info.%5Bto%5D+as+column_to%0D%0Afrom%0D%0A++sqlite_master%0D%0A++join+pragma_foreign_key_list%28sqlite_master.name%29+as+fk_info%0D%0Aorder+by%0D%0A++sqlite_master.name"&gt;run against my github-to-sqlite demo database&lt;/a&gt;. Exported as JSON, the results look like this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[
    {
        "table_from": "commits",
        "column_from": "committer",
        "table_to": "users",
        "column_to": "id"
    },
    {
        "table_from": "commits",
        "column_from": "author",
        "table_to": "users",
        "column_to": "id"
    }
    ...
]&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I used this to put together &lt;a href="https://observablehq.com/@simonw/datasette-table-diagram"&gt;an Observable Notebook&lt;/a&gt; which takes this data (from any public Datasette instance if you provide it with a URL) and renders those relationships as a force-directed graph using D3.&lt;/p&gt;

&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2020/observable-foreign-keys.png" style="max-width: 100%" alt="Force-directed D3 graph of foreign keys" /&gt;&lt;/p&gt;

&lt;h3&gt;Apple Photos breakthrough&lt;/h3&gt;

&lt;p&gt;I'll write about this more in a separate post, but I had a &lt;em&gt;huge&lt;/em&gt; breakthrough with my &lt;a href="https://github.com/dogsheep/photos-to-sqlite"&gt;photos-to-sqlite&lt;/a&gt; project this week.&lt;/p&gt;

&lt;p&gt;The goal of the project is to give my SQL access to the photos I've taken. I use Apple Photos for this, and my holy grail has always been the metadata that Apple generate around my photos using machine learning.&lt;/p&gt;

&lt;p&gt;Rhet Turnbull's fantastic &lt;a href="https://github.com/RhetTbull/osxphotos"&gt;osxphotos&lt;/a&gt; project gave me my first breakthrough, allowing me to easily start importing albums, people (from on-device facial recognition) and places.&lt;/p&gt;

&lt;p&gt;Apple run sophisticated label classification against my photos &lt;em&gt;on device&lt;/em&gt; - so if I search for "dog" they show me all of the photos I've taken of dogs. I was certain that this information was in a SQLite database somewhere, and this week I finally figured out where: the &lt;code&gt;~/Pictures/Photos\ Library.photoslibrary/database/search/psi.sqlite&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;After a &lt;a href="https://github.com/dogsheep/photos-to-sqlite/issues/16"&gt;whole lot of digging around&lt;/a&gt; I worked out how to join it against my photos database, and as a result I can now run SQL queries that select all of my photos of dogs! Or seals! Or pelicans!&lt;/p&gt;

&lt;p&gt;Combine that with some &lt;a href="https://github.com/dogsheep/photos-to-sqlite/issues/15"&gt;intriguing automated photo quality scores&lt;/a&gt; I found (&lt;code&gt;order by ZPLEASANTCOMPOSITIONSCORE&lt;/code&gt; is a thing) and my Dogsheep photos solution is beginning to take shape. I'll write it up in more detail over the next few days.&lt;/p&gt;

&lt;h3&gt;Other projects&lt;/h3&gt;

&lt;ul&gt;&lt;li&gt;&lt;a href="https://github.com/simonw/conditional-get/releases/tag/0.2a"&gt;conditional-get 0.2a&lt;/a&gt; now streams large downloads to disk and shows a progress bar in &lt;code&gt;-v&lt;/code&gt; mode.&lt;/li&gt;&lt;li&gt;&lt;a href="https://github.com/simonw/datasette-atom/releases/tag/0.5"&gt;datasette-atom 0.5&lt;/a&gt; adds support for the Atom author element, so you can add author information to feed entries.&lt;/li&gt;&lt;li&gt;&lt;a href="https://github.com/simonw/sqlite-utils"&gt;sqlite-utils&lt;/a&gt; went through versions 2.7, 2.7.1, 2.7.2 and 2.8 - see &lt;a href="https://sqlite-utils.readthedocs.io/en/stable/changelog.html#v2-8"&gt;release notes&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;&lt;a href="https://github.com/dogsheep/github-to-sqlite/releases"&gt;github-to-sqlite&lt;/a&gt; 2.1 and 2.2 introduced two new commands:&lt;ul&gt;&lt;li&gt;&lt;code&gt;github-to-sqlite scrape-dependents&lt;/code&gt; scrapes the GitHub &lt;a href="https://github.com/simonw/datasette/network/dependents"&gt;dependents page&lt;/a&gt; (which isn't yet in an API) to gather details of repos that depend on your repos&lt;/li&gt;&lt;li&gt;&lt;code&gt;github-to-sqlite stargazers&lt;/code&gt; fetches the users who have starred a specific list of repositories&lt;/li&gt;&lt;li&gt;I also added several new SQL views for better exploring the retrieved data&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;My &lt;a href="https://github-to-sqlite.dogsheep.net/"&gt;github-to-sqlite demo&lt;/a&gt; is updated daily via a &lt;a href="https://github.com/dogsheep/github-to-sqlite/blob/master/.github/workflows/deploy-demo.yml"&gt;scheduled GitHub Action&lt;/a&gt; and retrieves all kinds of data about my various &lt;a href="https://github.com/dogsheep"&gt;Dogsheep&lt;/a&gt; projects plus sqlite-utils and Datasette. Being able to see things like the &lt;a href="https://github-to-sqlite.dogsheep.net/github/dependent_repos"&gt;most recent dependent repos&lt;/a&gt; across my projects in one place is &lt;em&gt;extremely&lt;/em&gt; useful.&lt;/p&gt;

&lt;h3&gt;TIL this week&lt;/h3&gt;

&lt;ul&gt;&lt;li&gt;&lt;a href="https://github.com/simonw/til/blob/master/node/constant-time-compare-strings.md"&gt;Constant-time comparison of strings in Node&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://github.com/simonw/til/blob/master/sqlite/list-all-columns-in-a-database.md"&gt;List all columns in a SQLite database&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://github.com/simonw/til/blob/master/sqlite/compile-sqlite3-ubuntu.md"&gt;Compile a new sqlite3 binary on Ubuntu&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://github.com/simonw/til/blob/master/firefox/search-across-all-resources.md"&gt;Search across all loaded resources in Firefox&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://github.com/simonw/til/blob/master/github/dependencies-graphql-api.md"&gt;Accessing repository dependencies in the GitHub GraphQL API&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://github.com/simonw/til/blob/master/python/platform-specific-dependencies.md"&gt;Use setup.py to install platform-specific dependencies&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="https://github.com/simonw/til/blob/master/python/installing-upgrading-plugins-with-pipx.md"&gt;Installing and upgrading Datasette plugins with pipx&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/observable"&gt;observable&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/apple-photos"&gt;apple-photos&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="projects"/><category term="sqlite"/><category term="datasette"/><category term="observable"/><category term="dogsheep"/><category term="weeknotes"/><category term="apple-photos"/></entry><entry><title>github-to-sqlite 2.2 highlights thread</title><link href="https://simonwillison.net/2020/May/2/github-sqlite-22-highlights-thread/#atom-tag" rel="alternate"/><published>2020-05-02T22:16:16+00:00</published><updated>2020-05-02T22:16:16+00:00</updated><id>https://simonwillison.net/2020/May/2/github-sqlite-22-highlights-thread/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://twitter.com/simonw/status/1256699210000289792"&gt;github-to-sqlite 2.2 highlights thread&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I released github-to-sqlite 2.2 today with a new “stargazers” command for importing users who have starred one or more specific repositories. This Twitter thread lists highlights of recent releases and links to a live Datasette demo that shows what the tool can do.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;&lt;/p&gt;



</summary><category term="github"/><category term="projects"/><category term="datasette"/><category term="dogsheep"/></entry><entry><title>Weeknotes: Datasette 0.40, various projects, Dogsheep photos</title><link href="https://simonwillison.net/2020/Apr/22/weeknotes/#atom-tag" rel="alternate"/><published>2020-04-22T23:09:10+00:00</published><updated>2020-04-22T23:09:10+00:00</updated><id>https://simonwillison.net/2020/Apr/22/weeknotes/#atom-tag</id><summary type="html">
    &lt;p&gt;A new release of Datasette, two new projects and progress towards a Dogsheep photos solution.&lt;/p&gt;

&lt;h3 id="datasette-0-40"&gt;Datasette 0.40&lt;/h3&gt;

&lt;p&gt;I released Datasette 0.40 last night. Full &lt;a href="https://datasette.readthedocs.io/en/stable/changelog.html#v0-40"&gt;release notes are here&lt;/a&gt;, but the highlights of this key feature in this release is the ability to provide metadata in a &lt;code&gt;metadata.yaml&lt;/code&gt; file as an alternative to &lt;code&gt;metadata.json&lt;/code&gt;. This is particularly useful for embedded multi-line SQL queries: I've &lt;a href="https://github.com/simonw/museums/commit/81f7735bf8b3d8e98217bb5e2383d565cfa17799"&gt;upgraded simonw/museums&lt;/a&gt; and &lt;a href="https://github.com/simonw/til/commit/e55f0d3221a0838354c893507559d67727999bfd"&gt;simonw/til&lt;/a&gt; to take advantage of this, since they both use their metadata to define SQL queries that power their search pages and Atom feeds.&lt;/p&gt;

&lt;h3 id="jsk-fellows-directory"&gt;A JSK fellows directory and twitter-to-sqlite 0.21&lt;/h3&gt;

&lt;p&gt;My &lt;a href="https://jsk.stanford.edu/"&gt;JSK Fellowship&lt;/a&gt; at Stanford ends in a few months. JSK has extremely talented and influential &lt;a href="https://jsk.stanford.edu/fellows/alumni/"&gt;alumni&lt;/a&gt;, and one of the benefits of the fellowship is becoming part of that network afterwards.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://twitter.com/JSKstanford"&gt;@JSKStanford&lt;/a&gt; Twitter account maintains &lt;a href="https://twitter.com/JSKstanford/lists"&gt;lists&lt;/a&gt; of fellows on Twitter - journalists love Twitter! - so I decided to use my &lt;a href="https://github.com/dogsheep/twitter-to-sqlite"&gt;twitter-to-sqlite&lt;/a&gt; tool to build a Datasette-powered search engine of them.&lt;/p&gt;

&lt;p&gt;That search engine is now running at &lt;a href="https://jsk-fellows.datasettes.com/"&gt;jsk-fellows.datasettes.com&lt;/a&gt;. It's updated daily by &lt;a href="https://github.com/simonw/jsk-fellows-on-twitter/blob/26e0868fede4c26d0ae24faea20ee67d5ae13bab/.github/workflows/scheduled.yml"&gt;a GitHub Action&lt;/a&gt; to capture any bio changes or new list entrants.&lt;/p&gt;

&lt;p&gt;It's a neat example of taking advantage of SQLite views to build faceted search across a subset of data. A script &lt;a href="https://github.com/simonw/jsk-fellows-on-twitter/blob/26e0868fede4c26d0ae24faea20ee67d5ae13bab/create-view.sh"&gt;constructs the jsk_fellows view&lt;/a&gt; at build time, then &lt;a href="https://github.com/simonw/jsk-fellows-on-twitter/blob/26e0868fede4c26d0ae24faea20ee67d5ae13bab/metadata.json"&gt;metadata.json&lt;/a&gt; configures that view to &lt;a href="https://datasette.readthedocs.io/en/stable/full_text_search.html#configuring-full-text-search-for-a-table-or-view"&gt;run full-text search&lt;/a&gt; and facet by the derived fellowship column.&lt;/p&gt;

&lt;p&gt;I shipped &lt;a href="https://github.com/dogsheep/twitter-to-sqlite/releases/tag/0.21"&gt;twitter-to-sqlite 0.21&lt;/a&gt; with a new &lt;code&gt;twitter-to-sqlite lists username&lt;/code&gt; command as part of this project.&lt;/p&gt;

&lt;h3 id="tils-datasette-template-sql"&gt;TILs and datasette-template-sql 1.0&lt;/h3&gt;

&lt;p&gt;I described &lt;a href="https://simonwillison.net/2020/Apr/20/self-rewriting-readme/"&gt;my new TILs project&lt;/a&gt; on Monday. I've published 15 so far - the format is working really well for me.&lt;/p&gt;

&lt;p&gt;Hacking on &lt;a href="https://github.com/simonw/tils"&gt;simonw/tils&lt;/a&gt; reminded me of a feature gap in my &lt;a href="https://github.com/simonw/datasette-template-sql"&gt;datasette-template-sql&lt;/a&gt; plugin: it didn't have a solution for safely escaping parameters in SQL queries, leading to nasty string concatenated SQL queries.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/simonw/datasette-template-sql/releases/tag/1.0"&gt;datasette-template-sql 1.0&lt;/a&gt; fixes that issue, at the cost of backwards compatibility with previous releases. I'm using it for both &lt;a href="https://github.com/simonw/til/commit/f8642850c617c3e58327d0109a20d58490d6f098"&gt;til&lt;/a&gt; and &lt;a href="https://github.com/simonw/museums/commit/b08640bd4c69fa10a3bd3ea2cb291cfb764f5617"&gt;museums&lt;/a&gt; now.&lt;/p&gt;


&lt;h3 id="github-to-sqlite-2-0"&gt;github-to-sqlite 2.0&lt;/h3&gt;

&lt;p&gt;I released &lt;a href="https://github.com/dogsheep/github-to-sqlite/releases/tag/2.0"&gt;github-to-sqlite 2.0&lt;/a&gt; with a small backwards incompatible change to the database schema (hence the major version increment). It builds on 1.1 from a few days ago which added a new &lt;code&gt;github-to-sqlite contributors&lt;/code&gt; command for fetching statistics on contributors to repositories.&lt;/p&gt;

&lt;p&gt;More importantly, I improved the live demo running at &lt;a href="https://github-to-sqlite.dogsheep.net/"&gt;github-to-sqlite.dogsheep.net&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The demo now updates &lt;a href="https://github.com/dogsheep/github-to-sqlite/blob/master/.github/workflows/deploy-demo.yml"&gt;once a day&lt;/a&gt; using GitHub Actions and pulls in releases, commits, issues, issue comments and contributors for all of my &lt;a href="https://github.com/dogsheep"&gt;Dogsheep&lt;/a&gt; projects plus &lt;a href="https://github.com/simonw/datasette"&gt;datasette&lt;/a&gt; and &lt;a href="https://github.com/simonw/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This means I can browse and execute SQL queries across &lt;a href="https://github-to-sqlite.dogsheep.net/github/issues"&gt;929 issues&lt;/a&gt;, &lt;a href="https://github-to-sqlite.dogsheep.net/github/commits"&gt;1,505 commits&lt;/a&gt; and &lt;a href="https://github-to-sqlite.dogsheep.net/github/releases"&gt;132 releases&lt;/a&gt; across &lt;a href="https://github-to-sqlite.dogsheep.net/github/repos"&gt;14 repositories&lt;/a&gt;!&lt;/p&gt;

&lt;p&gt;Want to see which of my projects have had the most releases? &lt;a href="https://github-to-sqlite.dogsheep.net/github/releases?_facet=repo"&gt;Facet releases by repo&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I've also installed &lt;a href="https://simonwillison.net/2020/Mar/9/datasette-search-all/"&gt;the datasette-search-all plugin&lt;/a&gt; there, so you can search across all commits, releases, issues etc &lt;a href="https://github-to-sqlite.dogsheep.net/-/search?q=zeit+now"&gt;for "zeit now"&lt;/a&gt; for example.&lt;/p&gt;

&lt;p&gt;Bringing all of my different project data together in one place like this is really powerful.&lt;/p&gt;

&lt;p&gt;I think it's a great illustration of the Datasette/Dogsheep philosophy of pulling down a complete SQLite-powered copy of data from external services so you can query and join across your data without being limited to the functionality that those services provide through their own interfaces or APIs.&lt;/p&gt;

&lt;h3 id="photos-to-sqlite-alpha"&gt;photos-to-sqlite alpha&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://dogsheep.github.io/"&gt;Dogsheep&lt;/a&gt; is about bringing all of my interesting personal and social data into a single, private place.&lt;/p&gt;

&lt;p&gt;The biggest thing missing at the moment is photos. I want to be able to query my photos with SQL, and eventually combine them with tweets, checkins etc in a unified timeline.&lt;/p&gt;

&lt;p&gt;Last week I took a step towards this goal with &lt;a href="https://github.com/simonw/heic-to-jpeg"&gt;heic-to-jpeg&lt;/a&gt;, a proxy to let me display my iPhone's HEIC photos online.&lt;/p&gt;

&lt;p&gt;This week I started work on &lt;a href="https://github.com/dogsheep/photos-to-sqlite/"&gt;photos-to-sqlite&lt;/a&gt; - the set of tools which I'll use to turn my photos into something I can run queries again.&lt;/p&gt;

&lt;p&gt;So far I've mainly been figuring out how to get them into an S3 bucket that I control. Once configured, running &lt;code&gt;photos-to-sqlite upload photos.db ~/Pictures/Photos\ Library.photoslibrary/originals&lt;/code&gt; will start uploading every photo it can find in that directory to the S3 bucket.&lt;/p&gt;

&lt;p&gt;The filename it uses is the sha256 hash of the photo file contents, which I'm hoping will let me de-dupe photos from multiple sources in the future. It also writes basic metadata on the photos to that &lt;code&gt;photos.db&lt;/code&gt; SQLite database.&lt;/p&gt;

&lt;p&gt;This is going to be a big project. I'm investigating &lt;a href="https://github.com/RhetTbull/osxphotos"&gt;osxphotos&lt;/a&gt; to liberate the metadata from Apple Photos, and various Python libraries for extracting EXIF data from the files themselves.&lt;/p&gt;

&lt;p&gt;Once I've got that working, I can experiment with things like piping photos through &lt;a href="https://cloud.google.com/vision"&gt;Google Cloud Vision&lt;/a&gt; to label them based on their contents.&lt;/p&gt;

&lt;p&gt;This is all a very, &lt;em&gt;very&lt;/em&gt; early alpha at the moment, but I'm cautiously optimistic about progress so far.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/twitter"&gt;twitter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/jsk"&gt;jsk&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="github"/><category term="projects"/><category term="twitter"/><category term="datasette"/><category term="jsk"/><category term="dogsheep"/><category term="weeknotes"/><category term="sqlite-utils"/></entry><entry><title>Weeknotes: Hacking on 23 different projects</title><link href="https://simonwillison.net/2020/Apr/16/weeknotes-hacking-23-different-projects/#atom-tag" rel="alternate"/><published>2020-04-16T05:03:11+00:00</published><updated>2020-04-16T05:03:11+00:00</updated><id>https://simonwillison.net/2020/Apr/16/weeknotes-hacking-23-different-projects/#atom-tag</id><summary type="html">
    &lt;p&gt;I wrote a lot of code this week: 184 commits over 23 repositories! I've also started falling for Zeit Now v2, having found workarounds for some of my biggest problems with it.&lt;/p&gt;

&lt;h3&gt;Better Datasette on Zeit Now v2&lt;/h3&gt;

&lt;p&gt;Last week I &lt;a href="https://simonwillison.net/2020/Apr/8/weeknotes-zeit-now-v2/"&gt;bemoaned the loss of Zeit Now v1&lt;/a&gt; and documented &lt;a href="https://simonwillison.net/2020/Apr/8/weeknotes-zeit-now-v2/#hello-zeit-now-v2"&gt;my initial explorations&lt;/a&gt; of Zeit Now v2 with respect to Datasette.&lt;/p&gt;

&lt;p&gt;My favourite thing about Now v1 was that it ran from Dockerfiles, which gave me complete control over the versions of everything in my deployment environment.&lt;/p&gt;

&lt;p&gt;Now v2 runs on AWS Lambda, which means you are mostly stuck with what Zeit's flavour of Lambda gives you. This currently means Python 3.6 (not too terrible - Datasette fully supports it) and a positively ancient SQLite -  3.7.17 from May 2013.&lt;/p&gt;

&lt;p&gt;Lambda runs on Amazon Linux. Charles Leifer maintains a package called &lt;a href="https://github.com/coleifer/pysqlite3/"&gt;pysqlite3&lt;/a&gt; which bundles the latest version of SQLite3 as a standalone Python package, and includes a &lt;code&gt;pysqlite3-binary&lt;/code&gt; package precompiled for Linux. Could it work on Amazon Linux...?&lt;/p&gt;

&lt;p&gt;It turns out it does! A &lt;a href="https://github.com/simonw/datasette-publish-now/commit/529f978beeccbb45240d398a3bf24ed9d77ebd55"&gt;one-line change&lt;/a&gt; (not including tests) to my &lt;a href="https://github.com/simonw/datasette-publish-now"&gt;datasette-publish-now&lt;/a&gt; and it now deploys Datasette on Now v2 &lt;a href="https://datasette-public.now.sh/-/versions"&gt;with SQLite 3.31.1&lt;/a&gt; - the &lt;a href="https://www.sqlite.org/changes.html#version_3_31_0"&gt;latest release&lt;/a&gt; from January this year, with window functions and all kinds of other goodness.&lt;/p&gt;

&lt;p&gt;This means that Now v2 is back to being a really solid option for hosting Datasette instances. You get scale-to-zero, crazily low prices and really fast cold-boot times. It can only take databases up to around 50MB - if you need more space than that you're better off with &lt;a href="https://datasette.readthedocs.io/en/stable/publish.html#publishing-to-google-cloud-run"&gt;Cloud Run&lt;/a&gt; - but it's a great option for smaller data.&lt;/p&gt;

&lt;p&gt;I released &lt;a href="https://github.com/simonw/datasette-publish-now/releases"&gt;a few versions of datasette-publish-now&lt;/a&gt; as a result of this research. I plan to release the first non-alpha version at the same time as Datasette 0.40.&lt;/p&gt;

&lt;h3&gt;Various projects ported to Now v2 or Cloud Run&lt;/h3&gt;

&lt;p&gt;I had over 100 projects running on Now v1 that needed updating or deleting in time for that platform's shutdown in August. I've been porting some of them very quickly using &lt;code&gt;datasette-publish-now&lt;/code&gt;, but a few have been more work. Some highlights from this week:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;&lt;a href="https://ftfy.now.sh/"&gt;ftfy.now.sh&lt;/a&gt;, my web app that takes a string of broken unicode and figures out the sequence of transformations you can use to make sense of it (built on the incredible &lt;a href="https://github.com/LuminosoInsight/python-ftfy"&gt;FTFY Python library&lt;/a&gt; by Robyn Speer) has been upgraded to Now v2 - &lt;a href="https://github.com/simonw/ftfy-web"&gt;repo here&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;&lt;a href="https://gzthermal.now.sh"&gt;gzthermal.now.sh&lt;/a&gt; offers a web interface to the &lt;code&gt;gzthermal&lt;/code&gt; gzip visualization tool, released by caveman &lt;a href="https://encode.su/threads/1889-gzthermal-pseudo-thermal-view-of-Gzip-Deflate-compression-efficiency"&gt;on the encode.ru (now encode.su) forum&lt;/a&gt;. My &lt;a href="https://github.com/simonw/gzthermal-web"&gt;repo is here&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;My &lt;a href="https://github.com/simonw/cryptozoology"&gt;crowdsourced directory of range maps of cryptozoological creatures&lt;/a&gt; is now running on Cloud Run (I haven't figured out a way to run SpatiaLite on Now v2 yet).&lt;/li&gt;&lt;li&gt;The &lt;a href="https://datasette-sqlite-fts4.datasette.io/24ways-fts4-52e8a02?sql=select%0D%0A++++title%2C+author%2C%0D%0A++++decode_matchinfo%28matchinfo%28articles_fts%2C+%22pcx%22%29%29%2C%0D%0A++++json_object%28%22pre%22%2C+annotate_matchinfo%28matchinfo%28articles_fts%2C+%22pcxnalyb%22%29%2C+%22pcxnalyb%22%29%29%0D%0Afrom%0D%0A++++articles_fts%0D%0Awhere%0D%0A++++articles_fts+match+%3Asearch&amp;amp;search=jquery+maps"&gt;datasette-sqlite-fts4.datasette.io&lt;/a&gt; demo instance I used for explanations in &lt;a href="https://simonwillison.net/2019/Jan/7/exploring-search-relevance-algorithms-sqlite/"&gt;Exploring search relevance algorithms with SQLite&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;The demo instance used for &lt;a href="https://github.com/simonw/datasette-jellyfish"&gt;datasette-jellyfish&lt;/a&gt; is on Now v2.&lt;/li&gt;&lt;li&gt;The demo for &lt;a href="https://github.com/simonw/datasette-jq"&gt;datasette-jq&lt;/a&gt; had to move to Cloud Run, because I couldn't install &lt;a href="https://github.com/doloopwhile/pyjq"&gt;pyjq&lt;/a&gt; on Now v2.&lt;/li&gt;&lt;/ul&gt;

&lt;h3&gt;big-local-datasette&lt;/h3&gt;

&lt;p&gt;I've been collaborating with the &lt;a href="https://biglocalnews.org/"&gt;Big Local&lt;/a&gt; team at Stanford on a number of projects related to the Covid-19 situation. It's not quite open to the public yet but I've been building a Datasette instance which shares data from the "open projects" maintained by that team.&lt;/p&gt;

&lt;p&gt;The implementation fits &lt;a href="https://simonwillison.net/2020/Jan/21/github-actions-cloud-run/"&gt;a common pattern&lt;/a&gt; for me: a &lt;a href="https://github.com/simonw/big-local-datasette/blob/afcb885b3e746d6380f4ad6bab899190b461975d/.github/workflows/deploy.yml"&gt;scheduled GitHub Action&lt;/a&gt; which fetches project data from a GraphQL API, seeks out CSV files which have changed (using HTTP HEAD requests to check their ETags), loads the CSV into SQLite tables and publishes the resulting database using &lt;code&gt;datasette publish cloudrun&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;There's one interesing new twist: I'm fetching the existing database files on every run using my new &lt;a href="https://simonwillison.net/2020/Apr/14/datasette-clone/"&gt;datasette-clone tool&lt;/a&gt; (written for this project), applying changes to them and then only publishing if the resulting MD5 sums have changed since last time.&lt;/p&gt;

&lt;p&gt;It seems to work well, and I'm excited about this technique as a way of incrementally updating existing databases using stateless code running in a GitHub Action.&lt;/p&gt;

&lt;h3&gt;Datasette Cloud&lt;/h3&gt;

&lt;p&gt;I continue to work on the invite-only alpha of my SaaS Datasette platform, Datasette Cloud. This week I ported the CI and deployment scripts from GitLab to GitHub Actions, mainly to try and reduce the variety of CI systems I'm working with (I now have projects live on three: Travis, Circle CI and GitHub Actions).&lt;/p&gt;

&lt;p&gt;I've also been figuring out ways of supporting API tokens for making requests to authentication-protected Datasette instances. I shipped small releases of &lt;a href="https://github.com/simonw/datasette-auth-github/releases/tag/0.12"&gt;datasette-auth-github&lt;/a&gt; and &lt;a href="https://github.com/simonw/datasette-auth-existing-cookies/releases/tag/0.7"&gt;datasette-auth-existing-cookies&lt;/a&gt; to support this.&lt;/p&gt;

&lt;p&gt;In tinkering with Datasette Cloud I also shipped an upgrade to &lt;a href="https://github.com/simonw/datasette-mask-columns"&gt;datasette-mask-columns&lt;/a&gt;, which now shows visible REDACTED text on redacted columns in table view.&lt;/p&gt;

&lt;h3&gt;Miscellaneous&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;My &lt;a href="https://covid-19.datasettes.com/"&gt;covid-19.datasettes.com&lt;/a&gt; project now also imports data &lt;a href="https://github.com/simonw/covid-19-datasette/issues/11"&gt;from the LA Times&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;I added &lt;code&gt;.rows_where(..., order_by="column")&lt;/code&gt; in &lt;a href="https://sqlite-utils.readthedocs.io/en/stable/changelog.html#v2-6"&gt;release 2.6 of sqlite-utils&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;I shipped a &lt;a href="https://github.com/simonw/paginate-json/releases/tag/0.3"&gt;new release&lt;/a&gt; of &lt;a href="https://github.com/simonw/paginate-json"&gt;paginate-json&lt;/a&gt;, a tool I built primarily for paginating through the GitHub API and piping the results to &lt;code&gt;sqlite-utils&lt;/code&gt;.&lt;/li&gt;&lt;li&gt;I fixed a minor bug &lt;a href="https://github.com/simonw/datasette/issues/724"&gt;with Datasette's --plugin-secret mechanism&lt;/a&gt; and added &lt;a href="https://github.com/simonw/datasette/issues/727"&gt;a CSS customization hook&lt;/a&gt; for the canned query page.&lt;/li&gt;&lt;li&gt;I built a &lt;a href="https://github.com/simonw/heic-to-jpeg"&gt;HEIC to JPEG converting proxy&lt;/a&gt; as part of my ongoing mission to eventually liberate my photos from Apple Photos and make them available to &lt;a href="https://simonwillison.net/tags/dogsheep/"&gt;Dogsheep&lt;/a&gt;. In doing so I &lt;a href="https://github.com/david-poirier-csn/pyheif/commit/8d03e0bf6dde6aa0317471792d698a30502f9e1d?short_path=04c6e90#diff-04c6e90faac2675aa89e2176d2eec7d8"&gt;contributed usage documentation&lt;/a&gt; to the pyheif Python library.&lt;/li&gt;&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/zeit-now"&gt;zeit-now&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette-cloud"&gt;datasette-cloud&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="projects"/><category term="zeit-now"/><category term="datasette"/><category term="dogsheep"/><category term="weeknotes"/><category term="datasette-cloud"/></entry><entry><title>Weeknotes: Covid-19, First Python Notebook, more Dogsheep, Tailscale</title><link href="https://simonwillison.net/2020/Apr/1/weeknotes/#atom-tag" rel="alternate"/><published>2020-04-01T20:29:59+00:00</published><updated>2020-04-01T20:29:59+00:00</updated><id>https://simonwillison.net/2020/Apr/1/weeknotes/#atom-tag</id><summary type="html">
    &lt;p&gt;My &lt;a href="https://covid-19.datasettes.com/"&gt;covid-19.datasettes.com&lt;/a&gt; project publishes information on COVID-19 cases around the world. The project started out using data &lt;a href="https://github.com/CSSEGISandData/COVID-19"&gt;from Johns Hopkins CSSE&lt;/a&gt;, but last week the New York Times &lt;a href="https://www.nytimes.com/article/coronavirus-county-data-us.html"&gt;started publishing&lt;/a&gt; high quality USA county- and state-level daily numbers to their &lt;a href="https://github.com/nytimes/covid-19-data"&gt;own repository&lt;/a&gt;. Here's &lt;a href="https://github.com/simonw/covid-19-datasette/commit/56e1644390e5d01ff67c61d6c165749093675632"&gt;the change&lt;/a&gt; that added the NY Times data.&lt;/p&gt;

&lt;p&gt;It's very easy to use this data to accidentally build misleading things. I've been &lt;a href="https://github.com/simonw/covid-19-datasette/blob/master/README.md"&gt;updating the README&lt;/a&gt; with links about this - my current favourite is &lt;a href="https://fivethirtyeight.com/features/why-its-so-freaking-hard-to-make-a-good-covid-19-model/"&gt;Why It’s So Freaking Hard To Make A Good COVID-19 Model&lt;/a&gt; by  Maggie Koerth, Laura Bronner and Jasmine Mithani at FiveThirtyEight.&lt;/p&gt;

&lt;h3 id="weeknotes-first-python-notebook"&gt;First Python Notebook&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://twitter.com/palewire"&gt;Ben Welsh&lt;/a&gt; from the LA Times teaches a course called &lt;a href="https://www.firstpythonnotebook.org/"&gt;First Python Notebook&lt;/a&gt; at journalism conferences such as NICAR. He ran a free online version the course last weekend, and I offered to help out as a TA.&lt;/p&gt;

&lt;p&gt;Most of the help I provided came before the course: Ben asked attendees to confirm that they had working installations of Python 3 and pipenv, and if they didn't volunteers such as myself would step in to help. I had Zoom and email conversations with at least ten people to help them get their environments into shape.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://xkcd.com/1987/"&gt;This XKCD&lt;/a&gt; neatly summarizes the problem:&lt;/p&gt;

&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2020/python_environment_2x.png" alt="XKCD Python Environments" style="max-width: 100%" /&gt;&lt;/p&gt;

&lt;p&gt;One of the most common problems I had to debug was PATH issues: people had installed the software, but due to various environmental differences &lt;code&gt;python3&lt;/code&gt; and &lt;code&gt;pipenv&lt;/code&gt; weren't available on the PATH. Talking people through the obscurities of creating a &lt;code&gt;~/.bashrc&lt;/code&gt; file and using it to define a PATH over-ride really helps emphasize how arcane this kind of knowledge is.&lt;/p&gt;

&lt;p&gt;I enjoyed this comment:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;"Welcome to intro to Tennis. In the first two weeks, we'll discuss how to rig a net and resurface a court." - &lt;a href="https://twitter.com/ClausWilke/status/1234941405883138048"&gt;Claus Wilke&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Ben's course itself is hands down the best introduction to Python from a Data Journalism perspective I have ever seen. Within an hour of starting the students are using Pandas in a Jupyter notebook to find interesting discrepancies in California campaign finance data.&lt;/p&gt;

&lt;p&gt;If you want to check it out yourself, the entire four hour workshop &lt;a href="https://twitter.com/palewire/status/1244410903279177728"&gt;is now on YouTube&lt;/a&gt; and closely follows the material on &lt;a href="https://www.firstpythonnotebook.org/"&gt;firstpythonnotebook.org&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id="weeknotes-coronavirus-diary"&gt;Coronavirus Diary&lt;/h3&gt;

&lt;p&gt;We are clearly living through a notable and very painful period of history right now. On the 19th of March (just under two weeks ago, but time is moving both really fast and incredibly slowly right now) I started a personal diary - something I've never done before. It lives in an Apple Note and I'm adding around a dozen paragraphs to it every day. I think it's helping. I'm sure it will be interesting to look back on in a few years time.&lt;/p&gt;

&lt;h3 id="weeknotes-dogsheep"&gt;Dogsheep&lt;/h3&gt;

&lt;p&gt;Much of my development work this past week has gone into my &lt;a href="https://github.com/dogsheep"&gt;Dogsheep&lt;/a&gt; suite of tools for personal analytics.&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;I upgraded the entire family of tools for compatibility with &lt;a href="https://sqlite-utils.readthedocs.io/en/stable/changelog.html#v2"&gt;sqlite-utils 2.x&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;&lt;a href="https://github.com/dogsheep/pocket-to-sqlite"&gt;pocket-to-sqlite&lt;/a&gt; got a major upgrade: it now fetches items using Pocket's API pagination (previously it just tried to pull in 5,000 items in one go) and has the ability to only fetch new items. As a result I'm now running it from cron in my personal Dogsheep instance, so "Save to Pocket" is now my preferred Dogsheep-compatible way of bookmarking content.&lt;/li&gt;&lt;li&gt;&lt;a href="https://github.com/dogsheep/twitter-to-sqlite"&gt;twitter-to-sqlite&lt;/a&gt; got a couple of important new features in &lt;a href="https://github.com/dogsheep/twitter-to-sqlite/releases/tag/0.20"&gt;release 0.20&lt;/a&gt;. I fixed &lt;a href="https://github.com/dogsheep/twitter-to-sqlite/issues/39"&gt;a nasty bug&lt;/a&gt; in the &lt;code&gt;--since&lt;/code&gt; flag where retweets from other accounts could cause new tweets from an account to be ignored. I also added a new &lt;code&gt;count_history&lt;/code&gt; table which automatically tracks changes to a Twitter user's friends, follower and listed counts over time (&lt;a href="https://github.com/dogsheep/twitter-to-sqlite/issues/40"&gt;#40&lt;/a&gt;).&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;I'm also now using Dogsheep for some journalism! I'm working with the &lt;a href="https://biglocalnews.org/"&gt;Big Local News&lt;/a&gt; team at Stanford to help track and archive tweets by a number of different US politicians and health departments relating to the ongoing pandemic. This collaboration resulted in the above improvements to &lt;code&gt;twitter-to-sqlite&lt;/code&gt;.&lt;/p&gt;

&lt;h3 id="weeknotes-tailscale"&gt;Tailscale&lt;/h3&gt;

&lt;p&gt;My personal Dogsheep is currently protected by &lt;a href="https://simonwillison.net/2019/Oct/5/client-side-certificate-authentication-nginx/"&gt;client certificates&lt;/a&gt;, so only my personal laptop and iPhone (with the right certificates installed) can connect to the web server it is running on.&lt;/p&gt;

&lt;p&gt;I spent a bit of time this week playing with &lt;a href="https://tailscale.com/"&gt;Tailscale&lt;/a&gt;, and I'm &lt;em&gt;really&lt;/em&gt; impressed by it.&lt;/p&gt;

&lt;p&gt;Tailscale is a commercial company built on top of &lt;a href="https://www.wireguard.com/"&gt;WireGuard&lt;/a&gt;, the new approach to VPN tunnels which just &lt;a href="https://arstechnica.com/gadgets/2020/03/wireguard-vpn-makes-it-to-1-0-0-and-into-the-next-linux-kernel/"&gt;got merged&lt;/a&gt; into the Linux 5.6 kernel. Tailscale first caught my attention in January when they &lt;a href="https://bradfitz.com/2020/01/30/joining-tailscale"&gt;hired Brad Fitzpatrick&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;WireGuard lets you form a private network by having individual hosts exchange public/private keys with each other. Tailscale provides software which manages those keys for you, making it trivial to set up a private network between different nodes.&lt;/p&gt;

&lt;p&gt;How trivial? It took me less than ten minutes to get a three-node private network running between my iPhone, laptop and a Linux server. I installed the &lt;a href="https://apps.apple.com/us/app/tailscale/id1470499037?ls=1"&gt;iPhone app&lt;/a&gt;, the &lt;a href="https://tailscale.com/kb/1037/install-ubuntu-1804"&gt;Ubuntu package&lt;/a&gt;, the &lt;a href="https://apps.apple.com/ca/app/tailscale/id1475387142?mt=12"&gt;OS X app&lt;/a&gt;, signed them all into my Google account and I was done.&lt;/p&gt;

&lt;p&gt;Each of those devices now has an additional IP address in the 100.x range which they can use to talk to each other. Tailscale guarantees that the IP address will stay constant for each of them.&lt;/p&gt;

&lt;p&gt;Since the network is public/private key encrypted between the nodes, Tailscale can't see any of my traffic - they're purely acting as a key management mechanism. And it's free: Tailscale charge for networks with multiple users, but a personal network like this is free of charge.&lt;/p&gt;

&lt;p&gt;I'm not running my own personal Dogsheep on it yet, but I'm tempted to switch over. I'd love other people to start running their own personal Dogsheep instances but I'm paranoid about encouraging this when securing them is so important. Tailscale looks like it might be a great solution for making secure personal infrastructure more easily and widely available.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/brad-fitzpatrick"&gt;brad-fitzpatrick&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/teaching"&gt;teaching&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tailscale"&gt;tailscale&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/covid19"&gt;covid19&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ben-welsh"&gt;ben-welsh&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="brad-fitzpatrick"/><category term="data-journalism"/><category term="projects"/><category term="python"/><category term="teaching"/><category term="datasette"/><category term="dogsheep"/><category term="weeknotes"/><category term="tailscale"/><category term="covid19"/><category term="ben-welsh"/></entry><entry><title>Weeknotes: Datasette 0.39 and many other projects</title><link href="https://simonwillison.net/2020/Mar/25/weeknotes/#atom-tag" rel="alternate"/><published>2020-03-25T05:33:19+00:00</published><updated>2020-03-25T05:33:19+00:00</updated><id>https://simonwillison.net/2020/Mar/25/weeknotes/#atom-tag</id><summary type="html">
    &lt;p&gt;This week's theme: Well, I'm not going anywhere. So a ton of progress to report on various projects.&lt;/p&gt;

&lt;h3 id="weeknotes-datasette-0-39"&gt;Datasette 0.39&lt;/h3&gt;

&lt;p&gt;This evening I shipped &lt;a href="https://datasette.readthedocs.io/en/stable/changelog.html#v0-39"&gt;Datasette 0.39&lt;/a&gt;. The two big features are a mechanism for setting the default sort order for tables and a new &lt;code&gt;base_url&lt;/code&gt; configuration setting.&lt;/p&gt;

&lt;p&gt;You can see the new default sort order in action &lt;a href="https://covid-19.datasettes.com/covid/daily_reports"&gt;on my Covid-19 project&lt;/a&gt; - the daily reports now default to sort by day descending so the most recent figures show up first. Here's &lt;a href="https://covid-19.datasettes.com/-/metadata.json"&gt;the metadata&lt;/a&gt; that makes it happen, and here's the &lt;a href="https://datasette.readthedocs.io/en/stable/metadata.html#setting-a-default-sort-order"&gt;new documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I had to do some extra work on that project this morning when the underlying data &lt;a href="https://github.com/simonw/covid-19-datasette/issues/4"&gt;changed its CSV column headings&lt;/a&gt; without warning.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;base_url&lt;/code&gt; feature has been &lt;a href="https://github.com/simonw/datasette/issues/394"&gt;an open issue&lt;/a&gt; since Janunary 2019. It lets you run Datasette behind a proxy on a different URL prefix - &lt;code&gt;/tools/datasette/&lt;/code&gt; for example. The trigger for finally getting this solved was &lt;a href="https://twitter.com/betatim/status/1242217777282285572"&gt;a Twitter conversation&lt;/a&gt; about running Datasette on Binder in coordination with a Jupyter notebook.&lt;/p&gt;

&lt;p&gt;Tony Hirst &lt;a href="https://github.com/psychemedia/jupyterserverproxy-datasette-demo"&gt;did some work on this&lt;/a&gt; last year, but was stumped by the lack of a &lt;code&gt;base_url&lt;/code&gt; equivalent. Terry Jones &lt;a href="https://github.com/simonw/datasette/pull/652"&gt;shared an implementation&lt;/a&gt; in December. I finally found the inspiration to pull it all together, and ended up wih &lt;a href="https://github.com/simonw/jupyterserverproxy-datasette-demo"&gt;a working fork&lt;/a&gt; of Tony's project which does indeed launch Datasette on Binder - &lt;a href="https://mybinder.org/v2/gh/simonw/jupyterserverproxy-datasette-demo/master?urlpath=datasette"&gt;try launching your own here&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id="weeknotes-github-to-sqlite"&gt;github-to-sqlite&lt;/h3&gt;

&lt;p&gt;I've not done much work on my &lt;a href="https://simonwillison.net/tags/dogsheep/"&gt;Dogsheep&lt;/a&gt; family of tools in a while. That changed this week: in particular, I shipped a &lt;a href="https://github.com/dogsheep/github-to-sqlite/releases/tag/1.0"&gt;1.0&lt;/a&gt; of &lt;a href="https://github.com/dogsheep/github-to-sqlite"&gt;github-to-sqlite&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;As you might expect, it's a tool for importing GitHub data into a SQLite database. Today it can handle repositories, releases, release assets, commits, issues and issue comments. You can see a live demo built from &lt;a href="https://github.com/dogsheep"&gt;Dogsheep organization&lt;/a&gt; data at &lt;a href="https://github-to-sqlite.dogsheep.net/"&gt;github-to-sqlite.dogsheep.net&lt;/a&gt; (deployed by &lt;a href="https://github.com/dogsheep/github-to-sqlite/blob/master/.github/workflows/deploy-demo.yml"&gt;this GitHub action&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;I built this tool primarily to help me better keep track of all of my projects. Pulling the issues into a single database means I can run queries against all open issues across all of my repositories, and imporing commits and releases is handy for when I want to write my weeknotes and need to figure out what I've worked on lately.&lt;/p&gt;

&lt;h3 id="weeknotes-datasette-render-markdown"&gt;datasette-render-markdown&lt;/h3&gt;

&lt;p&gt;GitHub issues use Markdown. To correctly display them it's useful to be able to render that Markdown. I built &lt;a href="https://github.com/simonw/datasette-render-markdown"&gt;datasette-render-markdown&lt;/a&gt; back &lt;a href="https://simonwillison.net/2019/Nov/11/weeknotes-8/#datasetterendermarkdown_81"&gt;in November&lt;/a&gt;, but this week I made some substantial upgrades: you can now &lt;a href="https://github.com/simonw/datasette-render-markdown/blob/1.1.1/README.md#usage"&gt;configure which columns should be rendered&lt;/a&gt;, and it includes &lt;a href="https://github.com/simonw/datasette-render-markdown/blob/1.1.1/README.md#markdown-extensions"&gt;support for Markdown extensions&lt;/a&gt; including GitHub-Flavored Markdown.&lt;/p&gt;

&lt;p&gt;You can see it in action on &lt;a href="https://github-to-sqlite.dogsheep.net/github/issues"&gt;the github-to-sqlite demo&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I also upgraded &lt;a href="https://github.com/simonw/datasette-render-timestamps"&gt;datasette-render-timestamps&lt;/a&gt; with the same explicit column configuration pattern.&lt;/p&gt;

&lt;h3 id="weeknotes-datasette-publish-fly"&gt;datasette-publish-fly&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://fly.io/"&gt;Fly&lt;/a&gt; is a relatively new hosting provider which lets you host applications bundled as Docker containers in load-balanced data centers geographically close to your users.&lt;/p&gt;

&lt;p&gt;It has a couple of characteristics that make it a really good fit for Datasette.&lt;/p&gt;

&lt;p&gt;Firstly, the &lt;a href="https://fly.io/docs/pricing/"&gt;pricing model&lt;/a&gt;: Fly will currently host a tiny (128MB of RAM) container for $2.67/month - and they give you $10/month of free service credit, enough for 3 containers.&lt;/p&gt;

&lt;p&gt;It turns out Datasette runs just fine in 128MB of RAM, so that's three always-on Datasette containers! (Unlike Heroku and Cloud Run, Fly keeps your containers running rather than scaling them to zero).&lt;/p&gt;

&lt;p&gt;Secondly, it works by shipping it a Dockerfile. This means building &lt;a href="https://datasette.readthedocs.io/en/stable/publish.html"&gt;datasette publish&lt;/a&gt; support for it is really easy.&lt;/p&gt;

&lt;p&gt;I added the &lt;a href="https://datasette.readthedocs.io/en/stable/plugins.html#publish-subcommand-publish"&gt;publish_subcommand&lt;/a&gt; plugin hook to Datasette all the way back in &lt;a href="https://datasette.readthedocs.io/en/stable/changelog.html#v0-25"&gt;0.25&lt;/a&gt; in September 2018, but I've never actually built anything with it. That's now changed: &lt;a href="https://github.com/simonw/datasette-publish-fly"&gt;datasette-publish-fly&lt;/a&gt; uses the hook to add a &lt;code&gt;datasette publish fly&lt;/code&gt; command for publishing databases directly to your Fly account.&lt;/p&gt;

&lt;h3 id="weeknotes-hacker-news-to-sqlite"&gt;hacker-news-to-sqlite&lt;/h3&gt;

&lt;p&gt;It turns out I created my &lt;a href="https://news.ycombinator.com/"&gt;Hacker News&lt;/a&gt; account in 2007, and I've posted 2,167 comments and submitted 131 stories since then. Since my personal Dogsheep project is about pulling my data from multiple sources into a single place it made sense to build a tool for importing from Hacker News.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/dogsheep/hacker-news-to-sqlite"&gt;hacker-news-to-sqlite&lt;/a&gt; uses the official &lt;a href="https://github.com/HackerNews/API"&gt;Hacker News API&lt;/a&gt; to import every comment and story posted by a specific user. It can also use one or more item IDs to suck the entire discussion tree around those items.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://github.com/dogsheep/hacker-news-to-sqlite/blob/c8697b3e4ef044412209b52c70548fedbcb346c7/README.md#browsing-your-data-with-datasette"&gt;README&lt;/a&gt; includes detailed documentation on how to best browse your data using Datasette once you have imported it.&lt;/p&gt;

&lt;h3 id="weeknotes-other-projects"&gt;Other projects&lt;/h3&gt;

&lt;ul&gt;&lt;li&gt;&lt;a href="https://github.com/simonw/sqlite-utils"&gt;sqlite-utils&lt;/a&gt; gained some improvements to the way it suggests types for existing columns.&lt;/li&gt;&lt;li&gt;&lt;a href="https://github.com/dogsheep/twitter-to-sqlite"&gt;twitter-to-sqlite&lt;/a&gt; now offers &lt;code&gt;--sql&lt;/code&gt; and &lt;code&gt;--attach&lt;/code&gt; for more of its subcommands.&lt;/li&gt;&lt;li&gt;&lt;a href="https://github.com/simonw/datasette-show-errors"&gt;datasette-show-errors&lt;/a&gt; is a new plugin which exposes 500 errors as tracebacks, like Django does with &lt;code&gt;DEBUG=True&lt;/code&gt;. It's built on top of &lt;a href="https://www.starlette.io/middleware/"&gt;Starlette's ServerErrorMiddleware&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;I upgraded &lt;a href="https://github.com/dogsheep/inaturalist-to-sqlite"&gt;inaturalist-to-sqlite&lt;/a&gt; to work with &lt;code&gt;sqlite-utils&lt;/code&gt; 2.x.&lt;/li&gt;&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/markdown"&gt;markdown&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/jupyter"&gt;jupyter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/fly"&gt;fly&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="github"/><category term="projects"/><category term="sqlite"/><category term="markdown"/><category term="jupyter"/><category term="datasette"/><category term="dogsheep"/><category term="weeknotes"/><category term="fly"/></entry><entry><title>hacker-news-to-sqlite</title><link href="https://simonwillison.net/2020/Mar/21/hacker-news-to-sqlite/#atom-tag" rel="alternate"/><published>2020-03-21T04:27:04+00:00</published><updated>2020-03-21T04:27:04+00:00</updated><id>https://simonwillison.net/2020/Mar/21/hacker-news-to-sqlite/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/dogsheep/hacker-news-to-sqlite"&gt;hacker-news-to-sqlite&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The latest in my Dogsheep series of tools: hacker-news-to-sqlite uses the Hacker News API to fetch your comments and submissions from Hacker News and save them to a SQLite database.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/simonw/status/1241220276223791107"&gt;@simonw&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/hacker-news"&gt;hacker-news&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;&lt;/p&gt;



</summary><category term="hacker-news"/><category term="projects"/><category term="sqlite"/><category term="dogsheep"/></entry><entry><title>pinboard-to-sqlite</title><link href="https://simonwillison.net/2019/Nov/7/pinboard-to-sqlite/#atom-tag" rel="alternate"/><published>2019-11-07T20:46:38+00:00</published><updated>2019-11-07T20:46:38+00:00</updated><id>https://simonwillison.net/2019/Nov/7/pinboard-to-sqlite/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/jacobian/pinboard-to-sqlite"&gt;pinboard-to-sqlite&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Jacob Kaplan-Moss just released the second Dogsheep tool that wasn’t written by me (after goodreads-to-sqlite by Tobias Kunze)—this one imports your Pinterest bookmarks. The repo includes a really clean minimal example of how to use GitHub actions to run tests and release packages to PyPI.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/jacobian/status/1192500064163389442"&gt;@jacobian&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/jacob-kaplan-moss"&gt;jacob-kaplan-moss&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pypi"&gt;pypi&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pinboard"&gt;pinboard&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;&lt;/p&gt;



</summary><category term="github"/><category term="jacob-kaplan-moss"/><category term="pypi"/><category term="pinboard"/><category term="dogsheep"/></entry><entry><title>Weeknotes: PG&amp;E outages, and Open Source works!</title><link href="https://simonwillison.net/2019/Oct/14/weeknotes/#atom-tag" rel="alternate"/><published>2019-10-14T17:00:28+00:00</published><updated>2019-10-14T17:00:28+00:00</updated><id>https://simonwillison.net/2019/Oct/14/weeknotes/#atom-tag</id><summary type="html">
    &lt;p&gt;My big focus this week was the &lt;a href="https://simonwillison.net/2019/Oct/10/pge-outages/"&gt;PG&amp;amp;E outages project&lt;/a&gt;. I'm really pleased with how this turned out: the San Francisco Chronicle used data from it for &lt;a href="https://projects.sfchronicle.com/2019/wind-outage-map/"&gt;their excellent PG&amp;amp;E outage interactive&lt;/a&gt; (mixing in data on wind conditions) and it earned a bunch of interest on &lt;a href="https://twitter.com/simonw/status/1182440312590848001"&gt;Twitter&lt;/a&gt; and some discussion on &lt;a href="https://news.ycombinator.com/item?id=21224378"&gt;Hacker News&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I gave a talk about the project on Thursday for the &lt;a href="https://www.meetup.com/ONA-Silicon-Valley/events/265317251/"&gt;Online News Association Silicon Valley meetup&lt;/a&gt; and made connections with a bunch of interesting journalists from around the Bay Area.&lt;/p&gt;

&lt;h3 id="weeknotes-opensourceworks"&gt;Open Source works!&lt;/h3&gt;

&lt;p&gt;A highlight of last week was that a bunch of my projects gained new functionality through open source contributions!&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I landed &lt;a href="https://github.com/simonw/datasette/commits/12cec411cae73ba7211429da12cd32c551fe17b1"&gt;three pull requests&lt;/a&gt; to Datasette from Tobias Kunze - a metadata fix, smarter database ordering and &lt;a href="https://github.com/simonw/datasette/issues/136"&gt;a long-wanted button&lt;/a&gt; to auto-indent SQL queries.&lt;/li&gt;

&lt;li&gt;&lt;a href="https://github.com/simonw/datasette-leaflet-geojson"&gt;datasette-leaflet-json&lt;/a&gt; is a plugin that I built last year but never promoted because it had a frustrating bug in it. Chris Shaw &lt;a href="https://github.com/simonw/datasette-leaflet-geojson/pull/4"&gt;spotted and fixed the bug&lt;/a&gt;, and &lt;a href="https://github.com/simonw/datasette-leaflet-geojson/pull/5"&gt;bumped the Leaflet version dependency&lt;/a&gt; too. These changes are now shipped as &lt;a href="https://github.com/simonw/datasette-leaflet-geojson/releases/tag/0.3"&gt;version 0.3&lt;/a&gt;.&lt;/li&gt;

&lt;li&gt;I quietly shipped a new plugin, &lt;a href="https://github.com/simonw/datasette-render-timestamps"&gt;datasette-render-timestamps&lt;/a&gt;, last week which identifies columns containing Unix-style timestamp integers and renders them as a readable string. Chris spotted that too, and added a feature &lt;a href="https://github.com/simonw/datasette-render-timestamps/pull/1"&gt;to let you configure the formatting&lt;/a&gt; in a &lt;code&gt;metadata.json&lt;/code&gt; plugin setting. It was the best kind of pull request, incorporating the fix, the tests and the documentation update all in a single commit.&lt;/li&gt;

&lt;li&gt;Ishan Anand plans to use &lt;a href="https://github.com/simonw/datasette-auth-github"&gt;datasette-auth-github&lt;/a&gt; to &lt;a href="https://github.com/simonw/datasette-auth-github/issues/45"&gt;hook into nginx authentication&lt;/a&gt;. He &lt;a href="https://github.com/simonw/datasette-auth-github/pull/49"&gt;submitted a PR&lt;/a&gt; to refactor the &lt;code&gt;GithubAuth&lt;/code&gt; class to support building that as a separate project.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id="weeknotes-more-dogsheep"&gt;Dogsheep&lt;/h3&gt;

&lt;p&gt;I started &lt;a href="https://dogsheep.github.io/"&gt;a very basic website&lt;/a&gt; for my Dogsheep personal analytics project.&lt;/p&gt;

&lt;p&gt;I also started running various Dogsheep tools via cron on my personal Dogsheep server, to keep that set of databases automatically updated with my latest activity on various services.&lt;/p&gt;

&lt;p&gt;Most excitingly, Tobias Kunze built the first independent Dogsheep-style tool: &lt;a href="https://github.com/rixx/goodreads-to-sqlite"&gt;goodreads-to-sqlite&lt;/a&gt;!&lt;/p&gt;

&lt;p&gt;As the name suggests, it imports your data from &lt;a href="https://goodreads.com/"&gt;Goodreads&lt;/a&gt; into a SQLite database. It inspired me to create a Goodreads account which I will be using to track my book reading activity from now on.&lt;/p&gt;

&lt;p&gt;Tobias wrote &lt;a href="https://rixx.de/blog/goodreads-sqlite/"&gt;a fantastic blog post introducing the tool&lt;/a&gt; which includes some neat example queries and graphs.&lt;/p&gt;

&lt;p&gt;In other Dogsheep news, I added &lt;a href="https://github.com/dogsheep/github-to-sqlite/issues/7"&gt;an issue-comments command to github-to-sqlite&lt;/a&gt; for fetching all issue comments in a repo. My goal is to evolve that tool to the point where it can import all relevant data from all of my repositories and give me a single Datasette-powered dashboard for keeping track of everything in one place.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/leaflet"&gt;leaflet&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="open-source"/><category term="projects"/><category term="datasette"/><category term="dogsheep"/><category term="weeknotes"/><category term="leaflet"/></entry><entry><title>goodreads-to-sqlite</title><link href="https://simonwillison.net/2019/Oct/14/goodreads-sqlite/#atom-tag" rel="alternate"/><published>2019-10-14T04:07:53+00:00</published><updated>2019-10-14T04:07:53+00:00</updated><id>https://simonwillison.net/2019/Oct/14/goodreads-sqlite/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://rixx.de/blog/goodreads-sqlite/"&gt;goodreads-to-sqlite&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This is so cool! Tobias Kunze built a Python CLI tool to import your Goodreads data into a SQLite database, inspired by github-to-sqlite and my various other Dogsheep tools. It’s the first Dogsheep style tool I’ve seen that wasn’t built by me—and Tobias’ write-up includes some neat examples of queries you can run against your Goodreads data. I’ve now started using Goodreads and I’m importing my books into my own private Dogsheep Datasette instance.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/rixxtr/status/1183451595805249543"&gt;@rixxtr&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/books"&gt;books&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;&lt;/p&gt;



</summary><category term="books"/><category term="cli"/><category term="sqlite"/><category term="datasette"/><category term="dogsheep"/></entry><entry><title>Weeknotes: Dogsheep</title><link href="https://simonwillison.net/2019/Oct/7/dogsheep/#atom-tag" rel="alternate"/><published>2019-10-07T19:33:14+00:00</published><updated>2019-10-07T19:33:14+00:00</updated><id>https://simonwillison.net/2019/Oct/7/dogsheep/#atom-tag</id><summary type="html">
    &lt;p&gt;Having figured out &lt;a href="https://simonwillison.net/2019/Sep/30/weeknotes-first-week-stanford/"&gt;my Stanford schedule&lt;/a&gt;, this week I started getting back into the habit of writing some code.&lt;/p&gt;

&lt;h3 id="weeknotes-dogsheep"&gt;Dogsheep&lt;/h3&gt;

&lt;p&gt;Dogsheep is the collective name I've given to a &lt;a href="https://github.com/dogsheep"&gt;suite of tools&lt;/a&gt; I'm building around the concept of &lt;em&gt;personal analytics&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;I generate a lot of data, and while much of it ends up in the silos of the internet giants, thanks to the GDPR most of those silos now feature an "export a copy of your data" button.&lt;/p&gt;

&lt;p&gt;Wouldn't it be cool if you could convert that data into a SQLite database and then use Datasette to run queries against it?&lt;/p&gt;

&lt;p&gt;So that's what I'm doing! The tools I've built so far include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/dogsheep/healthkit-to-sqlite"&gt;healthkit-to-sqlite&lt;/a&gt; for my Apple HealthKit data (mostly collected by my Apple Watch).&lt;/li&gt;

&lt;li&gt;&lt;a href="https://github.com/dogsheep/twitter-to-sqlite"&gt;twitter-to-sqlite&lt;/a&gt;, by far the most developed tool. For Dogsheep purposes it lets me import my tweets and the tweets I have favourited, but it's growing all kinds of other useful features for retrieving and analyzing data from Twitter. More on this &lt;a href="https://simonwillison.net/2019/Sep/13/weeknotestwitter-sqlite-datasette-rure/"&gt;in my previous weeknotes&lt;/a&gt;.&lt;/li&gt;

&lt;li&gt;&lt;a href="https://github.com/dogsheep/swarm-to-sqlite"&gt;swarm-to-sqlite&lt;/a&gt; for my Foursquare Swarm checkins.&lt;/li&gt;

&lt;li&gt;&lt;a href="https://github.com/dogsheep/inaturalist-to-sqlite"&gt;inaturalist-to-sqlite&lt;/a&gt; for my &lt;a href="https://www.inaturalist.org/observations/simonw"&gt;iNaturalist observations&lt;/a&gt;.&lt;/li&gt;

&lt;li&gt;&lt;a href=""&gt;google-takeout-to-sqlite&lt;/a&gt; for the wealth of data available from &lt;a href="https://takeout.google.com/"&gt;Google Takeout&lt;/a&gt;. I've barely scratched the surface with this one but it does have the ability to export my location history from Google Maps - 215,000 latitude/longitude/timestamp records dating back to &lt;code&gt;select min(timestamp) from location_history&lt;/code&gt; July 2015!&lt;/li&gt;

&lt;li&gt;&lt;a href="https://github.com/dogsheep/github-to-sqlite"&gt;github-to-sqlite&lt;/a&gt; for importing my GitHub repositories and the repositories I have starred.&lt;/li&gt;

&lt;li&gt;&lt;a href="https://github.com/dogsheep/pocket-to-sqlite"&gt;pocket-to-sqlite&lt;/a&gt; (built over this weekend) for articles I have saved to &lt;a href="https://getpocket.com/"&gt;Pocket&lt;/a&gt;.&lt;/li&gt;

&lt;li&gt;&lt;a href="https://github.com/dogsheep/genome-to-sqlite"&gt;genome-to-sqlite&lt;/a&gt; for my 23andMe genome, because being able to run SQL queries against my genetic code is really funny.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My biggest achievement this week is that my own personal Dogsheep is no longer vaporware! I got a Datasette instance running on a &lt;a href="https://aws.amazon.com/lightsail/"&gt;AWS Lightsail&lt;/a&gt; VPS (using the systemd recipe from &lt;a href="https://github.com/simonw/datasette/issues/514"&gt;this ticket&lt;/a&gt;) and started loading in copies of my data.&lt;/p&gt;

&lt;p&gt;Securing this instance well is important. In addition to running it behind &lt;a href="https://github.com/simonw/datasette-auth-github"&gt;datasette-auth-github&lt;/a&gt; I've followed up on a long-term ambition of deploying something protected by client certificates. If your browser doesn't have the required certificate installed it gets rejected by nginx before it's even proxied through to Datasette.&lt;/p&gt;

&lt;p&gt;I have the certificate successfully installed on my iPhone and my laptop browsers. Setting this all up wasn't particularly straight-forward - I ended up following &lt;a href="https://fardog.io/blog/2017/12/30/client-side-certificate-authentication-with-nginx/"&gt;this excellent tutorial&lt;/a&gt; by Nathan Wittstock for the nginx part of it, then randomly Googling for tips on installing the certificate in Firefox and on my iPhone.&lt;/p&gt;

&lt;p&gt;I think there's a great opportunity for tooling that makes this much easier. Adding that to my maybe-someday pile.&lt;/p&gt;

&lt;h3 id="datasetteauthgithubassetcaching"&gt;datasette-auth-github asset caching&lt;/h3&gt;

&lt;p&gt;Dogsheep is a great opportunity for &lt;a href="https://en.wikipedia.org/wiki/Eating_your_own_dog_food"&gt;eating my own dogfood&lt;/a&gt;. I've been having a lot of fun using it to explore my data... but I noticed that there was a suspicous delay every time I clicked a link. Datasette is mostly JavaScript-free but usually it responds so fast (~100ms or less) that clicking around feels "app-like" - but here I was really feeling the delay, often of more than a second.&lt;/p&gt;

&lt;p&gt;I fired up the Firefox Network inspector and quickly spotted what was happening. I have the &lt;a href="https://github.com/simonw/datasette-vega"&gt;datasette-vega&lt;/a&gt; plugin installed so I can create charts, and that weighs in at nearly 1MB of JavaScript. According to Firefox my browser was loading that entire 1M file from scratch for every page load!&lt;/p&gt;

&lt;p&gt;The culprit there turned out to be my &lt;a href="https://github.com/simonw/datasette-auth-github"&gt;datasette-auth-github&lt;/a&gt; plugin.&lt;/p&gt;

&lt;p&gt;As detailed in &lt;a href="https://github.com/simonw/datasette-auth-github/issues/6"&gt;issue #6&lt;/a&gt;, the ASGI middleware adds a &lt;code&gt;cache-control: private&lt;/code&gt; header to every HTTP response - to make sure authentication-required content isn't accidentally cached by any intermediary proxies (I originally &lt;a href="https://datasette.readthedocs.io/en/stable/performance.html"&gt;designed Datasette&lt;/a&gt; to run effectively behind Varnish and Cloudflare).&lt;/p&gt;

&lt;p&gt;This was applied to everything... including the static JavaScript and CSS served up by the plugin!&lt;/p&gt;

&lt;p&gt;To fix this, I shipped a new release of &lt;code&gt;datasette-auth-github&lt;/code&gt; with a &lt;code&gt;cacheable_prefixes&lt;/code&gt; setting, which defaults to &lt;code&gt;/-/static/&lt;/code&gt; and &lt;code&gt;/-/static-plugins/&lt;/code&gt; when run as a Datasette plugin. See &lt;a href="https://github.com/simonw/datasette-auth-github/issues/47"&gt;issue 47&lt;/a&gt; for details.&lt;/p&gt;

&lt;h3 id="areasofresearch"&gt;Areas of research&lt;/h3&gt;

&lt;h4 id="connectionpooling"&gt;Connection pooling&lt;/h4&gt;

&lt;p&gt;My most significant ongoing project for Datasette right now concerns database connection pooling.&lt;/p&gt;

&lt;p&gt;For &lt;a href="https://github.com/simonw/datasette/issues/417"&gt;Datasette Library&lt;/a&gt; I want to be able to support potentially hundreds of attached databases, and &lt;a href="https://github.com/simonw/datasette/issues/567"&gt;Datasette Edit&lt;/a&gt; calls for the ability to use a writable (as opposed to read-only) connection.&lt;/p&gt;

&lt;p&gt;This means I need to get smarter about database connections, and last week I finally made some solid progress on figuring out &lt;a href="https://github.com/simonw/datasette/issues/569"&gt;more advanced connection pooling&lt;/a&gt;. Still more work to go but I'm finally feeling unblocked on this after sweating over it for a couple of months without writing any code.&lt;/p&gt;

&lt;p&gt;I also added a &lt;code&gt;/-/threads&lt;/code&gt; page to Datasette (&lt;a href="https://latest.datasette.io/-/threads"&gt;demo&lt;/a&gt;) to make it easier to understand the threads being run by the application.&lt;/p&gt;

&lt;h4 id="recursivectesinsqlite"&gt;Recursive CTEs in SQLite&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;twitter-to-sqlite&lt;/code&gt; generates a &lt;code&gt;tweets&lt;/code&gt; table with both &lt;code&gt;id&lt;/code&gt; and &lt;code&gt;in_reply_to_status_id&lt;/code&gt; (nullable) columns.&lt;/p&gt;

&lt;p&gt;SQLite supports recursive CTEs, &lt;a href="https://www.sqlite.org/lang_with.html"&gt;documented here&lt;/a&gt;. In theory, this means it should be possible to write SQL queries that analyze Twitter threads - showing me the tweets that are at the bottom of the longest reply chain, for example.&lt;/p&gt;

&lt;p&gt;I was feeling lazy and &lt;a href="https://twitter.com/simonw/status/1180884814771081216"&gt;asked if anyone knew how to do that&lt;/a&gt; on Twitter. Robin Houston, Matthew Somerville and Piers Cawley all chipped in with useful tips, which I've &lt;a href="https://gist.github.com/simonw/656a8c6e4688f720773c474080abe1b0"&gt;collected in this gist&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I particularly enjoyed learning how Piers uses recursive SQL queries to &lt;a href="https://bofh.org.uk/2019/03/04/recursive-sql-recipes/"&gt;help run his bakery&lt;/a&gt;.&lt;/p&gt;

&lt;h4 id="weeknotes-logging"&gt;logging&lt;/h4&gt;

&lt;p&gt;I finally made some progress on getting my head around the Python standard library logging module, a long-time nemesis.&lt;/p&gt;

&lt;p&gt;The breakthrough was to sit down and actually &lt;a href="https://github.com/python/cpython/blob/c5a7e0ce194c0eafe82eb3e431881012398e7d46/Lib/logging/__init__.py"&gt;read the source code&lt;/a&gt; - which helped me understand that the entire point of the library is to be a globally configured logging system that makes careful use of threading primitives to ensure you can log without worrying about concurrency.&lt;/p&gt;

&lt;p&gt;I've been worrying about that aspect of logging for years, so it's reassuring to see that Vinay Sajip has that comprehensively covered already.&lt;/p&gt;

&lt;h3 id="easierinstallationfordatasette"&gt;Easier installation for Datasette&lt;/h3&gt;

&lt;p&gt;I &lt;a href="https://simonwillison.net/2019/Sep/10/jsk-fellowship/"&gt;want journalists&lt;/a&gt; to be able to install Datasette. I don't want to have to talk them through installing xcode to install homebrew to install Python 3 first!&lt;/p&gt;

&lt;p&gt;Raffaele Messuti did &lt;a href="https://github.com/simonw/datasette/issues/93"&gt;some great work&lt;/a&gt; investigating pyinstaller for this a while ago. I started &lt;a href="https://twitter.com/simonw/status/1180866651962560512"&gt;a conversation on Twitter&lt;/a&gt; a few days ago after investigating &lt;a href="https://pex.readthedocs.io/en/stable/whatispex.html"&gt;pex&lt;/a&gt; (neat but not quite what I want, since users still have to install Python). From that thread I got some interesting further leads:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.python-guide.org/shipping/freezing/"&gt;Freezing your code&lt;/a&gt; in the The Hitchhiker's Guide to Python describes the problem space in general.&lt;/li&gt;

&lt;li&gt;&lt;a href="https://briefcase.readthedocs.io/"&gt;Briefcase&lt;/a&gt; is the part of the BeeWare suite that deals with for converting a Python project into a standalone native application (for a bunch of different platforms).&lt;/li&gt;

&lt;li&gt;&lt;a href="https://github.com/indygreg/PyOxidizer"&gt;PyOxidizer&lt;/a&gt; is a relatively new tool for building standalone executables that bundle a Python interpreter, on top of the Rust and Cargo build ecosystem.&lt;/li&gt;

&lt;li&gt;&lt;a href="https://github.com/conda/constructor"&gt;Conda Constructor&lt;/a&gt; helps you construct an installer for a group of Conda packages (I need to figure out Conda).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Next step: actively try some of these out and see what I can do with them.&lt;/p&gt;

&lt;p&gt;My longer term goal is to package Datasette up with &lt;a href="https://electronjs.org/"&gt;Electron&lt;/a&gt; and make it available as a regular Windows and OS X application, for people who aren't comfortable using the command-line directly.&lt;/p&gt;

&lt;h3 id="weeknotes-3-books"&gt;Books&lt;/h3&gt;

&lt;p&gt;I've been mostly working through required reading for my Stanford courses. Strategic Communication in particular has some excellent text books:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.amazon.com/Speaking-without-Freaking-Out-Techniques/dp/1465290478"&gt;Speaking Up without Freaking Out: 50 Techniques for Confident and Compelling Presenting&lt;/a&gt; by Matthew Abrahams - despite the clickbait listicle title, this book is really good. I'm not a nervous public speaker, but I'm finding all kinds of actionable advice here on giving higher quality talks. And it's nice and short.&lt;/li&gt;

&lt;li&gt;&lt;a href="https://www.amazon.com/Resonate-Present-Stories-Transform-Audiences/dp/0470632011"&gt;Resonate: Present Visual Stories that Transform Audiences&lt;/a&gt; by Nancy Duarte. This digs deep into the &lt;a href="https://en.wikipedia.org/wiki/Hero%27s_journey"&gt;Hero's journey&lt;/a&gt; as a methodology for presenting, but with an interesting twist: it encourages you to think about your audience's journey and how you can take them from unaware, to skeptical, to tentatively committed, to excited and convinced.&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="projects"/><category term="sqlite"/><category term="dogsheep"/><category term="weeknotes"/></entry><entry><title>twitter-to-sqlite 0.6, with track and follow</title><link href="https://simonwillison.net/2019/Oct/6/twitter-to-sqlite/#atom-tag" rel="alternate"/><published>2019-10-06T04:54:33+00:00</published><updated>2019-10-06T04:54:33+00:00</updated><id>https://simonwillison.net/2019/Oct/6/twitter-to-sqlite/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/dogsheep/twitter-to-sqlite/releases/tag/0.6"&gt;twitter-to-sqlite 0.6, with track and follow&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I shipped a new release of my twitter-to-sqlite command-line tool this evening. It now includes experimental features for subscribing to the Twitter streaming API: you can track keywords or follow users and matching Tweets will be written to a SQLite database in real-time as they come in through the API. Since Datasette supports mutable databases now you can run Datasette against the database and run queries against the tweets as they are inserted into the tables.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/realtime"&gt;realtime&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/twitter"&gt;twitter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;&lt;/p&gt;



</summary><category term="projects"/><category term="realtime"/><category term="twitter"/><category term="dogsheep"/></entry><entry><title>Client-Side Certificate Authentication with nginx</title><link href="https://simonwillison.net/2019/Oct/5/client-side-certificate-authentication-nginx/#atom-tag" rel="alternate"/><published>2019-10-05T17:26:35+00:00</published><updated>2019-10-05T17:26:35+00:00</updated><id>https://simonwillison.net/2019/Oct/5/client-side-certificate-authentication-nginx/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://fardog.io/blog/2017/12/30/client-side-certificate-authentication-with-nginx/"&gt;Client-Side Certificate Authentication with nginx&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I’m intrigued by client-side browser certificates, which allow you to lock down a website such that only browsers with a specific certificate installed can access them. They work on both laptops and mobile phones. I followed the steps in this tutorial and managed to get an nginx instance running which only allows connections from my personal laptop and iPhone.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/certificates"&gt;certificates&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nginx"&gt;nginx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;&lt;/p&gt;



</summary><category term="certificates"/><category term="nginx"/><category term="security"/><category term="dogsheep"/></entry></feed>