Simon Willison's Weblog: digitalocean

Running Datasette on DigitalOcean App Platform

2020-10-07T02:52:54+00:00

Running Datasette on DigitalOcean App Platform

I spent some time with DigitalOcean’s new App Platform today, which is a Heroku-style PaaS that starts at $5/month. It looks like it could be a really good fit for Datasette. Disk is ephemeral, but if you’re publishing read-only data that doesn’t matter since you can build the SQLite database as part of the deployment and bundle it up in the Docker/Kubernetes container.

Via @simonw

Tags: deployment, heroku, datasette, digitalocean

Weeknotes: Datasette Cloud and zero downtime deployments

2020-01-21T20:56:46+00:00

Yesterday's piece on Tracking FARA by deploying a data API using GitHub Actions and Cloud Run was originally intended to be my weeknotes, but ended up getting a bit too involved.

Aside from playing with GitHub Actions and Cloud Run, my focus over the past week has been working on Datasette Cloud. Datasette Cloud is the current name I'm using for my hosted Datasette product - the idea being that I'll find it a lot easier to get feedback on Datasette from journalists if they can use it without having to install anything!

My MVP for Datasette Cloud is that I can use it to instantly provision a new, private Datasette instance for a journalist (or team of journalists) that they can then sign into, start playing with and start uploading their data to (initially as CSV files).

I have to solve quite a few problems to get there:

Secure, isolated instances of Datasette. A team or user should only be able to see their own files. I plan to solve this using Docker containers that are mounted such that they can only see their own dedicated volumes.
The ability to provision new instances as easily as possible - and give each one its own HTTPS subdomain.
Authentication: users need to be able to register and sign in to accounts. I could use datasette-auth-github for this but I'd like to be able to support regular email/password accounts too.
Users need to be able to upload CSV files and have them converted into a SQLite database compatible with Datasette.

Zero downtime deployments

I have a stretch goal which I'm taking pretty seriously: I want to have a mechanism in place for zero-downtime deployments of new versions of the software.

Arguable this is an unneccessary complication for an MVP. I may not fully implement it, but I do want to at least know that the path I've taken is compatible with zero downtime deployments.

Why do zero downtime deployments matter so much to me? Because they are desirable for rapid iteration, and crucial for setting up continuious deployment. Even a couple of seconds of downtime during a deployment leaves a psychological desire not to deploy too often. I've seen the productivity boost that deploying fearlessly multiple times a day brings, and I want it.

So I've been doing a bunch of research into zero downtime deployment options (thanks to some great help on Twitter) and I think I have something that's going to work for me.

The first ingredient is Traefik - a new-to-me edge router (similar to nginx) which has a delightful focus on runtime configuration based on automatic discovery.

It works with a bunch of different technology stacks, but I'm going to be using it with regular Docker. Traefik watches for new Docker containers, reads their labels and uses that to reroute traffic to them.

So I can launch a new Docker container, apply the Docker label "traefik.frontend.rule": "Host:subdomain.mydomain.com" and Traefik will start proxying traffic to that subdomain directly to that container.

Traefik also has extremely robust built-in support for Lets Encrypt to issue certificates. I managed to issue a wildcard TLS certificate for my entire domain, so new subdomains are encrypted straight away. This did require me to give Traefik API access to modify DNS entries - I'm running DNS for this project on Digital Ocean and thankfully Traefik knows how to do this by talking to their API.

That solves provisioning: when I create a new account I can call the Docker API (from Python) to start up a new, labelled container on a subdomain protected by a TLS certificate.

I still needed a way to run a zero-downtime deployment of a new container (for example when I release a new version of Datasette and want to upgrade everyone). After quite a bit of research (during which I discovered you can't modify the labels on a Docker container without restarting it) I settled on the approach described in this article.

Essentially you configure Traefik to retry failed requests, start a new, updated container with the same routing information as the existing one (causing Traefik to load balance HTTP requests across both), then shut down the old container and trust Traefik to retry in-flight requests against the one that's still running.

Rudimentary testing with ab suggested that this is working as desired.

One remaining problem: if Traefik is running in a Docker container and proxying all of my traffic, how can I upgrade Traefik itself without any downtime?

Consensus on Twitter seems to be that Docker on its own doesn't have a great mechanism for this (I was hoping I could re-route port 80 traffic to the host to a different container in an atomic way). But... iptables has mechanisms that can re-route traffic from one port to another - so I should be able to run a new Traefik container on a different port and re-route to it at the operating system level.

That's quite enough yak shaving around zero time deployments for now!

datasette-upload-csvs

A big problem I'm seeing with the current Datasette ecosystem is that while Datasette offers a web-based user interface for querying and accessing data, the tools I've written for actually creating those databases are decidedly command-line only.

Telling journalists they have to learn to install and run software on the command-line is way too high a barrier to entry.

I've always intended to have Datasette plugins that can handle uploading and converting data. It's time to actually build one!

datasette-upload-csvs is what I've got so far. It has a big warning not to use it in the README - it's very alpha sofware at the moment - but it does prove that the concept can work.

It uses the asgi_wrapper plugin hook to intercept requests to the path /-/upload-csv and forward them on to another ASGI app, written using Starlette, which provides a basic upload form and then handles the upload.

Uploaded CSVs are converted to SQLite using sqlite-utils and written to the first mutable database attached to Datasette.

It needs a bunch more work (and tests) before I'm comfortable telling people to use it, but it does at least exist as a proof of concept for me to iterate on.

datasette-auth-django-cookies

No code for this yet, but I'm beginning to flesh it out as a concept.

I don't particularly want to implement user registration and authentication and cookies and password hashing. I know how to do it, which means I know it's not something you shuld re-roll for every project.

Django has a really well designed, robust authentication system. Can't I just use that?

Since all of my applications will be running on subdomains of a single domain, my current plan is to have a regular Django application which handles registration and logins. Each subdomain will then run a custom piece of Datasette ASGI middleware which knows how to read and validate the Django authentication cookie.

This should give me single sign-on with a single, audited codebase for registration and login with (hopefully) the least amount of work needed to integrate it with Datasette.

Code for this will hopefully follow over the next week.

Niche Museums - now publishing weekly

I hit a milestone with my Niche Museums project: the site now lists details of 100 museums!

For the 100th entry I decided to celebrate with by far the most rewarding (and exclusive) niche museum experience I've ever had: Ray Bandar's Bone Palace.

You should read the entry. The short version is that Ray Bandar collected 7,000 animals skulls over a sixty year period, and Natalie managed to score us a tour of his incredible basement mere weeks before the collection was donated to the California Academy of Sciences.

Posting one museum a day was taking increasingly more of my time, as I had to delve into the depths of my museums-I-have-visited backlog and do increasing amounts of research. Now that I've hit 100 I'm going to switch to publishing one a week, which should also help me visit new ones quickly enough to keep the backlog full!

So I only posted four this week:

The ruins of Llano del Rio in Los Angeles County
Cleveland Hungarian Museum in Cleveland
New Orleans Historic Voodoo Museum in New Orleans
Ray Bandar's Bone Palace in San Francisco

I also built a simple JavaScript image gallery to better display the 54 photos I published from our trip to Ray Bandar's basement.

Tags: csv, deployment, museums, projects, zero-downtime, docker, datasette, weeknotes, traefik, datasette-cloud, digitalocean

Weeknotes: Improv at Stanford, planning Datasette Cloud

2020-01-14T00:22:18+00:00

Last week was the first week of the quarter at Stanford - which is called "shopping week" here because students are expected to try different classes to see which ones they are going to stick with.

I've settled on three classes this quarter: Beginning Improvising, Designing Machine Learning and Entrepreneurship: Formation of New Ventures.

Beginning Improvising is the Stanford improv theater course. It's a big time commitment: three two-hours sessions a week for ten weeks is nearly 60 hours of improv!

It's already proving to be really interesting though: it turns out the course is a thinly disguised applied psychology course.

Improv is about creating a creative space for other people to shine. The applications to professional teamwork are obvious and fascinating to me. I'll probably write more about this as the course continues.

Designing Machine Learning is a class at the Stanford d.School taught by Michelle Carney and Emily Callaghan. It focuses on multidisciplinary applications of machine learning, mixing together students from many different disciplines around Stanford.

I took a fast.ai deep learning course last year which gave me a basic understanding of the code size of neural networks, but I'm much more interestind in figuring out applications so this seems like a much more interesting option than a more code-focused course.

The class started out building some initial models using Google's Teachable Machine tool, which is fascinating. It lets you train transfer learning models for image, audio and posture recognition entirely in your browser - no data is transferred to Google's servers at all. You can then export those models and use them with a variety of different libraries - I've got them to work with both JavaScript and Python already.

I'm taking Entrepreneurship: Formation of New Ventures because of the rave reviews I heard from other JSK fellows who took it last quarter. It's a classic case-study business school class: each session features a guest speaker who is a successful entrepreneur, and the class discusses their case for the first two thirds of the section while they listen in - then finds out how well the discussion matched to what actually happened.

Planning Datasette Cloud

Shopping week kept me pretty busy so I've not done much actual development over the past week, but I have started planning out and researching my next major project, which I'm currently calling Datasette Cloud.

Datasette Cloud will be an invite-only hosted SaaS version of Datasette. It's designed to help get news organizations on board with the software without having to talk them through figuring out their own hosting, so I can help them solve real problems and learn more about how the ecosystem should evolve to support them.

I'd love to be able to run this on serverless hosting platforms like Google Cloud Run or Heroku, but sadly those tools aren't an option to me due to a key problem: I'm trying to build a stateful service (SQLite databases need to live on a local disk) in 2020.

I posed this challenge on Twitter back in October:

What's the easiest way of running a stateful web application these days?

Stateful as in it supports a process which can accept web requests and is allowed to write to a durable disk

So not Heroku/Zeit Now/Cloud Run etc
- Simon Willison (@simonw) October 9, 2019

I've been exploring my options since then, and I think I've settled on a decidedly 2010-era way of doing this: I'm going to run my own instances! So I've been exploring hosting Datasette on both AWS Lightsail and Digital Ocean Droplets over the past few months.

My current plan is to have each Datasette Cloud account run as a Datasette instance in its own Docker container, primarily to ensure filesystem isolation: different accounts must not be able to see each other's database files.

I started another discussion about this on Twitter and had several recommendations for Traefik as a load balancer for assigning hostnames to different Docker containers, which is exactly what I need to do.

So this afternoon I made my way through Digital Ocean's outstanding tutorial How To Use Traefik as a Reverse Proxy for Docker Containers on Ubuntu 18.04 and I think I've convinced myself that this is a smart way forward.

So, mostly a research week but I've got a solid plan for my next steps.

This week's Niche Museums

Jelly Belly Factory in Fairfield, CA
Bevolo Gas Light Museum in New Orleans, LA
Museo de las Misiones de Baja California in Loreto
Fort Point in San Francisco, CA
Donner Memorial State Park Visitor Center in Nevada County, CA
Anja Community Reserve in Madagascar
Palace of Fine Arts in San Francisco, CA

I also finally got around to implementing a map.

Tags: stanford, docker, jsk, weeknotes, datasette-cloud, digitalocean

Tracking PG&E outages by scraping to a git repo

2019-10-10T23:32:14+00:00

PG&E have cut off power to several million people in northern California, supposedly as a precaution against wildfires.

As it happens, I've been scraping and recording PG&E's outage data every 10 minutes for the past 4+ months. This data got really interesting over the past two days!

The original data lives in a GitHub repo (more importantly in the commit history of that repo).

Reading JSON in a Git repo isn't particularly productive, so this afternoon I figured out how to transform that data into a SQLite database and publish it with Datasette.

The result is https://pge-outages.simonwillison.net/ (no longer available)

Update from 27th October 2019: I also used the data to create this animation (first shared on Twitter):

Your browser does not support the video tag.

The data model: outages and snapshots

The three key tables to understand are outages, snapshots and outage_snapshots.

PG&E assign an outage ID to every outage - where an outage is usually something that affects a few dozen customers. I store these in the outages table.

Every 10 minutes I grab a snapshot of their full JSON file, which reports every single outage that is currently ongoing. I store a record of when I grabbed that snapshot in the snapshots table.

The most interesting table is outage_snapshots. Every time I see an outage in the JSON feed, I record a new copy of its data as an outage_snapshot row. This allows me to reconstruct the full history of any outage, in 10 minute increments.

Here are all of the outages that were represented in snapshot 1269 - captured at 4:10pm Pacific Time today.

I can run select sum(estCustAffected) from outage_snapshots where snapshot = 1269 (try it here) to count up the total PG&E estimate of the number of affected customers - it's 545,706!

I've installed datasette-vega which means I can render graphs. Here's my first attempt at a graph showing the number of estimated customers affected over time.

(I don't know why there's a dip towards the end of the graph).

I also defined a SQL view which shows all of the outages from the most recently captured snapshot (usually within the past 10 minutes if the PG&E website hasn't gone down) and renders them using datasette-cluster-map.

Things to be aware of

There are a huge amount of unanswered questions about this data. I've just been looking at PG&E's JSON and making guesses about what things like estCustAffected means. Without official documentation we can only guess as to how accurate this data is, or how it should be interpreted.

Some things to question:

What's the quality of this data? Does it reflect accurately on what's actually going on out there?
What's the exact meaning of the different columns - estCustAffected, currentEtor, autoEtor, hazardFlag etc?
Various columns (lastUpdateTime, currentEtor, autoEtor) appear to be integer unix timestamps. What timezone were they recorded in? Do they include DST etc?

How it works

I originally wrote the scraper back in October 2017 during the North Bay fires, and moved it to run on Circle CI based on my work building a commit history of San Francisco's trees.

It's pretty simple: every 10 minutes a Circle CI job runs which scrapes the JSON feed that powers the PG&E website's outage map.

The JSON is then committed to my pge-outages GitHub repository, over-writing the existing pge-outages.json file. There's some code that attempts to generate a human-readable commit message, but the historic data itself is saved in the commit history of that single file.

Building the Datasette

The hardest part of this project was figuring out how to turn a GitHub commit history of changes to a JSON file into a SQLite database for use with Datasette.

After a bunch of prototyping in a Jupyter notebook, I ended up with the schema described above.

The code that generates the database can be found in build_database.py. I used GitPython to read data from the git repository and my sqlite-utils library to create and update the database.

Deployment

Since this is a large database that changes every ten minutes, I couldn't use the usual datasette publish trick of packaging it up and re-deploying it to a serverless host (Cloud Run or Heroku or Zeit Now) every time it updates.

Instead, I'm running it on a VPS instance. I ended up trying out Digital Ocean for this, after an enjoyable Twitter conversation about good options for stateful (as opposed to stateless) hosting.

Next steps

I'm putting this out there and sharing it with the California News Nerd community in the hope that people can find interesting stories in there and help firm up my methodology - or take what I've done and spin up much more interesting forks of it.

If you build something interesting with this please let me know, via email (swillison is my Gmail) or on Twitter.

Tags: data-journalism, projects, scraping, sqlite, datasette, git-scraping, digitalocean, sqlite-utils