Weeknotes: DjangoCon, SQLite in Django, datasette-gunicorn

2022-10-23T19:58:00+00:00

I spent most of this week at DjangoCon in San Diego - my first outside-of-the-Bay-Area conference since the before-times.

It was a most excellent event. I spent a lot of time in the corridor track - actually the sitting-outside-in-the-sunshine track, catching up with people I haven't seen in several years.

I gave a talk titled "Massively increase your productivity on personal projects with comprehensive documentation and automated tests", with the alternative title "Coping strategies for the serial project hoarder". I'll do a full write-up of this once the video is made available in a few weeks time, but in the meantime the talk materials can be found here:

Supporting notes and links
Slides on Speaker Deck
Video for paying DjangoCon attendees (public video coming soon)

I also gave a lightning talk about AI and magic, which was effectively the five minute oral version of my recent blog post Is the AI spell-casting metaphor harmful or helpful?

Benchmarking SQLite in Django

I also hung around for the first day of the DjangoCon sprints.

For over a decade, the Django documentation has warned against using SQLite in production - recommending PostgreSQL or MySQL instead.

I asked Django Fellow Carlton Gibson what it would take to update that advice for 2022. He suggested that what we really needed was a solid idea for how well modern SQLite performs with Django, against a variety of different settings.

So I spent some time running benchmarks, using my new django_sqlite_benchmark repository.

You can follow the full details of my experiments in these issues:

#2: Locust test to exercise /counter/xxx endpoint which runs benchmarks against a simple Django view that increments a counter stored in a SQLite table
#3: Load test for larger writes runs a benchmark using a script that inserts larger JSON objects into a database table. I also tried this against PostgreSQL, getting very similar numbers to SQLite.
#4: Benchmark endpoint that doesn't interact with database benchmarks a simple "hello world" view that doesn't use SQLite at all - as a baseline for comparison

I used Locust for all of these tests, and wrote up a TIL about using it as well.

Here's the TLDR version of the results: SQLite in its default "journal" mode starts returning "database locked" errors pretty quickly as the write load increases. But... if you switch to "wal" mode (here's how) those errors straight up vanish!

I was expecting WAL mode to improve things, but I thought I'd still be able to hit errors even with it enabled. No - it turns out that, at least for the amount of traffic I could generate on may laptop, WAL mode proved easily capable of handling the load.

Even without WAL mode, bumping the SQLite "timeout" option up to 20s solved most of the errors.

Even more interestingly: I tried using Gunicorn (and Uvicorn) to run multiple Django workers at once. I was certain this would lead to problems, as SQLite isn't designed to handle writes from multiple processes at once... or so I thought. It turned out SQLite's use of file locking meant everything worked far better than I expected - and upping the number of worker processes from 1 to 4 resulted in approximately a 4x increase in throughput.

I shouldn't be surprised by this, if only because every time I've tried to push SQLite in a new direction it's impressed me with how much more capable it is than I expected.

But still, these results are very exciting. This problem still needs more thorough testing and more eyes than just mine, but I think this indicates that SQLite should absolutely be considered a viable option for running Django in production in 2022.

datasette-gunicorn

Datasette has always run as a single process. It uses Uvicorn to serve requests, but it hard-codes Uvicorn to a single worker (here).

Based on my experiments with SQLite and Django - in particular how running multiple worker processes gave me an increase in how much traffic I could handle - I decided to try the same thing with Datasette itself.

Gunicorn remains one of the most well regarded options for deploying Python web applications. It acts as a process monitor, balancing requests between different workers and restarting anything that fails with an error.

I decided to experiment with this through the medium of a Datasette plugin. So I built datasette-gunicorn, a plugin that adds an extra command to Datasette that lets you start it like this:

datasette gunicorn my.db --workers 4

It takes most of the same arguments as Datasette's regular datasette serve command, plus that new -w/--workers option for setting the number of workers.

Initial benchmarks were very positive: 21 requests a second with a single worker, increasing to 75 requests/second with four! Not bad for an initial experiment. I also tested it serving a static page through Datasette and got up to over 500 requests a second with a warning that Locust needed to be moved to a separate machine for a full load test.

In writing the plugin I had to figure out how to build a new command that mostly copied parameters from the existing datasette serve Click command - I wrote a TIL about how I ended up doing that.

shot-scraper 1.0

Also this week: I released shot-scraper 1.0.

Despite the exciting version number this actually only has two small new features. Here's the full changelog:

New shot-scraper html URL command (documented here) for outputting the final HTML of a page, after JavaScript has been executed. #96

shot-scraper javascript has a new -r/--raw option for outputting the result of the JavaScript expression as a raw string rather than JSON encoded (shot-scraper javascript documentation). #95

Tutorial: Automating screenshots for the Datasette documentation using shot-scraper.

I bumped it to 1.0 because shot-scraper is mature enough now that I'm ready to commit to not breaking existing features (at least without shipping a 2.0, which I hope to avoid for as long as possible).

I'm always trying to get more brave when it comes to stamping a 1.0 release on my main projects.

(I really, really need to get Datasette 1.0 shipped soon.)

Releases this week

datasette-gunicorn: 0.1 - 2022-10-22
Plugin for running Datasette using Gunicorn
shot-scraper: 1.0 - (23 releases total) - 2022-10-15
A command-line utility for taking automated screenshots of websites
asgi-gzip: 0.2 - (2 releases total) - 2022-10-13
gzip middleware for ASGI applications, extracted from Starlette

TIL this week

Tags: django, djangocon, projects, sqlite, my-talks, gunicorn, datasette, weeknotes, shot-scraper, carlton-gibson

Running gunicorn behind nginx on Heroku for buffering and logging

2017-10-02T01:57:20+00:00

Heroku's default setup for Django uses the gunicorn application server. Each Heroku dyno can only run a limited number of gunicorn workers, which means a limited number of requests can be served in parallel (around 4 per dyno is a good rule of thumb).

Where things get nasty is when you have devices on slow connections - like mobile phones. Heroku's router buffers headers but it does not buffer response bodies, so a slow device could hold up a gunicorn worker for several seconds. Too many slow devices at once and the site will become unavailable to other users.

This issue is explained and discussed here: Don't use Gunicorn to host your Django sites on Heroku

That article recommends using waitress as an alternative to gunicorn, but in the comments at the bottom of the article people suggest using a Heroku nginx-buildpack as an alternative.

Here is a slightly out-of-date tutorial on getting this all set up: https://koed00.github.io/Heroku_setups/

I used the following commands to set up the buildpacks:

heroku stack:set cedar-14
heroku buildpacks:clear
heroku buildpacks:add https://github.com/beanieboi/nginx-buildpack.git
heroku buildpacks:add https://github.com/heroku/heroku-buildpack-python.git

Unfortunately the nginx buildpack is not yet compatible with the new heroku-16 stack, so until the nginx buildpack has been updated it's necessary to run the application on the older cedar-14 stack. See this discussion for details: ryandotsmith/nginx-buildpack#68

Adding nginx in this way also gives us the opportunity to fix another limitation of Heroku: its default logging configuration. By default, log lines produced by Heroku (visible using heroku logs --tail or with a logging addon such as Papertrail) look like this:

    Oct 01 18:01:06 simonwillisonblog heroku/router: at=info
        method=GET path="/2017/Oct/1/ship/" host=simonwillison.net
        request_id=bb22f67e-6924-4e81-b6ad-74d1f465cda7
        fwd="2001:8003:74c5:8b00:79e4:80ed:fa85:7b37,108.162.249.198"
        dyno=web.1 connect=0ms service=338ms status=200 bytes=4523 protocol=http

Notably missing here is both the user-agent string and the referrer header sent by the browser! If you're a fan of tailing log files these omissions are pretty disappointing.

The nginx buildback I'm using loads a default configuration file at config/nginx.conf.erb. By including my own copy of this file I can override the original and define my own custom log format.

Having applied this change, the new log lines look like this:

    2017-10-02T01:44:38.762845+00:00 app[web.1]:
        measure#nginx.service=0.133 request="GET / HTTP/1.1" status_code=200
        request_id=8b6402de-d072-42c4-9854-0f71697b30e5 remote_addr="10.16.227.159"
        forwarded_for="199.188.193.220" forwarded_proto="http" via="1.1 vegur"
        body_bytes_sent=12666 referer="-" user_agent="Mozilla/5.0 (Macintosh;
        Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko)
        Chrome/61.0.3163.100 Safari/537.36"

This blog entry started life as a commit message.

Tags: django, logging, nginx, user-agents, heroku, gunicorn