Simon Willison's Weblog: continuous-integration

Continuous AI

2025-06-27T23:31:11+00:00

GitHub Next have coined the term "Continuous AI" to describe "all uses of automated AI to support software collaboration on any platform". It's intended as an echo of Continuous Integration and Continuous Deployment:

We've chosen the term "Continuous AI” to align with the established concept of Continuous Integration/Continuous Deployment (CI/CD). Just as CI/CD transformed software development by automating integration and deployment, Continuous AI covers the ways in which AI can be used to automate and enhance collaboration workflows.

“Continuous AI” is not a term GitHub owns, nor a technology GitHub builds: it's a term we use to focus our minds, and which we're introducing to the industry. This means Continuous AI is an open-ended set of activities, workloads, examples, recipes, technologies and capabilities; a category, rather than any single tool.

I was thrilled to bits to see LLM get a mention as a tool that can be used to implement some of these patterns inside of GitHub Actions:

You can also use the llm framework in combination with the llm-github-models extension to create LLM-powered GitHub Actions which use GitHub Models using Unix shell scripting.

The GitHub Next team have started maintaining an Awesome Continuous AI list with links to projects that fit under this new umbrella term.

I'm particularly interested in the idea of having CI jobs (I guess CAI jobs?) that check proposed changes to see if there's documentation that needs to be updated and that might have been missed - a much more powerful variant of my documentation unit tests pattern.

Tags: continuous-integration, github, ai, github-actions, generative-ai, llms, llm

cibuildwheel 2.20.0 now builds Python 3.13 wheels by default

2024-08-06T22:54:44+00:00

cibuildwheel 2.20.0 now builds Python 3.13 wheels by default

CPython 3.13 wheels are now built by default […] This release includes CPython 3.13.0rc1, which is guaranteed to be ABI compatible with the final release.

cibuildwheel is an underrated but crucial piece of the overall Python ecosystem.

Python wheel packages that include binary compiled components - packages with C extensions for example - need to be built multiple times, once for each combination of Python version, operating system and architecture.

A package like Adam Johnson’s time-machine - which bundles a 500 line C extension - can end up with 55 different wheel files with names like time_machine-2.15.0-cp313-cp313-win_arm64.whl and time_machine-2.15.0-cp38-cp38-musllinux_1_2_x86_64.whl.

Without these wheels, anyone who runs pip install time-machine will need to have a working C compiler toolchain on their machine for the command to work.

cibuildwheel solves the problem of building all of those wheels for all of those different platforms on the CI provider of your choice. Adam is using it in GitHub Actions for time-machine, and his .github/workflows/build.yml file neatly demonstrates how concise the configuration can be once you figure out how to use it.

The first release candidate of Python 3.13 hit its target release date of August 1st, and the final version looks on schedule for release on the 1st of October. Since this rc should be binary compatible with the final build now is the time to start shipping those wheels to PyPI.

Via @adamchainz

Tags: continuous-integration, packaging, pypi, python, adam-johnson

Quoting jbreckmckye

2023-07-10T18:53:41+00:00

At The Guardian we had a pretty direct way to fix this [the problem of zombie feature flags]: experiments were associated with expiry dates, and if your team's experiments expired the build system simply wouldn't process your jobs without outside intervention. Seems harsh, but I've found with many orgs the only way to fix negative externalities in a shared codebase is a tool that says "you broke your promises, now we break your builds".

— jbreckmckye

Tags: continuous-integration, feature-flags

A tiny CI system

2022-04-26T15:39:27+00:00

A tiny CI system

Christian Ştefănescu shares a recipe for building a tiny self-hosted CI system using Git and Redis. A post-receive hook runs when a commit is pushed to the repo and uses redis-cli to push jobs to a list. Then a separate bash script runs a loop with a blocking “redis-cli blpop jobs” operation which waits for new jobs and then executes the CI job as a shell script.

Via @stchris_

Tags: bash, continuous-integration, git, redis

Pillar Point Stewards, pypi-to-sqlite, improvements to shot-scraper and appreciating datasette-dashboards

2022-04-08T20:26:36+00:00

This week I helped Natalie launch the Pillar Point Stewards website and built a new tool for loading PyPI package data into SQLite, in order to help promote the excellent datasette-dashboards plugin by Romain Clement.

Pillar Point Stewards

I've been helping my wife Natalie Downe build the website for the Pillar Point Stewards initative that she is organizing on behalf of the San Mateo MPA Collaborative and California Academy of Sciences.

We live in El Granada, CA - home to the Pillar Point reef.

The reef has always been mixed-use, with harvesting of sea life such as mussels and clams legal provided the harvesters have an inexpensive fishing license.

Unfortunately, during the pandemic the number of people harvesting the reef raised by an order of magnitude - up to over a thousand people in just a single weekend. This had a major impact on the biodiversity of the reef, as described in Packed at Pillar Point by Anne Marshall-Chalmers for Bay Nature.

Pillar Point Stewards is an initiative to recruit volunteer stewards to go out on the reef during low tides, talking to people and trying to inspire curiosity and discourage unsustainable harvesting practices.

A very small part of the project is the website to support it, which helps volunteers sign up for shifts at low tides.

We re-used some of the work we had previously done for Rocky Beaches, in particular the logic for working with tide times from NOAA to decide when the shifts should be.

Natalie designed the site and built the front-end. I implemented the Django backend and integrated with Auth0 in order to avoid running our own signup and registration flow. This was the inspiration for the datasette-auth0 plugin I released last week.

Signed in volunteers can select their shift times from a calendar:

We also included an SVG tide chart on each shift page using the tide data from NOAA, which looks like this:

We've been building the site in public. You can see how everything works in the natbat/pillarpointstewards GitHub repository, including how the site uses continuous deployment against Fly.

datasette-dashboards

This is not my project, but I'm writing about it here because I only just found out about it and it's really cool.

Romain Clement built a plugin for Datasette called datasette-dashboards. It's best explained by checking out his live demo, which looks like this:

There are a bunch of clever ideas in this plugin.

It uses YAML syntax to define the different dashboard panels, outsourcing the actual visualization elements to Vega. You can see the YAML for the demo here. Here's an edited subset of the YAML illustrating some interesting points:

plugins:
  datasette-dashboards:
    job-offers-stats:
      title: Job offers statistics
      description: Gather metrics about job offers
      layout:
        - [analysis-note, offers-day, offers-day, offers-count]
        - [analysis-note, offers-source, offers-day-source, offers-region]
      filters:
        date_start:
          name: Date Start
          type: date
          default: '2021-01-01'
        date_end:
          name: Date End
          type: date
      charts:
        analysis-note:
          library: markdown
          display: |-
            # Analysis details
            ...
        offers-count:
          title: Total number of offers
          db: jobs
          query: SELECT count(*) as count FROM offers_view WHERE TRUE [[ AND date >= date(:date_start) ]] [[ AND date <= date(:date_end) ]];
          library: metric
          display:
            field: count
            prefix:
            suffix: " offers"
        offers-day:
          title: Number of offers by day
          db: jobs
          query: SELECT date(date) as day, count(*) as count FROM offers_view WHERE TRUE [[ AND date >= date(:date_start) ]] [[ AND date <= date(:date_end) ]] GROUP BY day ORDER BY day
          library: vega
          display:
            mark: { type: line, tooltip: true }
            encoding:
              x: { field: day, type: temporal }
              y: { field: count, type: quantitative }

The SQL query for each panel is defined as query: - and can take parameters such as :date_end which are defined by the filters: section. Note that here one of the filters has a type of date, which turns into a <input type="date"> in the filter interface.

For library: vega panels the display: key holds the raw Vega specification, so anything the Vega visualization library can do is available to the plugin.

I didn't know Vega could render choropleth maps! That map there is defined by this YAML, which loads a GeoJSON file of the regions in France from the gregoiredavid/france-geojson GitHub repository.

display:
  mark: geoshape
  projection: { type: mercator }
  transform:
    - lookup: region
      from:
        data:
          url: https://raw.githubusercontent.com/gregoiredavid/france-geojson/master/regions.geojson
          format: { type: json, property: features }
        key: properties.nom
        fields: [type, geometry]

I think my favourite trick though is the way it handles layout. The layout for the demo is defined thus:

layout:
  - [analysis-note, offers-day, offers-day, offers-count]
  - [analysis-note, offers-source, offers-day-source, offers-region]

This is then implemented using CSS grids! Here's the template fragment that does the work:

<style>
  @media (min-width: 800px) {
    .dashboard-grid {
      {% if dashboard.layout %}
      grid-template-areas: {% for row in dashboard.layout %}"{% for col in row %}{{ col }} {% endfor %}" {% endfor %};
      {% else %}
      grid-template-columns: repeat(2, 1fr);
      {% endif %}
    }

    {% if dashboard.layout %}
    {% for chart_slug, chart in dashboard.charts.items() %}
    #card-{{ chart_slug }} {
      grid-area: {{ chart_slug }};
    }
    {% endfor %}
    {% endif %}
  }
</style>

Such a clever and elegant trick.

pypi-to-sqlite

I wanted to add datasette-dashboards to the official Datasette plugins directory, but there was a catch: since most of the plugins listed there are written by me, the site has some baked in expectations: in particular, it expects that plugins will all be using the GitHub releases feature (for example) to announce their releases.

Romain's plugin wasn't using that feature, instead maintaining its own changelog file.

I've been meaning to make the plugin directory more forgiving for a while. I decided to switch from using GitHub releases as the definitive source of release information to using releases published to PyPI (the Python package index) instead.

PyPI offers a stable JSON API: https://pypi.org/pypi/datasette-dashboards/json - which includes information on the package and all of its releases.

To reliably pull that information into datasette.io I decided on a two-step process. First, I set up a Git scraper to archive the data that I cared about into a new repository called pypi-datasette-packages.

That repo stores the current PyPI JSON for every package listed on the Datasette website. This means I can see changes made to those files over time by browsing the commit history. It also means that if PyPI is unavailable I can still build and deploy the site.

Then I wrote a new tool called pypi-to-sqlite to load that data into SQLite database tables. You can try that out like so:

pip install pypi-to-sqlite
pypi-to-sqlite pypi.db datasette-dashboards pypi-to-sqlite --prefix pypi_

That --prefix option causes the tables to be created with the specified prefix in their names.

Here are the three tables generated by that command:

Using data from these tables I was able to rework the SQL view that powers the plugins and tools directories on the site, and now datasette-dashboards has its own page there.

shot-scraper 0.10 and 0.11

shot-scraper is my tool for taking automated screenshots of web pages, built on top of Playwright.

Ben Welsh has been a key early adopter of shot-scraper, using it to power his news-homepages project which takes screenshots of various news websites and then both tweets the results and uploads them to the News Homepages collection on the Internet Archive.

shot-scraper 0.10 is mostly Ben's work: he contributed both a --timeout option and a --browser option to let you install and use browsers other than the Chromium default!

(Ben needed this because some news homepages were embedding videos in a format that wasn't supported by Chromium but did work fine in regular Chrome.)

Ryan Cheley also contributed to 0.10 - thanks to Ryan, the shot-scraper multi command now continues taking shots even if one of them fails, unless you pass the --fail-on-error flag.

In writing my weeknotes, I decided to use shot-scraper to take a screenshot of the signed in homepage of the www.pillarpointstewards.com site.

In doing so, I found out that Google SSO refuses to work with the default Playwright Chromium! But it does continue to work with Firefox, so I fixed the shot-scraper auth to support the --browser option.

I took the screenshot like this:

shot-scraper auth https://www.pillarpointstewards.com/ -b firefox auth.json
# Now manually sign in with Auth0 and Google
shot-scraper https://www.pillarpointstewards.com/ -b firefox -a auth.json \
  --javascript "
    Array.from(
      document.querySelectorAll('[href^=tel]')
    ).forEach(el => el.innerHTML = '(xxx) xxx-xxxx')"

That --javascript line there redacts the phone numbers that are displayed on the page to signed in volunteers.

I created the second screenshot of just the tide times chart using this:

shot-scraper https://www.pillarpointstewards.com/shifts/182/ \
  -b firefox -a auth.json \
  --selector '.primary h2:nth-child(8)' \
  --selector .day-alone --padding 15

shot-scraper 0.11, released a few minutes ago, contains the new auth --browser feature plus some additional contributions from Ben Welsh, Ryan Murphy and Ian Wootten:

New shot-scraper accessibility --timeout option, thanks Ben Welsh. #59

shot-scraper auth --browser option for authentication using a browser other than Chromium. #61

Using --quality now results in a JPEG file with the correct .jpg extension. Thanks, Ian Wootten. #58

New --reduced-motion flag for emulating the "prefers-reduced-motion" media feature. Thanks, Ryan Murphy. #49

Releases this week

shot-scraper: 0.11 - (12 releases total) - 2022-04-08
Tools for taking automated screenshots of websites
pypi-to-sqlite: 0.2.2 - (3 releases total) - 2022-04-08
Load data about Python packages from PyPI into SQLite
shot-scraper: 0.10 - (12 releases total) - 2022-03-29
Tools for taking automated screenshots of websites

TIL this week

Tags: continuous-deployment, continuous-integration, natalie-downe, plugins, projects, pypi, datasette, weeknotes, shot-scraper, ben-welsh

Running GitHub on Rails 6.0

2021-08-06T16:30:59+00:00

Running GitHub on Rails 6.0

Back in 2019 Eileen M. Uchitelle explained how GitHub upgraded everything in production to Rails 6.0 within 1.5 weeks of the stable release. There’s a trick in here I really like: they have an automated weekly job which fetches the latest Rails main branch and runs the full GitHub test suite against it, giving them super-early warnings about anything that might break and letting them provide feedback to upstream about unintended regressions.

Tags: continuous-integration, github, rails

PAGNIs: Probably Are Gonna Need Its

2021-07-01T19:13:58+00:00

Luke Page has a great post up with his list of YAGNI exceptions.

YAGNI - You Ain't Gonna Need It - is a rule that says you shouldn't add a feature just because it might be useful in the future - only write code when it solves a direct problem.

When should you over-ride YAGNI? When the cost of adding something later is so dramatically expensive compared with the cost of adding it early on that it's worth taking the risk. On when you know from experience that an initial investment will pay off many times over.

Lukes's exceptions to YAGNI are well chosen: things like logging, API versioning, created_at timestamps and a bias towards "store multiple X for a user" (a many-to-many relationship) if there's any inkling that the system may need to support more than one.

Because I like attempting to coin phrases, I propose we call these PAGNIs - short for Probably Are Gonna Need Its.

Here are some of mine.

A kill-switch for your mobile apps

If you're building a mobile app that talks to your API, make sure to ship a kill-switch: a mechanism by which you can cause older versions of the application to show a "you must upgrade to continue using this application" screen when the app starts up.

In an ideal world, you'll never use this ability: you'll continue to build new features to the app and make backwards-compatible changes to the API forever, such that ancient app versions keep working and new app versions get to do new things.

But... sometimes that simply isn't possible. You might discover a security hole in the design of the application or API that can only be fixed by breaking backwards-compatibility - or maybe you're still maintaining a v1 API from five years ago to support a mobile application version that's only still installed by 30 users, and you'd like to not have to maintain double the amount of API code.

You can't add a kill-switch retroactively to apps that have already been deployed!

Apparently Firebase offers this to many Android apps, but if you're writing for iOS you need to provide this yourself.

Automated deploys

Nothing kills a side project like coming back to it in six months time and having to figure out how to deploy it again. Thanks to GitHub Actions and hosting providers like Google Cloud Run, Vercel, Heroku and Netlify setting up automated deployments is way easier now than it used to be. I have enough examples now that getting automated deployments working for a new project usually only takes a few minutes, and it pays off instantly.

Continuous Integration (and a test framework)

Similar to automated deployment in that GitHub Actions (and Circle CI and Travis before it) make this much less painful to setup than it used to be.

Introducing a test framework to an existing project can be extremely painful. Introducing it at the very start is easy - and it sets a precedent that code should be tested from day one.

These days I'm all about pytest, and I have various cookiecutter templates (datasette-plugin, click-app, python-lib) that configure it on my new projects (with a passing test) out of the box.

(Honestly, at this point in my career I consider continuous integration a DAGNI - Definitely Are Gonna Need It.)

One particularly worthwhile trick is making sure the tests can spin up their own isolated test databases - another thing which is pretty easy to setup early (Django does this for you) and harder to add later on. I extend that to other external data stores - I once put a significant amount of effort into setting up a mechanism for running tests against Elasticsearch and clearing out the data again afterwards, and it paid off multiple times over.

Even better: continuous deployment! When the tests pass, deploy. If you have automated deployment setup already adding this is pretty easy, and doing it from the very start of a project sets a strong cultural expectation that no-one will land code to the main branch until it's in a production-ready state and covered by unit tests.

(If continuous deployment to production is too scary for your project, a valuable middle-ground is continuous deployment to a staging environment. Having everyone on your team able to interact with a live demo of your current main branch is a huge group productivity boost.)

API pagination

Never build an API endpoint that isn't paginated. Any time you think "there will never be enough items in this list for it to be worth pagination" one of your users will prove you wrong.

This can be as simple as shipping an API which, even though it only returns a single page, has hard-coded JSON that looks like this:

{
  "results": [
    {"id": 1, "name": "One"},
    {"id": 2, "name": "Two"},
    {"id": 3, "name": "Three"}
  ],
  "next_url": null
}

But make sure you leave space for the pagination information! You'll regret it if you don't.

Detailed API logs

This is a trick I learned while porting VaccinateCA to Django. If you are building an API, having a mechanism that provides detailed logs - including the POST bodies passed to the API - is invaluable.

It's an inexpensive way of maintaining a complete record of what happened with your application - invaluable for debugging, but also for tricks like replaying past API traffic against a new implementation under test.

Logs like these may become infeasible at scale, but for a new project they'll probably add up to just a few MBs a day - and they're easy to prune or switch off later on if you need to.

VIAL uses a Django view decorator to log these directly to a PostgreSQL table. We've been running this for a few months and it's now our largest table, but it's still only around 2GB - easily worth it for the productivity boost it gives us.

(Don't log any sensitive data that you wouldn't want your development team having access to while debugging a problem. This may require clever redaction, or you can avoid logging specific endpoints entirely. Also: don't log authentication tokens that could be used to imitate users: decode them and log the user identifier instead.)

A bookmarkable interface for executing read-only SQL queries against your database

This one is very much exposing my biases (I just released Django SQL Dashboard 1.0 which provides exactly this for Django+PosgreSQL projects) but having used this for the past few months I can't see myself going back. Using bookmarked SQL queries to inform the implementation of new features is an incredible productivity boost. Here's an issue I worked on recently with 18 comments linking to illustrative SQL queries.

(On further thought: this isn't actually a great example of a PAGNI because it's not particularly hard to add this to a project at a later date.)

Driving down the cost

One trick with all of these things is that while they may seem quite expensive to implement, they get dramatically cheaper as you gain experience and gather more tools for helping put them into practice.

Any of the ideas I've shown here could take an engineering team weeks (if not months) to add to an existing project - but with the right tooling they can represent just an hour (or less) work at the start of a project. And they'll pay themselves off many, many times over in the future.

Tags: continuous-deployment, continuous-integration, definitions, software-engineering, testing, pytest, github-actions, django-sql-dashboard, yagni, pagni

Quoting Drew DeVault

2021-04-26T23:52:32+00:00

Over the past several months, everyone in the industry who provides any kind of free CPU resources has been dealing with a massive outbreak of abuse for cryptocurrency mining. The industry has been setting up informal working groups to pool knowledge of mitigations, communicate when our platforms are being leveraged against one another, and cumulatively wasting thousands of hours of engineering time implementing measures to deal with this abuse, and responding as attackers find new ways to circumvent them.

— Drew DeVault, SourceHut

Tags: continuous-integration, security, bitcoin

Spinning up a new Django app to act as a backend for VaccinateCA

2021-02-23T17:00:00+00:00

Originally posted to my internal blog at VaccinateCA

My goal by the end of this week is to have a working proof of concept for a Django + PostgreSQL app that can replace Airtable as the principle backend for the https://www.vaccinateca.com/ site. This proof of concept will allow us to make a go or no-go decision and figure out what else needs to be implemented before we can start using it to track calls.

I'm calling it a "prototype" and a "proof of concept", but my career has taught me that prototypes often end up going into production - so I'm building it with that in mind.

Today I started building that app. The repo is currently https://github.com/CAVaccineInventory/django.vaccinate though we are likely to rename it soon - possibly to VIAL (for Vaccine Information Archive and Library) - Jesse is good at actually relevant codenames!

Here's what I have so far:

Proof of concept for Auth0 SSO - mostly from following their tutorial. You can try that out on the staging site homepage at https://vaccinateca-preview.herokuapp.com/
The very beginnings of a test suite - built using pytest-django. The tests run against PostgreSQL and I had to figure out how to do that inside GitHub Actions - here's my TIL.
The tests run in GitHub Actions! Continous Integration - should work against pull requests too.
... which means we can take the next step and go for Continuous Deployment. Every commit that passes the tests is now deployed instantly to the staging environment.
The staging environment itself is currently on Heroku, because they make it ridiculously easy to setup Continuous Deployment - it's literally a checkbox in their admin panel. I wrote about this a few years ago: How to set up world-class continuous deployment using free hosted tools. It's likely we'll move this to Google Cloud at some point since other VaccinateCA stuff is running there. I know how to run Continous Deployment using Google Cloud Run so that could be a good option here.
Errors now get logged to a new project in the VaccinateCA Sentry instance.

And the biggest thing: I've implemented a set of Django models against the SQL schema that has been coming together in this pull request. These are exposed in the Django Admin (just with default settings, no customization yet) in the staging environment.

You can try those out by visiting https://vaccinateca-preview.herokuapp.com/admin/ and signing in with username demo and password demo (this account will be deleted the second we have any real data in the prototype).

I wrote data migrations to insert states and insert counties - you can see the results in the admin here (counties) and here (states).

Some engineering principles

I've invested a lot of effort today in getting some fundamental things set up: a test suite, continuous integration, continuous deployment, and a detailed and up-to-date README.

The reason I'm investing that effort now is that I know from experience that these things are incredibly valuable, and very easy to implement at the start of a project... but become massively harder the longer you delay them. Adding comprehensive tests, documentation and CI to a six month old project can take weeks. Adding them to a project that is just starting takes just a few hours.

I also plan to lean very heavily on the Django migrations system.

I've worked at companies in the past where database migrations - any kind of schema change - are slow, rare and exciting. This has horrible knock-on effects: engineers will go to great lengths to avoid adding a column to a table, which can lead to a rapid acretion of technical debt.

I want schema changes to be quick, common and boring. Django's migration system - especially against PostgreSQL, which can execute schema changes inside transactions - is ideally suited to this. I want to start using it agressively as early as possible, to ensure we have a culture that says "yes" to schema changes and executes them promptly and frequently.

Next steps

I want to get some real data into the system! I'm going to lock down the security a bit more, then take some exports from Airtable, convert them to the new schema and load them into the prototype. This will allow us to really start kicking the tires on it.

I'm tracking all of the work on the Django app in the issues for that repository.

Tags: continuous-deployment, continuous-integration, django, django-admin, postgresql, vaccinate-ca, vaccinate-ca-blog

Quoting Vincent Driessen

2020-05-14T13:49:55+00:00

Web apps are typically continuously delivered, not rolled back, and you don't have to support multiple versions of the software running in the wild.

This is not the class of software that I had in mind when I wrote the blog post 10 years ago. If your team is doing continuous delivery of software, I would suggest to adopt a much simpler workflow (like GitHub flow) instead of trying to shoehorn git-flow into your team.

— Vincent Driessen

Tags: continuous-deployment, continuous-integration, git

Tracking FARA by deploying a data API using GitHub Actions and Cloud Run

2020-01-21T07:51:11+00:00

I'm using the combination of GitHub Actions and Google Cloud Run to retrieve data from the U.S. Department of Justice FARA website and deploy it as a queryable API using Datasette.

FARA background

The Foreign Agents Registration Act (FARA) law that requires "certain agents of foreign principals who are engaged in political activities or other activities specified under the statute to make periodic public disclosure of their relationship with the foreign principal, as well as activities, receipts and disbursements in support of those activities".

The law was introduced in 1938 in response to the large number of German propaganda agents that were operating in the U.S. prior to the war.

Basically, if you are in the United States as a lobbyist for a foreign government you need to register under FARA. It was used in 23 criminal cases during World War II, but hasn't had much use since it was ammended in 1966. Although... if you consult the list of recent cases you'll see some very interesting recent activity involving Russia and Ukraine.

It's also for spies! Quoting the FARA FAQ:

Finally, 50 U.S.C. § 851, requires registration of persons who have knowledge of or have received instruction or assignment in espionage, counterespionage or sabotage service or tactics of a foreign country or political party.

I imagine most spies operate in violation of this particular law and don't take steps to register themselves.

It's all still pretty fascinating though, in part because it gets updated. A lot. Almost every business day in fact.

Tracking FARA history

I know this because seven months ago I set up a scraper for it. Every twelve hours I have code which downloads the four bulk CSVs published by the Justice department and saves them to a git repository. It's the same trick I've been using to track San Francisco's database of trees and PG&E's outage map.

I've been running the scraper using Circle CI, but this weekend I decided to switch it over to GitHub Actions to get a better idea for how they work.

Deploying it as an API

I also wanted to upgrade my script to also deploy a fresh Datasette instance of the data using Google Cloud Run. I wrote a script to do this on a manual basis last year, but I never combined it with the daily scraper. Combining the two means I can offer a Datasette-powered API directly against the latest data.

https://fara.datasettes.com is that API - it now updates twice a day, assuming there are some changes to the underlying data.

Putting it all together

The final GitHub action workflow can be seen here. I'm going to present an annotated version here.

on:
  repository_dispatch:
  schedule:
    - cron:  '0 0,12 * * *'

This sets when the workflow should be triggered. I'm running it twice a day - at midnight and noon UTC (the 0,12 cron syntax).

The repository_dispatch key means I can also trigger it manually by running the following curl command - useful for testing:

curl -XPOST https://api.github.com/repos/simonw/fara-history/dispatches \
    -H 'Authorization: token MY_PERSONAL_TOKEN_HERE' \
    -d '{"event_type": "trigger_action"}' \
    -H 'Accept: application/vnd.github.everest-preview+json'

Next comes the job itself, which I called scheduled and set to run on the latest Ubuntu:

jobs:
  scheduled:
    runs-on: ubuntu-latest
    steps:

Next comes the steps. Each step is run in turn, in an isolated process (presumably a container) but with access to the current working directory.

- uses: actions/checkout@v2
  name: Check out repo
- name: Set up Python
  uses: actions/setup-python@v1
  with:
    python-version: 3.8

The first two steps checkout the fara-history repository and install Python 3.8.

- uses: actions/cache@v1
  name: Configure pip caching
  with:
    path: ~/.cache/pip
    key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
    restore-keys: |
      ${{ runner.os }}-pip-

This step should set up a cache so that pip doesn't have to download fresh dependencies on every run. Unfortunately it doesn't seem to actually work - it only works for push and pull_request events, but my workflow is triggered by schedule and repository_dispatch. There's an open issue about this.

- name: Install Python dependencies
  run: |
    python -m pip install --upgrade pip
    pip install -r requirements.txt

This step installs my dependencies from requirements.txt.

- name: Fetch, update and commit FARA data
  run: . update_and_commit_all.sh
- name: Build fara.db database
  run: python build_database.py

Now we're getting to the fun stuff. My update_and_commit_all.sh script downloads the four zip files from the FARA.gov site, unzips them, sorts them, diffs them against the previously stored files and commits the new copy to GitHub if they have changed. See my explanation of csv-diff for more on this - though sadly only one of the files has a reliable row identifier so I can't generate great commit messages for most of them.

My build_database.py script uses sqlite-utils to convert the CSV files into a SQLite database.

Now that we've got a SQLite database, we can deploy it to Google Cloud Run using Datasette.

But should we run a deploy at all? If the database hasn't changed, there's no point in deploying it. How can we tell if the database file has changed from the last one that was published?

Datasette has a mechanism for deriving a content hash of a database, part of a performance optimization which is no longer turned on by default and may be removed in the future.

You can generate JSON that includes hash using the datasette inspect command. The jq tool can then be used to extract out just the hash:

$ datasette inspect fara.db | jq '.fara.hash' -r
fbc9cbaca6de1e232fc14494faa06cc8d4cb9f379d0d568e4711e9a218800906

The -r option to jq causes it to return just the raw string, without quote marks.

Datasette's /-/databases.json introspection URL reveals the hashes of the currently deployed database. Here's how to pull the currently deployed hash:

$ curl -s https://fara.datasettes.com/-/databases.json | jq '.[0].hash' -r
a6c0ab26589bde0d225c5a45044e0adbfa3840b95fbb263d01fd8fb0d2460ed5

If those two hashes differ then we should deploy the new database.

GitHub Actions have a slightly bizarre mechanism for defining "output variables" for steps, which can then be used to conditionally run further steps.

Here's the step that sets those variables, followed by the step that conditionally installs the Google Cloud CLI tools using their official action:

- name: Set variables to decide if we should deploy
  id: decide_variables
  run: |-
    echo "##[set-output name=latest;]$(datasette inspect fara.db | jq '.fara.hash' -r)"
    echo "##[set-output name=deployed;]$(curl -s https://fara.datasettes.com/-/databases.json | jq '.[0].hash' -r)"
- name: Set up Cloud Run
  if: steps.decide_variables.outputs.latest != steps.decide_variables.outputs.deployed
  uses: GoogleCloudPlatform/github-actions/setup-gcloud@master
  with:
    version: '275.0.0'
    service_account_email: ${{ secrets.GCP_SA_EMAIL }}
    service_account_key: ${{ secrets.GCP_SA_KEY }}

Having installed the Google Cloud tools, I can deploy my database using Datasette:

- name: Deploy to Cloud Run
  if: steps.decide_variables.outputs.latest != steps.decide_variables.outputs.deployed
  run: |-
    gcloud components install beta
    gcloud config set run/region us-central1
    gcloud config set project datasette-222320
    datasette publish cloudrun fara.db --service fara-history -m metadata.json

This was by far the hardest part to figure out.

First, I needed to create a Google Cloud service account with an accompanying service key.

I tried and failed to do this using the CLI, so I switched to their web console following these and then these instructions.

Having downloaded the key JSON file, I converted it to base64 and pasted it into a GitHub Actions secret (hidden away in the repository settings area) called GCP_SA_KEY.

cat ~/Downloads/datasette-222320-2ad02afe6d82.json \
    | base64 | pbcopy

The service account needed permissions in order to run a build through Cloud Build and then deploy the result through Cloud Run. I spent a bunch of time trying out different combinations and eventually gave up and gave the account "Editor" permissions across my entire project. This is bad. I am hoping someone can help me understand what the correct narrow set of permissions are, and how to apply them.

It also took me a while to figure out that I needed to run these three commands before I could deploy to my project. The first one installs the Cloud Run tools, the second set up some required configuration:

gcloud components install beta
gcloud config set run/region us-central1
gcloud config set project datasette-222320

But... having done all of the above, the following command run from an action successfully deploys the site!

datasette publish cloudrun fara.db \
    --service fara-history -m metadata.json

DNS

Google Cloud Run deployments come with extremely ugly default URLs. For this project, that URL is https://fara-history-j7hipcg4aq-uc.a.run.app/.

I wanted something nicer. I own datasettes.com and manage the DNS via Cloudflare, which means I can point subdomains at Cloud Run instances.

This is a two-step process

I set fara.datasettes.com as a DNS-only (no proxying) CNAME for ghs.googlehosted.com.
In the Google Cloud Console I used Cloud Run -> Manage Custom Domains (a button in the header) -> Add Mapping to specify that fara.datasettes.com should map to my fara-history service (the --service argument from datasette publish earlier).

I had previously verified my domain ownership - I forget quite how I did it. Domains purchased through Google Domains get to skip this step.

Next steps

This was a lot of fiddling around. I'm hoping that by writing this up in detail I'll be able to get this working much faster next time.

I think this model - GitHub Actions that pull data, build a database and deploy to Cloud Run using datasette publish - is incredibly promising. The end result should be an API that costs cents-to-dollars a month to operate thanks to Cloud Run's scale-to-zero architecture. And hopefully by publishing this all on GitHub it will be as easy as possible for other people to duplicate it for their own projects.

Tags: continuous-deployment, continuous-integration, data-journalism, github, projects, datasette, cloudrun, github-actions, git-scraping

GitHub Actions ci.yml for deno

2019-12-18T08:49:40+00:00

GitHub Actions ci.yml for deno

Spotted this today: it’s one of the cleanest examples I’ve seen of a complex CI configuration for GitHub Actions, testing, linting, benchmarking and building Ryan Dahl’s Deno JavaScript runtime.

Tags: continuous-integration, github, ryan-dahl, github-actions, deno

Continuous Integration with Travis CI - ZEIT Documentation

2018-06-01T17:21:50+00:00

Continuous Integration with Travis CI - ZEIT Documentation

One of the neat things about Zeit Now is that since deployments are unlimited and are automatically assigned a unique URL you can set up a continuous integration system like Travis to deploy a brand new copy of every commit or every pull request. This documentation also shows how to have commits to master automatically aliased to a known URL. I have quite a few Datasette projects that are deployed automatically to Now by Travis and the pattern seems to be working great so far.

Tags: continuous-deployment, continuous-integration, zeit-now, travis

Porting my blog to Python 3

2017-10-21T22:22:40+00:00

This blog is now running on Python 3! Admittedly this is nearly nine years after the first release of Python 3.0, but it’s the first Python 3 project I’ve deployed myself so I’m pretty excited about it.

Library authors like to use six to allow them to write code that supports both Python 2 and Python 3 at the same time… but my blog isn’t a library, so I used the 2to3 conversion tool that ships with Python instead.

And… it worked pretty well! I ran the following command from my project’s root directory:

2to3 -w -n blog/ config/ redirects/ feedstats/

The -w option causes the files to be over-written in place. Since everything is already in git, there was no reason to have 2to3 show my a diff without applying it. Likewise, the -n option tells 2to3 not to bother saving backups of the files it modifies.

Here’s the initial commit containing mostly the 2to3 changes.

Next step: run the tests! My test suite may be very thin, but it does at least check that the app can run its migrations, start up and serve a few basic pages without errors. One of my migrations was failing due to rogue bytestrings but that was an easy fix.

At this point I started to lean heavily on my continuous integration setup built on Travis CI. All of my Python 3 work took place in a branch, and all it took was a one line change to my .travis.yml for Travis to start running the tests for that branch using Python 3.

With the basic tests working, I made my first deploy to my Heroku staging instance - after first modifying my Heroku runtime.txt to tell it to use Python 3.6.2. My staging environment allowed me to sanity check that everything would work OK when deployed to Heroku.

At this point I got a bit lazy. The responsible thing to do would have been extensive manual testing plus systematic unit test coverage of core functionality. My blog is hardly a critical piece of infrastructure though, so I went with the faster option: put it all live and use Sentry to see if anything breaks.

This is where Heroku’s ability to deploy a specific branch came in handy: one click to deploy my python3 branch, keep an eye on Sentry (via push notifications from my private slack channel) and then one click to deploy my master branch again for an instant rollback in case of errors. Which I had to do instantly, because it turned out I had stored some data in Django’s cache using Python 2 pickle and was trying to read it back out again using Python 3.

I fixed that by bumping my cache VERSION setting and deployed again. This deploy lasted a few minute longer before Sentry started to fill up with encoding errors and I rolled it back again.

The single biggest difference between Python 2 and Python 3 is how strings are handled. Python 3 strings are unicode sequences. Learning to live in a world where strings are all unicode and byte strings are the rare, deliberate exceptions takes some getting used to.

The key challenge for my blog actually came from my custom markup handling template tags. 15 years ago I made the decision to store all of my blog entries as valid XHTML fragments. This meant I could use XML processors - back then in PHP, today Python’s ElementTree - to perform various transformations on my content.

ElementTree in Python 2 can only consume bytestrings. In Python 3 it expects unicode strings. Cleaning this up took a while, eventually inspiring me to refactor my custom template tags completely. In the process I realized that my blog templates were mostly written back before Django’s template language implemented autoescape (in Django 1.0), so my code was littered with unnecessary |escape and |safe filters. Those are all gone now.

Sentry lets you mark an exception as “resolved” when you think you’ve fixed it - if it occurs again after that it will be re-reported to your Slack channel and added back to the Sentry list of unresolved issues. Once Sentry was clear (especially given Googlebot had crawled my older pages) I could be pretty confident there were no more critical 500-causing errors.

That left logic errors, of which only one has cropped up so far: the “zero years ago” bug. Entries on my homepage include a relative date representation, e.g. “three days ago”. Python 3 changed how the divison operator works on integers - 3 / 2 == 1.5 where in Python 2 it gets truncated to 1. As a result, every entry on my homepage showed “zero years ago”. Thankfully this was a one-line fix.

All in all this process was much less painful than I expected. It took me longer to write this blog entry than it did to actually make the conversion (thanks to 2to3 doing most of the tedious work), and the combination of Travis CI, Sentry and Heroku allowed me to ship aggressively with the knowledge that I could promptly identify and resolve any issues that came up.

Next upgrade: Django 2.0!

Tags: continuous-integration, python, python3, heroku, travis, sentry

How to set up world-class continuous deployment using free hosted tools

2017-10-17T13:32:49+00:00

I’m going to describe a way to put together a world-class continuous deployment infrastructure for your side-project without spending any money.

With continuous deployment every code commit is tested against an automated test suite. If the tests pass it gets deployed directly to the production environment! How’s that for an incentive to write comprehensive tests?

Each of the tools I’m using offers a free tier which is easily enough to handle most side-projects. And once you outgrow those free plans, you can solve those limitations in exchange for money!

Here’s the magic combination:

Step one: Publish some code to GitHub with some tests

I’ll be using the code for my blog as an example. It’s a classic Django application, with a small (OK, tiny) suite of unit tests. The tests are run using the standard Django ./manage.py test command.

Writing a Django application with tests is outside the scope of this article. Thankfully the official Django tutorial covers testing in some detail.

Step two: Hook up Travis CI

Travis CI is an outstanding hosted platform for continuous integration. Given a small configuration file it can check out code from GitHub, set up an isolated test environment (including hefty dependencies like a PostgreSQL database server, Elasticsearch, Redis etc), run your test suite and report the resulting pass/fail grade back to GitHub.

It’s free for publicly hosted GitHub projects. If you want to test code in a private repository you’ll have to pay them some money.

Here’s my .travis.yml configuration file:

language: python

python:
  - 2.7

services: postgresql

addons:
  postgresql: "9.6"

install:
  - pip install -r requirements.txt

before_script:
  - psql -c "CREATE DATABASE travisci;" -U postgres
  - python manage.py migrate --noinput
  - python manage.py collectstatic

script:
  - python manage.py test

And here’s the resulting Travis CI dashboard.

The integration of Travis with GitHub runs deep. Once you’ve set up Travis, it will automatically test every push to every branch - driven by GitHub webhooks, so test runs are set off almost instantly. Travis will then report the test results back to GitHub, where they’ll show up in a bunch of different places - including these pleasing green ticks on the branches page:

Travis will also run tests against any open pull requests. This is a great incentive to build new features in a pull request even if you aren’t using them for code review:

Circle CI deserves a mention as an alternative to Travis. The two are close competitors and offer very similar feature sets, and Circle CI's free plan allows up to 1,500 build minutes of private repositories per month.

Update 25th July 2020: I've started using GitHub Actions for most of my projects now - see my githubactions tag.

Step 3: Deploy to Heroku and turn on continuous deployment

I’m a big fan of Heroku for side projects, because it means not having to worry about ongoing server-maintenance. I’ve lost several side-projects to entropy and software erosion - getting an initial VPS set up may be pretty simple, but a year later security patches need applying and the OS needs upgrading and the log files have filled up the disk and you’ve forgotten how you set everything up in the first place…

It turns out Heroku has basic support for continuous deployment baked in, and it’s trivially easy to set up. You can tell Heroku to deploy on every commit to GitHub, and then if you’ve attached a CI service like Travis that reports build health back you can check the box for “Wait for CI to pass before deploy”:

Since small dynos on Heroku are free, you can even set up a separate Heroku app as a staging environment. I started my continuous integration adventure just deploying automatically to my staging instance, then switched over to deploying to production once I gained some confidence in how it all fitted together.

If you’re using continuous deployment with Heroku and Django, it’s a good idea to set up Heroku to automatically run your migrations for every deploy - otherwise you might merge a pull request with a model change and forget to run the migrations before the deploy goes out. You can do that using Heroku’s release phase feature, by adding the line release: python manage.py migrate --noinput to your Heroku Procfile (here’s mine).

Once you go beyond Heroku’s free tier things get much more powerful: Heroku Flow combines pipelines, review apps and their own CI solution to provide a comprehensive solution for much larger teams.

Step 4: Monitor errors with Sentry

If you’re going to move fast and break things, you need to know when things have broken. Sentry is a fantastic tool for collecting exceptions, aggregating them and spotting when something new crops up. It’s open source so you can host it yourself, but they also offer a robust hosted version with a free plan that can track up to 10,000 errors a month.

My favourite feature of Sentry is that it gives each exception it sees a “signature” based on a MD5 hash of its traceback. This means it can tell if errors are the same underlying issue or something different, and can hence de-dupe them and only alert you the first time it spots an error it has not seen before.

Sentry has integrations for most modern languages, but it’s particularly easy to use with Django. Just install raven and add few extra lines to your settings.py:

SENTRY_DSN = os.environ.get('SENTRY_DSN')
if SENTRY_DSN:
    INSTALLED_APPS += (
        'raven.contrib.django.raven_compat',
    )
    RAVEN_CONFIG = {
        'dsn': SENTRY_DSN,
        'release': os.environ.get('HEROKU_SLUG_COMMIT', ''),
    }

Here I’m using the Heroku pattern of keeping configuration in environment variables. SENTRY_DSN is provided by Sentry when you create your project there - you just have to add it as a Heroku config variable.

The HEROKU_SLUG_COMMIT line causes the currently deployed git commit hash to be fed to Sentry so that it knows what version of your code was running when it reports an error. To enable that variable, you’ll need to enable Dyno Metadata by running heroku labs:enable runtime-dyno-metadata against your application.

Step 5: Hook it all together with Slack

Would you like a push notification to your phone every time your site gets code committed / the tests pass or fail / a deploy goes out / a new error is detected? All of the above tools can report such things to Slack, and Slack’s free plan is easily enough to collect all of these notifications and push them to your phone via the free Slack iOS or Android apps.

Here are instructions for setting up Slack with GitHub, Travis CI, Heroku and Sentry.

Need more? Pay for it!

Having run much of this kind of infrastructure myself in the past I for one am delighted by the idea of outsourcing it, especially when the hosted options are of such high quality.

Each of these tools offers a free tier which is generous enough to work great for small side projects. As you start scaling up, you can start paying for them - that’s why they gave you a free tier in the first place.

Comments or suggestions? Join this thread on Hacker News.

Tags: continuous-deployment, continuous-integration, django, github, postgresql, testing, heroku, slack, travis, sentry

What are good and easy practices for frequent web deployments?

2013-01-08T10:32:00+00:00

My answer to What are good and easy practices for frequent web deployments? on Quora

At Lanyrd we use a combination of Fabric to drive our deploy scripts, git to get the code on to the servers, puppet for configuration management and Jenkins to run continuous integration tests and provide a "deploy the site" button.

Here are a few important techniques I've learned:

Use symlink switching to keep the previous version of the code around, so you can switch back in the case of problems (that said, we've never actually used this capability - but its nice for atomic deploys as well)
Have your build script rename your static asset files (CSS/JS/etc) to include part of the md5 hash of the file contents in their filename. This means you can upload them to your static host provider (we use S3) before you run a deploy, guaranteeing that freshly deployed templates will point to the right files. It also keeps the older versions around in case you need to roll back.
Having one button that deploys the site is invaluable
Deploys need to be almost "free" in terms of impact on site performance - if it doesn't cost anything to deploy the site people will be freely able to deploy often and push out small fixes, which is good for the health of your codebase
Get new engineers to deploy on the first day! Doing so forces you/them to get a full development and deployment environment up and running for them on day one, which means that they can start doing real work on day two.

Tags: continuous-integration, startups, quora, lanyrd, devops

Fabric factory

2009-09-21T18:35:12+00:00

Fabric factory

Promising looking continuous integration server written in Django, which uses Fabric scripts to define actions.

Tags: continuous-integration, django, fabric, fabricfactory, python, testing

Localbuilder

2009-01-14T22:57:59+00:00

Localbuilder

Gareth Rushgrove’s neat little Python continuous integration tool—it watches a directory for changes, then runs a command when it spots any.

Tags: continuous-integration, gareth-rushgrove, localbuilder, python, testing