Simon Willison's Weblog: cloudrun

Open redirect endpoint in Datasette prior to 0.65.2 and 1.0a21

2025-11-05T23:11:17+00:00

Open redirect endpoint in Datasette prior to 0.65.2 and 1.0a21

This GitHub security advisory covers two new releases of Datasette that I shipped today, both addressing the same open redirect issue with a fix by James Jefferies.

Datasette 0.65.2 fixes the bug and also adds Python 3.14 support and a datasette publish cloudrun fix.

Datasette 1.0a21 also has that Cloud Run fix and two other small new features:

New datasette --get /path --headers option for inspecting the headers returned by a path. (#2578)

New datasette.client.get(..., skip_permission_checks=True) parameter to bypass permission checks when making requests using the internal client. (#2583)

I decided to include the Cloud Run deployment fix so anyone with Datasette instances deployed to Cloud Run can update them with the new patched versions.

Tags: security, datasette, cloudrun, annotated-release-notes

Clickhouse on Cloud Run

2021-07-29T06:07:51+00:00

Clickhouse on Cloud Run

Alex Reid figured out how to run Clickhouse against read-only baked data on Cloud Run last year, and wrote up some comprehensive notes.

Via @alexjreid

Tags: cloudrun, baked-data, clickhouse

Building a search engine for datasette.io

2020-12-19T18:12:31+00:00

This week I added a search engine to datasette.io, using the search indexing tool I've been building for Dogsheep.

Project search for Datasette

The Datasette project has a lot of constituent parts. There's the project itself and its documentation - 171 pages when exported to PDF and counting. Then there are the 48 plugins, sqlite-utils and 21 more tools for creating SQLite databases, the Dogsheep collection and over three years of content I've written about the project on my blog.

The new datasette.io search engine provides a faceted search interface to all of this material in one place. It currently searches across:

Every section of the latest documentation (415 total)
48 plugin READMEs
22 tool READMEs
63 news items posted on the Datasette website
212 items from my blog
Release notes from 557 package releases

I plan to extend it with more data sources in the future.

How it works: Dogsheep Beta

I'm reusing the search engine I originally built for my Dogsheep personal analytics project (see Personal Data Warehouses: Reclaiming Your Data). I call that search engine Dogsheep Beta. The name is a pun.

SQLite has great full-text search built in, and I make extensive use of that in Datasette projects already. But out of the box it's not quite right for this kind of search engine that spans multiple different content types.

The problem is relevance calculation. I wrote about this in Exploring search relevance algorithms with SQLite - short version: query relevance is calculated using statistics against the whole corpus, so search terms that occur rarely in the overall corpus contribute a higher score than more common terms.

This means that calculated full-text ranking scores calculated against one table of data cannot be meaningfully compared to scores calculated independently against a separate table, as the corpus statistics used to calculate the rank will differ.

To get usable scores, you need everything in a single table. That's what Dogsheep Beta does: it creates a new table, called search_index, and copies searchable content from the other tables into that new table.

This is analagous to how an external search index like Elasticsearch works: you store your data in the main database, then periodically update an index in Elasticsearch. It's the denormalized query engine design pattern in action.

Configuring Dogsheep Beta

There are two components to Dogsheep Beta: a command-line tool for building a search index, and a Datasette plugin for providing an interface for running searches.

Both of these run off a YAML configuration file, which defines the tables that should be indexed and also defines how those search results should be displayed.

(Having one configuration file handle both indexing and display feels a little inelegant, but it's extremely productive for iterating on so I'm letting that slide.)

Here's the full Dogsheep configuration for datasette.io. An annotated extract:

# Index material in the content.db SQLite file
content.db:
  # Define a search type called 'releases'
  releases:
    # Populate that search type by executing this SQL
    sql: |-
      select
        releases.id as key,
        repos.name || ' ' || releases.tag_name as title,
        releases.published_at as timestamp,
        releases.body as search_1,
        1 as is_public
      from
        releases
        join repos on releases.repo = repos.id
    # When displaying a search result, use this SQL to
    # return extra details about the item
    display_sql: |-
      select
        -- highlight() is a custom SQL function
        highlight(render_markdown(releases.body), :q) as snippet,
        html_url
      from releases where id = :key
    # Jinja template fragment to display the result
    display: |-
      <h3>Release: <a href="{{ display.html_url }}">{{ title }}</a></h3>
      <p>{{ display.snippet|safe }}</p>
      <p><small>Released {{ timestamp }}</small></p>

The core pattern here is the sql: key, which defines a SQL query that must return the following columns:

key - a unique identifier for this search item
title - a title for this indexed document
timestamp - a timestamp for when it was created. May be null.
search_1 - text to be searched. I may add support for search_2 and search_3 later on to store text that will be treated with a lower relevance score.
is_public - should this be considered "public" data. This is a holdover from Dogsheep Beta's application for personal analytics, I don't actually need it for datasette.io.

To create an index, run the following:

dogsheep-beta index dogsheep-index.db dogsheep-config.yml

The index command will loop through every configured search type in the YAML file, execute the SQL query and use it to populate a search_index table in the dogsheep-index.db SQLite database file.

Here's the search_index table for datasette.io.

When you run a search, the plugin queries that table and gets back results sorted by relevance (or other sort criteria, if specified).

To display the results, it loops through each one and uses the Jinja template fragment from the configuration file to turn it into HTML.

If a display_sql: query is defined, that query will be executed for each result to populate the {{ display }} object made available to the template. Many Small Queries Are Efficient In SQLite.

Search term highlighting

I spent a bit of time thinking about search highlighting. SQLite has an implementation of highlighting built in - the snippet() function - but it's not designed to be HTML-aware so there's a risk it might mangle HTML by adding highlighting marks in the middle of a tag or attribute.

I ended up rolling borrowing a BSD licensed highlighting class from the django-haystack project. It deals with HTML by stripping tags, which seems to be more-or-less what Google do for their own search results so I figured that's good enough for me.

I used this one-off site plugin to wrap the highlighting code in a custom SQLite function. This meant I could call it from the display_sql: query in the Dogsheep Beta YAML configuration.

A custom template tag would be more elegant, but I don't yet have a mechanism to expose custom template tags in the Dogsheep Beta rendering mechanism.

Build, index, deploy

The Datasette website implements the Baked Data pattern, where the content is compiled into SQLite database files and bundled with the application code itself as part of the deploy.

Building the index is just another step of that process.

Here's the deploy.yml GitHub workflow used by the site. It roughly does the following:

Download the current version of the content.db database file. This is so it doesn't have to re-fetch release and README content that was previously stored there.
Download the current version of blog.db, with entries from my blog. This means I don't have to fetch all entries, just the new ones.
Run build_directory.py, the script which fetches data for the plugins and tools pages.
- This hits the GitHub GraphQL API to find new repositories tagged datasette-io and datasette-plugin and datasette-tool.
- That GraphQL query also returns the most recent release. The script then checks to see if those releases have previously been fetched and, if not, uses github-to-sqlite to fetch them.
Imports the data from news.yaml into a news table using yaml-to-sqlite
Imports the latest PyPI download statistics for my packages from my simonw/package-stats repository, which implements git scraping against the most excellent pypistats.org.
Runs the dogsheep-beta index command to build a dogsheep-index.db search index.
Runs some soundness checks, e.g. datasette . --get "/plugins", to verify that Datasette is likely to at least return 200 results for some critical pages once published.
Uses datasette publish cloudrun to deploy the results to Google Cloud Run, which hosts the website.

I love building websites this way. You can have as much complexity as you like in the build script (my TIL website build script generates screenshots using Puppeteer) but the end result is some simple database files running on inexpensive, immutable, scalable hosting.

Tags: projects, search, sqlite, datasette, dogsheep, weeknotes, cloudrun, baked-data

Datasette 0.52

2020-11-29T00:56:15+00:00

Datasette 0.52

A relatively small release—it has a new plugin hook (database_actions(), for adding links to a new database actions menu), renames the --config option to --setting and adds a new “datasette publish cloudrun --apt-get-install” option.

Tags: projects, datasette, cloudrun

datasette-ripgrep: deploy a regular expression search engine for your source code

2020-11-28T06:51:06+00:00

This week I built datasette-ripgrep - a web application for running regular expression searches against source code, built on top of the amazing ripgrep command-line tool.

datasette-ripgrep demo

I've deployed a demo version of the application here:

ripgrep.datasette.io/-/ripgrep?pattern=pytest

The demo runs searches against the source code of every one of my GitHub repositories that start with datasette - 61 repos right now - so it should include all of my Datasette plugins plus the core Datasette repository itself.

Since it's running on top of ripgrep, it supports regular expressions. This is absurdly useful. Some examples:

Every usage of the .plugin_config( method: plugin_config\(
Everywhere I use async with httpx.AsyncClient (usually in tests): async with.*AsyncClient
All places where I use a Jinja | filter inside a variable: \{\{.*\|.*\}\}

I usually run ripgrep as rg on the command-line, or use it within Visual Studio Code (fun fact: the reason VS Code's "Find in Files" is so good is it's running ripgrep under the hood).

So why have it as a web application? Because this means I can link to it, bookmark it and use it on my phone.

Why build this?

There are plenty of great existing code search tools out there already: I've heard great things about livegrep, and a quick Google search shows a bunch of other options.

Aside from being a fun project, datasette-ripgrep has one key advantage: it gets to benefit from Datasette's publishing mechanism, which means it's really easy to deploy.

That ripgrep.datasette.io demo is deployed by checking out the source code to be searched into a all directory and then using the following command:

datasette publish cloudrun \
    --metadata metadata.json \
    --static all:all \
    --install=datasette-ripgrep \
    --service datasette-ripgrep \
    --apt-get-install ripgrep

all is a folder containing the source code to be searched. metadata.json contains this:

{
    "plugins": {
        "datasette-ripgrep": {
            "path": "/app/all",
            "time_limit": 3.0
        }
    }
}

That's all there is to it! The result is a deployed code search engine, running on Google Cloud Run.

(If you want to try this yourself you'll need to be using the just-released Datasette 0.52.)

The GitHub Action workflow that deploys the demo also uses my github-to-sqlite tool to fetch my repos and then shallow-clones the ones that begin with datasette.

If you have your own Google Cloud Run credentials, you can run your own copy of that workflow against your own repositories.

A different kind of Datasette plugin

Datasette is a tool for publishing SQLite databases, so most Datasette plugins integrate with SQLite in some way.

datasette-ripgrep is different: it makes no use of SQLite at all, but instead takes advantage of Datasette's URL routing, datasette publish deployments and permissions system.

The plugin implementation is currently 134 lines of code, excluding tests and templates.

While the plugin doesn't use SQLite, it does share a common philosophy with Datasette: the plugin bundles the source code that it is going to search as part of the deployed application, in a similar way to how Datasette usually bundles one or more SQLite database files.

As such, it's extremely inexpensive to run and can be deployed to serverless hosting. If you need to scale it, you can run more copies.

This does mean that the application needs to be re-deployed to pick up changes to the searchable code. I'll probably set my demo to do this on a daily basis.

Controlling processes from asyncio

The trickiest part of the implementation was figuring out how to use Python's asyncio.create_subprocess_exec() method to safely run the rg process in response to incoming requests.

I don't want expensive searches to tie up the server, so I implemented two limits here. The first is a time limit: by default, searches have a second to run after which the rg process will be terminated and only results recieved so far will be returned. This is achieved using the asyncio.wait_for() function.

I also implemented a limit on the number of matching lines that can be returned, defaulting to 2,000. Any more than that and the process is terminated early.

Both of these limits can be customized using plugin settings (documented in the README). You can see how they are implemented in the async def run_ripgrep(pattern, path, time_limit=1.0, max_lines=2000) function.

Highlighted linkable line numbers

The other fun implementation detail is the way the source code listings are displayed. I'm using CSS to display the line numbers in a way that makes them visible without them breaking copy-and-paste (inspired by this article by Sylvain Durand).

code:before {
    content: attr(data-line);
    display: inline-block;
    width: 3.5ch;
    -webkit-user-select: none;
    color: #666;
}

The HTML looks like this:

<pre><code id="L1" data-line="1">from setuptools import setup</code>
<code id="L2" data-line="2">import os</code>
<code id="L3" data-line="3">&nbsp;</code>
<code id="L4" data-line="4">VERSION = &#34;0.1&#34;</code>
...

I wanted to imitate GitHub's handling of line links, where adding #L23 to the URL both jumps to that line and causes the line to be highlighted. Here's a demo of that - I use the following JavaScript to update the contents of a <style id="highlightStyle"></style> element in the document head any time the URL fragment changes:

<script>
var highlightStyle = document.getElementById('highlightStyle');
function highlightLineFromFragment() {
    if (/^#L\d+$/.exec(location.hash)) {
        highlightStyle.innerText = `${location.hash} { background-color: yellow; }`;
    }
}
highlightLineFromFragment();
window.addEventListener("hashchange", highlightLineFromFragment);
</script>

It's the simplest way I could think of to achieve this effect.

Update 28th November 2020: Louis Lévêque on Twitter suggested using the CSS :target selector instead, which is indeed MUCH simpler - I deleted the above JavaScript and replaced it with this CSS:

:target {
    background-color: #FFFF99;
}

Next steps for this project

I'm pleased to have got datasette-ripgrep to a workable state, and I'm looking forward to using it to answer questions about the growing Datasette ecosystem. I don't know how much more time I'll invest in this - if it proves useful then I may well expand it.

I do think there's something really interesting about being able to spin up this kind of code search engine on demand using datasette publish. It feels like a very useful trick to have access to.

Better URLs for my TILs

My other project this week was an upgrade to til.simonwillison.net: I finally spent the time to design nicer URLs for the site.

Before:

til.simonwillison.net/til/til/javascript_manipulating-query-params.md

After:

til.simonwillison.net/javascript/manipulating-query-params

The implementation for this takes advantage of a feature I sneaked into Datasette 0.49: Path parameters for custom page templates. I can create a template file called pages/{topic}/{slug}.html and Datasette use that template to handle 404 errors that match that pattern.

Here's the new pages/{topic}/{slug}.html template for my TIL site. It uses the sql() template function from the datasette-template-sql plugin to retrieve and render the matching TIL, or raises a 404 if no TIL can be found.

I also needed to setup redirects from the old pages to the new ones. I wrote a TIL on edirects for Datasette explaining how I did that.

TIL this week

Redirects for Datasette

Releases this week

datasette-ripgrep 0.2 - 2020-11-27
datasette-ripgrep 0.1 - 2020-11-26
datasette-atom 0.8.1 - 2020-11-25
datasette-ripgrep 0.1a1 - 2020-11-25
datasette-ripgrep 0.1a0 - 2020-11-25
datasette-graphql 1.2.1 - 2020-11-24

Tags: async, css, projects, python, datasette, weeknotes, cloudrun, ripgrep, baked-data

The unofficial Google Cloud Run FAQ

2020-07-22T17:20:20+00:00

The unofficial Google Cloud Run FAQ

This is really useful: a no-fluff, content rich explanation of Google Cloud Run hosted as a GitHub repo that actively accepts pull requests from the community. It’s maintained by Ahmet Alp Balkan, a Cloud Run engineer who states “Googlers: If you find this repo useful, you should recognize the work internally, as I actively fight for alternative forms of content like this”. One of the hardest parts of working with AWS and GCP is digging through the marketing materials to figure out what the product actually does, so the more alternative forms of documentation like this the better.

Tags: documentation, google, cloudrun

Advice on specifying more granular permissions with Google Cloud IAM

2020-05-28T22:44:24+00:00

Advice on specifying more granular permissions with Google Cloud IAM

My single biggest frustration working with both Google Cloud and AWS is permissions: more specifically, figuring out what the smallest set of permissions are that I need to assign in order to achieve different goals. Katie McLaughlin’s new series aims to address exactly that problem. I learned a ton from this that I’ve previously missed, and there’s plenty of actionable advice on tooling that can be used to help figure this stuff out.

Via Katie McLaughlin

Tags: permissions, cloudrun

html-to-svg

2020-05-07T06:01:44+00:00

html-to-svg

‪This is absolutely ingenious: 50 lines of JavaScript which uses Puppeteer to get headless Chrome to grab a PDF screenshot of a page, then shells out to Inkscape to convert the PDF to SVG. Wraps the whole thing up in a Docker container and ships it to Cloud Run as a web service you can call by passing it a URL.

Via @steren

Tags: chrome, svg, cloudrun, puppeteer

Tracking FARA by deploying a data API using GitHub Actions and Cloud Run

2020-01-21T07:51:11+00:00

I'm using the combination of GitHub Actions and Google Cloud Run to retrieve data from the U.S. Department of Justice FARA website and deploy it as a queryable API using Datasette.

FARA background

The Foreign Agents Registration Act (FARA) law that requires "certain agents of foreign principals who are engaged in political activities or other activities specified under the statute to make periodic public disclosure of their relationship with the foreign principal, as well as activities, receipts and disbursements in support of those activities".

The law was introduced in 1938 in response to the large number of German propaganda agents that were operating in the U.S. prior to the war.

Basically, if you are in the United States as a lobbyist for a foreign government you need to register under FARA. It was used in 23 criminal cases during World War II, but hasn't had much use since it was ammended in 1966. Although... if you consult the list of recent cases you'll see some very interesting recent activity involving Russia and Ukraine.

It's also for spies! Quoting the FARA FAQ:

Finally, 50 U.S.C. § 851, requires registration of persons who have knowledge of or have received instruction or assignment in espionage, counterespionage or sabotage service or tactics of a foreign country or political party.

I imagine most spies operate in violation of this particular law and don't take steps to register themselves.

It's all still pretty fascinating though, in part because it gets updated. A lot. Almost every business day in fact.

Tracking FARA history

I know this because seven months ago I set up a scraper for it. Every twelve hours I have code which downloads the four bulk CSVs published by the Justice department and saves them to a git repository. It's the same trick I've been using to track San Francisco's database of trees and PG&E's outage map.

I've been running the scraper using Circle CI, but this weekend I decided to switch it over to GitHub Actions to get a better idea for how they work.

Deploying it as an API

I also wanted to upgrade my script to also deploy a fresh Datasette instance of the data using Google Cloud Run. I wrote a script to do this on a manual basis last year, but I never combined it with the daily scraper. Combining the two means I can offer a Datasette-powered API directly against the latest data.

https://fara.datasettes.com is that API - it now updates twice a day, assuming there are some changes to the underlying data.

Putting it all together

The final GitHub action workflow can be seen here. I'm going to present an annotated version here.

on:
  repository_dispatch:
  schedule:
    - cron:  '0 0,12 * * *'

This sets when the workflow should be triggered. I'm running it twice a day - at midnight and noon UTC (the 0,12 cron syntax).

The repository_dispatch key means I can also trigger it manually by running the following curl command - useful for testing:

curl -XPOST https://api.github.com/repos/simonw/fara-history/dispatches \
    -H 'Authorization: token MY_PERSONAL_TOKEN_HERE' \
    -d '{"event_type": "trigger_action"}' \
    -H 'Accept: application/vnd.github.everest-preview+json'

Next comes the job itself, which I called scheduled and set to run on the latest Ubuntu:

jobs:
  scheduled:
    runs-on: ubuntu-latest
    steps:

Next comes the steps. Each step is run in turn, in an isolated process (presumably a container) but with access to the current working directory.

- uses: actions/checkout@v2
  name: Check out repo
- name: Set up Python
  uses: actions/setup-python@v1
  with:
    python-version: 3.8

The first two steps checkout the fara-history repository and install Python 3.8.

- uses: actions/cache@v1
  name: Configure pip caching
  with:
    path: ~/.cache/pip
    key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
    restore-keys: |
      ${{ runner.os }}-pip-

This step should set up a cache so that pip doesn't have to download fresh dependencies on every run. Unfortunately it doesn't seem to actually work - it only works for push and pull_request events, but my workflow is triggered by schedule and repository_dispatch. There's an open issue about this.

- name: Install Python dependencies
  run: |
    python -m pip install --upgrade pip
    pip install -r requirements.txt

This step installs my dependencies from requirements.txt.

- name: Fetch, update and commit FARA data
  run: . update_and_commit_all.sh
- name: Build fara.db database
  run: python build_database.py

Now we're getting to the fun stuff. My update_and_commit_all.sh script downloads the four zip files from the FARA.gov site, unzips them, sorts them, diffs them against the previously stored files and commits the new copy to GitHub if they have changed. See my explanation of csv-diff for more on this - though sadly only one of the files has a reliable row identifier so I can't generate great commit messages for most of them.

My build_database.py script uses sqlite-utils to convert the CSV files into a SQLite database.

Now that we've got a SQLite database, we can deploy it to Google Cloud Run using Datasette.

But should we run a deploy at all? If the database hasn't changed, there's no point in deploying it. How can we tell if the database file has changed from the last one that was published?

Datasette has a mechanism for deriving a content hash of a database, part of a performance optimization which is no longer turned on by default and may be removed in the future.

You can generate JSON that includes hash using the datasette inspect command. The jq tool can then be used to extract out just the hash:

$ datasette inspect fara.db | jq '.fara.hash' -r
fbc9cbaca6de1e232fc14494faa06cc8d4cb9f379d0d568e4711e9a218800906

The -r option to jq causes it to return just the raw string, without quote marks.

Datasette's /-/databases.json introspection URL reveals the hashes of the currently deployed database. Here's how to pull the currently deployed hash:

$ curl -s https://fara.datasettes.com/-/databases.json | jq '.[0].hash' -r
a6c0ab26589bde0d225c5a45044e0adbfa3840b95fbb263d01fd8fb0d2460ed5

If those two hashes differ then we should deploy the new database.

GitHub Actions have a slightly bizarre mechanism for defining "output variables" for steps, which can then be used to conditionally run further steps.

Here's the step that sets those variables, followed by the step that conditionally installs the Google Cloud CLI tools using their official action:

- name: Set variables to decide if we should deploy
  id: decide_variables
  run: |-
    echo "##[set-output name=latest;]$(datasette inspect fara.db | jq '.fara.hash' -r)"
    echo "##[set-output name=deployed;]$(curl -s https://fara.datasettes.com/-/databases.json | jq '.[0].hash' -r)"
- name: Set up Cloud Run
  if: steps.decide_variables.outputs.latest != steps.decide_variables.outputs.deployed
  uses: GoogleCloudPlatform/github-actions/setup-gcloud@master
  with:
    version: '275.0.0'
    service_account_email: ${{ secrets.GCP_SA_EMAIL }}
    service_account_key: ${{ secrets.GCP_SA_KEY }}

Having installed the Google Cloud tools, I can deploy my database using Datasette:

- name: Deploy to Cloud Run
  if: steps.decide_variables.outputs.latest != steps.decide_variables.outputs.deployed
  run: |-
    gcloud components install beta
    gcloud config set run/region us-central1
    gcloud config set project datasette-222320
    datasette publish cloudrun fara.db --service fara-history -m metadata.json

This was by far the hardest part to figure out.

First, I needed to create a Google Cloud service account with an accompanying service key.

I tried and failed to do this using the CLI, so I switched to their web console following these and then these instructions.

Having downloaded the key JSON file, I converted it to base64 and pasted it into a GitHub Actions secret (hidden away in the repository settings area) called GCP_SA_KEY.

cat ~/Downloads/datasette-222320-2ad02afe6d82.json \
    | base64 | pbcopy

The service account needed permissions in order to run a build through Cloud Build and then deploy the result through Cloud Run. I spent a bunch of time trying out different combinations and eventually gave up and gave the account "Editor" permissions across my entire project. This is bad. I am hoping someone can help me understand what the correct narrow set of permissions are, and how to apply them.

It also took me a while to figure out that I needed to run these three commands before I could deploy to my project. The first one installs the Cloud Run tools, the second set up some required configuration:

gcloud components install beta
gcloud config set run/region us-central1
gcloud config set project datasette-222320

But... having done all of the above, the following command run from an action successfully deploys the site!

datasette publish cloudrun fara.db \
    --service fara-history -m metadata.json

DNS

Google Cloud Run deployments come with extremely ugly default URLs. For this project, that URL is https://fara-history-j7hipcg4aq-uc.a.run.app/.

I wanted something nicer. I own datasettes.com and manage the DNS via Cloudflare, which means I can point subdomains at Cloud Run instances.

This is a two-step process

I set fara.datasettes.com as a DNS-only (no proxying) CNAME for ghs.googlehosted.com.
In the Google Cloud Console I used Cloud Run -> Manage Custom Domains (a button in the header) -> Add Mapping to specify that fara.datasettes.com should map to my fara-history service (the --service argument from datasette publish earlier).

I had previously verified my domain ownership - I forget quite how I did it. Domains purchased through Google Domains get to skip this step.

Next steps

This was a lot of fiddling around. I'm hoping that by writing this up in detail I'll be able to get this working much faster next time.

I think this model - GitHub Actions that pull data, build a database and deploy to Cloud Run using datasette publish - is incredibly promising. The end result should be an API that costs cents-to-dollars a month to operate thanks to Cloud Run's scale-to-zero architecture. And hopefully by publishing this all on GitHub it will be as easy as possible for other people to duplicate it for their own projects.

Tags: continuous-deployment, continuous-integration, data-journalism, github, projects, datasette, cloudrun, github-actions, git-scraping

Cloud Run Button: Click-to-deploy your git repos to Google Cloud

2019-11-04T04:57:33+00:00

Cloud Run Button: Click-to-deploy your git repos to Google Cloud

Google Cloud Run now has its own version of the Heroku deploy button: you can add a button to a GitHub repository which, when clicked, will provide an interface for deploying your repo to the user’s own Google Cloud account using Cloud Run.

Via Joe Gregorio

Tags: github, google, cloudrun

Datasette 0.28 - and why master should always be releasable

2019-05-19T22:15:56+00:00

It's been quite a while since the last substantial release of Datasette. Datasette 0.27 came out all the way back in January.

This isn't because development has slowed down. In fact, the project has had 131 commits since then, covering a bewildering array of new functionality and with some significant contributions from developers who aren't me - Russ Garrett and Romain Primet deserve special recognition here.

The problem has been one of discipline. I'm a big fan of the idea of keeping master shippable at all times in my professional work, but I hadn't quite adopted this policy for my open-source side projects. A couple of months ago I found myself in a situation where I had two major refactorings (of faceting and of Datasette's treatment of immutable files) going on in master at the same time, and untangling them turned out to take way longer than I had expected.

So I've updated Datasette's contribution guidelines to specify that master should always be releasable, almost entirely as a reminder to myself.

All of that said, I'm finally back out of the weeds and I'm excited to announce today's release of Datasette 0.28. It features a salmagundi of new features! I'm replicating the release notes below.

Supporting databases that change

From the beginning of the project, Datasette has been designed with read-only databases in mind. If a database is guaranteed not to change it opens up all kinds of interesting opportunities - from taking advantage of SQLite immutable mode and HTTP caching to bundling static copies of the database directly in a Docker container. The interesting ideas in Datasette explores this idea in detail.

As my goals for the project have developed, I realized that read-only databases are no longer the right default. SQLite actually supports concurrent access very well provided only one thread attempts to write to a database at a time, and I keep encountering sensible use-cases for running Datasette on top of a database that is processing inserts and updates.

So, as-of version 0.28 Datasette no longer assumes that a database file will not change. It is now safe to point Datasette at a SQLite database which is being updated by another process.

Making this change was a lot of work - see tracking tickets #418, #419 and #420. It required new thinking around how Datasette should calculate table counts (an expensive operation against a large, changing database) and also meant reconsidering the “content hash” URLs Datasette has used in the past to optimize the performance of HTTP caches.

Datasette can still run against immutable files and gains numerous performance benefits from doing so, but this is no longer the default behaviour. Take a look at the new Performance and caching documentation section for details on how to make the most of Datasette against data that you know will be staying read-only and immutable.

Faceting improvements, and faceting plugins

Datasette Facets provide an intuitive way to quickly summarize and interact with data. Previously the only supported faceting technique was column faceting, but 0.28 introduces two powerful new capibilities: facet-by-JSON-array and the ability to define further facet types using plugins.

Facet by array (#359) is only available if your SQLite installation provides the json1 extension. Datasette will automatically detect columns that contain JSON arrays of values and offer a faceting interface against those columns - useful for modelling things like tags without needing to break them out into a new table. See Facet by JSON array for more.

The new register_facet_classes() plugin hook (#445) can be used to register additional custom facet classes. Each facet class should provide two methods: suggest() which suggests facet selections that might be appropriate for a provided SQL query, and facet_results() which executes a facet operation and returns results. Datasette’s own faceting implementations have been refactored to use the same API as these plugins.

datasette publish cloudrun

Google Cloud Run is a brand new serverless hosting platform from Google, which allows you to build a Docker container which will run only when HTTP traffic is recieved and will shut down (and hence cost you nothing) the rest of the time. It’s similar to Zeit’s Now v1 Docker hosting platform which sadly is no longer accepting signups from new users.

The new datasette publish cloudrun command was contributed by Romain Primet (#434) and publishes selected databases to a new Datasette instance running on Google Cloud Run.

See Publishing to Google Cloud Run for full documentation.

register_output_renderer plugins

Russ Garrett implemented a new Datasette plugin hook called register_output_renderer (#441) which allows plugins to create additional output renderers in addition to Datasette’s default .json and .csv.

Russ’s in-development datasette-geo plugin includes an example of this hook being used to output .geojson automatically converted from SpatiaLite.

Medium changes

Datasette now conforms to the Black coding style (#449) - and has a unit test to enforce this in the future
New Special table arguments:
- ?columnname__in=value1,value2,value3 filter for executing SQL IN queries against a table, see Table arguments (#433)
- ?columnname__date=yyyy-mm-dd filter which returns rows where the spoecified datetime column falls on the specified date (583b22a)
- ?tags__arraycontains=tag filter which acts against a JSON array contained in a column (78e45ea)
- ?_where=sql-fragment filter for the table view (#429)
- ?_fts_table=mytable and ?_fts_pk=mycolumn querystring options can be used to specify which FTS table to use for a search query - see Configuring full-text search for a table or view (#428)
You can now pass the same table filter multiple times - for example, ?content__not=world&content__not=hello will return all rows where the content column is neither hello or world (#288)
You can now specify about and about_url metadata (in addition to source and license) linking to further information about a project - see Source, license and about
New ?_trace=1 parameter now adds debug information showing every SQL query that was executed while constructing the page (#435)
datasette inspect now just calculates table counts, and does not introspect other database metadata (#462)
Removed /-/inspect page entirely - this will be replaced by something similar in the future, see #465
Datasette can now run against an in-memory SQLite database. You can do this by starting it without passing any files or by using the new --memory option to datasette serve. This can be useful for experimenting with SQLite queries that do not access any data, such as SELECT 1+1 or SELECT sqlite_version().

Small changes

We now show the size of the database file next to the download link (#172)
New /-/databases introspection page shows currently connected databases (#470)
Binary data is no longer displayed on the table and row pages (#442 - thanks, Russ Garrett)
New show/hide SQL links on custom query pages (#415)
The extra_body_script plugin hook now accepts an optional view_name argument (#443 - thanks, Russ Garrett)
Bumped Jinja2 dependency to 2.10.1 (#426)
All table filters are now documented, and documentation is enforced via unit tests (2c19a27)
New project guideline: master should stay shippable at all times! (31f36e1)
Fixed a bug where sqlite_timelimit() occasionally failed to clean up after itself (bac4e01)
We no longer load additional plugins when executing pytest (#438)
Homepage now links to database views if there are less than five tables in a database (#373)
The --cors option is now respected by error pages (#453)
datasette publish heroku now uses the --include-vcs-ignore option, which means it works under Travis CI (#407)
datasette publish heroku now publishes using Python 3.6.8 (666c374)
Renamed datasette publish now to datasette publish nowv1 (#472)
datasette publish nowv1 now accepts multiple --alias parameters (09ef305)
Removed the datasette skeleton command (#476)
The documentation on how to build the documentation now recommends sphinx-autobuild

Tags: open-source, projects, datasette, cloudrun

Ministry of Silly Runtimes: Vintage Python on Cloud Run

2019-04-09T17:33:47+00:00

Ministry of Silly Runtimes: Vintage Python on Cloud Run

Cloud Run is an exciting new hosting service from Google that lets you define a container using a Dockerfile and then run that container in a “scale to zero” environment, so you only pay for time spent serving traffic. It’s similar to the now-deprecated Zeit Now 1.0 which inspired me to create Datasette. Here Dustin Ingram demonstrates how powerful Docker can be as the underlying abstraction by deploying a web app using a 25 year old version of Python 1.x.

Via @jacobian

Tags: cloud, python, zeit-now, docker, datasette, cloudrun, dustin-ingram