<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: cloudrun</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/cloudrun.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-11-05T23:11:17+00:00</updated><author><name>Simon Willison</name></author><entry><title>Open redirect endpoint in Datasette prior to 0.65.2 and 1.0a21</title><link href="https://simonwillison.net/2025/Nov/5/open-redirect-datasette/#atom-tag" rel="alternate"/><published>2025-11-05T23:11:17+00:00</published><updated>2025-11-05T23:11:17+00:00</updated><id>https://simonwillison.net/2025/Nov/5/open-redirect-datasette/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette/security/advisories/GHSA-w832-gg5g-x44m"&gt;Open redirect endpoint in Datasette prior to 0.65.2 and 1.0a21&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This GitHub security advisory covers two new releases of Datasette that I shipped today, both addressing &lt;a href="https://github.com/simonw/datasette/issues/2429"&gt;the same open redirect issue&lt;/a&gt; with a fix by &lt;a href="https://github.com/jamesjefferies"&gt;James Jefferies&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href="https://docs.datasette.io/en/stable/changelog.html#v0-65-2"&gt;Datasette 0.65.2&lt;/a&gt;&lt;/strong&gt; fixes the bug and also adds Python 3.14 support and a &lt;code&gt;datasette publish cloudrun&lt;/code&gt; fix.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href="https://docs.datasette.io/en/latest/changelog.html#a21-2025-11-05"&gt;Datasette 1.0a21&lt;/a&gt;&lt;/strong&gt; also has that Cloud Run fix and two other small new features:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;New &lt;code&gt;datasette --get /path --headers&lt;/code&gt; option for inspecting the headers returned by a path. (&lt;a href="https://github.com/simonw/datasette/issues/2578"&gt;#2578&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;New &lt;code&gt;datasette.client.get(..., skip_permission_checks=True)&lt;/code&gt; parameter to bypass permission checks when making requests using the internal client. (&lt;a href="https://github.com/simonw/datasette/issues/2583"&gt;#2583&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;I decided to include the Cloud Run deployment fix so anyone with Datasette instances deployed to Cloud Run can update them with the new patched versions.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cloudrun"&gt;cloudrun&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/annotated-release-notes"&gt;annotated-release-notes&lt;/a&gt;&lt;/p&gt;



</summary><category term="security"/><category term="datasette"/><category term="cloudrun"/><category term="annotated-release-notes"/></entry><entry><title>Clickhouse on Cloud Run</title><link href="https://simonwillison.net/2021/Jul/29/clickhouse-on-cloud-run/#atom-tag" rel="alternate"/><published>2021-07-29T06:07:51+00:00</published><updated>2021-07-29T06:07:51+00:00</updated><id>https://simonwillison.net/2021/Jul/29/clickhouse-on-cloud-run/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://alexjreid.dev/posts/clickhouse-on-cloud-run/"&gt;Clickhouse on Cloud Run&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Alex Reid figured out how to run Clickhouse against read-only baked data on Cloud Run last year, and wrote up some comprehensive notes.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/alexjreid/status/1420625467384737797"&gt;@alexjreid&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cloudrun"&gt;cloudrun&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/baked-data"&gt;baked-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/clickhouse"&gt;clickhouse&lt;/a&gt;&lt;/p&gt;



</summary><category term="cloudrun"/><category term="baked-data"/><category term="clickhouse"/></entry><entry><title>Building a search engine for datasette.io</title><link href="https://simonwillison.net/2020/Dec/19/dogsheep-beta/#atom-tag" rel="alternate"/><published>2020-12-19T18:12:31+00:00</published><updated>2020-12-19T18:12:31+00:00</updated><id>https://simonwillison.net/2020/Dec/19/dogsheep-beta/#atom-tag</id><summary type="html">
    &lt;p&gt;This week I added &lt;a href="https://datasette.io/-/beta"&gt;a search engine&lt;/a&gt; to &lt;a href="https://datasette.io/"&gt;datasette.io&lt;/a&gt;, using the search indexing tool I've been building for &lt;a href="https://datasette.substack.com/p/dogsheep-personal-analytics-with"&gt;Dogsheep&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A screenshot of dogsheep.io search results for ripgrep" src="https://static.simonwillison.net/static/2020/dogsheep-beta-ripgrep.jpg" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;h4&gt;Project search for Datasette&lt;/h4&gt;
&lt;p&gt;The Datasette project has a &lt;em&gt;lot&lt;/em&gt; of constituent parts. There's the project itself and its &lt;a href="https://docs.datasette.io/"&gt;documentation&lt;/a&gt; - 171 pages when exported to PDF and counting. Then there are the &lt;a href="https://datasette.io/plugins"&gt;48 plugins&lt;/a&gt;, &lt;a href="https://datasette.io/tools/sqlite-utils"&gt;sqlite-utils&lt;/a&gt; and &lt;a href="https://datasette.io/tools"&gt;21 more tools&lt;/a&gt; for creating SQLite databases, the &lt;a href="https://dogsheep.github.io/"&gt;Dogsheep&lt;/a&gt; collection and over three years of content I've written about the project &lt;a href="https://simonwillison.net/tags/datasette/"&gt;on my blog&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The new &lt;a href="https://datasette.io/-/beta"&gt;datasette.io search engine&lt;/a&gt; provides a faceted search interface to all of this material in one place. It currently searches across:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Every section of the latest documentation (415 total)&lt;/li&gt;
&lt;li&gt;48 plugin READMEs&lt;/li&gt;
&lt;li&gt;22 tool READMEs&lt;/li&gt;
&lt;li&gt;63 news items posted on the Datasette website&lt;/li&gt;
&lt;li&gt;212 items from my blog&lt;/li&gt;
&lt;li&gt;Release notes from 557 package releases&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I plan to extend it with more data sources in the future.&lt;/p&gt;
&lt;h4&gt;How it works: Dogsheep Beta&lt;/h4&gt;
&lt;p&gt;I'm reusing the search engine I originally built for my Dogsheep personal analytics project (see &lt;a href="https://simonwillison.net/2020/Nov/14/personal-data-warehouses/"&gt;Personal Data Warehouses: Reclaiming Your Data&lt;/a&gt;). I call that search engine &lt;a href="https://github.com/dogsheep/beta"&gt;Dogsheep Beta&lt;/a&gt;. The name is &lt;a href="https://datasette.substack.com/p/dogsheep-personal-analytics-with"&gt;a pun&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;SQLite has great &lt;a href="https://sqlite.org/fts5.html"&gt;full-text search&lt;/a&gt; built in, and I make extensive use of that in Datasette projects already. But out of the box it's not quite right for this kind of search engine that spans multiple different content types.&lt;/p&gt;
&lt;p&gt;The problem is relevance calculation. I wrote about this in &lt;a href="https://simonwillison.net/2019/Jan/7/exploring-search-relevance-algorithms-sqlite/"&gt;Exploring search relevance algorithms with SQLite&lt;/a&gt; - short version: query relevance is calculated using statistics against the whole corpus, so search terms that occur rarely in the overall corpus contribute a higher score than more common terms.&lt;/p&gt;
&lt;p&gt;This means that calculated full-text ranking scores calculated against one table of data cannot be meaningfully compared to scores calculated independently against a separate table, as the corpus statistics used to calculate the rank will differ.&lt;/p&gt;
&lt;p&gt;To get usable scores, you need everything in a single table. That's what Dogsheep Beta does: it creates a new table, called &lt;code&gt;search_index&lt;/code&gt;, and copies searchable content from the other tables into that new table.&lt;/p&gt;
&lt;p&gt;This is analagous to how an external search index like Elasticsearch works: you store your data in the main database, then periodically update an index in Elasticsearch. It's the &lt;a href="https://2017.djangocon.us/talks/the-denormalized-query-engine-design-pattern/"&gt;denormalized query engine&lt;/a&gt; design pattern in action.&lt;/p&gt;
&lt;h4&gt;Configuring Dogsheep Beta&lt;/h4&gt;
&lt;p&gt;There are two components to Dogsheep Beta: a command-line tool for building a search index, and a Datasette plugin for providing an interface for running searches.&lt;/p&gt;
&lt;p&gt;Both of these run off a YAML configuration file, which defines the tables that should be indexed and also defines how those search results should be displayed.&lt;/p&gt;
&lt;p&gt;(Having one configuration file handle both indexing and display feels a little inelegant, but it's extremely productive for iterating on so I'm letting that slide.)&lt;/p&gt;
&lt;p&gt;Here's the full &lt;a href="https://github.com/simonw/datasette.io/blob/0e3e839ca1efb19cd84100291789521c0a1a3561/templates/dogsheep-beta.yml"&gt;Dogsheep configuration for datasette.io&lt;/a&gt;. An annotated extract:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Index material in the content.db SQLite file&lt;/span&gt;
&lt;span class="pl-ent"&gt;content.db&lt;/span&gt;:
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Define a search type called 'releases'&lt;/span&gt;
  &lt;span class="pl-ent"&gt;releases&lt;/span&gt;:
    &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Populate that search type by executing this SQL&lt;/span&gt;
    &lt;span class="pl-ent"&gt;sql&lt;/span&gt;: &lt;span class="pl-s"&gt;|-&lt;/span&gt;
&lt;span class="pl-s"&gt;      select&lt;/span&gt;
&lt;span class="pl-s"&gt;        releases.id as key,&lt;/span&gt;
&lt;span class="pl-s"&gt;        repos.name || ' ' || releases.tag_name as title,&lt;/span&gt;
&lt;span class="pl-s"&gt;        releases.published_at as timestamp,&lt;/span&gt;
&lt;span class="pl-s"&gt;        releases.body as search_1,&lt;/span&gt;
&lt;span class="pl-s"&gt;        1 as is_public&lt;/span&gt;
&lt;span class="pl-s"&gt;      from&lt;/span&gt;
&lt;span class="pl-s"&gt;        releases&lt;/span&gt;
&lt;span class="pl-s"&gt;        join repos on releases.repo = repos.id&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;    &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; When displaying a search result, use this SQL to&lt;/span&gt;
    &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; return extra details about the item&lt;/span&gt;
    &lt;span class="pl-ent"&gt;display_sql&lt;/span&gt;: &lt;span class="pl-s"&gt;|-&lt;/span&gt;
&lt;span class="pl-s"&gt;      select&lt;/span&gt;
&lt;span class="pl-s"&gt;        -- highlight() is a custom SQL function&lt;/span&gt;
&lt;span class="pl-s"&gt;        highlight(render_markdown(releases.body), :q) as snippet,&lt;/span&gt;
&lt;span class="pl-s"&gt;        html_url&lt;/span&gt;
&lt;span class="pl-s"&gt;      from releases where id = :key&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;    &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Jinja template fragment to display the result&lt;/span&gt;
    &lt;span class="pl-ent"&gt;display&lt;/span&gt;: &lt;span class="pl-s"&gt;|-&lt;/span&gt;
&lt;span class="pl-s"&gt;      &amp;lt;h3&amp;gt;Release: &amp;lt;a href="{{ display.html_url }}"&amp;gt;{{ title }}&amp;lt;/a&amp;gt;&amp;lt;/h3&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;      &amp;lt;p&amp;gt;{{ display.snippet|safe }}&amp;lt;/p&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;      &amp;lt;p&amp;gt;&amp;lt;small&amp;gt;Released {{ timestamp }}&amp;lt;/small&amp;gt;&amp;lt;/p&amp;gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The core pattern here is the &lt;code&gt;sql:&lt;/code&gt; key, which defines a SQL query that must return the following columns:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;key&lt;/code&gt; - a unique identifier for this search item&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;title&lt;/code&gt; - a title for this indexed document&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;timestamp&lt;/code&gt; - a timestamp for when it was created. May be null.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;search_1&lt;/code&gt; - text to be searched. I may add support for &lt;code&gt;search_2&lt;/code&gt; and &lt;code&gt;search_3&lt;/code&gt; later on to store text that will be treated with a lower relevance score.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;is_public&lt;/code&gt; - should this be considered "public" data. This is a holdover from Dogsheep Beta's application for personal analytics, I don't actually need it for datasette.io.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To create an index, run the following:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;dogsheep-beta index dogsheep-index.db dogsheep-config.yml
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;index&lt;/code&gt; command will loop through every configured search type in the YAML file, execute the SQL query and use it to populate a &lt;code&gt;search_index&lt;/code&gt; table in the &lt;code&gt;dogsheep-index.db&lt;/code&gt; SQLite database file.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://datasette.io/dogsheep-index/search_index"&gt;the search_index table&lt;/a&gt; for &lt;a href="https://datasette.io/"&gt;datasette.io&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;When you run a search, the plugin queries that table and gets back results sorted by relevance (or other sort criteria, if specified).&lt;/p&gt;
&lt;p&gt;To display the results, it loops through each one and uses the Jinja template fragment from the configuration file to turn it into HTML.&lt;/p&gt;
&lt;p&gt;If a &lt;code&gt;display_sql:&lt;/code&gt; query is defined, that query will be executed for each result to populate the &lt;code&gt;{{ display }}&lt;/code&gt; object made available to the template. &lt;a href="https://www.sqlite.org/np1queryprob.html"&gt;Many Small Queries Are Efficient In SQLite&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;Search term highlighting&lt;/h4&gt;
&lt;p&gt;I spent &lt;a href="https://github.com/simonw/datasette.io/issues/49"&gt;a bit of time&lt;/a&gt; thinking about search highlighting. SQLite has an implementation of highlighting built in - &lt;a href="https://sqlite.org/fts5.html#the_snippet_function"&gt;the snippet() function&lt;/a&gt; - but it's not designed to be HTML-aware so there's a risk it might mangle HTML by adding highlighting marks in the middle of a tag or attribute.&lt;/p&gt;
&lt;p&gt;I ended up rolling borrowing a BSD licensed &lt;a href="https://github.com/django-haystack/django-haystack/blob/v3.0/haystack/utils/highlighting.py"&gt;highlighting class&lt;/a&gt; from the &lt;a href="https://github.com/django-haystack/django-haystack"&gt;django-haystack&lt;/a&gt; project. It deals with HTML by stripping tags, which seems to be more-or-less what Google do for their own search results so I figured that's good enough for me.&lt;/p&gt;
&lt;p&gt;I used this &lt;a href="https://github.com/simonw/datasette.io/blob/0e3e839ca1efb19cd84100291789521c0a1a3561/plugins/sql_functions.py"&gt;one-off site plugin&lt;/a&gt; to wrap the highlighting code in a custom SQLite function. This meant I could call it from the &lt;code&gt;display_sql:&lt;/code&gt; query in the Dogsheep Beta YAML configuration.&lt;/p&gt;
&lt;p&gt;A custom template tag would be more elegant, but I don't yet have a mechanism to expose custom template tags in the Dogsheep Beta rendering mechanism.&lt;/p&gt;
&lt;h4&gt;Build, index, deploy&lt;/h4&gt;
&lt;p&gt;The Datasette website implements the Baked Data pattern, where the content is compiled into SQLite database files and bundled with the application code itself as part of the deploy.&lt;/p&gt;
&lt;p&gt;Building the index is just another step of that process.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://github.com/simonw/datasette.io/blob/0e3e839ca1efb19cd84100291789521c0a1a3561/.github/workflows/deploy.yml"&gt;the deploy.yml&lt;/a&gt; GitHub workflow used by the site. It roughly does the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Download the current version of the &lt;a href="https://datasette.io/content"&gt;content.db&lt;/a&gt; database file. This is so it doesn't have to re-fetch release and README content that was previously stored there.&lt;/li&gt;
&lt;li&gt;Download the current version of &lt;a href="https://datasette.io/blog"&gt;blog.db&lt;/a&gt;, with entries from my blog. This means I don't have to fetch all entries, just the new ones.&lt;/li&gt;
&lt;li&gt;Run &lt;a href="https://github.com/simonw/datasette.io/blob/0e3e839ca1efb19cd84100291789521c0a1a3561/build_directory.py"&gt;build_directory.py&lt;/a&gt;, the script which fetches data for the plugins and tools pages.
&lt;ul&gt;
&lt;li&gt;This hits the GitHub GraphQL API to find new repositories tagged &lt;code&gt;datasette-io&lt;/code&gt; and &lt;code&gt;datasette-plugin&lt;/code&gt; and &lt;code&gt;datasette-tool&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;That GraphQL query also returns the most recent release. The script then checks to see if those releases have previously been fetched and, if not, uses &lt;a href="https://datasette.io/tools/github-to-sqlite"&gt;github-to-sqlite&lt;/a&gt; to fetch them.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Imports the data from &lt;a href="https://github.com/simonw/datasette.io/blob/0e3e839ca1efb19cd84100291789521c0a1a3561/news.yaml"&gt;news.yaml&lt;/a&gt; into a &lt;code&gt;news&lt;/code&gt; table using &lt;a href="https://datasette.io/tools/yaml-to-sqlite"&gt;yaml-to-sqlite&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Imports the latest PyPI download statistics for my packages from my &lt;a href="https://github.com/simonw/package-stats"&gt;simonw/package-stats&lt;/a&gt; repository, which implements &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;git scraping&lt;/a&gt; against the most excellent &lt;a href="https://pypistats.org/"&gt;pypistats.org&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Runs the &lt;code&gt;dogsheep-beta index&lt;/code&gt; command to build a &lt;code&gt;dogsheep-index.db&lt;/code&gt; search index.&lt;/li&gt;
&lt;li&gt;Runs some soundness checks, e.g. &lt;code&gt;datasette . --get "/plugins"&lt;/code&gt;, to verify that Datasette is likely to at least return 200 results for some critical pages once published.&lt;/li&gt;
&lt;li&gt;Uses &lt;code&gt;datasette publish cloudrun&lt;/code&gt; to deploy the results to Google Cloud Run, which hosts the website.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I love building websites this way. You can have as much complexity as you like in the build script (my TIL website build script &lt;a href="https://simonwillison.net/2020/Sep/3/weeknotes-airtable-screenshots-dogsheep/#weeknotes-2020-09-03-social-media-cards-tils"&gt;generates screenshots using Puppeteer&lt;/a&gt;) but the end result is some simple database files running on inexpensive, immutable, scalable hosting.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cloudrun"&gt;cloudrun&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/baked-data"&gt;baked-data&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="projects"/><category term="search"/><category term="sqlite"/><category term="datasette"/><category term="dogsheep"/><category term="weeknotes"/><category term="cloudrun"/><category term="baked-data"/></entry><entry><title>Datasette 0.52</title><link href="https://simonwillison.net/2020/Nov/29/datasette-052/#atom-tag" rel="alternate"/><published>2020-11-29T00:56:15+00:00</published><updated>2020-11-29T00:56:15+00:00</updated><id>https://simonwillison.net/2020/Nov/29/datasette-052/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://docs.datasette.io/en/stable/changelog.html#v0-52"&gt;Datasette 0.52&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
A relatively small release—it has a new plugin hook (database_actions(), for adding links to a new database actions menu), renames the --config option to --setting and adds a new “datasette publish cloudrun --apt-get-install” option.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cloudrun"&gt;cloudrun&lt;/a&gt;&lt;/p&gt;



</summary><category term="projects"/><category term="datasette"/><category term="cloudrun"/></entry><entry><title>datasette-ripgrep: deploy a regular expression search engine for your source code</title><link href="https://simonwillison.net/2020/Nov/28/datasette-ripgrep/#atom-tag" rel="alternate"/><published>2020-11-28T06:51:06+00:00</published><updated>2020-11-28T06:51:06+00:00</updated><id>https://simonwillison.net/2020/Nov/28/datasette-ripgrep/#atom-tag</id><summary type="html">
    &lt;p&gt;This week I built &lt;a href="https://github.com/simonw/datasette-ripgrep"&gt;datasette-ripgrep&lt;/a&gt; - a web application  for running regular expression searches against source code, built on top of the amazing &lt;a href="https://github.com/BurntSushi/ripgrep"&gt;ripgrep&lt;/a&gt; command-line tool.&lt;/p&gt;
&lt;h4&gt;datasette-ripgrep demo&lt;/h4&gt;
&lt;p&gt;I've deployed a demo version of the application here:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://ripgrep.datasette.io/-/ripgrep?pattern=pytest"&gt;ripgrep.datasette.io/-/ripgrep?pattern=pytest&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The demo runs searches against the source code of every one of my GitHub repositories that start with &lt;code&gt;datasette&lt;/code&gt; - &lt;a href="https://github-to-sqlite.dogsheep.net/github/repos?name__startswith=datasette&amp;amp;owner__exact=9599"&gt;61 repos&lt;/a&gt; right now - so it should include all of my Datasette plugins plus the core Datasette repository itself.&lt;/p&gt;
&lt;p&gt;Since it's running on top of &lt;code&gt;ripgrep&lt;/code&gt;, it supports regular expressions. This is absurdly useful. Some examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Every usage of the &lt;code&gt;.plugin_config(&lt;/code&gt; method: &lt;a href="https://ripgrep.datasette.io/-/ripgrep?pattern=%5C.plugin_config%5C%28"&gt;plugin_config\(&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Everywhere I use &lt;code&gt;async with httpx.AsyncClient&lt;/code&gt; (usually in tests): &lt;a href="https://ripgrep.datasette.io/-/ripgrep?pattern=async+with.*AsyncClient"&gt;async with.*AsyncClient&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;All places where I use a Jinja &lt;code&gt;|&lt;/code&gt; filter inside a variable: &lt;a href="https://ripgrep.datasette.io/-/ripgrep?pattern=%5C%7B%5C%7B.*%5C%7C.*%5C%7D%5C%7D"&gt;\{\{.*\|.*\}\}&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I usually run ripgrep as &lt;code&gt;rg&lt;/code&gt; on the command-line, or use it within Visual Studio Code (&lt;a href="https://twitter.com/simonw/status/1331381448171929600"&gt;fun fact&lt;/a&gt;: the reason VS Code's "Find in Files" is so good is it's running ripgrep under the hood).&lt;/p&gt;
&lt;p&gt;So why have it as a web application? Because this means I can link to it, bookmark it and use it on my phone.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2020/datasette-ripgrep.png" alt="A screenshot of datasette-ripgrep in action" style="max-width: 100%" /&gt;&lt;/p&gt;
&lt;h4&gt;Why build this?&lt;/h4&gt;
&lt;p&gt;There are plenty of great existing code search tools out there already: I've heard great things about &lt;a href="https://github.com/livegrep/livegrep"&gt;livegrep&lt;/a&gt;, and a quick Google search shows a bunch of other options.&lt;/p&gt;
&lt;p&gt;Aside from being a fun project, &lt;code&gt;datasette-ripgrep&lt;/code&gt; has one key advantage: it gets to benefit from Datasette's publishing mechanism, which means it's really easy to deploy.&lt;/p&gt;
&lt;p&gt;That &lt;a href="https://ripgrep.datasette.io/"&gt;ripgrep.datasette.io&lt;/a&gt; demo is deployed by checking out the source code to be searched into a &lt;code&gt;all&lt;/code&gt; directory and then using the following command:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;datasette publish cloudrun \
    --metadata metadata.json \
    --static all:all \
    --install=datasette-ripgrep \
    --service datasette-ripgrep \
    --apt-get-install ripgrep
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;all&lt;/code&gt; is a folder containing the source code to be searched. &lt;code&gt;metadata.json&lt;/code&gt; contains this:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;plugins&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: {
        &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;datasette-ripgrep&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: {
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;path&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;/app/all&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;time_limit&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;3.0&lt;/span&gt;
        }
    }
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;That's all there is to it! The result is a deployed code search engine, running on Google Cloud Run.&lt;/p&gt;
&lt;p&gt;(If you want to try this yourself you'll need to be using the just-released Datasette 0.52.)&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/simonw/datasette-ripgrep/blob/main/.github/workflows/deploy_demo.yml"&gt;GitHub Action workflow&lt;/a&gt; that deploys the demo also uses my &lt;a href="https://github.com/dogsheep/github-to-sqlite"&gt;github-to-sqlite&lt;/a&gt; tool to fetch my repos and then shallow-clones the ones that begin with &lt;code&gt;datasette&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;If you have &lt;a href="https://docs.datasette.io/en/stable/publish.html#publishing-to-google-cloud-run"&gt;your own Google Cloud Run credentials&lt;/a&gt;, you can run your own copy of that workflow against your own repositories.&lt;/p&gt;
&lt;h4&gt;A different kind of Datasette plugin&lt;/h4&gt;
&lt;p&gt;Datasette is a tool for publishing SQLite databases, so most Datasette plugins integrate with SQLite in some way.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;datasette-ripgrep&lt;/code&gt; is different: it makes no use of SQLite at all, but instead takes advantage of Datasette's URL routing, &lt;code&gt;datasette publish&lt;/code&gt; deployments and permissions system.&lt;/p&gt;
&lt;p&gt;The plugin implementation is currently &lt;a href="https://github.com/simonw/datasette-ripgrep/blob/07b9ced2935b0b6080c1c42fcaf6ab9e8003d186/datasette_ripgrep/__init__.py"&gt;134 lines of code&lt;/a&gt;, excluding tests and templates.&lt;/p&gt;
&lt;p&gt;While the plugin doesn't use SQLite, it does share a common philosophy with Datasette: the plugin bundles the source code that it is going to search as part of the deployed application, in a similar way to how Datasette usually bundles one or more SQLite database files.&lt;/p&gt;
&lt;p&gt;As such, it's extremely inexpensive to run and can be deployed to serverless hosting. If you need to scale it, you can run more copies.&lt;/p&gt;
&lt;p&gt;This does mean that the application needs to be re-deployed to pick up changes to the searchable code. I'll probably set my demo to do this on a daily basis.&lt;/p&gt;
&lt;h4&gt;Controlling processes from asyncio&lt;/h4&gt;
&lt;p&gt;The trickiest part of the implementation was figuring out how to use Python's &lt;code&gt;asyncio.create_subprocess_exec()&lt;/code&gt; method to safely run the &lt;code&gt;rg&lt;/code&gt; process in response to incoming requests.&lt;/p&gt;
&lt;p&gt;I don't want expensive searches to tie up the server, so I implemented two limits here. The first is a time limit: by default, searches have a second to run after which the &lt;code&gt;rg&lt;/code&gt; process will be terminated and only results recieved so far will be returned. This is achieved using the &lt;a href="https://docs.python.org/3/library/asyncio-task.html#asyncio.wait_for"&gt;asyncio.wait_for()&lt;/a&gt; function.&lt;/p&gt;
&lt;p&gt;I also implemented a limit on the number of matching lines that can be returned, defaulting to 2,000. Any more than that and the process is terminated early.&lt;/p&gt;
&lt;p&gt;Both of these limits can be customized using plugin settings (documented in &lt;a href="https://github.com/simonw/datasette-ripgrep/blob/main/README.md"&gt;the README&lt;/a&gt;). You can see how they are implemented in the &lt;a href="https://github.com/simonw/datasette-ripgrep/blob/0.2/datasette_ripgrep/__init__.py#L9-L55"&gt;async def run_ripgrep(pattern, path, time_limit=1.0, max_lines=2000)&lt;/a&gt; function.&lt;/p&gt;
&lt;h4&gt;Highlighted linkable line numbers&lt;/h4&gt;
&lt;p&gt;The other fun implementation detail is the way the source code listings are displayed. I'm using CSS to display the line numbers in a way that makes them visible without them breaking copy-and-paste (inspired by &lt;a href="https://www.sylvaindurand.org/using-css-to-add-line-numbering/"&gt;this article by Sylvain Durand&lt;/a&gt;).&lt;/p&gt;
&lt;div class="highlight highlight-source-css"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;code&lt;/span&gt;:&lt;span class="pl-c1"&gt;before&lt;/span&gt; {
    &lt;span class="pl-c1"&gt;content&lt;/span&gt;: &lt;span class="pl-en"&gt;attr&lt;/span&gt;(data-line);
    &lt;span class="pl-c1"&gt;display&lt;/span&gt;: inline-block;
    &lt;span class="pl-c1"&gt;width&lt;/span&gt;: &lt;span class="pl-c1"&gt;3.5&lt;span class="pl-smi"&gt;ch&lt;/span&gt;&lt;/span&gt;;
    &lt;span class="pl-c1"&gt;-webkit-user-select&lt;/span&gt;: none;
    &lt;span class="pl-c1"&gt;color&lt;/span&gt;: &lt;span class="pl-pds"&gt;&lt;span class="pl-kos"&gt;#&lt;/span&gt;666&lt;/span&gt;;
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The HTML looks like this:&lt;/p&gt;
&lt;div class="highlight highlight-text-html-basic"&gt;&lt;pre&gt;&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;pre&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;code&lt;/span&gt; &lt;span class="pl-c1"&gt;id&lt;/span&gt;="&lt;span class="pl-s"&gt;L1&lt;/span&gt;" &lt;span class="pl-c1"&gt;data-line&lt;/span&gt;="&lt;span class="pl-s"&gt;1&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;from setuptools import setup&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;code&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;code&lt;/span&gt; &lt;span class="pl-c1"&gt;id&lt;/span&gt;="&lt;span class="pl-s"&gt;L2&lt;/span&gt;" &lt;span class="pl-c1"&gt;data-line&lt;/span&gt;="&lt;span class="pl-s"&gt;2&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;import os&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;code&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;code&lt;/span&gt; &lt;span class="pl-c1"&gt;id&lt;/span&gt;="&lt;span class="pl-s"&gt;L3&lt;/span&gt;" &lt;span class="pl-c1"&gt;data-line&lt;/span&gt;="&lt;span class="pl-s"&gt;3&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&amp;amp;nbsp;&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;code&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;code&lt;/span&gt; &lt;span class="pl-c1"&gt;id&lt;/span&gt;="&lt;span class="pl-s"&gt;L4&lt;/span&gt;" &lt;span class="pl-c1"&gt;data-line&lt;/span&gt;="&lt;span class="pl-s"&gt;4&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;VERSION = &amp;amp;#34;0.1&amp;amp;#34;&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;code&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
...&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I wanted to imitate GitHub's handling of line links, where adding &lt;code&gt;#L23&lt;/code&gt; to the URL both jumps to that line and causes the line to be highlighted. Here's &lt;a href="https://ripgrep.datasette.io/-/ripgrep/view/datasette-allow-permissions-debug/setup.py#L23"&gt;a demo of that&lt;/a&gt; - I use the following JavaScript to update the contents of a &lt;code&gt;&amp;lt;style id="highlightStyle"&amp;gt;&amp;lt;/style&amp;gt;&lt;/code&gt; element in the document head any time the URL fragment changes:&lt;/p&gt;
&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-c1"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;script&lt;/span&gt;&lt;span class="pl-c1"&gt;&amp;gt;&lt;/span&gt;
var highlightStyle = document.getElementById('highlightStyle');
function highlightLineFromFragment() &lt;span class="pl-kos"&gt;{&lt;/span&gt;
    &lt;span class="pl-en"&gt;if&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-pds"&gt;/&lt;span class="pl-cce"&gt;^&lt;/span&gt;#L&lt;span class="pl-cce"&gt;\d&lt;/span&gt;&lt;span class="pl-c1"&gt;+&lt;/span&gt;&lt;span class="pl-cce"&gt;$&lt;/span&gt;/&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;exec&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;location&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;hash&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
        &lt;span class="pl-s1"&gt;highlightStyle&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;innerText&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;`&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;${&lt;/span&gt;&lt;span class="pl-s1"&gt;location&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;hash&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt; { background-color: yellow; }`&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;}&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt;
&lt;span class="pl-en"&gt;highlightLineFromFragment&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-smi"&gt;window&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;addEventListener&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;"hashchange"&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s1"&gt;highlightLineFromFragment&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-c1"&gt;&amp;lt;&lt;/span&gt;/&lt;span class="pl-ent"&gt;script&lt;/span&gt;&lt;span class="pl-c1"&gt;&amp;gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It's the simplest way I could think of to achieve this effect.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update 28th November 2020&lt;/strong&gt;: Louis Lévêque on Twitter suggested using the CSS &lt;a href="https://developer.mozilla.org/en-US/docs/Web/CSS/:target"&gt;:target selector&lt;/a&gt; instead, which is indeed MUCH simpler - I deleted the above JavaScript and replaced it with this CSS:&lt;/p&gt;
&lt;div class="highlight highlight-source-css"&gt;&lt;pre&gt;:&lt;span class="pl-c1"&gt;target&lt;/span&gt; {
    &lt;span class="pl-c1"&gt;background-color&lt;/span&gt;: &lt;span class="pl-pds"&gt;&lt;span class="pl-kos"&gt;#&lt;/span&gt;FFFF99&lt;/span&gt;;
}&lt;/pre&gt;&lt;/div&gt;
&lt;h4&gt;Next steps for this project&lt;/h4&gt;
&lt;p&gt;I'm pleased to have got &lt;a href=""&gt;datasette-ripgrep&lt;/a&gt; to a workable state, and I'm looking forward to using it to answer questions about the growing Datasette ecosystem. I don't know how much more time I'll invest in this - if it proves useful then I may well expand it.&lt;/p&gt;
&lt;p&gt;I do think there's something really interesting about being able to spin up this kind of code search engine on demand using &lt;code&gt;datasette publish&lt;/code&gt;. It feels like a very useful trick to have access to.&lt;/p&gt;
&lt;h4&gt;Better URLs for my TILs&lt;/h4&gt;
&lt;p&gt;My other project this week was an upgrade to &lt;a href="https://til.simonwillison.net/"&gt;til.simonwillison.net&lt;/a&gt;: I finally spent the time to &lt;a href="https://github.com/simonw/til/issues/34"&gt;design nicer URLs&lt;/a&gt; for the site.&lt;/p&gt;
&lt;p&gt;Before:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;til.simonwillison.net/til/til/javascript_manipulating-query-params.md&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;After:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;til.simonwillison.net/javascript/manipulating-query-params&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;The implementation for this takes advantage of a feature I sneaked into Datasette 0.49: &lt;a href="https://simonwillison.net/2020/Sep/15/datasette-0-49#path-parameters-custom-page-templates"&gt;Path parameters for custom page templates&lt;/a&gt;. I can create a template file called &lt;code&gt;pages/{topic}/{slug}.html&lt;/code&gt; and Datasette use that template to handle 404 errors that match that pattern.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://github.com/simonw/til/blob/main/templates/pages/%7Btopic%7D/%7Bslug%7D.html"&gt;the new pages/{topic}/{slug}.html&lt;/a&gt; template for my TIL site. It uses the &lt;code&gt;sql()&lt;/code&gt; template function from the &lt;a href="https://github.com/simonw/datasette-template-sql"&gt;datasette-template-sql&lt;/a&gt; plugin to retrieve and render the matching TIL, or raises a 404 if no TIL can be found.&lt;/p&gt;
&lt;p&gt;I also needed to setup redirects from the old pages to the new ones. I wrote a &lt;a href="https://til.simonwillison.net/til/til/datasette_redirects-for-datasette.md"&gt;TIL on edirects for Datasette&lt;/a&gt; explaining how I did that.&lt;/p&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/til/til/datasette_redirects-for-datasette.md"&gt;Redirects for Datasette&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette-ripgrep/releases/tag/0.2"&gt;datasette-ripgrep 0.2&lt;/a&gt; - 2020-11-27&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette-ripgrep/releases/tag/0.1"&gt;datasette-ripgrep 0.1&lt;/a&gt; - 2020-11-26&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette-atom/releases/tag/0.8.1"&gt;datasette-atom 0.8.1&lt;/a&gt; - 2020-11-25&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette-ripgrep/releases/tag/0.1a1"&gt;datasette-ripgrep 0.1a1&lt;/a&gt; - 2020-11-25&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette-ripgrep/releases/tag/0.1a0"&gt;datasette-ripgrep 0.1a0&lt;/a&gt; - 2020-11-25&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette-graphql/releases/tag/1.2.1"&gt;datasette-graphql 1.2.1&lt;/a&gt; - 2020-11-24&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/async"&gt;async&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/css"&gt;css&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cloudrun"&gt;cloudrun&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ripgrep"&gt;ripgrep&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/baked-data"&gt;baked-data&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="async"/><category term="css"/><category term="projects"/><category term="python"/><category term="datasette"/><category term="weeknotes"/><category term="cloudrun"/><category term="ripgrep"/><category term="baked-data"/></entry><entry><title>The unofficial Google Cloud Run FAQ</title><link href="https://simonwillison.net/2020/Jul/22/unofficial-google-cloud-run-faq/#atom-tag" rel="alternate"/><published>2020-07-22T17:20:20+00:00</published><updated>2020-07-22T17:20:20+00:00</updated><id>https://simonwillison.net/2020/Jul/22/unofficial-google-cloud-run-faq/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/ahmetb/cloud-run-faq"&gt;The unofficial Google Cloud Run FAQ&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This is really useful: a no-fluff, content rich explanation of Google Cloud Run hosted as a GitHub repo that actively accepts pull requests from the community. It’s maintained by Ahmet Alp Balkan, a Cloud Run engineer who states “Googlers: If you find this repo useful, you should recognize the work internally, as I actively fight for alternative forms of content like this”. One of the hardest parts of working with AWS and GCP is digging through the marketing materials to figure out what the product actually does, so the more alternative forms of documentation like this the better.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/documentation"&gt;documentation&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cloudrun"&gt;cloudrun&lt;/a&gt;&lt;/p&gt;



</summary><category term="documentation"/><category term="google"/><category term="cloudrun"/></entry><entry><title>Advice on specifying more granular permissions with Google Cloud IAM</title><link href="https://simonwillison.net/2020/May/28/advice-specifying-more-granular-permissions-google-cloud-iam/#atom-tag" rel="alternate"/><published>2020-05-28T22:44:24+00:00</published><updated>2020-05-28T22:44:24+00:00</updated><id>https://simonwillison.net/2020/May/28/advice-specifying-more-granular-permissions-google-cloud-iam/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://dev.to/googlecloud/advice-on-specifying-more-granular-permissions-with-google-cloud-iam-4b70"&gt;Advice on specifying more granular permissions with Google Cloud IAM&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
My single biggest frustration working with both Google Cloud and AWS is permissions: more specifically, figuring out what the smallest set of permissions are that I need to assign in order to achieve different goals. Katie McLaughlin’s new series aims to address exactly that problem. I learned a ton from this that I’ve previously missed, and there’s plenty of actionable advice on tooling that can be used to help figure this stuff out.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/glasnt/status/1266123072580358144"&gt;Katie McLaughlin&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/permissions"&gt;permissions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cloudrun"&gt;cloudrun&lt;/a&gt;&lt;/p&gt;



</summary><category term="permissions"/><category term="cloudrun"/></entry><entry><title>html-to-svg</title><link href="https://simonwillison.net/2020/May/7/html-svg/#atom-tag" rel="alternate"/><published>2020-05-07T06:01:44+00:00</published><updated>2020-05-07T06:01:44+00:00</updated><id>https://simonwillison.net/2020/May/7/html-svg/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/as-a-service/html-to-svg"&gt;html-to-svg&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
‪This is absolutely ingenious: 50 lines of JavaScript which uses Puppeteer to get headless Chrome to grab a PDF screenshot of a page, then shells out to Inkscape to convert the PDF to SVG. Wraps the whole thing up in a Docker container and ships it to Cloud Run as a web service you can call by passing it a URL.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/steren/status/1258273345843290118"&gt;@steren&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/chrome"&gt;chrome&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/svg"&gt;svg&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cloudrun"&gt;cloudrun&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/puppeteer"&gt;puppeteer&lt;/a&gt;&lt;/p&gt;



</summary><category term="chrome"/><category term="svg"/><category term="cloudrun"/><category term="puppeteer"/></entry><entry><title>Tracking FARA by deploying a data API using GitHub Actions and Cloud Run</title><link href="https://simonwillison.net/2020/Jan/21/github-actions-cloud-run/#atom-tag" rel="alternate"/><published>2020-01-21T07:51:11+00:00</published><updated>2020-01-21T07:51:11+00:00</updated><id>https://simonwillison.net/2020/Jan/21/github-actions-cloud-run/#atom-tag</id><summary type="html">
    &lt;p&gt;I'm using the combination of GitHub Actions and Google Cloud Run to retrieve data from the U.S. Department of Justice FARA website and deploy it as a queryable API using Datasette.&lt;/p&gt;

&lt;h3&gt;FARA background&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://www.justice.gov/nsd-fara"&gt;Foreign Agents Registration Act (FARA)&lt;/a&gt; law that requires "certain agents of foreign principals who are engaged in political activities or other activities specified under the statute to make periodic public disclosure of their relationship with the foreign principal, as well as activities, receipts and disbursements in support of those activities".&lt;/p&gt;

&lt;p&gt;The law was introduced in 1938 in response to the large number of German propaganda agents that were operating in the U.S. prior to the war.&lt;/p&gt;

&lt;p&gt;Basically, if you are in the United States as a lobbyist for a foreign government you need to register under FARA. It was used in 23 criminal cases during World War II, but hasn't had much use since it was ammended in 1966. Although... if you consult the &lt;a href="https://www.justice.gov/nsd-fara/recent-cases"&gt;list of recent cases&lt;/a&gt; you'll see some very interesting recent activity involving Russia and Ukraine.&lt;/p&gt;

&lt;p&gt;It's also for spies! Quoting &lt;a href="https://www.justice.gov/nsd-fara/general-fara-frequently-asked-questions"&gt;the FARA FAQ&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;Finally, 50 U.S.C. § 851, requires registration of persons who have knowledge of or have received instruction or assignment in espionage, counterespionage or sabotage service or tactics of a foreign country or political party.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;I imagine most spies operate in violation of this particular law and don't take steps to register themselves.&lt;/p&gt;

&lt;p&gt;It's all still pretty fascinating though, in part because it gets updated. A lot. Almost every business day in fact.&lt;/p&gt;

&lt;h3&gt;Tracking FARA history&lt;/h3&gt;

&lt;p&gt;I know this because seven months ago I set up a scraper for it. Every twelve hours I have code which downloads the &lt;a href="https://efile.fara.gov/ords/f?p=API:BULKDATA"&gt;four bulk CSVs&lt;/a&gt; published by the Justice department and saves them to &lt;a href="https://github.com/simonw/fara-history"&gt;a git repository&lt;/a&gt;. It's the same trick I've been using &lt;a href="https://simonwillison.net/2019/Mar/13/tree-history/"&gt;to track San Francisco's database of trees&lt;/a&gt; and &lt;a href="https://simonwillison.net/2019/Oct/10/pge-outages/"&gt;PG&amp;amp;E's outage map&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I've been running the scraper using Circle CI, but this weekend I decided to switch it over to &lt;a href="https://github.com/features/actions"&gt;GitHub Actions&lt;/a&gt; to get a better idea for how they work.&lt;/p&gt;

&lt;h3&gt;Deploying it as an API&lt;/h3&gt;

&lt;p&gt;I also wanted to upgrade my script to also deploy a fresh &lt;a href="https://datasette.readthedocs.io/"&gt;Datasette&lt;/a&gt; instance of the data using &lt;a href="https://cloud.google.com/run/"&gt;Google Cloud Run&lt;/a&gt;. I wrote &lt;a href="https://github.com/simonw/fara-datasette"&gt;a script&lt;/a&gt; to do this on a manual basis last year, but I never combined it with the daily scraper. Combining the two means I can offer a Datasette-powered API directly against the latest data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://fara.datasettes.com/"&gt;https://fara.datasettes.com&lt;/a&gt; is that API - it now updates twice a day, assuming there are some changes to the underlying data.&lt;/p&gt;

&lt;h3&gt;Putting it all together&lt;/h3&gt;

&lt;p&gt;The final GitHub action workflow can be &lt;a href="https://github.com/simonw/fara-history/blob/7e33f2fc4619247e77d9b3b725ace6584228b601/.github/workflows/scheduled.yml"&gt;seen here&lt;/a&gt;. I'm going to present an annotated version here.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;on:
  repository_dispatch:
  schedule:
    - cron:  '0 0,12 * * *'&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This sets when the workflow should be triggered. I'm running it twice a day - at midnight and noon UTC (the 0,12 cron syntax).&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;repository_dispatch&lt;/code&gt; key means I can also &lt;a href="https://help.github.com/en/actions/automating-your-workflow-with-github-actions/events-that-trigger-workflows#external-events-repository_dispatch"&gt;trigger it manually&lt;/a&gt; by running the following &lt;code&gt;curl&lt;/code&gt; command - useful for testing:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;curl -XPOST https://api.github.com/repos/simonw/fara-history/dispatches \
    -H 'Authorization: token MY_PERSONAL_TOKEN_HERE' \
    -d '{"event_type": "trigger_action"}' \
    -H 'Accept: application/vnd.github.everest-preview+json'&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Next comes the job itself, which I called &lt;code&gt;scheduled&lt;/code&gt; and set to run on the latest Ubuntu:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;jobs:
  scheduled:
    runs-on: ubuntu-latest
    steps:&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Next comes the steps. Each step is run in turn, in an isolated process (presumably a container) but with access to the current working directory.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;- uses: actions/checkout@v2
  name: Check out repo
- name: Set up Python
  uses: actions/setup-python@v1
  with:
    python-version: 3.8&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The first two steps checkout the &lt;code&gt;fara-history&lt;/code&gt; repository and install Python 3.8.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;- uses: actions/cache@v1
  name: Configure pip caching
  with:
    path: ~/.cache/pip
    key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
    restore-keys: |
      ${{ runner.os }}-pip-&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This step &lt;em&gt;should&lt;/em&gt; set up a cache so that &lt;code&gt;pip&lt;/code&gt; doesn't have to download fresh dependencies on every run. Unfortunately it doesn't seem to actually work - it only works for &lt;code&gt;push&lt;/code&gt; and &lt;code&gt;pull_request&lt;/code&gt; events, but my workflow is triggered by &lt;code&gt;schedule&lt;/code&gt; and &lt;code&gt;repository_dispatch&lt;/code&gt;. There's &lt;a href="https://github.com/actions/cache/issues/63"&gt;an open issue about this&lt;/a&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;- name: Install Python dependencies
  run: |
    python -m pip install --upgrade pip
    pip install -r requirements.txt&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This step installs my dependencies &lt;a href="https://github.com/simonw/fara-history/blob/7e33f2fc4619247e77d9b3b725ace6584228b601/requirements.txt"&gt;from requirements.txt&lt;/a&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;- name: Fetch, update and commit FARA data
  run: . update_and_commit_all.sh
- name: Build fara.db database
  run: python build_database.py&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now we're getting to the fun stuff. My &lt;a href="https://github.com/simonw/fara-history/blob/7e33f2fc4619247e77d9b3b725ace6584228b601/update_and_commit_all.sh"&gt;update_and_commit_all.sh&lt;/a&gt; script downloads the four zip files &lt;a href="https://efile.fara.gov/ords/f?p=API:BULKDATA"&gt;from the FARA.gov site&lt;/a&gt;, unzips them, sorts them, diffs them against the previously stored files and commits the new copy to GitHub if they have changed. See &lt;a href="https://simonwillison.net/2019/Mar/13/tree-history/#csvdiff_18"&gt;my explanation of csv-diff&lt;/a&gt; for more on this - though sadly only one of the files has a reliable row identifier so I can't generate great commit messages for most of them.&lt;/p&gt;

&lt;p&gt;My &lt;a href="https://github.com/simonw/fara-history/blob/7e33f2fc4619247e77d9b3b725ace6584228b601/build_database.py"&gt;build_database.py&lt;/a&gt; script uses &lt;a href="https://sqlite-utils.readthedocs.io/"&gt;sqlite-utils&lt;/a&gt; to convert the CSV files into a SQLite database.&lt;/p&gt;

&lt;p&gt;Now that we've got a SQLite database, we can &lt;a href="https://datasette.readthedocs.io/en/stable/publish.html#publishing-to-google-cloud-run"&gt;deploy it to Google Cloud Run&lt;/a&gt; using Datasette.&lt;/p&gt;

&lt;p&gt;But should we run a deploy at all? If the database hasn't changed, there's no point in deploying it. How can we tell if the database file has changed from the last one that was published?&lt;/p&gt;

&lt;p&gt;Datasette has a mechanism for deriving a content hash of a database, part of &lt;a href="https://datasette.readthedocs.io/en/stable/performance.html"&gt;a performance optimization&lt;/a&gt; which is no longer turned on by default and may be removed in the future.&lt;/p&gt;

&lt;p&gt;You can generate JSON that includes hash using the &lt;code&gt;datasette inspect&lt;/code&gt; command. The jq tool can then be used to extract out just the hash:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ datasette inspect fara.db | jq '.fara.hash' -r
fbc9cbaca6de1e232fc14494faa06cc8d4cb9f379d0d568e4711e9a218800906&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;-r&lt;/code&gt; option to &lt;code&gt;jq&lt;/code&gt; causes it to return just the raw string, without quote marks.&lt;/p&gt;

&lt;p&gt;Datasette's &lt;code&gt;/-/databases.json&lt;/code&gt; introspection URL reveals the hashes of the currently deployed database. Here's how to pull the currently deployed hash:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ curl -s https://fara.datasettes.com/-/databases.json | jq '.[0].hash' -r
a6c0ab26589bde0d225c5a45044e0adbfa3840b95fbb263d01fd8fb0d2460ed5&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If those two hashes differ then we should deploy the new database.&lt;/p&gt;

&lt;p&gt;GitHub Actions have a &lt;a href="https://help.github.com/en/actions/automating-your-workflow-with-github-actions/development-tools-for-github-actions#set-an-output-parameter-set-output"&gt;slightly bizarre mechanism&lt;/a&gt; for defining "output variables" for steps, which can then be used to conditionally run further steps.&lt;/p&gt;

&lt;p&gt;Here's the step that sets those variables, followed by the step that conditionally installs the Google Cloud CLI tools using &lt;a href="https://github.com/GoogleCloudPlatform/github-actions/tree/master/setup-gcloud"&gt;their official action&lt;/a&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;- name: Set variables to decide if we should deploy
  id: decide_variables
  run: |-
    echo "##[set-output name=latest;]$(datasette inspect fara.db | jq '.fara.hash' -r)"
    echo "##[set-output name=deployed;]$(curl -s https://fara.datasettes.com/-/databases.json | jq '.[0].hash' -r)"
- name: Set up Cloud Run
  if: steps.decide_variables.outputs.latest != steps.decide_variables.outputs.deployed
  uses: GoogleCloudPlatform/github-actions/setup-gcloud@master
  with:
    version: '275.0.0'
    service_account_email: ${{ secrets.GCP_SA_EMAIL }}
    service_account_key: ${{ secrets.GCP_SA_KEY }}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Having installed the Google Cloud tools, I can deploy my database using Datasette:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;- name: Deploy to Cloud Run
  if: steps.decide_variables.outputs.latest != steps.decide_variables.outputs.deployed
  run: |-
    gcloud components install beta
    gcloud config set run/region us-central1
    gcloud config set project datasette-222320
    datasette publish cloudrun fara.db --service fara-history -m metadata.json&lt;/code&gt;&lt;/pre&gt;

&lt;p id="google-cloud-service-key"&gt;This was by far the hardest part to figure out.&lt;/p&gt;

&lt;p&gt;First, I needed to create a Google Cloud &lt;a href="https://cloud.google.com/iam/docs/service-accounts"&gt;service account&lt;/a&gt; with an accompanying service key.&lt;/p&gt;

&lt;p&gt;I tried and failed to do this using the CLI, so I switched to their web console following &lt;a href="https://cloud.google.com/iam/docs/creating-managing-service-accounts"&gt;these&lt;/a&gt; and then &lt;a href="https://cloud.google.com/iam/docs/creating-managing-service-account-keys"&gt;these&lt;/a&gt; instructions.&lt;/p&gt;

&lt;p&gt;Having downloaded the key JSON file, I converted it to base64 and pasted it into a GitHub Actions secret (hidden away in the repository settings area) called &lt;code&gt;GCP_SA_KEY&lt;/code&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cat ~/Downloads/datasette-222320-2ad02afe6d82.json \
    | base64 | pbcopy&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The service account needed permissions in order to run a build through Cloud Build and then deploy the result through Cloud Run. I spent a bunch of time trying out different combinations and eventually gave up and gave the account "Editor" permissions across my entire project. This is bad. I am hoping someone can help me understand what the correct narrow set of permissions are, and how to apply them.&lt;/p&gt;

&lt;p&gt;It also took me a while to figure out that I needed to run these three commands before I could deploy to my project. The first one installs the Cloud Run tools, the second set up some required configuration:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;gcloud components install beta
gcloud config set run/region us-central1
gcloud config set project datasette-222320&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;But... having done all of the above, the following command run from an action successfully deploys the site!&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;datasette publish cloudrun fara.db \
    --service fara-history -m metadata.json&lt;/code&gt;&lt;/pre&gt;

&lt;h3&gt;DNS&lt;/h3&gt;

&lt;p&gt;Google Cloud Run deployments come with extremely ugly default URLs. For this project, that URL is &lt;code&gt;https://fara-history-j7hipcg4aq-uc.a.run.app/&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I wanted something nicer. I own &lt;code&gt;datasettes.com&lt;/code&gt; and manage the DNS via Cloudflare, which means I can point subdomains at Cloud Run instances.&lt;/p&gt;

&lt;p&gt;This is a two-step process&lt;/p&gt;

&lt;ol&gt;&lt;li&gt;I set &lt;code&gt;fara.datasettes.com&lt;/code&gt; as a DNS-only (no proxying) CNAME for &lt;code&gt;ghs.googlehosted.com&lt;/code&gt;.&lt;/li&gt;&lt;li&gt;In the Google Cloud Console I used Cloud Run -&amp;gt; Manage Custom Domains (a button in the header) -&amp;gt; Add Mapping to specify that &lt;code&gt;fara.datasettes.com&lt;/code&gt; should map to my &lt;code&gt;fara-history&lt;/code&gt; service (the &lt;code&gt;--service&lt;/code&gt; argument from &lt;code&gt;datasette publish&lt;/code&gt; earlier).&lt;/li&gt;&lt;/ol&gt;

&lt;p&gt;I had previously &lt;a href="https://support.google.com/webmasters/answer/9008080?hl=en"&gt;verified my domain ownership&lt;/a&gt; - I forget quite how I did it. Domains purchased through &lt;a href="https://domains.google/"&gt;Google Domains&lt;/a&gt; get to skip this step.&lt;/p&gt;

&lt;h3&gt;Next steps&lt;/h3&gt;

&lt;p&gt;This was a lot of fiddling around. I'm hoping that by writing this up in detail I'll be able to get this working much faster next time.&lt;/p&gt;

&lt;p&gt;I think this model - GitHub Actions that pull data, build a database and deploy to Cloud Run using &lt;code&gt;datasette publish&lt;/code&gt; - is incredibly promising. The end result should be an API that costs cents-to-dollars a month to operate thanks to Cloud Run's scale-to-zero architecture. And hopefully by publishing this all on GitHub it will be as easy as possible for other people to duplicate it for their own projects.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/continuous-deployment"&gt;continuous-deployment&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/continuous-integration"&gt;continuous-integration&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cloudrun"&gt;cloudrun&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="continuous-deployment"/><category term="continuous-integration"/><category term="data-journalism"/><category term="github"/><category term="projects"/><category term="datasette"/><category term="cloudrun"/><category term="github-actions"/><category term="git-scraping"/></entry><entry><title>Cloud Run Button: Click-to-deploy your git repos to Google Cloud</title><link href="https://simonwillison.net/2019/Nov/4/cloud-run-button/#atom-tag" rel="alternate"/><published>2019-11-04T04:57:33+00:00</published><updated>2019-11-04T04:57:33+00:00</updated><id>https://simonwillison.net/2019/Nov/4/cloud-run-button/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://cloud.google.com/blog/products/serverless/introducing-cloud-run-button-click-to-deploy-your-git-repos-to-google-cloud"&gt;Cloud Run Button: Click-to-deploy your git repos to Google Cloud&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Google Cloud Run now has its own version of the Heroku deploy button: you can add a button to a GitHub repository which, when clicked, will provide an interface for deploying your repo to the user’s own Google Cloud account using Cloud Run.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/bitworking/status/1190823729925906432"&gt;Joe Gregorio&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cloudrun"&gt;cloudrun&lt;/a&gt;&lt;/p&gt;



</summary><category term="github"/><category term="google"/><category term="cloudrun"/></entry><entry><title>Datasette 0.28 - and why master should always be releasable</title><link href="https://simonwillison.net/2019/May/19/datasette-0-28/#atom-tag" rel="alternate"/><published>2019-05-19T22:15:56+00:00</published><updated>2019-05-19T22:15:56+00:00</updated><id>https://simonwillison.net/2019/May/19/datasette-0-28/#atom-tag</id><summary type="html">
    &lt;p&gt;It's been quite a while since the last substantial release of Datasette. &lt;a href="https://datasette.readthedocs.io/en/stable/changelog.html#v0-27"&gt;Datasette 0.27&lt;/a&gt; came out all the way back in January.&lt;/p&gt;

&lt;p&gt;This isn't because development has slowed down. In fact, the project has had &lt;a href="https://github.com/simonw/datasette/compare/0.27...0.28"&gt;131 commits&lt;/a&gt; since then, covering a bewildering array of new functionality and with some significant contributions from developers who aren't me - Russ Garrett and Romain Primet deserve special recognition here.&lt;/p&gt;

&lt;p&gt;The problem has been one of discipline. I'm a big fan of the idea of keeping master shippable at all times in my professional work, but I hadn't quite adopted this policy for my open-source side projects. A couple of months ago I found myself in a situation where I had two major refactorings (of faceting and of Datasette's treatment of immutable files) going on in master at the same time, and untangling them turned out to take way longer than I had expected.&lt;/p&gt;

&lt;p&gt;So I've updated Datasette's &lt;a href="https://datasette.readthedocs.io/en/stable/contributing.html#general-guidelines"&gt;contribution guidelines&lt;/a&gt; to specify that &lt;strong&gt;master should always be releasable&lt;/strong&gt;, almost entirely as a reminder to myself.&lt;/p&gt;

&lt;p&gt;All of that said, I'm finally back out of the weeds and I'm excited to announce today's release of &lt;a href="https://pypi.org/project/datasette/0.28/"&gt;Datasette 0.28&lt;/a&gt;. It features a &lt;a href="https://adamj.eu/tech/2019/01/18/a-salmagundi-of-django-alpha-announcements/"&gt;salmagundi&lt;/a&gt; of new features! I'm replicating &lt;a href="https://datasette.readthedocs.io/en/stable/changelog.html#v0-28"&gt;the release notes&lt;/a&gt; below.&lt;/p&gt;

&lt;h3&gt;Supporting databases that change&lt;/h3&gt;

&lt;p&gt;From the beginning of the project, Datasette has been designed with read-only databases in mind. If a database is guaranteed not to change it opens up all kinds of interesting opportunities - from taking advantage of SQLite immutable mode and HTTP caching to bundling static copies of the database directly in a Docker container. &lt;a href="https://simonwillison.net/2018/Oct/4/datasette-ideas/"&gt;The interesting ideas in Datasette&lt;/a&gt; explores this idea in detail.&lt;/p&gt;

&lt;p&gt;As my goals for the project have developed, I realized that read-only databases are no longer the right default. SQLite actually supports concurrent access very well provided only one thread attempts to write to a database at a time, and I keep encountering sensible use-cases for running Datasette on top of a database that is processing inserts and updates.&lt;/p&gt;

&lt;p&gt;So, as-of version 0.28 Datasette no longer assumes that a database file will not change. It is now safe to point Datasette at a SQLite database which is being updated by another process.&lt;/p&gt;

&lt;p&gt;Making this change was a lot of work - see tracking tickets &lt;a href="https://github.com/simonw/datasette/issues/418"&gt;#418&lt;/a&gt;, &lt;a href="https://github.com/simonw/datasette/issues/419"&gt;#419&lt;/a&gt; and &lt;a href="https://github.com/simonw/datasette/issues/420"&gt;#420&lt;/a&gt;. It required new thinking around how Datasette should calculate table counts (an expensive operation against a large, changing database) and also meant reconsidering the “content hash” URLs Datasette has used in the past to optimize the performance of HTTP caches.&lt;/p&gt;

&lt;p&gt;Datasette can still run against immutable files and gains numerous performance benefits from doing so, but this is no longer the default behaviour. Take a look at the new &lt;a href="https://datasette.readthedocs.io/en/stable/performance.html#performance"&gt;Performance and caching&lt;/a&gt; documentation section for details on how to make the most of Datasette against data that you know will be staying read-only and immutable.&lt;/p&gt;

&lt;h3&gt;Faceting improvements, and faceting plugins&lt;/h3&gt;

&lt;p&gt;Datasette &lt;a href="https://datasette.readthedocs.io/en/stable/facets.html#facets"&gt;Facets&lt;/a&gt; provide an intuitive way to quickly summarize and interact with data. Previously the only supported faceting technique was column faceting, but 0.28 introduces two powerful new capibilities: facet-by-JSON-array and the ability to define further facet types using plugins.&lt;/p&gt;
&lt;p&gt;Facet by array (&lt;a href="https://github.com/simonw/datasette/issues/359"&gt;#359&lt;/a&gt;) is only available if your SQLite installation provides the &lt;code&gt;json1&lt;/code&gt; extension. Datasette will automatically detect columns that contain JSON arrays of values and offer a faceting interface against those columns - useful for modelling things like tags without needing to break them out into a new table. See &lt;a href="https://datasette.readthedocs.io/en/stable/facets.html#facet-by-json-array"&gt;Facet by JSON array&lt;/a&gt; for more.&lt;/p&gt;
&lt;p&gt;The new &lt;a href="https://datasette.readthedocs.io/en/stable/plugins.html#plugin-register-facet-classes"&gt;register_facet_classes()&lt;/a&gt; plugin hook (&lt;a href="https://github.com/simonw/datasette/pull/445"&gt;#445&lt;/a&gt;) can be used to register additional custom facet classes. Each facet class should provide two methods: &lt;code&gt;suggest()&lt;/code&gt; which suggests facet selections that might be appropriate for a provided SQL query, and &lt;code&gt;facet_results()&lt;/code&gt; which executes a facet operation and returns results. Datasette’s own faceting implementations have been refactored to use the same API as these plugins.&lt;/p&gt;

&lt;h3&gt;datasette publish cloudrun&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://cloud.google.com/run/"&gt;Google Cloud Run&lt;/a&gt; is a brand new serverless hosting platform from Google, which allows you to build a Docker container which will run only when HTTP traffic is recieved and will shut down (and hence cost you nothing) the rest of the time. It’s similar to Zeit’s Now v1 Docker hosting platform which sadly is &lt;a href="https://hyperion.alpha.spectrum.chat/zeit/now/cannot-create-now-v1-deployments~d206a0d4-5835-4af5-bb5c-a17f0171fb25?m=MTU0Njk2NzgwODM3OA=="&gt;no longer accepting signups&lt;/a&gt; from new users.&lt;/p&gt;

&lt;p&gt;The new &lt;code&gt;datasette publish cloudrun&lt;/code&gt; command was contributed by Romain Primet (&lt;a href="https://github.com/simonw/datasette/pull/434"&gt;#434&lt;/a&gt;) and publishes selected databases to a new Datasette instance running on Google Cloud Run.&lt;/p&gt;
&lt;p&gt;See &lt;a href="https://datasette.readthedocs.io/en/stable/publish.html#publish-cloud-run"&gt;Publishing to Google Cloud Run&lt;/a&gt; for full documentation.&lt;/p&gt;

&lt;h3&gt;register_output_renderer plugins&lt;/h3&gt;

&lt;p&gt;Russ Garrett implemented a new Datasette plugin hook called &lt;a href="https://datasette.readthedocs.io/en/stable/plugins.html#plugin-register-output-renderer"&gt;register_output_renderer&lt;/a&gt; (&lt;a href="https://github.com/simonw/datasette/pull/441"&gt;#441&lt;/a&gt;) which allows plugins to create additional output renderers in addition to Datasette’s default &lt;code&gt;.json&lt;/code&gt; and &lt;code&gt;.csv&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Russ’s in-development &lt;a href="https://github.com/russss/datasette-geo"&gt;datasette-geo&lt;/a&gt; plugin includes &lt;a href="https://github.com/russss/datasette-geo/blob/d4cecc020848bbde91e9e17bf352f7c70bc3dccf/datasette_plugin_geo/geojson.py"&gt;an example&lt;/a&gt; of this hook being used to output &lt;code&gt;.geojson&lt;/code&gt; automatically converted from SpatiaLite.&lt;/p&gt;


&lt;h3&gt;Medium changes&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Datasette now conforms to the &lt;a href="https://github.com/python/black"&gt;Black coding style&lt;/a&gt; (&lt;a href="https://github.com/simonw/datasette/pull/449"&gt;#449&lt;/a&gt;) - and has a unit test to enforce this in the future&lt;/li&gt;
&lt;li&gt;New &lt;a href="https://datasette.readthedocs.io/en/stable/json_api.html#json-api-table-arguments"&gt;Special table arguments&lt;/a&gt;:
    &lt;ul&gt;
    &lt;li&gt;&lt;code&gt;?columnname__in=value1,value2,value3&lt;/code&gt; filter for executing SQL IN queries against a table, see &lt;a href="https://datasette.readthedocs.io/en/stable/json_api.html#table-arguments"&gt;Table arguments&lt;/a&gt; (&lt;a href="https://github.com/simonw/datasette/issues/433"&gt;#433&lt;/a&gt;)&lt;/li&gt;
    &lt;li&gt;&lt;code&gt;?columnname__date=yyyy-mm-dd&lt;/code&gt; filter which returns rows where the spoecified datetime column falls on the specified date (&lt;a href="https://github.com/simonw/datasette/commit/583b22aa28e26c318de0189312350ab2688c90b1"&gt;583b22a&lt;/a&gt;)&lt;/li&gt;
    &lt;li&gt;&lt;code&gt;?tags__arraycontains=tag&lt;/code&gt; filter which acts against a JSON array contained in a column (&lt;a href="https://github.com/simonw/datasette/commit/78e45ead4d771007c57b307edf8fc920101f8733"&gt;78e45ea&lt;/a&gt;)&lt;/li&gt;
    &lt;li&gt;&lt;code&gt;?_where=sql-fragment&lt;/code&gt; filter for the table view  (&lt;a href="https://github.com/simonw/datasette/issues/429"&gt;#429&lt;/a&gt;)&lt;/li&gt;
    &lt;li&gt;&lt;code&gt;?_fts_table=mytable&lt;/code&gt; and &lt;code&gt;?_fts_pk=mycolumn&lt;/code&gt; querystring options can be used to specify which FTS table to use for a search query - see &lt;a href="https://datasette.readthedocs.io/en/stable/full_text_search.html#full-text-search-table-or-view"&gt;Configuring full-text search for a table or view&lt;/a&gt; (&lt;a href="https://github.com/simonw/datasette/issues/428"&gt;#428&lt;/a&gt;)&lt;/li&gt;
    &lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;You can now pass the same table filter multiple times - for example, &lt;code&gt;?content__not=world&amp;amp;content__not=hello&lt;/code&gt; will return all rows where the content column is neither &lt;code&gt;hello&lt;/code&gt; or &lt;code&gt;world&lt;/code&gt; (&lt;a href="https://github.com/simonw/datasette/issues/288"&gt;#288&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;You can now specify &lt;code&gt;about&lt;/code&gt; and &lt;code&gt;about_url&lt;/code&gt; metadata (in addition to &lt;code&gt;source&lt;/code&gt; and &lt;code&gt;license&lt;/code&gt;) linking to further information about a project - see &lt;a href="https://datasette.readthedocs.io/en/stable/metadata.html#metadata-source-license-about"&gt;Source, license and about&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;New &lt;code&gt;?_trace=1&lt;/code&gt; parameter now adds debug information showing every SQL query that was executed while constructing the page (&lt;a href="https://github.com/simonw/datasette/issues/435"&gt;#435&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;datasette inspect&lt;/code&gt; now just calculates table counts, and does not introspect other database metadata (&lt;a href="https://github.com/simonw/datasette/issues/462"&gt;#462&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Removed &lt;code&gt;/-/inspect&lt;/code&gt; page entirely - this will be replaced by something similar in the future, see &lt;a href="https://github.com/simonw/datasette/issues/465"&gt;#465&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Datasette can now run against an in-memory SQLite database. You can do this by starting it without passing any files or by using the new &lt;code&gt;--memory&lt;/code&gt; option to &lt;code&gt;datasette serve&lt;/code&gt;. This can be useful for experimenting with SQLite queries that do not access any data, such as &lt;code&gt;SELECT 1+1&lt;/code&gt; or &lt;code&gt;SELECT sqlite_version()&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;Small changes&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;We now show the size of the database file next to the download link (&lt;a href="https://github.com/simonw/datasette/issues/172"&gt;#172&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;New &lt;code&gt;/-/databases&lt;/code&gt; introspection page shows currently connected databases (&lt;a href="https://github.com/simonw/datasette/issues/470"&gt;#470&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Binary data is no longer displayed on the table and row pages (&lt;a href="https://github.com/simonw/datasette/pull/442"&gt;#442&lt;/a&gt; - thanks, Russ Garrett)&lt;/li&gt;
&lt;li&gt;New show/hide SQL links on custom query pages (&lt;a href="https://github.com/simonw/datasette/issues/415"&gt;#415&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;The &lt;a href="https://datasette.readthedocs.io/en/stable/plugins.html#plugin-hook-extra-body-script"&gt;extra_body_script&lt;/a&gt; plugin hook now accepts an optional &lt;code&gt;view_name&lt;/code&gt; argument (&lt;a href="https://github.com/simonw/datasette/pull/443"&gt;#443&lt;/a&gt; - thanks, Russ Garrett)&lt;/li&gt;
&lt;li&gt;Bumped Jinja2 dependency to 2.10.1 (&lt;a href="https://github.com/simonw/datasette/pull/426"&gt;#426&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;All table filters are now documented, and documentation is enforced via unit tests (&lt;a href="https://github.com/simonw/datasette/commit/2c19a27d15a913e5f3dd443f04067169a6f24634"&gt;2c19a27&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;New project guideline: master should stay shippable at all times! (&lt;a href="https://github.com/simonw/datasette/commit/31f36e1b97ccc3f4387c80698d018a69798b6228"&gt;31f36e1&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Fixed a bug where &lt;code&gt;sqlite_timelimit()&lt;/code&gt; occasionally failed to clean up after itself (&lt;a href="https://github.com/simonw/datasette/commit/bac4e01f40ae7bd19d1eab1fb9349452c18de8f5"&gt;bac4e01&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;We no longer load additional plugins when executing pytest (&lt;a href="https://github.com/simonw/datasette/issues/438"&gt;#438&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Homepage now links to database views if there are less than five tables in a database (&lt;a href="https://github.com/simonw/datasette/issues/373"&gt;#373&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;--cors&lt;/code&gt; option is now respected by error pages (&lt;a href="https://github.com/simonw/datasette/issues/453"&gt;#453&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;datasette publish heroku&lt;/code&gt; now uses the &lt;code&gt;--include-vcs-ignore&lt;/code&gt; option, which means it works under Travis CI (&lt;a href="https://github.com/simonw/datasette/pull/407"&gt;#407&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;datasette publish heroku&lt;/code&gt; now publishes using Python 3.6.8 (&lt;a href="https://github.com/simonw/datasette/commit/666c37415a898949fae0437099d62a35b1e9c430"&gt;666c374&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Renamed &lt;code&gt;datasette publish now&lt;/code&gt; to &lt;code&gt;datasette publish nowv1&lt;/code&gt; (&lt;a href="https://github.com/simonw/datasette/issues/472"&gt;#472&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;datasette publish nowv1&lt;/code&gt; now accepts multiple &lt;code&gt;--alias&lt;/code&gt; parameters (&lt;a href="https://github.com/simonw/datasette/commit/09ef305c687399384fe38487c075e8669682deb4"&gt;09ef305&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Removed the &lt;code&gt;datasette skeleton&lt;/code&gt; command (&lt;a href="https://github.com/simonw/datasette/issues/476"&gt;#476&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;The &lt;a href="https://datasette.readthedocs.io/en/stable/contributing.html#contributing-documentation"&gt;documentation on how to build the documentation&lt;/a&gt; now recommends &lt;code&gt;sphinx-autobuild&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cloudrun"&gt;cloudrun&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="open-source"/><category term="projects"/><category term="datasette"/><category term="cloudrun"/></entry><entry><title>Ministry of Silly Runtimes: Vintage Python on Cloud Run</title><link href="https://simonwillison.net/2019/Apr/9/vintage-python-on-cloud-run/#atom-tag" rel="alternate"/><published>2019-04-09T17:33:47+00:00</published><updated>2019-04-09T17:33:47+00:00</updated><id>https://simonwillison.net/2019/Apr/9/vintage-python-on-cloud-run/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://dev.to/di/ministry-of-silly-runtimes-vintage-python-on-cloud-run-3b9d"&gt;Ministry of Silly Runtimes: Vintage Python on Cloud Run&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Cloud Run is an exciting new hosting service from Google that lets you define a container using a Dockerfile and then run that container in a “scale to zero” environment, so you only pay for time spent serving traffic. It’s similar to the now-deprecated Zeit Now 1.0 which inspired me to create Datasette. Here Dustin Ingram demonstrates how powerful Docker can be as the underlying abstraction by deploying a web app using a 25 year old version of Python 1.x.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/jacobian/status/1115665262215352320"&gt;@jacobian&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cloud"&gt;cloud&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/zeit-now"&gt;zeit-now&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/docker"&gt;docker&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cloudrun"&gt;cloudrun&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dustin-ingram"&gt;dustin-ingram&lt;/a&gt;&lt;/p&gt;



</summary><category term="cloud"/><category term="python"/><category term="zeit-now"/><category term="docker"/><category term="datasette"/><category term="cloudrun"/><category term="dustin-ingram"/></entry></feed>