<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: s3</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/s3.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2026-03-25T21:57:05+00:00</updated><author><name>Simon Willison</name></author><entry><title>datasette-files-s3 0.1a1</title><link href="https://simonwillison.net/2026/Mar/25/datasette-files-s3/#atom-tag" rel="alternate"/><published>2026-03-25T21:57:05+00:00</published><updated>2026-03-25T21:57:05+00:00</updated><id>https://simonwillison.net/2026/Mar/25/datasette-files-s3/#atom-tag</id><summary type="html">
    
        &lt;p&gt;&lt;strong&gt;Release:&lt;/strong&gt; &lt;a href="https://github.com/datasette/datasette-files-s3/releases/tag/0.1a1"&gt;datasette-files-s3 0.1a1&lt;/a&gt;&lt;/p&gt;
        &lt;p&gt;A backend for &lt;a href="https://github.com/datasette/datasette-files"&gt;datasette-files&lt;/a&gt; that adds the ability to store and retrieve files using an S3 bucket. This release added &lt;a href="https://github.com/datasette/datasette-files-s3/blob/main/README.md#credentials-broker-response"&gt;a mechanism&lt;/a&gt; for fetching S3 configuration periodically from a URL, which means we can use time limited IAM credentials that are restricted to a prefix within a bucket.&lt;/p&gt;
    
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="s3"/><category term="datasette"/></entry><entry><title>s3-credentials 0.17</title><link href="https://simonwillison.net/2025/Dec/16/s3-credentials/#atom-tag" rel="alternate"/><published>2025-12-16T23:40:31+00:00</published><updated>2025-12-16T23:40:31+00:00</updated><id>https://simonwillison.net/2025/Dec/16/s3-credentials/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.17"&gt;s3-credentials 0.17&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New release of my &lt;a href="https://s3-credentials.readthedocs.io/"&gt;s3-credentials&lt;/a&gt; CLI tool for managing credentials needed to access just one S3 bucket. Here are the release notes in full:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;New commands &lt;code&gt;get-bucket-policy&lt;/code&gt; and &lt;code&gt;set-bucket-policy&lt;/code&gt;. &lt;a href="https://github.com/simonw/s3-credentials/issues/91"&gt;#91&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;New commands &lt;code&gt;get-public-access-block&lt;/code&gt; and &lt;code&gt;set-public-access-block&lt;/code&gt;. &lt;a href="https://github.com/simonw/s3-credentials/issues/92"&gt;#92&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;New &lt;code&gt;localserver&lt;/code&gt; command for starting a web server that makes time limited credentials accessible via a JSON API. &lt;a href="https://github.com/simonw/s3-credentials/pull/93"&gt;#93&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;That &lt;code&gt;s3-credentials localserver&lt;/code&gt; command (&lt;a href="https://s3-credentials.readthedocs.io/en/stable/localserver.html"&gt;documented here&lt;/a&gt;) is a little obscure, but I found myself wanting something like that to help me test out a new feature I'm building to help create temporary Litestream credentials using Amazon STS.&lt;/p&gt;
&lt;p&gt;Most of that new feature was &lt;a href="https://gistpreview.github.io/?500add71f397874ebadb8e04e8a33b53"&gt;built by Claude Code&lt;/a&gt; from the following starting prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Add a feature s3-credentials localserver which starts a localhost weberver running (using the Python standard library stuff) on port 8094 by default but -p/--port can set a different port and otherwise takes an option that names a bucket and then takes the same options for read--write/read-only etc as other commands. It also takes a required --refresh-interval option which can be set as 5m or 10h or 30s. All this thing does is reply on / to a GET request with the IAM expiring credentials that allow access to that bucket with that policy for that specified amount of time. It caches internally the credentials it generates and will return the exact same data up until they expire (it also tracks expected expiry time) after which it will generate new credentials (avoiding dog pile effects if multiple requests ask at the same time) and return and cache those instead.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/annotated-release-notes"&gt;annotated-release-notes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3-credentials"&gt;s3-credentials&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-engineering"&gt;prompt-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;&lt;/p&gt;



</summary><category term="aws"/><category term="projects"/><category term="s3"/><category term="ai"/><category term="annotated-release-notes"/><category term="s3-credentials"/><category term="prompt-engineering"/><category term="generative-ai"/><category term="llms"/><category term="coding-agents"/><category term="claude-code"/></entry><entry><title>An MVCC-like columnar table on S3 with constant-time deletes</title><link href="https://simonwillison.net/2025/Oct/11/mvcc-s3/#atom-tag" rel="alternate"/><published>2025-10-11T03:47:16+00:00</published><updated>2025-10-11T03:47:16+00:00</updated><id>https://simonwillison.net/2025/Oct/11/mvcc-s3/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.shayon.dev/post/2025/277/an-mvcc-like-columnar-table-on-s3-with-constant-time-deletes/"&gt;An MVCC-like columnar table on S3 with constant-time deletes&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
s3's support for conditional writes (&lt;a href="https://simonwillison.net/2024/Nov/26/s3-conditional-writes/"&gt;previously&lt;/a&gt;) makes it an interesting, scalable and often inexpensive platform for all kinds of database patterns.&lt;/p&gt;
&lt;p&gt;Shayon Mukherjee presents an ingenious design for a Parquet-backed database in S3 which accepts concurrent writes, presents a single atomic view for readers and even supports reliable row deletion despite Parquet requiring a complete file rewrite in order to remove data.&lt;/p&gt;
&lt;p&gt;The key to the design is a &lt;code&gt;_latest_manifest&lt;/code&gt; JSON file at the top of the bucket, containing an integer version number. Clients use compare-and-swap to increment that version - only one client can succeed at this, so the incremented version they get back is guaranteed unique to them.&lt;/p&gt;
&lt;p&gt;Having reserved a version number the client can write a unique manifest file for that version - &lt;code&gt;manifest/v00000123.json&lt;/code&gt; - with a more complex data structure referencing the current versions of every persisted file, including the one they just uploaded.&lt;/p&gt;
&lt;p&gt;Deleted rows are written to tombstone files as either a list of primary keys or a list of of ranges. Clients consult these when executing reads, filtering out deleted rows as part of resolving a query.&lt;/p&gt;
&lt;p&gt;The pricing estimates are especially noteworthy:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;For a workload ingesting 6 TB/day with 2 TB of deletes and 50K queries/day:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;PUT requests: ~380K/day (≈4 req/s) = $1.88/day&lt;/li&gt;
&lt;li&gt;GET requests: highly variable, depends on partitioning effectiveness&lt;ul&gt;
&lt;li&gt;Best case (good time-based partitioning): ~100K-200K/day = $0.04-$0.08/day&lt;/li&gt;
&lt;li&gt;Worst case (poor partitioning, scanning many files): ~2M/day = $0.80/day&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;~$3/day for ingesting 6TB of data is pretty fantastic!&lt;/p&gt;
&lt;p&gt;Watch out for storage costs though - each new TB of data at $0.023/GB/month adds $23.55 to the ongoing monthly bill.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=45493158"&gt;Hacked News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/databases"&gt;databases&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parquet"&gt;parquet&lt;/a&gt;&lt;/p&gt;



</summary><category term="aws"/><category term="databases"/><category term="s3"/><category term="parquet"/></entry><entry><title>AWS in 2025: The Stuff You Think You Know That’s Now Wrong</title><link href="https://simonwillison.net/2025/Aug/20/aws-in-2025/#atom-tag" rel="alternate"/><published>2025-08-20T16:29:56+00:00</published><updated>2025-08-20T16:29:56+00:00</updated><id>https://simonwillison.net/2025/Aug/20/aws-in-2025/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.lastweekinaws.com/blog/aws-in-2025-the-stuff-you-think-you-know-thats-now-wrong/"&gt;AWS in 2025: The Stuff You Think You Know That’s Now Wrong&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Absurdly useful roundup from Corey Quinn of AWS changes you may have missed that can materially affect your architectural decisions about how you use their services.&lt;/p&gt;
&lt;p&gt;A few that stood out to me:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;EC2 instances can now live-migrate between physical hosts, and can have their security groups, IAM roles and EBS volumes modified without a restart. They now charge by the second; they used to round up to the hour.&lt;/li&gt;
&lt;li&gt;S3 Glacier restore fees are now fast and predictably priced.&lt;/li&gt;
&lt;li&gt;AWS Lambdas can now run containers, execute for up to 15 minutes, use up to 10GB of RAM and request 10GB of /tmp storage.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Also this note on AWS's previously legendary resistance to shutting things down:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;While deprecations remain rare, they’re definitely on the rise; if an AWS service sounds relatively niche or goofy, consider your exodus plan before building atop it.&lt;/p&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=44962844"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/amazon"&gt;amazon&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ec2"&gt;ec2&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lambda"&gt;lambda&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/corey-quinn"&gt;corey-quinn&lt;/a&gt;&lt;/p&gt;



</summary><category term="amazon"/><category term="aws"/><category term="ec2"/><category term="lambda"/><category term="s3"/><category term="corey-quinn"/></entry><entry><title>Quoting James Cowling</title><link href="https://simonwillison.net/2025/May/14/james-cowling/#atom-tag" rel="alternate"/><published>2025-05-14T03:49:37+00:00</published><updated>2025-05-14T03:49:37+00:00</updated><id>https://simonwillison.net/2025/May/14/james-cowling/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://twitter.com/jamesacowling/status/1922428807136608380"&gt;&lt;p&gt;I designed Dropbox's storage system and modeled its durability. Durability numbers (11 9's etc) are meaningless because competent providers don't lose data because of disk failures, they lose data because of bugs and operator error. [...]&lt;/p&gt;
&lt;p&gt;The best thing you can do for your own durability is to choose a competent provider and then ensure you don't accidentally delete or corrupt own data on it:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Ideally never mutate an object in S3, add a new version instead.&lt;/li&gt;
&lt;li&gt;Never live-delete any data. Mark it for deletion and then use a lifecycle policy to clean it up after a week.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This way you have time to react to a bug in your own stack.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://twitter.com/jamesacowling/status/1922428807136608380"&gt;James Cowling&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ops"&gt;ops&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/software-architecture"&gt;software-architecture&lt;/a&gt;&lt;/p&gt;



</summary><category term="ops"/><category term="s3"/><category term="software-architecture"/></entry><entry><title>suitenumerique/docs</title><link href="https://simonwillison.net/2025/Mar/17/docs/#atom-tag" rel="alternate"/><published>2025-03-17T18:51:50+00:00</published><updated>2025-03-17T18:51:50+00:00</updated><id>https://simonwillison.net/2025/Mar/17/docs/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/suitenumerique/docs"&gt;suitenumerique/docs&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New open source (MIT licensed) collaborative text editing web application, similar to Google Docs or Notion, notable because it's a joint effort funded by the French and German governments and "currently onboarding the Netherlands".&lt;/p&gt;
&lt;p&gt;It's built using Django and React:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Docs is built on top of &lt;a href="https://www.django-rest-framework.org/"&gt;Django Rest Framework&lt;/a&gt;, &lt;a href="https://nextjs.org/"&gt;Next.js&lt;/a&gt;, &lt;a href="https://www.blocknotejs.org/"&gt;BlockNote.js&lt;/a&gt;, &lt;a href="https://tiptap.dev/docs/hocuspocus/introduction"&gt;HocusPocus&lt;/a&gt; and &lt;a href="https://yjs.dev/"&gt;Yjs&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Deployments currently &lt;a href="https://github.com/suitenumerique/docs/blob/main/docs/installation.md"&gt;require&lt;/a&gt; Kubernetes, PostgreSQL, memcached, an S3 bucket (or compatible) and an OIDC provider.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/django"&gt;django&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/postgresql"&gt;postgresql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/react"&gt;react&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/kubernetes"&gt;kubernetes&lt;/a&gt;&lt;/p&gt;



</summary><category term="django"/><category term="open-source"/><category term="postgresql"/><category term="s3"/><category term="react"/><category term="kubernetes"/></entry><entry><title>Using S3 triggers to maintain a list of files in DynamoDB</title><link href="https://simonwillison.net/2025/Feb/19/s3-triggers/#atom-tag" rel="alternate"/><published>2025-02-19T22:07:32+00:00</published><updated>2025-02-19T22:07:32+00:00</updated><id>https://simonwillison.net/2025/Feb/19/s3-triggers/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://til.simonwillison.net/aws/s3-triggers-dynamodb"&gt;Using S3 triggers to maintain a list of files in DynamoDB&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I built an experimental prototype this morning of a system for efficiently tracking files that have been added to a large S3 bucket by maintaining a parallel DynamoDB table using S3 triggers and AWS lambda.&lt;/p&gt;
&lt;p&gt;I got 80% of the way there with this single prompt (complete with typos) to my &lt;a href="https://simonwillison.net/2024/Dec/19/one-shot-python-tools/#writing-these-with-the-help-of-a-claude-project"&gt;custom Claude Project&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Python CLI app using boto3 with commands for creating a new S3 bucket which it also configures to have S3 lambada event triggers which moantian a dynamodb table containing metadata about all of the files in that bucket. Include these commands&lt;/code&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;create_bucket - create a bucket and sets up the associated triggers and dynamo tables&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;list_files - shows me a list of files based purely on querying dynamo&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;ChatGPT then took me to the 95% point. The code Claude produced included an obvious bug, so I pasted the code into o3-mini-high on the basis that "reasoning" is often a great way to fix those kinds of errors:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Identify, explain and then fix any bugs in this code:&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;code from Claude pasted here&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;... and aside from adding a couple of &lt;code&gt;time.sleep()&lt;/code&gt; calls to work around timing errors with IAM policy distribution, &lt;a href="https://til.simonwillison.net/aws/s3-triggers-dynamodb#user-content-trying-it-out"&gt;everything worked&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;Getting from a rough idea to a working proof of concept of something like this with less than 15 minutes of prompting is extraordinarily valuable.&lt;/p&gt;
&lt;p&gt;This is exactly the kind of project I've avoided in the past because of my almost irrational intolerance of the frustration involved in figuring out the individual details of each call to S3, IAM, AWS Lambda and DynamoDB.&lt;/p&gt;
&lt;p&gt;(Update: I just found out about &lt;a href="https://aws.amazon.com/about-aws/whats-new/2025/01/amazon-s3-metadata-generally-available/"&gt;the new S3 Metadata system&lt;/a&gt; which launched a few weeks ago and might solve this exact problem!)


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lambda"&gt;lambda&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nosql"&gt;nosql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prototyping"&gt;prototyping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatgpt"&gt;chatgpt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/o3"&gt;o3&lt;/a&gt;&lt;/p&gt;



</summary><category term="aws"/><category term="lambda"/><category term="nosql"/><category term="prototyping"/><category term="s3"/><category term="ai"/><category term="generative-ai"/><category term="chatgpt"/><category term="llms"/><category term="ai-assisted-programming"/><category term="claude"/><category term="uv"/><category term="o3"/></entry><entry><title>sqlite-s3vfs</title><link href="https://simonwillison.net/2025/Feb/7/sqlite-s3vfs/#atom-tag" rel="alternate"/><published>2025-02-07T02:22:43+00:00</published><updated>2025-02-07T02:22:43+00:00</updated><id>https://simonwillison.net/2025/Feb/7/sqlite-s3vfs/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/uktrade/sqlite-s3vfs"&gt;sqlite-s3vfs&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Neat open source project on the GitHub organisation for the UK government's Department for Business and Trade: a "Python virtual filesystem for SQLite to read from and write to S3."&lt;/p&gt;
&lt;p&gt;I tried out &lt;a href="https://github.com/uktrade/sqlite-s3vfs/blob/main/README.md#usage"&gt;their usage example&lt;/a&gt; by running it in a Python REPL with all of the dependencies&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uv run --python 3.13 --with apsw --with sqlite-s3vfs --with boto3 python
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It worked as advertised. When I listed my S3 bucket I found it had created two files - one called &lt;code&gt;demo.sqlite/0000000000&lt;/code&gt; and another called &lt;code&gt;demo.sqlite/0000000001&lt;/code&gt;, both 4096 bytes because each one represented a SQLite page.&lt;/p&gt;
&lt;p&gt;The implementation is just &lt;a href="https://github.com/uktrade/sqlite-s3vfs/blob/main/sqlite_s3vfs.py"&gt;200 lines of Python&lt;/a&gt;, implementing a new SQLite Virtual Filesystem on top of &lt;a href="https://rogerbinns.github.io/apsw/vfs.html#vfs-class"&gt;apsw.VFS&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The README includes this warning:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;No locking is performed, so client code &lt;em&gt;must&lt;/em&gt; ensure that writes do not overlap with other writes or reads. If multiple writes happen at the same time, the database will probably become corrupt and data be lost.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I wonder if the &lt;a href="https://simonwillison.net/2024/Nov/26/s3-conditional-writes/"&gt;conditional writes&lt;/a&gt; feature added to S3 back in November could be used to protect against that happening. Tricky as there are multiple files involved, but maybe it (or a &lt;a href="https://simonwillison.net/2024/Aug/30/leader-election-with-s3-conditional-writes/"&gt;trick like this one&lt;/a&gt;) could be used to implement some kind of exclusive lock between multiple processes?

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=42965198#42966961"&gt;Hacker News comment&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/apsw"&gt;apsw&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="s3"/><category term="sqlite"/><category term="uv"/><category term="apsw"/></entry><entry><title>Building Python tools with a one-shot prompt using uv run and Claude Projects</title><link href="https://simonwillison.net/2024/Dec/19/one-shot-python-tools/#atom-tag" rel="alternate"/><published>2024-12-19T07:00:37+00:00</published><updated>2024-12-19T07:00:37+00:00</updated><id>https://simonwillison.net/2024/Dec/19/one-shot-python-tools/#atom-tag</id><summary type="html">
    &lt;p&gt;I've written a lot about how I've been using Claude to build one-shot HTML+JavaScript applications &lt;a href="https://simonwillison.net/tags/claude-artifacts/"&gt;via Claude Artifacts&lt;/a&gt;. I recently started using a similar pattern to create one-shot Python utilities, using a custom Claude Project combined with the dependency management capabilities of &lt;a href="https://github.com/astral-sh/uv"&gt;uv&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;(In LLM jargon a "one-shot" prompt is a prompt that produces the complete desired result on the first attempt. Confusingly it also sometimes means a prompt that includes a single example of the desired output format. Here I'm using the first of those two definitions.)&lt;/p&gt;
&lt;p&gt;I'll start with an example of a tool I built that way.&lt;/p&gt;
&lt;p&gt;I had another round of battle with Amazon S3 today trying to figure out why a file in one of my buckets couldn't be accessed via a public URL.&lt;/p&gt;
&lt;p&gt;Out of frustration I prompted Claude with a variant of the following (&lt;a href="https://gist.github.com/simonw/9f69cf35889b0445b80eeed691d44504"&gt;full transcript here&lt;/a&gt;):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;I can't access the file at EXAMPLE_S3_URL. Write me a Python CLI tool using Click and boto3 which takes a URL of that form and then uses EVERY single boto3 trick in the book to try and debug why the file is returning a 404&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It wrote me &lt;a href="https://github.com/simonw/tools/blob/main/python/debug_s3_access.py"&gt;this script&lt;/a&gt;, which gave me exactly what I needed. I ran it like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uv run debug_s3_access.py \
  https://test-public-bucket-simonw.s3.us-east-1.amazonaws.com/0f550b7b28264d7ea2b3d360e3381a95.jpg&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/debug-s3.jpg" alt="Terminal screenshot showing S3 access analysis results. Command: '$ uv run http://tools.simonwillison.net/python/debug_s3_access.py url-to-image' followed by detailed output showing bucket exists (Yes), region (default), key exists (Yes), bucket policy (AllowAllGetObject), bucket owner (swillison), versioning (Not enabled), content type (image/jpeg), size (71683 bytes), last modified (2024-12-19 03:43:30+00:00) and public access settings (all False)" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;You can &lt;a href="https://github.com/simonw/tools/tree/main/python#debug_s3_accesspy"&gt;see the text output here&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="inline-dependencies-and-uv-run"&gt;Inline dependencies and uv run&lt;/h4&gt;
&lt;p&gt;Crucially, I didn't have to take any extra steps to install any of the dependencies that the script needed. That's because the script starts with this magic comment:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-c"&gt;# /// script&lt;/span&gt;
&lt;span class="pl-c"&gt;# requires-python = "&amp;gt;=3.12"&lt;/span&gt;
&lt;span class="pl-c"&gt;# dependencies = [&lt;/span&gt;
&lt;span class="pl-c"&gt;#     "click",&lt;/span&gt;
&lt;span class="pl-c"&gt;#     "boto3",&lt;/span&gt;
&lt;span class="pl-c"&gt;#     "urllib3",&lt;/span&gt;
&lt;span class="pl-c"&gt;#     "rich",&lt;/span&gt;
&lt;span class="pl-c"&gt;# ]&lt;/span&gt;
&lt;span class="pl-c"&gt;# ///&lt;/span&gt;&lt;/pre&gt;
&lt;p&gt;This is an example of &lt;a href="https://docs.astral.sh/uv/guides/scripts/#declaring-script-dependencies"&gt;inline script dependencies&lt;/a&gt;, a feature described in &lt;a href="https://peps.python.org/pep-0723/"&gt;PEP 723&lt;/a&gt; and implemented by &lt;code&gt;uv run&lt;/code&gt;. Running the script causes &lt;code&gt;uv&lt;/code&gt; to create a temporary virtual environment with those dependencies installed, a process that takes just a few milliseconds once the &lt;code&gt;uv&lt;/code&gt; cache has been populated.&lt;/p&gt;
&lt;p&gt;This even works if the script is specified by a URL! Anyone with &lt;code&gt;uv&lt;/code&gt; installed can run the following command (provided you trust me not to have replaced the script with something malicious) to debug one of their own S3 buckets:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uv run http://tools.simonwillison.net/python/debug_s3_access.py \
  https://test-public-bucket-simonw.s3.us-east-1.amazonaws.com/0f550b7b28264d7ea2b3d360e3381a95.jpg&lt;/pre&gt;&lt;/div&gt;
&lt;h4 id="writing-these-with-the-help-of-a-claude-project"&gt;Writing these with the help of a Claude Project&lt;/h4&gt;
&lt;p&gt;The reason I can one-shot scripts like this now is that I've set up a &lt;a href="https://www.anthropic.com/news/projects"&gt;Claude Project&lt;/a&gt; called "Python app". Projects can have custom instructions, and I used those to "teach" Claude how to take advantage of inline script dependencies:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;You write Python tools as single files. They always start with this comment:&lt;/p&gt;
&lt;pre&gt;&lt;span&gt;# /// script&lt;/span&gt;
&lt;span&gt;# requires-python = "&amp;gt;=3.12"&lt;/span&gt;
&lt;span&gt;# ///&lt;/span&gt;&lt;/pre&gt;
&lt;p&gt;These files can include dependencies on libraries such as Click. If they do, those dependencies are included in a list like this one in that same comment (here showing two dependencies):&lt;/p&gt;
&lt;pre&gt;&lt;span&gt;# /// script&lt;/span&gt;
&lt;span&gt;# requires-python = "&amp;gt;=3.12"&lt;/span&gt;
&lt;span&gt;# dependencies = [&lt;/span&gt;
&lt;span&gt;#     "click",&lt;/span&gt;
&lt;span&gt;#     "sqlite-utils",&lt;/span&gt;
&lt;span&gt;# ]&lt;/span&gt;
&lt;span&gt;# ///&lt;/span&gt;&lt;/pre&gt;
&lt;/blockquote&gt;
&lt;p&gt;That's everything Claude needs to reliably knock out full-featured Python tools as single scripts which can be run directly using whatever dependencies Claude chose to include.&lt;/p&gt;
&lt;p&gt;I didn't suggest that Claude use &lt;a href="https://github.com/Textualize/rich"&gt;rich&lt;/a&gt; for the &lt;code&gt;debug_s3_access.py&lt;/code&gt; script earlier but it decided to use it anyway!&lt;/p&gt;
&lt;p&gt;I've only recently started experimenting with this pattern but it seems to work &lt;em&gt;really&lt;/em&gt; well. Here's another example - my prompt was:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Starlette web app that provides an API where you pass in ?url= and it strips all HTML tags and returns just the text, using beautifulsoup&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/08957a1490ebde1ea38b4a8374989cf8"&gt;the chat transcript&lt;/a&gt; and &lt;a href="https://gist.githubusercontent.com/simonw/08957a1490ebde1ea38b4a8374989cf8/raw/143ee24dc65ca109b094b72e8b8c494369e763d6/strip_html.py"&gt;the raw code it produced&lt;/a&gt;. You can run that server directly on your machine (it uses port 8000) like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uv run https://gist.githubusercontent.com/simonw/08957a1490ebde1ea38b4a8374989cf8/raw/143ee24dc65ca109b094b72e8b8c494369e763d6/strip_html.py&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then visit &lt;code&gt;http://127.0.0.1:8000/?url=https://simonwillison.net/&lt;/code&gt; to see it in action.&lt;/p&gt;
&lt;h4 id="custom-instructions"&gt;Custom instructions&lt;/h4&gt;
&lt;p&gt;The pattern here that's most interesting to me is using custom instructions or system prompts to show LLMs how to implement new patterns that may not exist in their training data. &lt;code&gt;uv run&lt;/code&gt; is less than a year old, but providing just a short example is enough to get the models to write code that takes advantage of its capabilities.&lt;/p&gt;
&lt;p&gt;I have a similar set of custom instructions I use for creating single page HTML and JavaScript tools, again running in a Claude Project:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Never use React in artifacts - always plain HTML and vanilla JavaScript and CSS with minimal dependencies.&lt;/p&gt;
&lt;p&gt;CSS should be indented with two spaces and should start like this:&lt;/p&gt;
&lt;div class="highlight highlight-text-html-basic"&gt;&lt;pre&gt;&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;style&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
* {
  box-sizing: border-box;
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Inputs and textareas should be font size 16px. Font should always prefer Helvetica.&lt;/p&gt;
&lt;p&gt;JavaScript should be two space indents and start like this:&lt;/p&gt;
&lt;div class="highlight highlight-text-html-basic"&gt;&lt;pre&gt;&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;script&lt;/span&gt; &lt;span class="pl-c1"&gt;type&lt;/span&gt;="&lt;span class="pl-s"&gt;module&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
// code in here should not be indented at the first level&lt;/pre&gt;&lt;/div&gt;
&lt;/blockquote&gt;
&lt;p&gt;Most of the tools on my &lt;a href="https://tools.simonwillison.net/"&gt;tools.simonwillison.net&lt;/a&gt; site were created using versions of this custom instructions prompt.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-engineering"&gt;prompt-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-artifacts"&gt;claude-artifacts&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rich"&gt;rich&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-to-app"&gt;prompt-to-app&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/starlette"&gt;starlette&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="aws"/><category term="cli"/><category term="python"/><category term="s3"/><category term="ai"/><category term="prompt-engineering"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="claude"/><category term="claude-artifacts"/><category term="uv"/><category term="rich"/><category term="prompt-to-app"/><category term="starlette"/></entry><entry><title>DSQL Vignette: Reads and Compute</title><link href="https://simonwillison.net/2024/Dec/6/dsql-vignette-reads-and-compute/#atom-tag" rel="alternate"/><published>2024-12-06T17:12:10+00:00</published><updated>2024-12-06T17:12:10+00:00</updated><id>https://simonwillison.net/2024/Dec/6/dsql-vignette-reads-and-compute/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://brooker.co.za/blog/2024/12/04/inside-dsql.html"&gt;DSQL Vignette: Reads and Compute&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Marc Brooker is one of the engineers behind AWS's new &lt;a href="https://simonwillison.net/2024/Dec/3/amazon-aurora-dsql/"&gt;Aurora DSQL&lt;/a&gt; horizontally scalable database. Here he shares all sorts of interesting details about how it works under the hood.&lt;/p&gt;
&lt;p&gt;The system is built around the principle of separating storage from compute: storage uses S3, while compute runs in Firecracker:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Each transaction inside DSQL runs in a customized Postgres engine inside a Firecracker MicroVM, dedicated to your database. When you connect to DSQL, we make sure there are enough of these MicroVMs to serve your load, and scale up dynamically if needed. We add MicroVMs in the AZs and regions your connections are coming from, keeping your SQL query processor engine as close to your client as possible to optimize for latency.&lt;/p&gt;
&lt;p&gt;We opted to use PostgreSQL here because of its pedigree, modularity, extensibility, and performance. We’re not using any of the storage or transaction processing parts of PostgreSQL, but are using the SQL engine, an adapted version of the planner and optimizer, and the client protocol implementation.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The system then provides strong repeatable-read transaction isolation using MVCC and EC2's high precision clocks, enabling reads "as of time X" including against nearby read replicas.&lt;/p&gt;
&lt;p&gt;The storage layer supports index scans, which means the compute layer can push down some operations allowing it to load a subset of the rows it needs, reducing round-trips that are affected by speed-of-light latency.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The overall approach here is &lt;em&gt;disaggregation&lt;/em&gt;: we’ve taken each of the critical components of an OLTP database and made it a dedicated service. Each of those services is independently horizontally scalable, most of them are shared-nothing, and each can make the design choices that is most optimal in its domain.&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/databases"&gt;databases&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ec2"&gt;ec2&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/postgresql"&gt;postgresql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scaling"&gt;scaling&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/software-architecture"&gt;software-architecture&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/firecracker"&gt;firecracker&lt;/a&gt;&lt;/p&gt;



</summary><category term="aws"/><category term="databases"/><category term="ec2"/><category term="postgresql"/><category term="s3"/><category term="scaling"/><category term="software-architecture"/><category term="firecracker"/></entry><entry><title>Amazon S3 adds new functionality for conditional writes</title><link href="https://simonwillison.net/2024/Nov/26/s3-conditional-writes/#atom-tag" rel="alternate"/><published>2024-11-26T01:14:29+00:00</published><updated>2024-11-26T01:14:29+00:00</updated><id>https://simonwillison.net/2024/Nov/26/s3-conditional-writes/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-s3-functionality-conditional-writes/"&gt;Amazon S3 adds new functionality for conditional writes&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Amazon S3 can now perform conditional writes that evaluate if an object is unmodified before updating it. This helps you coordinate simultaneous writes to the same object and prevents multiple concurrent writers from unintentionally overwriting the object without knowing the state of its content. You can use this capability by providing the ETag of an object [...]&lt;/p&gt;
&lt;p&gt;This new conditional header can help improve the efficiency of your large-scale analytics, distributed machine learning, and other highly parallelized workloads by reliably offloading compare and swap operations to S3.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;(Both &lt;a href="https://learn.microsoft.com/en-us/rest/api/storageservices/specifying-conditional-headers-for-blob-service-operations#Subheading1"&gt;Azure Blob Storage&lt;/a&gt; and &lt;a href="https://cloud.google.com/storage/docs/request-preconditions#precondition_criteria"&gt;Google Cloud&lt;/a&gt; have this feature already.)&lt;/p&gt;
&lt;p&gt;When AWS added conditional write support just for if an object with that key exists or not back in August I &lt;a href="https://simonwillison.net/2024/Aug/30/leader-election-with-s3-conditional-writes/"&gt;wrote about&lt;/a&gt; Gunnar Morling's trick for &lt;a href="https://www.morling.dev/blog/leader-election-with-s3-conditional-writes/"&gt;Leader Election With S3 Conditional Writes&lt;/a&gt;. This new capability opens up a whole set of new patterns for implementing distributed locking systems along those lines.&lt;/p&gt;
&lt;p&gt;Here's a useful illustrative example &lt;a href="https://news.ycombinator.com/item?id=42240678#42241577"&gt;by lxgr on Hacker News&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;As a (horribly inefficient, in case of non-trivial write contention) toy example, you could use S3 as a lock-free concurrent SQLite storage backend: Reads work as expected by fetching the entire database and satisfying the operation locally; writes work like this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Download the current database copy&lt;/li&gt;
&lt;li&gt;Perform your write locally&lt;/li&gt;
&lt;li&gt;Upload it back using "Put-If-Match" and the pre-edit copy as the matched object.&lt;/li&gt;
&lt;li&gt;If you get success, consider the transaction successful.&lt;/li&gt;
&lt;li&gt;If you get failure, go back to step 1 and try again.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;AWS also just added the ability to &lt;a href="https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-s3-enforcement-conditional-write-operations-general-purpose-buckets/"&gt;enforce conditional writes in bucket policies&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;To enforce conditional write operations, you can now use s3:if-none-match or s3:if-match condition keys to write a bucket policy that mandates the use of HTTP if-none-match or HTTP if-match conditional headers in S3 PutObject and CompleteMultipartUpload API requests. With this bucket policy in place, any attempt to write an object to your bucket without the required conditional header will be rejected.&lt;/p&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=42240678"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scaling"&gt;scaling&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/software-architecture"&gt;software-architecture&lt;/a&gt;&lt;/p&gt;



</summary><category term="aws"/><category term="s3"/><category term="scaling"/><category term="software-architecture"/></entry><entry><title>Amazon S3 Express One Zone now supports the ability to append data to an object</title><link href="https://simonwillison.net/2024/Nov/22/amazon-s3-append-data/#atom-tag" rel="alternate"/><published>2024-11-22T04:39:35+00:00</published><updated>2024-11-22T04:39:35+00:00</updated><id>https://simonwillison.net/2024/Nov/22/amazon-s3-append-data/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-s3-express-one-zone-append-data-object/"&gt;Amazon S3 Express One Zone now supports the ability to append data to an object&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This is a first for Amazon S3: it is now possible to append data to an existing object in a bucket, where previously the only supported operation was to atomically replace the object with an updated version.&lt;/p&gt;
&lt;p&gt;This is only available for S3 Express One Zone, a bucket class introduced &lt;a href="https://aws.amazon.com/blogs/aws/new-amazon-s3-express-one-zone-high-performance-storage-class/"&gt;a year ago&lt;/a&gt; which provides storage in just a single availability zone, providing significantly lower latency at the cost of reduced redundancy and a much higher price (16c/GB/month compared to 2.3c for S3 standard tier).&lt;/p&gt;
&lt;p&gt;The fact that appends have never been supported for multi-availability zone S3 provides an interesting clue as to the underlying architecture. Guaranteeing that every copy of an object has received and applied an append is significantly harder than doing a distributed atomic swap to a new version.&lt;/p&gt;
&lt;p&gt;More details from &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/directory-buckets-objects-append.html"&gt;the documentation&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;There is no minimum size requirement for the data you can append to an object. However, the maximum size of the data that you can append to an object in a single request is 5GB. This is the same limit as the largest request size when uploading data using any Amazon S3 API.&lt;/p&gt;
&lt;p&gt;With each successful append operation, you create a part of the object and each object can have up to 10,000 parts. This means you can append data to an object up to 10,000 times. If an object is created using S3 multipart upload, each uploaded part is counted towards the total maximum of 10,000 parts. For example, you can append up to 9,000 times to an object created by multipart upload comprising of 1,000 parts.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That 10,000 limit means this won't quite work for constantly appending to a log file in a bucket.&lt;/p&gt;
&lt;p&gt;Presumably it will be possible to "tail" an object that is receiving appended updates using the HTTP Range header.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scaling"&gt;scaling&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/software-architecture"&gt;software-architecture&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/http-range-requests"&gt;http-range-requests&lt;/a&gt;&lt;/p&gt;



</summary><category term="aws"/><category term="s3"/><category term="scaling"/><category term="software-architecture"/><category term="http-range-requests"/></entry><entry><title>Leader Election With S3 Conditional Writes</title><link href="https://simonwillison.net/2024/Aug/30/leader-election-with-s3-conditional-writes/#atom-tag" rel="alternate"/><published>2024-08-30T23:13:09+00:00</published><updated>2024-08-30T23:13:09+00:00</updated><id>https://simonwillison.net/2024/Aug/30/leader-election-with-s3-conditional-writes/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.morling.dev/blog/leader-election-with-s3-conditional-writes/"&gt;Leader Election With S3 Conditional Writes&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Amazon S3 added &lt;a href="https://aws.amazon.com/about-aws/whats-new/2024/08/amazon-s3-conditional-writes/"&gt;support for conditional writes&lt;/a&gt; last week, so you can now write a key to S3 with a reliable failure if someone else has has already created it.&lt;/p&gt;
&lt;p&gt;This is a big deal. It reminds me of the time in 2020 when S3 &lt;a href="https://aws.amazon.com/about-aws/whats-new/2020/12/amazon-s3-now-delivers-strong-read-after-write-consistency-automatically-for-all-applications/"&gt;added read-after-write consistency&lt;/a&gt;, an astonishing piece of distributed systems engineering.&lt;/p&gt;
&lt;p&gt;Gunnar Morling demonstrates how this can be used to implement a distributed leader election system. The core flow looks like this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Scan an S3 bucket for files matching &lt;code&gt;lock_*&lt;/code&gt; - like &lt;code&gt;lock_0000000001.json&lt;/code&gt;. If the highest number contains &lt;code&gt;{"expired": false}&lt;/code&gt; then that is the leader&lt;/li&gt;
&lt;li&gt;If the highest lock has expired, attempt to become the leader yourself: increment that lock ID and then attempt to create &lt;code&gt;lock_0000000002.json&lt;/code&gt; with a PUT request that includes the new &lt;code&gt;If-None-Match: *&lt;/code&gt; header - set the file content to &lt;code&gt;{"expired": false}&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;If that succeeds, you are the leader! If not then someone else beat you to it.&lt;/li&gt;
&lt;li&gt;To resign from leadership, update the file with &lt;code&gt;{"expired": true}&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There's a bit more to it than that - Gunnar also describes how to implement lock validity timeouts such that a crashed leader doesn't leave the system leaderless.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=41357123"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/software-architecture"&gt;software-architecture&lt;/a&gt;&lt;/p&gt;



</summary><category term="aws"/><category term="s3"/><category term="software-architecture"/></entry><entry><title>Quoting Jeff Barr</title><link href="https://simonwillison.net/2024/Jul/31/jeff-barr/#atom-tag" rel="alternate"/><published>2024-07-31T12:59:57+00:00</published><updated>2024-07-31T12:59:57+00:00</updated><id>https://simonwillison.net/2024/Jul/31/jeff-barr/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://twitter.com/jeffbarr/status/1818461689920344321"&gt;&lt;p&gt;After giving it a lot of thought, we made the decision to discontinue new access to a small number of services, including AWS CodeCommit. &lt;/p&gt;
&lt;p&gt;While we are no longer onboarding new customers to these services, there are no plans to change the features or experience you get today, including keeping them secure and reliable. [...]&lt;/p&gt;
&lt;p&gt;The services I'm referring to are: S3 Select, CloudSearch, Cloud9, SimpleDB, Forecast, Data Pipeline, and CodeCommit.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://twitter.com/jeffbarr/status/1818461689920344321"&gt;Jeff Barr&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/jeff-barr"&gt;jeff-barr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;&lt;/p&gt;



</summary><category term="aws"/><category term="jeff-barr"/><category term="s3"/></entry><entry><title>How an empty S3 bucket can make your AWS bill explode</title><link href="https://simonwillison.net/2024/Apr/30/how-an-empty-s3-bucket-can-make-your-aws-bill-explode/#atom-tag" rel="alternate"/><published>2024-04-30T11:19:21+00:00</published><updated>2024-04-30T11:19:21+00:00</updated><id>https://simonwillison.net/2024/Apr/30/how-an-empty-s3-bucket-can-make-your-aws-bill-explode/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://medium.com/@maciej.pocwierz/how-an-empty-s3-bucket-can-make-your-aws-bill-explode-934a383cb8b1"&gt;How an empty S3 bucket can make your AWS bill explode&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Maciej Pocwierz accidentally created an S3 bucket with a name that was already used as a placeholder value in a widely used piece of software. They saw 100 million PUT requests to their new bucket in a single day, racking up a big bill since AWS charges $5/million PUTs.&lt;/p&gt;
&lt;p&gt;It turns out AWS charge that same amount for PUTs that result in a 403 authentication error, a policy &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/RequesterPaysBuckets.html#ChargeDetails"&gt;that extends&lt;/a&gt; even to "requester pays" buckets!&lt;/p&gt;
&lt;p&gt;So, if you know someone's S3 bucket name you can DDoS their AWS bill just by flooding them with meaningless unauthenticated PUT requests.&lt;/p&gt;
&lt;p&gt;AWS support refunded Maciej's bill as an exception here, but I'd like to see them reconsider this broken policy entirely.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt; from &lt;a href="https://twitter.com/jeffbarr/status/1785386554372042890"&gt;Jeff Barr&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We agree that customers should not have to pay for unauthorized requests that they did not initiate. We’ll have more to share on exactly how we’ll help prevent these charges shortly.&lt;/p&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://lobste.rs/s/cy9i87/how_empty_s3_bucket_can_make_your_aws_bill"&gt;Lobste.rs&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/jeff-barr"&gt;jeff-barr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;&lt;/p&gt;



</summary><category term="aws"/><category term="jeff-barr"/><category term="s3"/><category term="security"/></entry><entry><title>s3-credentials 0.16</title><link href="https://simonwillison.net/2024/Apr/5/s3-credentials-016/#atom-tag" rel="alternate"/><published>2024-04-05T05:35:57+00:00</published><updated>2024-04-05T05:35:57+00:00</updated><id>https://simonwillison.net/2024/Apr/5/s3-credentials-016/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.16"&gt;s3-credentials 0.16&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I spent entirely too long this evening trying to figure out why files in my new supposedly public S3 bucket were unavailable to view. It turns out these days you need to set a &lt;code&gt;PublicAccessBlockConfiguration&lt;/code&gt; of &lt;code&gt;{"BlockPublicAcls": false, "IgnorePublicAcls": false, "BlockPublicPolicy": false, "RestrictPublicBuckets": false}&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;s3-credentials --create-bucket --public&lt;/code&gt; option now does that for you. I also added a &lt;code&gt;s3-credentials debug-bucket name-of-bucket&lt;/code&gt; command to help figure out why a bucket isn't working as expected.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/p&gt;



</summary><category term="aws"/><category term="projects"/><category term="s3"/><category term="s3-credentials"/></entry><entry><title>S3 is files, but not a filesystem</title><link href="https://simonwillison.net/2024/Mar/10/s3-is-not-a-filesystem/#atom-tag" rel="alternate"/><published>2024-03-10T11:47:34+00:00</published><updated>2024-03-10T11:47:34+00:00</updated><id>https://simonwillison.net/2024/Mar/10/s3-is-not-a-filesystem/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://calpaterson.com/s3.html"&gt;S3 is files, but not a filesystem&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Cal Paterson helps some concepts click into place for me: S3 imitates a file system but has a number of critical missing features, the most important of which is the lack of partial updates. Any time you want to modify even a few bytes in a file you have to upload and overwrite the entire thing. Almost every database system is dependent on partial updates to function, which is why there are so few databases that can use S3 directly as a backend storage mechanism.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://lobste.rs/s/t9d5z4/s3_is_files_not_filesystem"&gt;Lobste.rs&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/databases"&gt;databases&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;&lt;/p&gt;



</summary><category term="aws"/><category term="databases"/><category term="s3"/></entry><entry><title>Slashing Data Transfer Costs in AWS by 99%</title><link href="https://simonwillison.net/2024/Jan/15/slashing-data-transfer-costs-in-aws-by-99/#atom-tag" rel="alternate"/><published>2024-01-15T22:22:10+00:00</published><updated>2024-01-15T22:22:10+00:00</updated><id>https://simonwillison.net/2024/Jan/15/slashing-data-transfer-costs-in-aws-by-99/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.bitsand.cloud/posts/slashing-data-transfer-costs/"&gt;Slashing Data Transfer Costs in AWS by 99%&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Brilliant trick by Daniel Kleinstein. If you have data in two availability zones in the same AWS region, transferring a TB will cost you $10 in ingress and $10 in egress at the inter-zone rates charged by AWS.&lt;/p&gt;

&lt;p&gt;But... transferring data to an S3 bucket in that same region is free (aside from S3 storage costs). And buckets are available with free transfer to all availability zones in their region, which means that TB of data can be transferred between availability zones for mere cents of S3 storage costs provided you delete the data as soon as it’s transferred.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=38998516"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;&lt;/p&gt;



</summary><category term="aws"/><category term="s3"/></entry><entry><title>How ima.ge.cx works</title><link href="https://simonwillison.net/2023/Dec/31/how-imagecx-works/#atom-tag" rel="alternate"/><published>2023-12-31T04:32:41+00:00</published><updated>2023-12-31T04:32:41+00:00</updated><id>https://simonwillison.net/2023/Dec/31/how-imagecx-works/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://awsteele.com/blog/2023/12/29/how-ima-ge-cx-works.html"&gt;How ima.ge.cx works&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
ima.ge.cx is Aidan Steele’s web tool for browsing the contents of Docker images hosted on Docker Hub. The architecture is really interesting: it’s a set of AWS Lambda functions, written in Go, that fetch metadata about the images using Step Functions and then cache it in DynamoDB and S3. It uses S3 Select to serve directory listings from newline-delimited JSON in S3 without retrieving the whole file.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/__steele/status/1740597225913458938"&gt;@__steele&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/go"&gt;go&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/docker"&gt;docker&lt;/a&gt;&lt;/p&gt;



</summary><category term="aws"/><category term="go"/><category term="s3"/><category term="docker"/></entry><entry><title>Litestream backups for Datasette Cloud (and weeknotes)</title><link href="https://simonwillison.net/2022/Aug/11/litestream-backups/#atom-tag" rel="alternate"/><published>2022-08-11T17:19:52+00:00</published><updated>2022-08-11T17:19:52+00:00</updated><id>https://simonwillison.net/2022/Aug/11/litestream-backups/#atom-tag</id><summary type="html">
    &lt;p&gt;My main focus this week has been adding robust backups to the forthcoming Datasette Cloud.&lt;/p&gt;
&lt;p&gt;Datasette Cloud is a SaaS service for &lt;a href="https://datasette.io"&gt;Datasette&lt;/a&gt;. It allows people to create a private Datasette instance where they can upload data, visualize and transform it and share it with other members of their team. You can join the waiting list to try it out using &lt;a href="https://docs.google.com/forms/d/e/1FAIpQLSeMRYHBHXlWGDkjCqhAOinDrKEVwqNR5GfPs3iEXn8LzDT2Qg/viewform?embedded=true"&gt;this form&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I'm building Datastte Cloud on &lt;a href="https://fly.io/"&gt;Fly&lt;/a&gt;, specifically on &lt;a href="https://fly.io/blog/fly-machines/"&gt;Fly Machines&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Security is a big concern for Datasette Cloud. Teams should only be able to access their own data - bugs where users accidentally (or maliciously) access data for another team should be protected against as much as possible.&lt;/p&gt;
&lt;p&gt;To help guarantee that, I've designed Datasette Cloud so that each team gets their own, dedicated instance, running in a &lt;a href="https://firecracker-microvm.github.io/"&gt;Firecracker VM&lt;/a&gt; managed by Fly. Their data lives in a dedicated volume.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://fly.io/docs/reference/volumes/"&gt;Fly volumes&lt;/a&gt; already implement snapshot backups, but I'm interested in defence in depth. This is where &lt;a href="https://litestream.io/"&gt;Litestream&lt;/a&gt; comes in (coincidentally now &lt;a href="https://fly.io/blog/all-in-on-sqlite-litestream/"&gt;part of Fly&lt;/a&gt;, although it wasn't when I first selected it as my backup strategy).&lt;/p&gt;
&lt;p&gt;I'm using Litestream to constantly backup the data for each Datasette Cloud team to an S3 bucket. In the case of a complete failure of a volume, I can restore data from a backup that should be at most a few seconds out of date. Litestream also gives me point-in-time backups, such that I can recover a previous version of the data within a configurable retention window.&lt;/p&gt;
&lt;h4&gt;Keeping backups isolated&lt;/h4&gt;
&lt;p&gt;Litestream &lt;a href="https://litestream.io/how-it-works/"&gt;works&lt;/a&gt; by writing a constant stream of pages from SQLite's WAL (Write-Ahead Log) up to an S3 bucket. It needs the ability to both read and write from S3.&lt;/p&gt;
&lt;p&gt;This requires making S3 credentials available within the containers that run Datasette and Litestream for each team account.&lt;/p&gt;
&lt;p&gt;Credentials in those containers are not visible to the users of the software, but I still wanted to be confident that if the credentials leaked in some way the isolation between teams would be maintained.&lt;/p&gt;
&lt;p&gt;Initially I thought about having a separate S3 bucket for each team, but it turns out AWS has a default limit of 100 buckets per account, and a hard limit of 1,000. I aspire to have more than 1,000 customers, so this limit makes a bucket-per-team seem like the wrong solution.&lt;/p&gt;
&lt;p&gt;I've learned an absolute ton about S3 and AWS permissions building my &lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt; tool for creating credentials for accessing S3.&lt;/p&gt;
&lt;p&gt;One of the tricks I've learned is that it's possible to create temporary, time-limited credentials that &lt;a href="https://simonwillison.net/2022/Jan/18/weeknotes/"&gt;only work for a prefix&lt;/a&gt; (effectively a folder) within an S3 bucket.&lt;/p&gt;
&lt;p&gt;This means I can run Litestream with credentials that are specific to the team - that can read and write only from the &lt;code&gt;team-ID/&lt;/code&gt; prefix in the S3 bucket I am using to store the backups.&lt;/p&gt;
&lt;h4&gt;Obtaining temporary credentials&lt;/h4&gt;
&lt;p&gt;My &lt;code&gt;s3-credentials&lt;/code&gt; tool can create credentials for a prefix within an S3 bucket like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;s3-credentials create my-bucket-for-backus \
  --duration 12h \
  --prefix team-56/
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command uses the &lt;a href="https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sts.html#STS.Client.assume_role"&gt;sts.assume_role()&lt;/a&gt; AWS method to create credentials that allow access to that bucket, attaching &lt;a href="https://github.com/simonw/s3-credentials/blob/0.12.1/README.md#--prefix-my-prefix"&gt;this generated JSON policy&lt;/a&gt; to it in order to restrict access to the provided prefix.&lt;/p&gt;
&lt;p&gt;I extracted the relevant Python code from  &lt;code&gt;s3-credentials&lt;/code&gt; and used it to create a private API endpoint in my Datasette Cloud management server which could return the temporary credentials needed by the team container.&lt;/p&gt;
&lt;p&gt;With the endpoint in place, my code for launching a team container can do this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Create the volume and machine for that team (if they do not yet exist)&lt;/li&gt;
&lt;li&gt;Generate a signed secret token that the machine container can exchange for its S3 credentials&lt;/li&gt;
&lt;li&gt;Launch the machine container, passing it the secret token&lt;/li&gt;
&lt;li&gt;On launch, the container runs a script which exchanges that secret token for its 12 hour S3 credentials, using the private API endpoint I created&lt;/li&gt;
&lt;li&gt;Those credentials are used to populate the &lt;code&gt;AWS_ACCESS_KEY_ID&lt;/code&gt;, &lt;code&gt;AWS_SECRET_ACCESS_KEY&lt;/code&gt; and &lt;code&gt;AWS_SESSION_TOKEN&lt;/code&gt; environment variables used by Litestream&lt;/li&gt;
&lt;li&gt;Start Litestream, which then starts Datasette&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Restarting every 12 hours&lt;/h4&gt;
&lt;p&gt;You may be wondering why I bothered with that initial secret token - why not just pass the temporary AWS credentials to the container when I launch it?&lt;/p&gt;
&lt;p&gt;The reason for this is that I need to be able to obtain fresh credentials every 12 hours.&lt;/p&gt;
&lt;p&gt;A really neat feature of Fly Machines is that they support scale-to-zero. You can stop them, and Fly will automatically restart them the next time they recieve traffic.&lt;/p&gt;
&lt;p&gt;All you need to do is call &lt;code&gt;sys.exit(0)&lt;/code&gt; in your Python code (or the equivalent in any other language) and Fly will stop your container... and then restart it again with a couple of seconds of cold start time the next time an HTTP request for your container hits the Fly router.&lt;/p&gt;
&lt;p&gt;So far I'm mainly using this to avoid the cost of running containers when they aren't actually in- use. But there's a neat benefit when it comes to Litestream too.&lt;/p&gt;
&lt;p&gt;I'm using S3 credentials which expire after 12 hours. This means I need to periodically refresh the credentials and restart Litestream or it will stop being able to write to the S3 bucket.&lt;/p&gt;
&lt;p&gt;After considering a few ways of doing this, I selected the simplest to implement: have Datasette call &lt;code&gt;sys.exit(0)&lt;/code&gt; after ten hours, and let Fly restart the container causing my startup script to fetch freshly generated 12 hour credentials and pass them to Litestream.&lt;/p&gt;
&lt;p&gt;I implemented this by adding it as a new setting to my existing &lt;a href="https://datasette.io/plugins/datasette-scale-to-zero"&gt;datasette-scale-to-zero&lt;/a&gt; plugin. You can now configure that with &lt;code&gt; "max-age": "10h"&lt;/code&gt; and it will shut down Datasette once the server has been running for that long.&lt;/p&gt;
&lt;p&gt;Why does this require my own secret token system? Because when the container is restarted, it needs to make an authenticated call to my endpoint to retrieve those fresh S3 credentials. Fly persists environment variable secrets between restarts to the container, so that secret can be long-lived even while it is exchanged for short-term S3 credentials.&lt;/p&gt;
&lt;p&gt;I only just put the new backup system in place, so I'm exercising it a bit before I open things up to trial users - but so far it's looking like a very robust solution to the problem.&lt;/p&gt;
&lt;h4&gt;s3-ocr improvements&lt;/h4&gt;
&lt;p&gt;I released a &lt;a href="https://datasette.io/tools/s3-ocr"&gt;few new versions&lt;/a&gt; of &lt;a href="https://simonwillison.net/2022/Jun/30/s3-ocr/"&gt;s3-ocr&lt;/a&gt; this week, as part of my ongoing project working with the San Francisco Microscopical Society team to release a searchable version of their scanned document archives.&lt;/p&gt;
&lt;p&gt;The two main improvements are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A new &lt;code&gt;--dry-run&lt;/code&gt; option to &lt;code&gt;s3-ocr start&lt;/code&gt; which shows you what the tool will do without making any changes to your S3 bucket, or triggering any OCR jobs. &lt;a href="https://github.com/simonw/s3-ocr/issues/22"&gt;#22&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;s3-ocr start&lt;/code&gt; used to fail with an error if running it would create more than 100 (or 600 depending on your region) concurrent OCR jobs. The tool now knows how to identify that error and pause and retry starting the jobs instead. &lt;a href="https://github.com/simonw/s3-ocr/issues/21"&gt;#21&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The fix that took the most time is this: installations of the tool no longer arbitrarily fail to work depending on the environment you install them into!&lt;/p&gt;
&lt;p&gt;Solving this took me the best part of a day. The short version is this: Click 8.1.0 introduced a new feature that lets you use &lt;code&gt;@cli.command&lt;/code&gt; as a decorator instead of &lt;code&gt;@cli.command()&lt;/code&gt;. This meant that installing &lt;code&gt;s3-ocr&lt;/code&gt; in an environment that already had a previous version of Click would result in silent errors.&lt;/p&gt;
&lt;p&gt;The solution is simple: pin to &lt;code&gt;click&amp;gt;=8.1.0&lt;/code&gt; in the project dependencies if you plan to use this new syntax.&lt;/p&gt;
&lt;p&gt;If I'd read the Click &lt;a href="https://click.palletsprojects.com/en/8.1.x/changes/#version-8-1-0"&gt;changelog more closely&lt;/a&gt; I would have saved myself a whole lot of time.&lt;/p&gt;
&lt;p&gt;Issues &lt;a href="https://github.com/simonw/s3-ocr/issues/25"&gt;#25&lt;/a&gt; and &lt;a href="https://github.com/simonw/s3-ocr/issues/26"&gt;#26&lt;/a&gt; detail the many false turns I took trying to figure this out.&lt;/p&gt;
&lt;h4&gt;More fun with GPT-3 and DALL-E&lt;/h4&gt;
&lt;p&gt;This tweet scored over a million impressions on Twitter:&lt;/p&gt;

&lt;blockquote class="twitter-tweet"&gt;&lt;p lang="en" dir="ltr"&gt;New hobby: prototyping video games in 60 seconds using a combination of GPT-3 and DALL-E&lt;br /&gt;&lt;br /&gt;Here&amp;#39;s &amp;quot;Raccoon Heist&amp;quot; &lt;a href="https://t.co/xQ3Vm8p2XW"&gt;pic.twitter.com/xQ3Vm8p2XW&lt;/a&gt;&lt;/p&gt;- Simon Willison (@simonw) &lt;a href="https://twitter.com/simonw/status/1555626060384911360?ref_src=twsrc%5Etfw"&gt;August 5, 2022&lt;/a&gt;&lt;/blockquote&gt;

&lt;p&gt;As this got retweeted outside of my usual circles it started confusing people who thought the "prototype" was a working game, as opposed to a fake screenshot and a paragraph of descriptive text! I wasn't kidding when I said I spent 60 seconds on this.&lt;/p&gt;
&lt;p&gt;I also figured out how to use GPT-3 to write &lt;code&gt;jq&lt;/code&gt; one-liners. I love &lt;code&gt;jq&lt;/code&gt; but I have to look up how to use it every time, so having GPT-3 do the work for me is a pretty neat time saver. More on that in this TIL: &lt;a href="https://til.simonwillison.net/gpt3/jq"&gt;Using GPT-3 to figure out jq recipes&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-ocr"&gt;s3-ocr&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/s3-ocr/releases/tag/0.6.3"&gt;0.6.3&lt;/a&gt; - (&lt;a href="https://github.com/simonw/s3-ocr/releases"&gt;9 releases total&lt;/a&gt;) - 2022-08-10
&lt;br /&gt;Tools for running OCR against files stored in S3&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-scale-to-zero"&gt;datasette-scale-to-zero&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-scale-to-zero/releases/tag/0.2"&gt;0.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-scale-to-zero/releases"&gt;4 releases total&lt;/a&gt;) - 2022-08-05
&lt;br /&gt;Quit Datasette if it has not received traffic for a specified time period&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/shot-scraper"&gt;shot-scraper&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/shot-scraper/releases/tag/0.14.3"&gt;0.14.3&lt;/a&gt; - (&lt;a href="https://github.com/simonw/shot-scraper/releases"&gt;18 releases total&lt;/a&gt;) - 2022-08-02
&lt;br /&gt;A command-line utility for taking automated screenshots of websites&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.12.1"&gt;0.12.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/s3-credentials/releases"&gt;13 releases total&lt;/a&gt;) - 2022-08-01
&lt;br /&gt;A tool for creating credentials for accessing S3 buckets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-sqlite-fts4"&gt;datasette-sqlite-fts4&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-sqlite-fts4/releases/tag/0.3.2"&gt;0.3.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-sqlite-fts4/releases"&gt;2 releases total&lt;/a&gt;) - 2022-07-31&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/sqlite/related-content"&gt;Related content with SQLite FTS and a Datasette template function&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/aws/boto-command-line"&gt;Using boto3 from the command line&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/sqlite/trying-macos-extensions"&gt;Trying out SQLite extensions on macOS&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/pytest/mocking-boto"&gt;Mocking a Textract LimitExceededException with boto&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/gpt3/jq"&gt;Using GPT-3 to figure out jq recipes&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette-cloud"&gt;datasette-cloud&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/fly"&gt;fly&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/litestream"&gt;litestream&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-3"&gt;gpt-3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dalle"&gt;dalle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt"&gt;gpt&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ocr"/><category term="s3"/><category term="datasette"/><category term="weeknotes"/><category term="datasette-cloud"/><category term="fly"/><category term="litestream"/><category term="gpt-3"/><category term="dalle"/><category term="gpt"/></entry><entry><title>s3-ocr: Extract text from PDF files stored in an S3 bucket</title><link href="https://simonwillison.net/2022/Jun/30/s3-ocr/#atom-tag" rel="alternate"/><published>2022-06-30T21:40:27+00:00</published><updated>2022-06-30T21:40:27+00:00</updated><id>https://simonwillison.net/2022/Jun/30/s3-ocr/#atom-tag</id><summary type="html">
    &lt;p&gt;I've released &lt;strong&gt;&lt;a href="https://datasette.io/tools/s3-ocr"&gt;s3-ocr&lt;/a&gt;&lt;/strong&gt;, a new tool that runs Amazon's &lt;a href="https://aws.amazon.com/textract/"&gt;Textract&lt;/a&gt; OCR text extraction against PDF files in an S3 bucket, then writes the resulting text out to a SQLite database with full-text search configured so you can run searches against the extracted data.&lt;/p&gt;
&lt;p&gt;You can search through a demo of 697 pages of OCRd text at &lt;a href="https://s3-ocr-demo.datasette.io/pages/pages"&gt;s3-ocr-demo.datasette.io/pages/pages&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Textract works extremely well: it handles dodgy scanned PDFs full of typewritten code and reads handwritten text better than I can! It &lt;a href="https://aws.amazon.com/textract/pricing/"&gt;charges&lt;/a&gt; $1.50 per thousand pages processed.&lt;/p&gt;
&lt;h4&gt;Why I built this&lt;/h4&gt;
&lt;p&gt;My initial need for this is a collaboration I have running with the &lt;a href="https://sfmicrosociety.org/"&gt;San Francisco Microscopy Society&lt;/a&gt;. They've been digitizing their archives - which stretch back to 1870! - and were looking for help turning the digital scans into something more useful.&lt;/p&gt;
&lt;p&gt;The archives are full of hand-written and type-written notes, scanned and stored as PDFs.&lt;/p&gt;
&lt;p&gt;I decided to wrap my work up as a tool because I'm sure there are a LOT of organizations out there with a giant bucket of PDF files that would benefit from being able to easily run OCR and turn the results into a searchable database.&lt;/p&gt;
&lt;p&gt;Running Textract directly against large numbers of files is somewhat inconvenient (here's my &lt;a href="https://til.simonwillison.net/aws/ocr-pdf-textract"&gt;earlier TIL about it&lt;/a&gt;). &lt;code&gt;s3-ocr&lt;/code&gt; is my attempt to make it easier.&lt;/p&gt;
&lt;h4&gt;Tutorial: How I built that demo&lt;/h4&gt;
&lt;p&gt;The demo instance uses three PDFs from the Library of Congress Harry Houdini Collection &lt;a href="https://archive.org/search.php?query=creator%3A%22Harry+Houdini+Collection+%28Library+of+Congress%29+DLC%22"&gt;on the Internet Archive&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://archive.org/details/unmaskingrobert00houdgoog"&gt;The unmasking of Robert-Houdin&lt;/a&gt; from 1908&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://archive.org/details/practicalmagicia00harr"&gt;The practical magician and ventriloquist's guide: a practical manual of fireside magic and conjuring illusions: containing also complete instructions for acquiring &amp;amp; practising the art of ventriloquism&lt;/a&gt; from 1876&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://archive.org/details/latestmagicbeing00hoff"&gt;Latest magic, being original conjuring tricks&lt;/a&gt; from 1918&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I started by downloading PDFs of those three files.&lt;/p&gt;
&lt;p&gt;Then I installed the two tools I needed:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;pip install s3-ocr s3-credentials
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I used my &lt;a href="https://datasette.io/tools/s3-credentials"&gt;s3-credentials&lt;/a&gt; tool to create a new S3 bucket and credentials with the ability to write files to it, with the new &lt;a href="https://github.com/simonw/s3-credentials/issues/72"&gt;--statement option&lt;/a&gt; (which I released today) to add &lt;code&gt;textract&lt;/code&gt; permissions to the generated credentials:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;s3-credentials create s3-ocr-demo --statement '{
  "Effect": "Allow",
  "Action": "textract:*",
  "Resource": "*"
}' --create-bucket &amp;gt; ocr.json
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;(Note that you don't need to use &lt;code&gt;s3-credentials&lt;/code&gt; at all if you have AWS credentials configured on your machine with root access to your account - just leave off the &lt;code&gt;-a ocr.json&lt;/code&gt; options in the following examples.)&lt;/p&gt;
&lt;p&gt;&lt;code&gt;s3-ocr-demo&lt;/code&gt; is now a bucket I can use for the demo. &lt;code&gt;ocr.json&lt;/code&gt; contains JSON with an access key and secret key for an IAM user account that can interact with the that bucket, and also has permission to access the AWS Textract APIs.&lt;/p&gt;
&lt;p&gt;I uploaded my three PDFs to the bucket:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;s3-credentials put-object s3-ocr-demo latestmagicbeing00hoff.pdf latestmagicbeing00hoff.pdf -a ocr.json
s3-credentials put-object s3-ocr-demo practicalmagicia00harr.pdf practicalmagicia00harr.pdf -a ocr.json
s3-credentials put-object s3-ocr-demo unmaskingrobert00houdgoog.pdf unmaskingrobert00houdgoog.pdf -a ocr.json
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;(I often use &lt;a href="https://panic.com/transmit/"&gt;Transmit&lt;/a&gt; as a GUI for this kind of operation.)&lt;/p&gt;
&lt;p&gt;Then I kicked off OCR jobs against every PDF file in the bucket:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% s3-ocr start s3-ocr-demo --all -a ocr.json 
Found 0 files with .s3-ocr.json out of 3 PDFs
Starting OCR for latestmagicbeing00hoff.pdf, Job ID: f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402
Starting OCR for practicalmagicia00harr.pdf, Job ID: ef085728135d524a39bc037ad6f7253284b1fdbeb728dddcfbb260778d902b55
Starting OCR for unmaskingrobert00houdgoog.pdf, Job ID: 93bd46f02eb099eca369c41e384836d2bd3199b95d415c0257ef3fa3602cbef9
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;--all&lt;/code&gt; option scans for any file with a &lt;code&gt;.pdf&lt;/code&gt; extension. You can pass explicit file names instead if you just want to process one or two files at a time.&lt;/p&gt;
&lt;p&gt;This returns straight away, but the OCR process itself can take several minutes depending on the size of the files.&lt;/p&gt;
&lt;p&gt;The job IDs can be used to inspect the progress of each task like so:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% s3-ocr inspect-job f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402
{
  "DocumentMetadata": {
    "Pages": 244
  },
  "JobStatus": "SUCCEEDED",
  "DetectDocumentTextModelVersion": "1.0"
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once the job completed, I could preview the text extracted from the PDF like so:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% s3-ocr text s3-ocr-demo latestmagicbeing00hoff.pdf
111
.
116

LATEST MAGIC
BEING
ORIGINAL CONJURING TRICKS
INVENTED AND ARRANGED
BY
PROFESSOR HOFFMANN
(ANGELO LEWIS, M.A.)
Author of "Modern Magic," etc.
WITH NUMEROUS ILLUSTRATIONS
FIRST EDITION
NEW YORK
SPON &amp;amp; CHAMBERLAIN, 120 LIBERTY ST.
...
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To create a SQLite database with a table containing rows for every page of scanned text, I ran this command:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% s3-ocr index s3-ocr-demo pages.db -a ocr.json 
Fetching job details  [####################################]  100%
Populating pages table  [####--------------------------------]   13%  00:00:34
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I then published the resulting &lt;code&gt;pages.db&lt;/code&gt; SQLite database using Datasette - you can &lt;a href="https://s3-ocr-demo.datasette.io/pages"&gt;explore it here&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;How s3-ocr works&lt;/h4&gt;
&lt;p&gt;&lt;code&gt;s3-ocr&lt;/code&gt; works by calling Amazon's S3 and Textract APIs.&lt;/p&gt;
&lt;p&gt;Textract only works against PDF files in &lt;a href="https://docs.aws.amazon.com/textract/latest/dg/api-async.html"&gt;asynchronous mode&lt;/a&gt;: you call an API endpoint to tell it "start running OCR against this PDF file in this S3 bucket", then wait for it to finish - which can take several minutes.&lt;/p&gt;
&lt;p&gt;It defaults to storing the OCR results in its own storage, expiring after seven days. You can instead tell it to store them in your own S3 bucket - I use that option in &lt;code&gt;s3-ocr&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;A design challenge I faced was that I wanted to make the command restartable and resumable: if the user cancelled the task, I wanted to be able to pick up from where it had got to. I also want to be able to run it again after adding more PDFs to the bucket without repeating work for the previously processed files.&lt;/p&gt;
&lt;p&gt;I also needed to persist those job IDs: Textract writes the OCR results to keys in the bucket called &lt;code&gt;textract-output/JOB_ID/1-?&lt;/code&gt; - but there's no indication as to which PDF file the results correspond to.&lt;/p&gt;
&lt;p&gt;My solution is to write tiny extra JSON files to the bucket when the OCR job is first started.&lt;/p&gt;
&lt;p&gt;If you have a file called &lt;code&gt;latestmagicbeing00hoff.pdf&lt;/code&gt; the &lt;code&gt;start&lt;/code&gt; command will create a new file called &lt;code&gt;latestmagicbeing00hoff.pdf.s3-ocr.json&lt;/code&gt; with the following content:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
  &lt;span class="pl-ent"&gt;"job_id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
  &lt;span class="pl-ent"&gt;"etag"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-cce"&gt;\"&lt;/span&gt;d79af487579dcbbef26c9b3be763eb5e-2&lt;span class="pl-cce"&gt;\"&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This associates the job ID with the PDF file. It also records the original ETag of the PDF file - this is so in the future I can implement a system that can re-run OCR if the PDF has been updated.&lt;/p&gt;
&lt;p&gt;The existence of these files lets me do two things:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;If you run &lt;code&gt;s3-ocr start s3-ocr-demo --all&lt;/code&gt; it can avoid re-submitting PDF files that have already been sent for OCR, by checking for the existence of the &lt;code&gt;.s3-ocr.json&lt;/code&gt; file.&lt;/li&gt;
&lt;li&gt;When you later ask for the results of the OCR it can use these files to associate the PDF with the results.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Scatting &lt;code&gt;.s3-ocr.json&lt;/code&gt; files all over the place feels a little messy, so I have an &lt;a href="https://github.com/simonw/s3-ocr/issues/14"&gt;open issue&lt;/a&gt; considering moving them all to a &lt;code&gt;s3-ocr/&lt;/code&gt; prefix in the bucket instead.&lt;/p&gt;
&lt;h4&gt;Try it and let me know what you think&lt;/h4&gt;
&lt;p&gt;This is a brand new project, but I think it's ready for other people to start trying it out.&lt;/p&gt;
&lt;p&gt;I ran it against around 7,000 pages from 531 PDF files in the San Francisco Microscopy Society archive and it seemed to work well!&lt;/p&gt;
&lt;p&gt;If you try this out and it works (or it doesn't work) please &lt;a href="https://twitter.com/simonw"&gt;let me know via Twitter&lt;/a&gt; or &lt;a href="https://github.com/simonw/s3-ocr"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;A challenging example page&lt;/h4&gt;
&lt;p&gt;Here's one of the more challenging pages I processed using Textract:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A very old page of difficult to read handwriting" src="https://static.simonwillison.net/static/2022/s3-ocr-sample-handwriting.jpg" style="max-width: 100%"/&gt;&lt;/p&gt;
&lt;p&gt;Here's the result:&lt;/p&gt;
&lt;pre&gt;
In. In J a ... the Joe 14
162
Volxv
Lalpa spinosa, Eggt bud development. of
146
Farcomas spindle. cells in nested gowers 271
Fayigaga tridactylites, leaf glaur of ruce 33
staining &amp;amp; mounting
Stiles 133
tilica films, a structure of Diatoins morehouse 38
thile new microscopic
Broeck 22 /
Smith reproduction in the huntroom tribe
6
Trakes, develop mouht succession of the porsion tango/229
Soirce President of the Roy: truc: Soo
285
forby, Presidents address
105
pongida, difficulties of classification
238
tage, american adjustable concentric
150
ttlese staining &amp;amp; mountring wood sections 133
Stodder, Frustulia Iasconica, havicula
chomboides, &amp;amp; havi cula crassinervis 265
Vol XVI
falicylic acid u movorcopy
160
falpar enctry ology of
Brooke 9.97
Sanderson micros: characters If inflammation
43
tap, circulation of the
42
Jars, structure of the genus Brisinga
44
latter throvite connective substances 191- 241
Jehorey Cessification in birds, formation
of ed blood corpuseles during the
ossification process
by
&lt;/pre&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-ocr"&gt;s3-ocr&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/s3-ocr/releases/tag/0.4"&gt;0.4&lt;/a&gt; - (&lt;a href="https://github.com/simonw/s3-ocr/releases"&gt;4 releases total&lt;/a&gt;) - 2022-06-30
&lt;br /&gt;Tools for running OCR against files stored in S3&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.12"&gt;0.12&lt;/a&gt; - (&lt;a href="https://github.com/simonw/s3-credentials/releases"&gt;12 releases total&lt;/a&gt;) - 2022-06-30
&lt;br /&gt;A tool for creating credentials for accessing S3 buckets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-scale-to-zero"&gt;datasette-scale-to-zero&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-scale-to-zero/releases/tag/0.1.2"&gt;0.1.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-scale-to-zero/releases"&gt;3 releases total&lt;/a&gt;) - 2022-06-23
&lt;br /&gt;Quit Datasette if it has not received traffic for a specified time period&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/sqlite/one-line-csv-operations"&gt;One-liner for running queries against CSV files with SQLite&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/bash/ignore-errors"&gt;Ignoring errors in a section of a Bash script&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/aws/ocr-pdf-textract"&gt;Running OCR against a PDF file with AWS Textract&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="aws"/><category term="ocr"/><category term="pdf"/><category term="projects"/><category term="s3"/><category term="weeknotes"/><category term="s3-credentials"/></entry><entry><title>s3-credentials 0.8</title><link href="https://simonwillison.net/2021/Dec/7/s3-credentials/#atom-tag" rel="alternate"/><published>2021-12-07T07:04:35+00:00</published><updated>2021-12-07T07:04:35+00:00</updated><id>https://simonwillison.net/2021/Dec/7/s3-credentials/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.8"&gt;s3-credentials 0.8&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The latest release of my s3-credentials CLI tool for creating S3 buckets with credentials to access them (with read-write, read-only or write-only policies) adds a new --public option for creating buckets that allow public access, such that anyone who knows a filename can download a file. The s3-credentials put-object command also now sets the appropriate Content-Type heading on the uploaded object.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/p&gt;



</summary><category term="cli"/><category term="projects"/><category term="s3"/><category term="s3-credentials"/></entry><entry><title>Weeknotes: git-history, created for a Git scraping workshop</title><link href="https://simonwillison.net/2021/Nov/15/weeknotes-git-history/#atom-tag" rel="alternate"/><published>2021-11-15T04:10:50+00:00</published><updated>2021-11-15T04:10:50+00:00</updated><id>https://simonwillison.net/2021/Nov/15/weeknotes-git-history/#atom-tag</id><summary type="html">
    &lt;p&gt;My main project this week was a 90 minute workshop I delivered about Git scraping at &lt;a href="https://escoladedados.org/coda2021/"&gt;Coda.Br 2021&lt;/a&gt;, a Brazilian data journalism conference, on Friday. This inspired the creation of a brand new tool, &lt;strong&gt;git-history&lt;/strong&gt;, plus smaller improvements to a range of other projects.&lt;/p&gt;
&lt;h4&gt;git-history&lt;/h4&gt;
&lt;p&gt;I still need to do a detailed write-up of this one (update: &lt;a href="https://simonwillison.net/2021/Dec/7/git-history/"&gt;git-history: a tool for analyzing scraped data collected using Git and SQLite&lt;/a&gt;), but on Thursday I released a brand new tool called &lt;a href="https://datasette.io/tools/git-history"&gt;git-history&lt;/a&gt;, which I describe as "tools for analyzing Git history using SQLite".&lt;/p&gt;
&lt;p&gt;This tool is the missing link in the &lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping pattern&lt;/a&gt; I described here last October.&lt;/p&gt;
&lt;p&gt;Git scraping is the technique of regularly scraping an online source of information and writing the results to a file in a Git repository... which automatically gives you a full revision history of changes made to that data source over time.&lt;/p&gt;
&lt;p&gt;The missing piece has always been what to do next: how do you turn a commit history of changes to a JSON or CSV file into a data source that can be used to answer questions about how that file changed over time?&lt;/p&gt;
&lt;p&gt;I've written one-off Python scripts for this a few times (here's &lt;a href="https://github.com/simonw/cdc-vaccination-history/blob/6f6bcb9437c0d44c4bcf94c111c631cc50bc2744/build_database.py"&gt;my CDC vaccinations one&lt;/a&gt;, for example), but giving an interactive workshop about the technique finally inspired me to build a tool to help.&lt;/p&gt;
&lt;p&gt;The tool has &lt;a href="https://datasette.io/tools/git-history"&gt;a comprehensive README&lt;/a&gt;, but the short version is that you can take a JSON (or CSV) file in a repository that has been tracking changes to some items over time and run the following to load all of the different versions into a SQLite database file for analysis with &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;git-convert file incidents.db incidents.json --id IncidentID
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This assumes that &lt;code&gt;incidents.json&lt;/code&gt; contains a JSON array of incidents (reported fires for example) and that each incident has a &lt;code&gt;IncidentID&lt;/code&gt; identifier key. It will then loop through the Git history of that file right from the start, creating an &lt;code&gt;item_versions&lt;/code&gt; table that tracks every change made to each of those items - using &lt;code&gt;IncidentID&lt;/code&gt; to decide if a row represents a new incident or an update to a previous one.&lt;/p&gt;
&lt;p&gt;I have a few more improvements I want to make before I start more widely promoting this, but it's already really useful. I've had a lot of fun running it against example repos from the &lt;a href="https://github.com/topics/git-scraping"&gt;git-scraping GitHub topic&lt;/a&gt; (now at 202 repos and counting).&lt;/p&gt;
&lt;h4&gt;Workshop: Raspando dados com o GitHub Actions e analisando com Datasette&lt;/h4&gt;
&lt;p&gt;The workshop I gave at the conference was live-translated into Portuguese, which is really exciting! I'm looking forward to watching the video when it comes out and seeing how well that worked.&lt;/p&gt;
&lt;p&gt;The title translates to "Scraping data with GitHub Actions and analyzing with Datasette", and it was the first time I've given a workshop that combines Git scraping and Datasette - hence the development of the new git-history tool to help tie the two together.&lt;/p&gt;
&lt;p&gt;I think it went really well. I put together four detailed exercises for the attendees, and then worked through each one live with the goal of attendees working through them at the same time - a method I learned from the Carpentries training course I took &lt;a href="https://simonwillison.net/2020/Sep/26/weeknotes-software-carpentry-sqlite/"&gt;last year&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Four exercises turns out to be exactly right for 90 minutes, with reasonable time for an introduction and some extra material and questions at the end.&lt;/p&gt;
&lt;p&gt;The worst part of running a workshop is inevitably the part where you try and get everyone setup with a functional development environment on their own machines (see &lt;a href="https://xkcd.com/1987/"&gt;XKCD 1987&lt;/a&gt;). This time round I skipped that entirely by encouraging my students to use &lt;strong&gt;&lt;a href="https://gitpod.io/"&gt;GitPod&lt;/a&gt;&lt;/strong&gt;, which provides free browser-based cloud development environments running Linux, with a browser-embedded VS Code editor and terminal running on top.&lt;/p&gt;

&lt;p&gt;&lt;img style="max-width: 100%" src="https://static.simonwillison.net/static/2021/start-datasette-gitpod.gif" alt="Animated demo of GitPod showing how to run Datasette and have it proxy a port" /&gt;&lt;/p&gt;

&lt;p&gt;(It's similar to &lt;a href="https://github.com/features/codespaces"&gt;GitHub Codespaces&lt;/a&gt;, but Codespaces is not yet available to free customers outside of the beta.)&lt;/p&gt;
&lt;p&gt;I demonstrated all of the exercises using GitPod myself during the workshop, and ensured that they could be entirely completed through that environment, with no laptop software needed at all.&lt;/p&gt;
&lt;p&gt;This worked &lt;strong&gt;so well&lt;/strong&gt;. Not having to worry about development environments makes workshops massively more productive. I will absolutely be doing this again in the future.&lt;/p&gt;
&lt;p&gt;The workshop exercises are available &lt;a href="https://docs.google.com/document/d/1TCatZP5gQNfFjZJ5M77wMlf9u_05Z3BZnjp6t1SA6UU/edit"&gt;in this Google Doc&lt;/a&gt;, and I hope to extract some of them out into official tutorials for various tools later on.&lt;/p&gt;
&lt;h4&gt;Datasette 0.58.2&lt;/h4&gt;
&lt;p&gt;Yesterday was Datasette's fourth birthday - the four year anniversary of &lt;a href="https://simonwillison.net/2017/Nov/13/datasette/"&gt;the initial release announcement&lt;/a&gt;! I celebrated by releasing a minor bug-fix, &lt;a href="https://github.com/simonw/datasette/releases/tag/0.59.2"&gt;Datasette 0.58.2&lt;/a&gt;, the release notes for which are quoted below:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Column names with a leading underscore now work correctly when used as a facet. (&lt;a href="https://github.com/simonw/datasette/issues/1506"&gt;#1506&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Applying &lt;code&gt;?_nocol=&lt;/code&gt; to a column no longer removes that column from the filtering interface. (&lt;a href="https://github.com/simonw/datasette/issues/1503"&gt;#1503&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Official Datasette Docker container now uses Debian Bullseye as the base image. (&lt;a href="https://github.com/simonw/datasette/issues/1497"&gt;#1497&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That first change was inspired by ongoing work on &lt;code&gt;git-history&lt;/code&gt;, where I decided to use a &lt;code&gt;_id&lt;/code&gt; underscoper prefix pattern for columns that were reserved for use by that tool in order &lt;a href="https://github.com/simonw/git-history/issues/14"&gt;to avoid clashing with column names&lt;/a&gt; in the provided source data.&lt;/p&gt;
&lt;h4&gt;sqlite-utils 3.18&lt;/h4&gt;
&lt;p&gt;Today I released &lt;a href="https://sqlite-utils.datasette.io/en/stable/changelog.html#v3-18"&gt;sqlite-utils 3.18&lt;/a&gt; - initially also to provide a feature I wanted for &lt;code&gt;git-history&lt;/code&gt; (a way to &lt;a href="https://github.com/simonw/sqlite-utils/issues/339"&gt;populate additional columns&lt;/a&gt; when creating a row using &lt;code&gt;table.lookup()&lt;/code&gt;) but I also closed some bug reports and landed some small pull requests that had come in since 3.17.&lt;/p&gt;
&lt;h4&gt;s3-credentials 0.5&lt;/h4&gt;
&lt;p&gt;Earlier in the week I released &lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.5"&gt;version 0.5&lt;/a&gt; of &lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt; - my CLI tool for creating read-only, read-write or write-only AWS credentials for a specific S3 bucket.&lt;/p&gt;
&lt;p&gt;The biggest new feature is the ability to create temporary credentials, that expire after a given time limit.&lt;/p&gt;
&lt;p&gt;This is achived using &lt;code&gt;STS.assume_role()&lt;/code&gt;, where STS is &lt;a href="https://docs.aws.amazon.com/STS/latest/APIReference/welcome.html"&gt;Security Token Service&lt;/a&gt;. I've been wanting to learn this API for quite a while now.&lt;/p&gt;
&lt;p&gt;Assume role comes with some limitations: tokens must live between 15 minutes and 12 hours, and you need to first create a role that you can assume. In creating those credentials you can define an additional policy document, which is how I scope down the token I'm creating to only allow a specific level of access to a specific S3 bucket.&lt;/p&gt;
&lt;p&gt;I've learned a huge amount about AWS, IAM and S3 through developming this project. I think I'm finally overcoming my multi-year phobia of anything involving IAM!&lt;/p&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/sqlite-utils/releases/tag/3.18"&gt;3.18&lt;/a&gt; - (&lt;a href="https://github.com/simonw/sqlite-utils/releases"&gt;88 releases total&lt;/a&gt;) - 2021-11-15
&lt;br /&gt;Python CLI utility and library for manipulating SQLite databases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette"&gt;datasette&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette/releases/tag/0.59.2"&gt;0.59.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette/releases"&gt;100 releases total&lt;/a&gt;) - 2021-11-14
&lt;br /&gt;An open source multi-tool for exploring and publishing data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-hello-world"&gt;datasette-hello-world&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-hello-world/releases/tag/0.1.1"&gt;0.1.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-hello-world/releases"&gt;2 releases total&lt;/a&gt;) - 2021-11-14
&lt;br /&gt;The hello world of Datasette plugins&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/git-history"&gt;git-history&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/git-history/releases/tag/0.3.1"&gt;0.3.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/git-history/releases"&gt;5 releases total&lt;/a&gt;) - 2021-11-12
&lt;br /&gt;Tools for analyzing Git history using SQLite&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.5"&gt;0.5&lt;/a&gt; - (&lt;a href="https://github.com/simonw/s3-credentials/releases"&gt;5 releases total&lt;/a&gt;) - 2021-11-11
&lt;br /&gt;A tool for creating credentials for accessing S3 buckets&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/kubernetes/basic-datasette-in-kubernetes"&gt;Basic Datasette in Kubernetes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/deno/annotated-deno-deploy-demo"&gt;Annotated code for a demo of WebSocket chat in Deno Deploy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/javascript/tesseract-ocr-javascript"&gt;Using Tesseract.js to OCR every image on a page&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/my-talks"&gt;my-talks&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/teaching"&gt;teaching&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-scraping"&gt;git-scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/git-history"&gt;git-history&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="aws"/><category term="projects"/><category term="s3"/><category term="my-talks"/><category term="teaching"/><category term="datasette"/><category term="weeknotes"/><category term="git-scraping"/><category term="sqlite-utils"/><category term="git-history"/><category term="s3-credentials"/></entry><entry><title>s3-credentials: a tool for creating credentials for S3 buckets</title><link href="https://simonwillison.net/2021/Nov/3/s3-credentials/#atom-tag" rel="alternate"/><published>2021-11-03T04:02:04+00:00</published><updated>2021-11-03T04:02:04+00:00</updated><id>https://simonwillison.net/2021/Nov/3/s3-credentials/#atom-tag</id><summary type="html">
    &lt;p&gt;I've built a command-line tool called &lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt; to solve a problem that's been frustrating me for ages: how to quickly and easily create AWS credentials (an access key and secret key) that have permission to read or write from just a single S3 bucket.&lt;/p&gt;
&lt;h4&gt;The TLDR version&lt;/h4&gt;
&lt;p&gt;To create a new S3 bucket and generate credentials for reading and writing to it:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% pip install s3-credentials
% s3-credentials create demo-bucket-for-simonwillison-blog-post --create-bucket
Created bucket: demo-bucket-for-simonwillison-blog-post
Created  user: 's3.read-write.demo-bucket-for-simonwillison-blog-post' with permissions boundary: 'arn:aws:iam::aws:policy/AmazonS3FullAccess'
Attached policy s3.read-write.demo-bucket-for-simonwillison-blog-post to user s3.read-write.demo-bucket-for-simonwillison-blog-post
Created access key for user: s3.read-write.demo-bucket-for-simonwillison-blog-post
{
    "UserName": "s3.read-write.demo-bucket-for-simonwillison-blog-post",
    "AccessKeyId": "AKIAWXFXAIOZHY6WAJSF",
    "Status": "Active",
    "SecretAccessKey": "...",
    "CreateDate": "2021-12-06 23:54:08+00:00"
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can now use the that &lt;code&gt;AccessKeyId&lt;/code&gt; and &lt;code&gt;SecretAccessKey&lt;/code&gt; to read and write files in that bucket.&lt;/p&gt;
&lt;h4&gt;The need for bucket credentials for S3&lt;/h4&gt;
&lt;p&gt;I'm an enormous fan of &lt;a href="https://aws.amazon.com/s3/"&gt;Amazon S3&lt;/a&gt;: I've been using it &lt;a href="https://simonwillison.net/tags/s3/?page=last"&gt;for fifteen years&lt;/a&gt; now (since the launch in 2006) and it's my all-time favourite cloud service: it's cheap, reliable and basically indestructible.&lt;/p&gt;
&lt;p&gt;You need two credentials to make API calls to S3: an &lt;code&gt;AWS_ACCESS_KEY_ID&lt;/code&gt; and a &lt;code&gt;AWS_SECRET_ACCESS_KEY&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Since I often end up adding these credentials to projects hosted in different environments, I'm not at all keen on using my root-level credentials here: usually a project works against just one dedicated S3 bucket, so ideally I would like to create dedicated credentials that are limited to just that bucket.&lt;/p&gt;
&lt;p&gt;Creating those credentials is surprisingly difficult!&lt;/p&gt;
&lt;h4&gt;Dogsheep Photos&lt;/h4&gt;
&lt;p&gt;The last time I solved this problem was for my &lt;a href="https://datasette.io/tools/dogsheep-photos"&gt;Dogsheep Photos&lt;/a&gt; project. I built a tool that uploads all of my photos from Apple Photos to my own dedicated S3 bucket, and extracts the photo metadata into a SQLite database. This means I can do some really cool tricks using SQL to analyze my photos, as described in &lt;a href="https://simonwillison.net/2020/May/21/dogsheep-photos/"&gt;Using SQL to find my best photo of a pelican according to Apple Photos&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The photos are stored in a S3 private bucket, with &lt;a href="https://github.com/simonw/s3-image-proxy"&gt;a custom proxy&lt;/a&gt; in front of them that I can use to grant access to specific photographs via a signed URL.&lt;/p&gt;
&lt;p&gt;For the proxy, I decided to create dedicated credentials that were allowed to make read-only requests to my private S3 bucket.&lt;/p&gt;
&lt;p&gt;I made &lt;a href="https://github.com/dogsheep/dogsheep-photos/issues/4"&gt;detailed notes&lt;/a&gt; along the way as I figured out to do that. It was really hard! There's one step where you literally have to hand-edit a JSON policy document that looks like this (replace &lt;code&gt;dogsheep-photos-simon&lt;/code&gt; with your own bucket name) and paste that into the AWS web console:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
  &lt;span class="pl-ent"&gt;"Version"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;2012-10-17&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
  &lt;span class="pl-ent"&gt;"Statement"&lt;/span&gt;: [
    {
      &lt;span class="pl-ent"&gt;"Effect"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Allow&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-ent"&gt;"Action"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;s3:*&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-ent"&gt;"Resource"&lt;/span&gt;: [
        &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;arn:aws:s3:::dogsheep-photos-simon/*&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
      ]
    }
  ]
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I set myself an ambition to try and fix this at some point in the future (that was in April 2020).&lt;/p&gt;
&lt;p&gt;Today I found myself wanting new bucket credentials, so I could play with &lt;a href="https://litestream.io/"&gt;Litestream&lt;/a&gt;. I decided to solve this problem once and for all.&lt;/p&gt;
&lt;p&gt;I've also been meaning to really get my head around Amazon's IAM permission model for years, and this felt like a great excuse to figure it out through writing code.&lt;/p&gt;
&lt;h4&gt;The process in full&lt;/h4&gt;
&lt;p&gt;Here are the steps you need to take in order to get long-lasting credentials for accessing a specific S3 bucket.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create an S3 bucket&lt;/li&gt;
&lt;li&gt;Create a new, dedicated user. You need a user and not a role because long-lasting AWS credentials cannot be created for roles - and we want credentials we can use in a project without constantly needing to update them.&lt;/li&gt;
&lt;li&gt;Assign an "inline policy" to that user granting them read-only or read-write access to the specific S3 bucket - this is the JSON format shown above.&lt;/li&gt;
&lt;li&gt;Create AWS credentials for that user.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;There are plenty of other ways you can achieve this: you can add permissions to a group and assign that user to a group, or you can create a named "managed policy" and attach that to the user. But using an inline policy seems to be the simplest of the available options.&lt;/p&gt;
&lt;p&gt;Using the &lt;a href="https://aws.amazon.com/sdk-for-python/"&gt;boto3&lt;/a&gt; Python client library for AWS this sequence converts to the following API calls:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;boto3&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;json&lt;/span&gt;

&lt;span class="pl-s1"&gt;s3&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;boto3&lt;/span&gt;.&lt;span class="pl-en"&gt;client&lt;/span&gt;(&lt;span class="pl-s"&gt;"s3"&lt;/span&gt;)
&lt;span class="pl-s1"&gt;iam&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;boto3&lt;/span&gt;.&lt;span class="pl-en"&gt;client&lt;/span&gt;(&lt;span class="pl-s"&gt;"iam"&lt;/span&gt;)

&lt;span class="pl-s1"&gt;username&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"my-new-user"&lt;/span&gt;
&lt;span class="pl-s1"&gt;bucket_name&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"my-new-bucket"&lt;/span&gt;
&lt;span class="pl-s1"&gt;policy_name&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"user-can-access-bucket"&lt;/span&gt;

&lt;span class="pl-s1"&gt;policy_document&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; {
    &lt;span class="pl-s"&gt;"... that big JSON document ..."&lt;/span&gt;: &lt;span class="pl-s"&gt;""&lt;/span&gt;
}

&lt;span class="pl-c"&gt;# Create the bucket&lt;/span&gt;
&lt;span class="pl-s1"&gt;s3&lt;/span&gt;.&lt;span class="pl-en"&gt;create_bucket&lt;/span&gt;(&lt;span class="pl-v"&gt;Bucket&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;bucket_name&lt;/span&gt;)

&lt;span class="pl-c"&gt;# Create the user&lt;/span&gt;
&lt;span class="pl-s1"&gt;iam&lt;/span&gt;.&lt;span class="pl-en"&gt;create_user&lt;/span&gt;(&lt;span class="pl-v"&gt;UserName&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;username&lt;/span&gt;)

&lt;span class="pl-c"&gt;# Assign the policy to the user&lt;/span&gt;
&lt;span class="pl-s1"&gt;iam&lt;/span&gt;.&lt;span class="pl-en"&gt;put_user_policy&lt;/span&gt;(
    &lt;span class="pl-v"&gt;PolicyDocument&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;json&lt;/span&gt;.&lt;span class="pl-en"&gt;dumps&lt;/span&gt;(&lt;span class="pl-s1"&gt;policy_document&lt;/span&gt;),
    &lt;span class="pl-v"&gt;PolicyName&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;policy_name&lt;/span&gt;,
    &lt;span class="pl-v"&gt;UserName&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;username&lt;/span&gt;,
)

&lt;span class="pl-c"&gt;# Retrieve and print the credentials&lt;/span&gt;
&lt;span class="pl-s1"&gt;response&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;iam&lt;/span&gt;.&lt;span class="pl-en"&gt;create_access_key&lt;/span&gt;(
    &lt;span class="pl-v"&gt;UserName&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;username&lt;/span&gt;,
)
&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;response&lt;/span&gt;[&lt;span class="pl-s"&gt;"AccessKey"&lt;/span&gt;])&lt;/pre&gt;
&lt;h4&gt;Turning it into a CLI tool&lt;/h4&gt;
&lt;p&gt;I never want to have to figure out how to do this again, so I decided to build a tool around it.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt; is a Python CLI utility built on top of &lt;a href="https://click.palletsprojects.com/"&gt;Click&lt;/a&gt; using my &lt;a href="https://github.com/simonw/click-app"&gt;click-app&lt;/a&gt; cookicutter template.&lt;/p&gt;
&lt;p&gt;It's available through PyPI, so you can install it using:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% pip install s3-credentials&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The main command is &lt;code&gt;s3-credentials create&lt;/code&gt;, which runs through the above sequence of steps.&lt;/p&gt;
&lt;p&gt;To create read-only credentials for my existing &lt;code&gt;static.niche-museums.com&lt;/code&gt; bucket I can run the following:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% s3-credentials create static.niche-museums.com --read-only

Created user: s3.read-only.static.niche-museums.com with permissions boundary: arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess
Attached policy s3.read-only.static.niche-museums.com to user s3.read-only.static.niche-museums.com
Created access key for user: s3.read-only.static.niche-museums.com
{
    "UserName": "s3.read-only.static.niche-museums.com",
    "AccessKeyId": "AKIAWXFXAIOZJ26NEGBN",
    "Status": "Active",
    "SecretAccessKey": "...",
    "CreateDate": "2021-11-03 03:21:12+00:00"
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The command shows each step as it executes, and at the end it outputs the newly created access key and secret key.&lt;/p&gt;
&lt;p&gt;It defaults to creating a user with a username that reflects what it will be able to do: &lt;code&gt;s3.read-only.static.niche-museums.com&lt;/code&gt;. You can pass &lt;code&gt;--username something&lt;/code&gt; to specify a custom username instead.&lt;/p&gt;
&lt;p&gt;If you omit the &lt;code&gt;--read-only&lt;/code&gt; flag it will create a user with read and write access to the bucket. There's also a &lt;code&gt;--write-only&lt;/code&gt; flag which creates a user that can write to but not read from the bucket - useful for use-cases like logging or backup scripts.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/simonw/s3-credentials/blob/main/README.md"&gt;README&lt;/a&gt; has full documentation on the various other options, plus details of the other &lt;code&gt;s3-credentials&lt;/code&gt; utility commands &lt;code&gt;list-users&lt;/code&gt;, &lt;code&gt;list-buckets&lt;/code&gt;, &lt;code&gt;list-user-policies&lt;/code&gt; and &lt;code&gt;whoami&lt;/code&gt;.&lt;/p&gt;
&lt;h4&gt;Learned along the way&lt;/h4&gt;
&lt;p&gt;This really was a fantastic project for deepening my understanding of S3, IAM and how it all fits together. A few extra points I picked up:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;AWS users can be created with something called a &lt;a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_boundaries.html"&gt;permissions boundary&lt;/a&gt;. This is an advanced security feature which lets a user be restricted to a set of maximum permissions - for example, only allowed to interact with S3, not any other AWS service.&lt;/p&gt;
&lt;p&gt;Pemissions boundaries do not themselves grant permissions - a user will not be able to do anything until extra policies are added to their account. It instead acts as defense in depth, setting an upper limit to what a user can do no matter what other policies are applied to them.&lt;/p&gt;
&lt;p&gt;There's one big catch: the value you set for a permissions boundary is a very weakly documented ARN string - the &lt;code&gt;boto3&lt;/code&gt; documentation simply calls it "The ARN of the policy that is used to set the permissions boundary for the user". I used &lt;a href="https://github.com/search?l=Python&amp;amp;q=iam+PermissionsBoundary&amp;amp;type=Code"&gt;GitHub code search&lt;/a&gt; to dig up some examples, and found &lt;code&gt;arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess&lt;/code&gt; and &lt;code&gt;arn:aws:iam::aws:policy/AmazonS3FullAccess&lt;/code&gt; to be the ones most relevant to my project. &lt;a href="https://github.com/daviddawha/ArchivesSpaceDevUNR/blob/488b5b83f9ac66a6013e9a0a02d25734886dee02/gems/gems/fog-aws-2.0.0/lib/fog/aws/iam/default_policy_versions.json"&gt;This random file&lt;/a&gt; appears to contain more.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Those JSON policy documents really are the dark secret magic that holds AWS together. Finding trustworthy examples of read-only, read-write and write-only policies for specific S3 buckets was not at all easy. I made &lt;a href="https://github.com/simonw/s3-credentials/issues/3#issuecomment-958401364"&gt;detailed notes in this comment thread&lt;/a&gt; - the policies I went with are baked into the &lt;a href="https://github.com/simonw/s3-credentials/blob/0.2/s3_credentials/policies.py"&gt;policies.py&lt;/a&gt; file in the &lt;code&gt;s3-credentials&lt;/code&gt; repository. If you know your way around IAM I would love to hear your feedback on the policies I ended up using!&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Writing automated tests for code that makes extensive use of &lt;code&gt;boto3&lt;/code&gt; - such that those tests don't make any real HTTP requests to the API - is a bit fiddly. I &lt;a href="https://github.com/simonw/s3-credentials/pull/6#issuecomment-958522457"&gt;explored a few options&lt;/a&gt; for this - potential candidates included the &lt;a href="https://botocore.amazonaws.com/v1/documentation/api/latest/reference/stubber.html"&gt;botocore.stub.Stubber&lt;/a&gt; class and the &lt;a href="https://vcrpy.readthedocs.io/"&gt;VCR.py&lt;/a&gt; class for saving and replaying HTTP traffic (see &lt;a href="https://til.simonwillison.net/pytest/pytest-recording-vcr"&gt;this TIL&lt;/a&gt;). I ended up going with Python's &lt;code&gt;Mock&lt;/code&gt; class, via &lt;a href="https://github.com/pytest-dev/pytest-mock"&gt;pytest-mock&lt;/a&gt; - here's &lt;a href="https://til.simonwillison.net/pytest/pytest-mock-calls"&gt;another TIL&lt;/a&gt; on the pattern I used for that. (Update: Jeff Triplett &lt;a href="https://twitter.com/webology/status/1455749203595087872"&gt;pointed me&lt;/a&gt; to &lt;a href="https://github.com/spulec/moto"&gt;moto&lt;/a&gt; which looks like a really great solution for this.)&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Feedback from AWS experts wanted&lt;/h4&gt;
&lt;p&gt;The tool I've built solves my specific problem pretty well. I'm nervous about it though: I am by no means an IAM expert, and I'm somewhat paranoid that I may have made a dumb mistake and baked it into the tooling.&lt;/p&gt;
&lt;p&gt;As such, the README currently &lt;a href="https://github.com/simonw/s3-credentials/tree/0.2#%EF%B8%8F-warning"&gt;carries a warning&lt;/a&gt; that you should review what the tool is doing carefully before trusting it against your own AWS account!&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update 20 February 2022:&lt;/strong&gt; I removed that warning, since I've now spent long enough working on this tool that I'm comfortable with how it works.&lt;/p&gt;
&lt;p&gt;If you are an AWS expert, you can help: I have &lt;a href="https://github.com/simonw/s3-credentials/issues/7"&gt;an open issue&lt;/a&gt; requesting expert feedback, and I'd love to hear from people with deep experience who can either validate that my approach is sound or help explain what I'm doing wrong and how the process can be fixed.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="cli"/><category term="projects"/><category term="python"/><category term="s3"/><category term="security"/><category term="s3-credentials"/></entry><entry><title>Abusing Terraform to Upload Static Websites to S3</title><link href="https://simonwillison.net/2021/Oct/6/abusing-terraform/#atom-tag" rel="alternate"/><published>2021-10-06T18:51:27+00:00</published><updated>2021-10-06T18:51:27+00:00</updated><id>https://simonwillison.net/2021/Oct/6/abusing-terraform/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.tangramvision.com/blog/abusing-terraform-to-upload-static-websites-to-s3"&gt;Abusing Terraform to Upload Static Websites to S3&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I found this really interesting. Terraform is infrastructure as code software which mostly handles creating and updating infrastructure resources, so it’s a poor fit for uploading files to S3 and setting the correct Content-Type headers for them. But... in figuring out how to do that, this article taught me a ton about how Terraform works. I wonder if that’s a useful general pattern? Get a tool to do something that it’s poorly designed to handle and see how much you learn about that tool along the way.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=28776287"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;&lt;/p&gt;



</summary><category term="s3"/></entry><entry><title>Quoting Brian LeRoux</title><link href="https://simonwillison.net/2021/May/13/brian-leroux/#atom-tag" rel="alternate"/><published>2021-05-13T15:01:26+00:00</published><updated>2021-05-13T15:01:26+00:00</updated><id>https://simonwillison.net/2021/May/13/brian-leroux/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://twitter.com/brianleroux/status/1392837707093536775"&gt;&lt;p&gt;Folks think s3 is static assets hosting but really it's a consistent and highly available key value store with first class blob support&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://twitter.com/brianleroux/status/1392837707093536775"&gt;Brian LeRoux&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;&lt;/p&gt;



</summary><category term="s3"/></entry><entry><title>logpaste</title><link href="https://simonwillison.net/2021/Mar/17/logpaste/#atom-tag" rel="alternate"/><published>2021-03-17T15:48:45+00:00</published><updated>2021-03-17T15:48:45+00:00</updated><id>https://simonwillison.net/2021/Mar/17/logpaste/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/mtlynch/logpaste/tree/fe53779aef25cdc9dfaabd9404ff801951730ddb"&gt;logpaste&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Useful example of how to use the Litestream SQLite replication tool in a Dockerized application: S3 credentials are passed to the container on startup, it then attempts to restore the SQLite database from S3 and starts a Litestream process in the same container to periodically synchronize changes back up to the S3 bucket.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/deliberatecoder/status/1371993772549103621"&gt;@deliberatecoder&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/replication"&gt;replication&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/docker"&gt;docker&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/litestream"&gt;litestream&lt;/a&gt;&lt;/p&gt;



</summary><category term="replication"/><category term="s3"/><category term="sqlite"/><category term="docker"/><category term="litestream"/></entry><entry><title>athena-sqlite</title><link href="https://simonwillison.net/2019/Dec/18/athena-sqlite/#atom-tag" rel="alternate"/><published>2019-12-18T09:05:40+00:00</published><updated>2019-12-18T09:05:40+00:00</updated><id>https://simonwillison.net/2019/Dec/18/athena-sqlite/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/dacort/athena-sqlite"&gt;athena-sqlite&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Amazon Athena is the AWS tool for querying data stored in S3—as CSV, JSON or Apache Parquet files—using SQL. It’s an interesting way of buliding a very cheap data warehouse on top of S3 without having to run any additional services. Athena recently added a query federation SDK which lets you define additional custom data sources using Lambda functions. Damon Cortesi used this to write a custom connector for SQLite, which lets you run queries against data stored in SQLite files that you have uploaded to S3. You can then run joins between that data and other Athena sources.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/dacort/status/1206059672513306626"&gt;@dacort&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/athena"&gt;athena&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sql"&gt;sql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parquet"&gt;parquet&lt;/a&gt;&lt;/p&gt;



</summary><category term="athena"/><category term="aws"/><category term="s3"/><category term="sql"/><category term="sqlite"/><category term="parquet"/></entry><entry><title>Client-side instrumentation for under $1 per month. No servers necessary.</title><link href="https://simonwillison.net/2019/Mar/15/client-side-instrumentation-under-1-month-no-servers-necessary/#atom-tag" rel="alternate"/><published>2019-03-15T16:03:48+00:00</published><updated>2019-03-15T16:03:48+00:00</updated><id>https://simonwillison.net/2019/Mar/15/client-side-instrumentation-under-1-month-no-servers-necessary/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://bostata.com/post/client-side-instrumentation-for-under-one-dollar/"&gt;Client-side instrumentation for under $1 per month. No servers necessary.&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Rolling your own analytics used to be too complex and expensive to be worth the effort. Thanks to cloud technologies like Cloudfront, Athena, S3 and Lambda you can now inexpensively implement client-side analytics (via requests to a tracking pixel) that stores detailed logs on S3, then use Amazon Athena to run queries against those logs ($5/TB scanned) to get detailed reporting. This post also introduced me to Snowplow, an open source JavaScript analytics script (released by a commercial analytics platform) which looks very neat—it’s based on piwik.js, the tracker from the open-source Piwik  analytics tool.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=19388489"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/analytics"&gt;analytics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/athena"&gt;athena&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cloudfront"&gt;cloudfront&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lambda"&gt;lambda&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;&lt;/p&gt;



</summary><category term="analytics"/><category term="athena"/><category term="cloudfront"/><category term="lambda"/><category term="s3"/></entry><entry><title>Django Bakery</title><link href="https://simonwillison.net/2018/Jun/16/django-bakery/#atom-tag" rel="alternate"/><published>2018-06-16T01:49:35+00:00</published><updated>2018-06-16T01:49:35+00:00</updated><id>https://simonwillison.net/2018/Jun/16/django-bakery/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://django-bakery.readthedocs.io/en/latest/"&gt;Django Bakery&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
“A set of helpers for baking your Django site out as flat files”. Released by the LA Times Data Desk, who use it for a large number of projects from election results to data journalism interactives. Statically publishing these projects to S3 lets them handle huge traffic spikes at a very low cost.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/palewire/status/1007721981179883520"&gt;@palewire&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/django"&gt;django&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/static-generator"&gt;static-generator&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ben-welsh"&gt;ben-welsh&lt;/a&gt;&lt;/p&gt;



</summary><category term="data-journalism"/><category term="django"/><category term="s3"/><category term="static-generator"/><category term="ben-welsh"/></entry></feed>