Simon Willison's Weblog: s3

datasette-files-s3 0.1a1

2026-03-25T21:57:05+00:00

A backend for datasette-files that adds the ability to store and retrieve files using an S3 bucket. This release added a mechanism for fetching S3 configuration periodically from a URL, which means we can use time limited IAM credentials that are restricted to a prefix within a bucket.

Tags: s3, datasette

s3-credentials 0.17

2025-12-16T23:40:31+00:00

s3-credentials 0.17

New release of my s3-credentials CLI tool for managing credentials needed to access just one S3 bucket. Here are the release notes in full:

New commands get-bucket-policy and set-bucket-policy. #91

New commands get-public-access-block and set-public-access-block. #92

New localserver command for starting a web server that makes time limited credentials accessible via a JSON API. #93

That s3-credentials localserver command (documented here) is a little obscure, but I found myself wanting something like that to help me test out a new feature I'm building to help create temporary Litestream credentials using Amazon STS.

Most of that new feature was built by Claude Code from the following starting prompt:

Add a feature s3-credentials localserver which starts a localhost weberver running (using the Python standard library stuff) on port 8094 by default but -p/--port can set a different port and otherwise takes an option that names a bucket and then takes the same options for read--write/read-only etc as other commands. It also takes a required --refresh-interval option which can be set as 5m or 10h or 30s. All this thing does is reply on / to a GET request with the IAM expiring credentials that allow access to that bucket with that policy for that specified amount of time. It caches internally the credentials it generates and will return the exact same data up until they expire (it also tracks expected expiry time) after which it will generate new credentials (avoiding dog pile effects if multiple requests ask at the same time) and return and cache those instead.

Tags: aws, projects, s3, ai, annotated-release-notes, s3-credentials, prompt-engineering, generative-ai, llms, coding-agents, claude-code

An MVCC-like columnar table on S3 with constant-time deletes

2025-10-11T03:47:16+00:00

An MVCC-like columnar table on S3 with constant-time deletes

s3's support for conditional writes (previously) makes it an interesting, scalable and often inexpensive platform for all kinds of database patterns.

Shayon Mukherjee presents an ingenious design for a Parquet-backed database in S3 which accepts concurrent writes, presents a single atomic view for readers and even supports reliable row deletion despite Parquet requiring a complete file rewrite in order to remove data.

The key to the design is a _latest_manifest JSON file at the top of the bucket, containing an integer version number. Clients use compare-and-swap to increment that version - only one client can succeed at this, so the incremented version they get back is guaranteed unique to them.

Having reserved a version number the client can write a unique manifest file for that version - manifest/v00000123.json - with a more complex data structure referencing the current versions of every persisted file, including the one they just uploaded.

Deleted rows are written to tombstone files as either a list of primary keys or a list of of ranges. Clients consult these when executing reads, filtering out deleted rows as part of resolving a query.

The pricing estimates are especially noteworthy:

For a workload ingesting 6 TB/day with 2 TB of deletes and 50K queries/day:

PUT requests: ~380K/day (≈4 req/s) = $1.88/day

GET requests: highly variable, depends on partitioning effectiveness

Best case (good time-based partitioning): ~100K-200K/day = $0.04-$0.08/day

Worst case (poor partitioning, scanning many files): ~2M/day = $0.80/day

~$3/day for ingesting 6TB of data is pretty fantastic!

Watch out for storage costs though - each new TB of data at $0.023/GB/month adds $23.55 to the ongoing monthly bill.

Via Hacked News

Tags: aws, databases, s3, parquet

AWS in 2025: The Stuff You Think You Know That’s Now Wrong

2025-08-20T16:29:56+00:00

AWS in 2025: The Stuff You Think You Know That’s Now Wrong

Absurdly useful roundup from Corey Quinn of AWS changes you may have missed that can materially affect your architectural decisions about how you use their services.

A few that stood out to me:

EC2 instances can now live-migrate between physical hosts, and can have their security groups, IAM roles and EBS volumes modified without a restart. They now charge by the second; they used to round up to the hour.
S3 Glacier restore fees are now fast and predictably priced.
AWS Lambdas can now run containers, execute for up to 15 minutes, use up to 10GB of RAM and request 10GB of /tmp storage.

Also this note on AWS's previously legendary resistance to shutting things down:

While deprecations remain rare, they’re definitely on the rise; if an AWS service sounds relatively niche or goofy, consider your exodus plan before building atop it.

Via Hacker News

Tags: amazon, aws, ec2, lambda, s3, corey-quinn

Quoting James Cowling

2025-05-14T03:49:37+00:00

I designed Dropbox's storage system and modeled its durability. Durability numbers (11 9's etc) are meaningless because competent providers don't lose data because of disk failures, they lose data because of bugs and operator error. [...]

The best thing you can do for your own durability is to choose a competent provider and then ensure you don't accidentally delete or corrupt own data on it:

Ideally never mutate an object in S3, add a new version instead.

Never live-delete any data. Mark it for deletion and then use a lifecycle policy to clean it up after a week.

This way you have time to react to a bug in your own stack.

— James Cowling

Tags: ops, s3, software-architecture

suitenumerique/docs

2025-03-17T18:51:50+00:00

suitenumerique/docs

New open source (MIT licensed) collaborative text editing web application, similar to Google Docs or Notion, notable because it's a joint effort funded by the French and German governments and "currently onboarding the Netherlands".

It's built using Django and React:

Docs is built on top of Django Rest Framework, Next.js, BlockNote.js, HocusPocus and Yjs.

Deployments currently require Kubernetes, PostgreSQL, memcached, an S3 bucket (or compatible) and an OIDC provider.

Tags: django, open-source, postgresql, s3, react, kubernetes

Using S3 triggers to maintain a list of files in DynamoDB

2025-02-19T22:07:32+00:00

Using S3 triggers to maintain a list of files in DynamoDB

I built an experimental prototype this morning of a system for efficiently tracking files that have been added to a large S3 bucket by maintaining a parallel DynamoDB table using S3 triggers and AWS lambda.

I got 80% of the way there with this single prompt (complete with typos) to my custom Claude Project:

Python CLI app using boto3 with commands for creating a new S3 bucket which it also configures to have S3 lambada event triggers which moantian a dynamodb table containing metadata about all of the files in that bucket. Include these commands

create_bucket - create a bucket and sets up the associated triggers and dynamo tables

list_files - shows me a list of files based purely on querying dynamo

ChatGPT then took me to the 95% point. The code Claude produced included an obvious bug, so I pasted the code into o3-mini-high on the basis that "reasoning" is often a great way to fix those kinds of errors:

Identify, explain and then fix any bugs in this code:

code from Claude pasted here

... and aside from adding a couple of time.sleep() calls to work around timing errors with IAM policy distribution, everything worked!

Getting from a rough idea to a working proof of concept of something like this with less than 15 minutes of prompting is extraordinarily valuable.

This is exactly the kind of project I've avoided in the past because of my almost irrational intolerance of the frustration involved in figuring out the individual details of each call to S3, IAM, AWS Lambda and DynamoDB.

(Update: I just found out about the new S3 Metadata system which launched a few weeks ago and might solve this exact problem!)

Tags: aws, lambda, nosql, prototyping, s3, ai, generative-ai, chatgpt, llms, ai-assisted-programming, claude, uv, o3

sqlite-s3vfs

2025-02-07T02:22:43+00:00

sqlite-s3vfs

Neat open source project on the GitHub organisation for the UK government's Department for Business and Trade: a "Python virtual filesystem for SQLite to read from and write to S3."

I tried out their usage example by running it in a Python REPL with all of the dependencies

uv run --python 3.13 --with apsw --with sqlite-s3vfs --with boto3 python

It worked as advertised. When I listed my S3 bucket I found it had created two files - one called demo.sqlite/0000000000 and another called demo.sqlite/0000000001, both 4096 bytes because each one represented a SQLite page.

The implementation is just 200 lines of Python, implementing a new SQLite Virtual Filesystem on top of apsw.VFS.

The README includes this warning:

No locking is performed, so client code must ensure that writes do not overlap with other writes or reads. If multiple writes happen at the same time, the database will probably become corrupt and data be lost.

I wonder if the conditional writes feature added to S3 back in November could be used to protect against that happening. Tricky as there are multiple files involved, but maybe it (or a trick like this one) could be used to implement some kind of exclusive lock between multiple processes?

Via Hacker News comment

Tags: python, s3, sqlite, uv, apsw

Building Python tools with a one-shot prompt using uv run and Claude Projects

2024-12-19T07:00:37+00:00

I've written a lot about how I've been using Claude to build one-shot HTML+JavaScript applications via Claude Artifacts. I recently started using a similar pattern to create one-shot Python utilities, using a custom Claude Project combined with the dependency management capabilities of uv.

(In LLM jargon a "one-shot" prompt is a prompt that produces the complete desired result on the first attempt. Confusingly it also sometimes means a prompt that includes a single example of the desired output format. Here I'm using the first of those two definitions.)

I'll start with an example of a tool I built that way.

I had another round of battle with Amazon S3 today trying to figure out why a file in one of my buckets couldn't be accessed via a public URL.

Out of frustration I prompted Claude with a variant of the following (full transcript here):

I can't access the file at EXAMPLE_S3_URL. Write me a Python CLI tool using Click and boto3 which takes a URL of that form and then uses EVERY single boto3 trick in the book to try and debug why the file is returning a 404

It wrote me this script, which gave me exactly what I needed. I ran it like this:

uv run debug_s3_access.py \
  https://test-public-bucket-simonw.s3.us-east-1.amazonaws.com/0f550b7b28264d7ea2b3d360e3381a95.jpg

You can see the text output here.

Inline dependencies and uv run

Crucially, I didn't have to take any extra steps to install any of the dependencies that the script needed. That's because the script starts with this magic comment:

# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "click",
#     "boto3",
#     "urllib3",
#     "rich",
# ]
# ///

This is an example of inline script dependencies, a feature described in PEP 723 and implemented by uv run. Running the script causes uv to create a temporary virtual environment with those dependencies installed, a process that takes just a few milliseconds once the uv cache has been populated.

This even works if the script is specified by a URL! Anyone with uv installed can run the following command (provided you trust me not to have replaced the script with something malicious) to debug one of their own S3 buckets:

uv run http://tools.simonwillison.net/python/debug_s3_access.py \
  https://test-public-bucket-simonw.s3.us-east-1.amazonaws.com/0f550b7b28264d7ea2b3d360e3381a95.jpg

Writing these with the help of a Claude Project

The reason I can one-shot scripts like this now is that I've set up a Claude Project called "Python app". Projects can have custom instructions, and I used those to "teach" Claude how to take advantage of inline script dependencies:

You write Python tools as single files. They always start with this comment:
# /// script
# requires-python = ">=3.12"
# ///
These files can include dependencies on libraries such as Click. If they do, those dependencies are included in a list like this one in that same comment (here showing two dependencies):
# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "click",
#     "sqlite-utils",
# ]
# ///

That's everything Claude needs to reliably knock out full-featured Python tools as single scripts which can be run directly using whatever dependencies Claude chose to include.

I didn't suggest that Claude use rich for the debug_s3_access.py script earlier but it decided to use it anyway!

I've only recently started experimenting with this pattern but it seems to work really well. Here's another example - my prompt was:

Starlette web app that provides an API where you pass in ?url= and it strips all HTML tags and returns just the text, using beautifulsoup

Here's the chat transcript and the raw code it produced. You can run that server directly on your machine (it uses port 8000) like this:

uv run https://gist.githubusercontent.com/simonw/08957a1490ebde1ea38b4a8374989cf8/raw/143ee24dc65ca109b094b72e8b8c494369e763d6/strip_html.py

Then visit http://127.0.0.1:8000/?url=https://simonwillison.net/ to see it in action.

Custom instructions

The pattern here that's most interesting to me is using custom instructions or system prompts to show LLMs how to implement new patterns that may not exist in their training data. uv run is less than a year old, but providing just a short example is enough to get the models to write code that takes advantage of its capabilities.

I have a similar set of custom instructions I use for creating single page HTML and JavaScript tools, again running in a Claude Project:

Never use React in artifacts - always plain HTML and vanilla JavaScript and CSS with minimal dependencies.

CSS should be indented with two spaces and should start like this:
<style>
* {
  box-sizing: border-box;
}
Inputs and textareas should be font size 16px. Font should always prefer Helvetica.

JavaScript should be two space indents and start like this:
<script type="module">
// code in here should not be indented at the first level

Most of the tools on my tools.simonwillison.net site were created using versions of this custom instructions prompt.

Tags: aws, cli, python, s3, ai, prompt-engineering, generative-ai, llms, ai-assisted-programming, claude, claude-artifacts, uv, rich, prompt-to-app, starlette

DSQL Vignette: Reads and Compute

2024-12-06T17:12:10+00:00

DSQL Vignette: Reads and Compute

Marc Brooker is one of the engineers behind AWS's new Aurora DSQL horizontally scalable database. Here he shares all sorts of interesting details about how it works under the hood.

The system is built around the principle of separating storage from compute: storage uses S3, while compute runs in Firecracker:

Each transaction inside DSQL runs in a customized Postgres engine inside a Firecracker MicroVM, dedicated to your database. When you connect to DSQL, we make sure there are enough of these MicroVMs to serve your load, and scale up dynamically if needed. We add MicroVMs in the AZs and regions your connections are coming from, keeping your SQL query processor engine as close to your client as possible to optimize for latency.

We opted to use PostgreSQL here because of its pedigree, modularity, extensibility, and performance. We’re not using any of the storage or transaction processing parts of PostgreSQL, but are using the SQL engine, an adapted version of the planner and optimizer, and the client protocol implementation.

The system then provides strong repeatable-read transaction isolation using MVCC and EC2's high precision clocks, enabling reads "as of time X" including against nearby read replicas.

The storage layer supports index scans, which means the compute layer can push down some operations allowing it to load a subset of the rows it needs, reducing round-trips that are affected by speed-of-light latency.

The overall approach here is disaggregation: we’ve taken each of the critical components of an OLTP database and made it a dedicated service. Each of those services is independently horizontally scalable, most of them are shared-nothing, and each can make the design choices that is most optimal in its domain.

Tags: aws, databases, ec2, postgresql, s3, scaling, software-architecture, firecracker

Amazon S3 adds new functionality for conditional writes

2024-11-26T01:14:29+00:00

Amazon S3 adds new functionality for conditional writes

Amazon S3 can now perform conditional writes that evaluate if an object is unmodified before updating it. This helps you coordinate simultaneous writes to the same object and prevents multiple concurrent writers from unintentionally overwriting the object without knowing the state of its content. You can use this capability by providing the ETag of an object [...]

This new conditional header can help improve the efficiency of your large-scale analytics, distributed machine learning, and other highly parallelized workloads by reliably offloading compare and swap operations to S3.

(Both Azure Blob Storage and Google Cloud have this feature already.)

When AWS added conditional write support just for if an object with that key exists or not back in August I wrote about Gunnar Morling's trick for Leader Election With S3 Conditional Writes. This new capability opens up a whole set of new patterns for implementing distributed locking systems along those lines.

Here's a useful illustrative example by lxgr on Hacker News:

As a (horribly inefficient, in case of non-trivial write contention) toy example, you could use S3 as a lock-free concurrent SQLite storage backend: Reads work as expected by fetching the entire database and satisfying the operation locally; writes work like this:

Download the current database copy

Perform your write locally

Upload it back using "Put-If-Match" and the pre-edit copy as the matched object.

If you get success, consider the transaction successful.

If you get failure, go back to step 1 and try again.

AWS also just added the ability to enforce conditional writes in bucket policies:

To enforce conditional write operations, you can now use s3:if-none-match or s3:if-match condition keys to write a bucket policy that mandates the use of HTTP if-none-match or HTTP if-match conditional headers in S3 PutObject and CompleteMultipartUpload API requests. With this bucket policy in place, any attempt to write an object to your bucket without the required conditional header will be rejected.

Via Hacker News

Tags: aws, s3, scaling, software-architecture

Amazon S3 Express One Zone now supports the ability to append data to an object

2024-11-22T04:39:35+00:00

Amazon S3 Express One Zone now supports the ability to append data to an object

This is a first for Amazon S3: it is now possible to append data to an existing object in a bucket, where previously the only supported operation was to atomically replace the object with an updated version.

This is only available for S3 Express One Zone, a bucket class introduced a year ago which provides storage in just a single availability zone, providing significantly lower latency at the cost of reduced redundancy and a much higher price (16c/GB/month compared to 2.3c for S3 standard tier).

The fact that appends have never been supported for multi-availability zone S3 provides an interesting clue as to the underlying architecture. Guaranteeing that every copy of an object has received and applied an append is significantly harder than doing a distributed atomic swap to a new version.

Leader Election With S3 Conditional Writes

2024-08-30T23:13:09+00:00

Leader Election With S3 Conditional Writes

Amazon S3 added support for conditional writes last week, so you can now write a key to S3 with a reliable failure if someone else has has already created it.

This is a big deal. It reminds me of the time in 2020 when S3 added read-after-write consistency, an astonishing piece of distributed systems engineering.

Gunnar Morling demonstrates how this can be used to implement a distributed leader election system. The core flow looks like this:

Scan an S3 bucket for files matching lock_* - like lock_0000000001.json. If the highest number contains {"expired": false} then that is the leader
If the highest lock has expired, attempt to become the leader yourself: increment that lock ID and then attempt to create lock_0000000002.json with a PUT request that includes the new If-None-Match: * header - set the file content to {"expired": false}
If that succeeds, you are the leader! If not then someone else beat you to it.
To resign from leadership, update the file with {"expired": true}

There's a bit more to it than that - Gunnar also describes how to implement lock validity timeouts such that a crashed leader doesn't leave the system leaderless.

Via Hacker News

Tags: aws, s3, software-architecture

Quoting Jeff Barr

2024-07-31T12:59:57+00:00

After giving it a lot of thought, we made the decision to discontinue new access to a small number of services, including AWS CodeCommit.

While we are no longer onboarding new customers to these services, there are no plans to change the features or experience you get today, including keeping them secure and reliable. [...]

The services I'm referring to are: S3 Select, CloudSearch, Cloud9, SimpleDB, Forecast, Data Pipeline, and CodeCommit.

— Jeff Barr

Tags: aws, jeff-barr, s3

How an empty S3 bucket can make your AWS bill explode

2024-04-30T11:19:21+00:00

How an empty S3 bucket can make your AWS bill explode

Maciej Pocwierz accidentally created an S3 bucket with a name that was already used as a placeholder value in a widely used piece of software. They saw 100 million PUT requests to their new bucket in a single day, racking up a big bill since AWS charges $5/million PUTs.

It turns out AWS charge that same amount for PUTs that result in a 403 authentication error, a policy that extends even to "requester pays" buckets!

So, if you know someone's S3 bucket name you can DDoS their AWS bill just by flooding them with meaningless unauthenticated PUT requests.

AWS support refunded Maciej's bill as an exception here, but I'd like to see them reconsider this broken policy entirely.

Update from Jeff Barr:

We agree that customers should not have to pay for unauthorized requests that they did not initiate. We’ll have more to share on exactly how we’ll help prevent these charges shortly.

Via Lobste.rs

Tags: aws, jeff-barr, s3, security

s3-credentials 0.16

2024-04-05T05:35:57+00:00

s3-credentials 0.16

I spent entirely too long this evening trying to figure out why files in my new supposedly public S3 bucket were unavailable to view. It turns out these days you need to set a PublicAccessBlockConfiguration of {"BlockPublicAcls": false, "IgnorePublicAcls": false, "BlockPublicPolicy": false, "RestrictPublicBuckets": false}.

The s3-credentials --create-bucket --public option now does that for you. I also added a s3-credentials debug-bucket name-of-bucket command to help figure out why a bucket isn't working as expected.

Tags: aws, projects, s3, s3-credentials

S3 is files, but not a filesystem

2024-03-10T11:47:34+00:00

S3 is files, but not a filesystem

Cal Paterson helps some concepts click into place for me: S3 imitates a file system but has a number of critical missing features, the most important of which is the lack of partial updates. Any time you want to modify even a few bytes in a file you have to upload and overwrite the entire thing. Almost every database system is dependent on partial updates to function, which is why there are so few databases that can use S3 directly as a backend storage mechanism.

Via Lobste.rs

Tags: aws, databases, s3

Slashing Data Transfer Costs in AWS by 99%

2024-01-15T22:22:10+00:00

Slashing Data Transfer Costs in AWS by 99%

Brilliant trick by Daniel Kleinstein. If you have data in two availability zones in the same AWS region, transferring a TB will cost you $10 in ingress and $10 in egress at the inter-zone rates charged by AWS.

But... transferring data to an S3 bucket in that same region is free (aside from S3 storage costs). And buckets are available with free transfer to all availability zones in their region, which means that TB of data can be transferred between availability zones for mere cents of S3 storage costs provided you delete the data as soon as it’s transferred.

Via Hacker News

Tags: aws, s3

How ima.ge.cx works

2023-12-31T04:32:41+00:00

How ima.ge.cx works

ima.ge.cx is Aidan Steele’s web tool for browsing the contents of Docker images hosted on Docker Hub. The architecture is really interesting: it’s a set of AWS Lambda functions, written in Go, that fetch metadata about the images using Step Functions and then cache it in DynamoDB and S3. It uses S3 Select to serve directory listings from newline-delimited JSON in S3 without retrieving the whole file.

Via @__steele

Tags: aws, go, s3, docker

Litestream backups for Datasette Cloud (and weeknotes)

2022-08-11T17:19:52+00:00

My main focus this week has been adding robust backups to the forthcoming Datasette Cloud.

Datasette Cloud is a SaaS service for Datasette. It allows people to create a private Datasette instance where they can upload data, visualize and transform it and share it with other members of their team. You can join the waiting list to try it out using this form.

I'm building Datastte Cloud on Fly, specifically on Fly Machines.

Security is a big concern for Datasette Cloud. Teams should only be able to access their own data - bugs where users accidentally (or maliciously) access data for another team should be protected against as much as possible.

To help guarantee that, I've designed Datasette Cloud so that each team gets their own, dedicated instance, running in a Firecracker VM managed by Fly. Their data lives in a dedicated volume.

Fly volumes already implement snapshot backups, but I'm interested in defence in depth. This is where Litestream comes in (coincidentally now part of Fly, although it wasn't when I first selected it as my backup strategy).

I'm using Litestream to constantly backup the data for each Datasette Cloud team to an S3 bucket. In the case of a complete failure of a volume, I can restore data from a backup that should be at most a few seconds out of date. Litestream also gives me point-in-time backups, such that I can recover a previous version of the data within a configurable retention window.

Keeping backups isolated

Litestream works by writing a constant stream of pages from SQLite's WAL (Write-Ahead Log) up to an S3 bucket. It needs the ability to both read and write from S3.

This requires making S3 credentials available within the containers that run Datasette and Litestream for each team account.

Credentials in those containers are not visible to the users of the software, but I still wanted to be confident that if the credentials leaked in some way the isolation between teams would be maintained.

Initially I thought about having a separate S3 bucket for each team, but it turns out AWS has a default limit of 100 buckets per account, and a hard limit of 1,000. I aspire to have more than 1,000 customers, so this limit makes a bucket-per-team seem like the wrong solution.

I've learned an absolute ton about S3 and AWS permissions building my s3-credentials tool for creating credentials for accessing S3.

One of the tricks I've learned is that it's possible to create temporary, time-limited credentials that only work for a prefix (effectively a folder) within an S3 bucket.

This means I can run Litestream with credentials that are specific to the team - that can read and write only from the team-ID/ prefix in the S3 bucket I am using to store the backups.

Obtaining temporary credentials

My s3-credentials tool can create credentials for a prefix within an S3 bucket like this:

s3-credentials create my-bucket-for-backus \
  --duration 12h \
  --prefix team-56/

This command uses the sts.assume_role() AWS method to create credentials that allow access to that bucket, attaching this generated JSON policy to it in order to restrict access to the provided prefix.

I extracted the relevant Python code from s3-credentials and used it to create a private API endpoint in my Datasette Cloud management server which could return the temporary credentials needed by the team container.

With the endpoint in place, my code for launching a team container can do this:

Create the volume and machine for that team (if they do not yet exist)
Generate a signed secret token that the machine container can exchange for its S3 credentials
Launch the machine container, passing it the secret token
On launch, the container runs a script which exchanges that secret token for its 12 hour S3 credentials, using the private API endpoint I created
Those credentials are used to populate the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN environment variables used by Litestream
Start Litestream, which then starts Datasette

Restarting every 12 hours

You may be wondering why I bothered with that initial secret token - why not just pass the temporary AWS credentials to the container when I launch it?

The reason for this is that I need to be able to obtain fresh credentials every 12 hours.

A really neat feature of Fly Machines is that they support scale-to-zero. You can stop them, and Fly will automatically restart them the next time they recieve traffic.

All you need to do is call sys.exit(0) in your Python code (or the equivalent in any other language) and Fly will stop your container... and then restart it again with a couple of seconds of cold start time the next time an HTTP request for your container hits the Fly router.

So far I'm mainly using this to avoid the cost of running containers when they aren't actually in- use. But there's a neat benefit when it comes to Litestream too.

I'm using S3 credentials which expire after 12 hours. This means I need to periodically refresh the credentials and restart Litestream or it will stop being able to write to the S3 bucket.

After considering a few ways of doing this, I selected the simplest to implement: have Datasette call sys.exit(0) after ten hours, and let Fly restart the container causing my startup script to fetch freshly generated 12 hour credentials and pass them to Litestream.

I implemented this by adding it as a new setting to my existing datasette-scale-to-zero plugin. You can now configure that with "max-age": "10h" and it will shut down Datasette once the server has been running for that long.

Why does this require my own secret token system? Because when the container is restarted, it needs to make an authenticated call to my endpoint to retrieve those fresh S3 credentials. Fly persists environment variable secrets between restarts to the container, so that secret can be long-lived even while it is exchanged for short-term S3 credentials.

I only just put the new backup system in place, so I'm exercising it a bit before I open things up to trial users - but so far it's looking like a very robust solution to the problem.

s3-ocr improvements

I released a few new versions of s3-ocr this week, as part of my ongoing project working with the San Francisco Microscopical Society team to release a searchable version of their scanned document archives.

The two main improvements are:

A new --dry-run option to s3-ocr start which shows you what the tool will do without making any changes to your S3 bucket, or triggering any OCR jobs. #22
s3-ocr start used to fail with an error if running it would create more than 100 (or 600 depending on your region) concurrent OCR jobs. The tool now knows how to identify that error and pause and retry starting the jobs instead. #21

The fix that took the most time is this: installations of the tool no longer arbitrarily fail to work depending on the environment you install them into!

Solving this took me the best part of a day. The short version is this: Click 8.1.0 introduced a new feature that lets you use @cli.command as a decorator instead of @cli.command(). This meant that installing s3-ocr in an environment that already had a previous version of Click would result in silent errors.

The solution is simple: pin to click>=8.1.0 in the project dependencies if you plan to use this new syntax.

If I'd read the Click changelog more closely I would have saved myself a whole lot of time.

Issues #25 and #26 detail the many false turns I took trying to figure this out.

More fun with GPT-3 and DALL-E

This tweet scored over a million impressions on Twitter:

New hobby: prototyping video games in 60 seconds using a combination of GPT-3 and DALL-E

Here's "Raccoon Heist" pic.twitter.com/xQ3Vm8p2XW
- Simon Willison (@simonw) August 5, 2022

As this got retweeted outside of my usual circles it started confusing people who thought the "prototype" was a working game, as opposed to a fake screenshot and a paragraph of descriptive text! I wasn't kidding when I said I spent 60 seconds on this.

I also figured out how to use GPT-3 to write jq one-liners. I love jq but I have to look up how to use it every time, so having GPT-3 do the work for me is a pretty neat time saver. More on that in this TIL: Using GPT-3 to figure out jq recipes

Releases this week

s3-ocr: 0.6.3 - (9 releases total) - 2022-08-10
Tools for running OCR against files stored in S3
datasette-scale-to-zero: 0.2 - (4 releases total) - 2022-08-05
Quit Datasette if it has not received traffic for a specified time period
shot-scraper: 0.14.3 - (18 releases total) - 2022-08-02
A command-line utility for taking automated screenshots of websites
s3-credentials: 0.12.1 - (13 releases total) - 2022-08-01
A tool for creating credentials for accessing S3 buckets
datasette-sqlite-fts4: 0.3.2 - (2 releases total) - 2022-07-31

TIL this week

Tags: ocr, s3, datasette, weeknotes, datasette-cloud, fly, litestream, gpt-3, dalle, gpt

s3-ocr: Extract text from PDF files stored in an S3 bucket

2022-06-30T21:40:27+00:00

I've released s3-ocr, a new tool that runs Amazon's Textract OCR text extraction against PDF files in an S3 bucket, then writes the resulting text out to a SQLite database with full-text search configured so you can run searches against the extracted data.

You can search through a demo of 697 pages of OCRd text at s3-ocr-demo.datasette.io/pages/pages.

Textract works extremely well: it handles dodgy scanned PDFs full of typewritten code and reads handwritten text better than I can! It charges $1.50 per thousand pages processed.

Why I built this

My initial need for this is a collaboration I have running with the San Francisco Microscopy Society. They've been digitizing their archives - which stretch back to 1870! - and were looking for help turning the digital scans into something more useful.

The archives are full of hand-written and type-written notes, scanned and stored as PDFs.

I decided to wrap my work up as a tool because I'm sure there are a LOT of organizations out there with a giant bucket of PDF files that would benefit from being able to easily run OCR and turn the results into a searchable database.

Running Textract directly against large numbers of files is somewhat inconvenient (here's my earlier TIL about it). s3-ocr is my attempt to make it easier.

Tutorial: How I built that demo

The demo instance uses three PDFs from the Library of Congress Harry Houdini Collection on the Internet Archive:

I started by downloading PDFs of those three files.

Then I installed the two tools I needed:

pip install s3-ocr s3-credentials

I used my s3-credentials tool to create a new S3 bucket and credentials with the ability to write files to it, with the new --statement option (which I released today) to add textract permissions to the generated credentials:

s3-credentials create s3-ocr-demo --statement '{
  "Effect": "Allow",
  "Action": "textract:*",
  "Resource": "*"
}' --create-bucket > ocr.json

(Note that you don't need to use s3-credentials at all if you have AWS credentials configured on your machine with root access to your account - just leave off the -a ocr.json options in the following examples.)

s3-ocr-demo is now a bucket I can use for the demo. ocr.json contains JSON with an access key and secret key for an IAM user account that can interact with the that bucket, and also has permission to access the AWS Textract APIs.

I uploaded my three PDFs to the bucket:

s3-credentials put-object s3-ocr-demo latestmagicbeing00hoff.pdf latestmagicbeing00hoff.pdf -a ocr.json
s3-credentials put-object s3-ocr-demo practicalmagicia00harr.pdf practicalmagicia00harr.pdf -a ocr.json
s3-credentials put-object s3-ocr-demo unmaskingrobert00houdgoog.pdf unmaskingrobert00houdgoog.pdf -a ocr.json

(I often use Transmit as a GUI for this kind of operation.)

Then I kicked off OCR jobs against every PDF file in the bucket:

% s3-ocr start s3-ocr-demo --all -a ocr.json 
Found 0 files with .s3-ocr.json out of 3 PDFs
Starting OCR for latestmagicbeing00hoff.pdf, Job ID: f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402
Starting OCR for practicalmagicia00harr.pdf, Job ID: ef085728135d524a39bc037ad6f7253284b1fdbeb728dddcfbb260778d902b55
Starting OCR for unmaskingrobert00houdgoog.pdf, Job ID: 93bd46f02eb099eca369c41e384836d2bd3199b95d415c0257ef3fa3602cbef9

The --all option scans for any file with a .pdf extension. You can pass explicit file names instead if you just want to process one or two files at a time.

This returns straight away, but the OCR process itself can take several minutes depending on the size of the files.

The job IDs can be used to inspect the progress of each task like so:

% s3-ocr inspect-job f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402
{
  "DocumentMetadata": {
    "Pages": 244
  },
  "JobStatus": "SUCCEEDED",
  "DetectDocumentTextModelVersion": "1.0"
}

Once the job completed, I could preview the text extracted from the PDF like so:

% s3-ocr text s3-ocr-demo latestmagicbeing00hoff.pdf
111
.
116

LATEST MAGIC
BEING
ORIGINAL CONJURING TRICKS
INVENTED AND ARRANGED
BY
PROFESSOR HOFFMANN
(ANGELO LEWIS, M.A.)
Author of "Modern Magic," etc.
WITH NUMEROUS ILLUSTRATIONS
FIRST EDITION
NEW YORK
SPON & CHAMBERLAIN, 120 LIBERTY ST.
...

To create a SQLite database with a table containing rows for every page of scanned text, I ran this command:

% s3-ocr index s3-ocr-demo pages.db -a ocr.json 
Fetching job details  [####################################]  100%
Populating pages table  [####--------------------------------]   13%  00:00:34

I then published the resulting pages.db SQLite database using Datasette - you can explore it here.

How s3-ocr works

s3-ocr works by calling Amazon's S3 and Textract APIs.

Textract only works against PDF files in asynchronous mode: you call an API endpoint to tell it "start running OCR against this PDF file in this S3 bucket", then wait for it to finish - which can take several minutes.

It defaults to storing the OCR results in its own storage, expiring after seven days. You can instead tell it to store them in your own S3 bucket - I use that option in s3-ocr.

A design challenge I faced was that I wanted to make the command restartable and resumable: if the user cancelled the task, I wanted to be able to pick up from where it had got to. I also want to be able to run it again after adding more PDFs to the bucket without repeating work for the previously processed files.

I also needed to persist those job IDs: Textract writes the OCR results to keys in the bucket called textract-output/JOB_ID/1-? - but there's no indication as to which PDF file the results correspond to.

My solution is to write tiny extra JSON files to the bucket when the OCR job is first started.

If you have a file called latestmagicbeing00hoff.pdf the start command will create a new file called latestmagicbeing00hoff.pdf.s3-ocr.json with the following content:

{
  "job_id": "f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402",
  "etag": "\"d79af487579dcbbef26c9b3be763eb5e-2\""
}

This associates the job ID with the PDF file. It also records the original ETag of the PDF file - this is so in the future I can implement a system that can re-run OCR if the PDF has been updated.

The existence of these files lets me do two things:

If you run s3-ocr start s3-ocr-demo --all it can avoid re-submitting PDF files that have already been sent for OCR, by checking for the existence of the .s3-ocr.json file.
When you later ask for the results of the OCR it can use these files to associate the PDF with the results.

Scatting .s3-ocr.json files all over the place feels a little messy, so I have an open issue considering moving them all to a s3-ocr/ prefix in the bucket instead.

Try it and let me know what you think

This is a brand new project, but I think it's ready for other people to start trying it out.

I ran it against around 7,000 pages from 531 PDF files in the San Francisco Microscopy Society archive and it seemed to work well!

If you try this out and it works (or it doesn't work) please let me know via Twitter or GitHub.

A challenging example page

Here's one of the more challenging pages I processed using Textract:

Here's the result:

In. In J a ... the Joe 14
162
Volxv
Lalpa spinosa, Eggt bud development. of
146
Farcomas spindle. cells in nested gowers 271
Fayigaga tridactylites, leaf glaur of ruce 33
staining & mounting
Stiles 133
tilica films, a structure of Diatoins morehouse 38
thile new microscopic
Broeck 22 /
Smith reproduction in the huntroom tribe
6
Trakes, develop mouht succession of the porsion tango/229
Soirce President of the Roy: truc: Soo
285
forby, Presidents address
105
pongida, difficulties of classification
238
tage, american adjustable concentric
150
ttlese staining & mountring wood sections 133
Stodder, Frustulia Iasconica, havicula
chomboides, & havi cula crassinervis 265
Vol XVI
falicylic acid u movorcopy
160
falpar enctry ology of
Brooke 9.97
Sanderson micros: characters If inflammation
43
tap, circulation of the
42
Jars, structure of the genus Brisinga
44
latter throvite connective substances 191- 241
Jehorey Cessification in birds, formation
of ed blood corpuseles during the
ossification process
by

Releases this week

s3-ocr: 0.4 - (4 releases total) - 2022-06-30
Tools for running OCR against files stored in S3
s3-credentials: 0.12 - (12 releases total) - 2022-06-30
A tool for creating credentials for accessing S3 buckets
datasette-scale-to-zero: 0.1.2 - (3 releases total) - 2022-06-23
Quit Datasette if it has not received traffic for a specified time period

TIL this week

Tags: aws, ocr, pdf, projects, s3, weeknotes, s3-credentials

s3-credentials 0.8

2021-12-07T07:04:35+00:00

s3-credentials 0.8

The latest release of my s3-credentials CLI tool for creating S3 buckets with credentials to access them (with read-write, read-only or write-only policies) adds a new --public option for creating buckets that allow public access, such that anyone who knows a filename can download a file. The s3-credentials put-object command also now sets the appropriate Content-Type heading on the uploaded object.

Tags: cli, projects, s3, s3-credentials

Weeknotes: git-history, created for a Git scraping workshop

2021-11-15T04:10:50+00:00

My main project this week was a 90 minute workshop I delivered about Git scraping at Coda.Br 2021, a Brazilian data journalism conference, on Friday. This inspired the creation of a brand new tool, git-history, plus smaller improvements to a range of other projects.

git-history

I still need to do a detailed write-up of this one (update: git-history: a tool for analyzing scraped data collected using Git and SQLite), but on Thursday I released a brand new tool called git-history, which I describe as "tools for analyzing Git history using SQLite".

This tool is the missing link in the Git scraping pattern I described here last October.

Git scraping is the technique of regularly scraping an online source of information and writing the results to a file in a Git repository... which automatically gives you a full revision history of changes made to that data source over time.

The missing piece has always been what to do next: how do you turn a commit history of changes to a JSON or CSV file into a data source that can be used to answer questions about how that file changed over time?

I've written one-off Python scripts for this a few times (here's my CDC vaccinations one, for example), but giving an interactive workshop about the technique finally inspired me to build a tool to help.

The tool has a comprehensive README, but the short version is that you can take a JSON (or CSV) file in a repository that has been tracking changes to some items over time and run the following to load all of the different versions into a SQLite database file for analysis with Datasette:

git-convert file incidents.db incidents.json --id IncidentID

This assumes that incidents.json contains a JSON array of incidents (reported fires for example) and that each incident has a IncidentID identifier key. It will then loop through the Git history of that file right from the start, creating an item_versions table that tracks every change made to each of those items - using IncidentID to decide if a row represents a new incident or an update to a previous one.

I have a few more improvements I want to make before I start more widely promoting this, but it's already really useful. I've had a lot of fun running it against example repos from the git-scraping GitHub topic (now at 202 repos and counting).

Workshop: Raspando dados com o GitHub Actions e analisando com Datasette

The workshop I gave at the conference was live-translated into Portuguese, which is really exciting! I'm looking forward to watching the video when it comes out and seeing how well that worked.

The title translates to "Scraping data with GitHub Actions and analyzing with Datasette", and it was the first time I've given a workshop that combines Git scraping and Datasette - hence the development of the new git-history tool to help tie the two together.

I think it went really well. I put together four detailed exercises for the attendees, and then worked through each one live with the goal of attendees working through them at the same time - a method I learned from the Carpentries training course I took last year.

Four exercises turns out to be exactly right for 90 minutes, with reasonable time for an introduction and some extra material and questions at the end.

The worst part of running a workshop is inevitably the part where you try and get everyone setup with a functional development environment on their own machines (see XKCD 1987). This time round I skipped that entirely by encouraging my students to use GitPod, which provides free browser-based cloud development environments running Linux, with a browser-embedded VS Code editor and terminal running on top.

(It's similar to GitHub Codespaces, but Codespaces is not yet available to free customers outside of the beta.)

I demonstrated all of the exercises using GitPod myself during the workshop, and ensured that they could be entirely completed through that environment, with no laptop software needed at all.

This worked so well. Not having to worry about development environments makes workshops massively more productive. I will absolutely be doing this again in the future.

The workshop exercises are available in this Google Doc, and I hope to extract some of them out into official tutorials for various tools later on.

Datasette 0.58.2

Yesterday was Datasette's fourth birthday - the four year anniversary of the initial release announcement! I celebrated by releasing a minor bug-fix, Datasette 0.58.2, the release notes for which are quoted below:

Column names with a leading underscore now work correctly when used as a facet. (#1506)
Applying ?_nocol= to a column no longer removes that column from the filtering interface. (#1503)
Official Datasette Docker container now uses Debian Bullseye as the base image. (#1497)

That first change was inspired by ongoing work on git-history, where I decided to use a _id underscoper prefix pattern for columns that were reserved for use by that tool in order to avoid clashing with column names in the provided source data.

sqlite-utils 3.18

Today I released sqlite-utils 3.18 - initially also to provide a feature I wanted for git-history (a way to populate additional columns when creating a row using table.lookup()) but I also closed some bug reports and landed some small pull requests that had come in since 3.17.

s3-credentials 0.5

Earlier in the week I released version 0.5 of s3-credentials - my CLI tool for creating read-only, read-write or write-only AWS credentials for a specific S3 bucket.

The biggest new feature is the ability to create temporary credentials, that expire after a given time limit.

This is achived using STS.assume_role(), where STS is Security Token Service. I've been wanting to learn this API for quite a while now.

Assume role comes with some limitations: tokens must live between 15 minutes and 12 hours, and you need to first create a role that you can assume. In creating those credentials you can define an additional policy document, which is how I scope down the token I'm creating to only allow a specific level of access to a specific S3 bucket.

I've learned a huge amount about AWS, IAM and S3 through developming this project. I think I'm finally overcoming my multi-year phobia of anything involving IAM!

Releases this week

sqlite-utils: 3.18 - (88 releases total) - 2021-11-15
Python CLI utility and library for manipulating SQLite databases
datasette: 0.59.2 - (100 releases total) - 2021-11-14
An open source multi-tool for exploring and publishing data
datasette-hello-world: 0.1.1 - (2 releases total) - 2021-11-14
The hello world of Datasette plugins
git-history: 0.3.1 - (5 releases total) - 2021-11-12
Tools for analyzing Git history using SQLite
s3-credentials: 0.5 - (5 releases total) - 2021-11-11
A tool for creating credentials for accessing S3 buckets

TIL this week

Tags: aws, projects, s3, my-talks, teaching, datasette, weeknotes, git-scraping, sqlite-utils, git-history, s3-credentials

s3-credentials: a tool for creating credentials for S3 buckets

2021-11-03T04:02:04+00:00

I've built a command-line tool called s3-credentials to solve a problem that's been frustrating me for ages: how to quickly and easily create AWS credentials (an access key and secret key) that have permission to read or write from just a single S3 bucket.

The TLDR version

To create a new S3 bucket and generate credentials for reading and writing to it:

% pip install s3-credentials
% s3-credentials create demo-bucket-for-simonwillison-blog-post --create-bucket
Created bucket: demo-bucket-for-simonwillison-blog-post
Created  user: 's3.read-write.demo-bucket-for-simonwillison-blog-post' with permissions boundary: 'arn:aws:iam::aws:policy/AmazonS3FullAccess'
Attached policy s3.read-write.demo-bucket-for-simonwillison-blog-post to user s3.read-write.demo-bucket-for-simonwillison-blog-post
Created access key for user: s3.read-write.demo-bucket-for-simonwillison-blog-post
{
    "UserName": "s3.read-write.demo-bucket-for-simonwillison-blog-post",
    "AccessKeyId": "AKIAWXFXAIOZHY6WAJSF",
    "Status": "Active",
    "SecretAccessKey": "...",
    "CreateDate": "2021-12-06 23:54:08+00:00"
}

You can now use the that AccessKeyId and SecretAccessKey to read and write files in that bucket.

The need for bucket credentials for S3

I'm an enormous fan of Amazon S3: I've been using it for fifteen years now (since the launch in 2006) and it's my all-time favourite cloud service: it's cheap, reliable and basically indestructible.

You need two credentials to make API calls to S3: an AWS_ACCESS_KEY_ID and a AWS_SECRET_ACCESS_KEY.

Since I often end up adding these credentials to projects hosted in different environments, I'm not at all keen on using my root-level credentials here: usually a project works against just one dedicated S3 bucket, so ideally I would like to create dedicated credentials that are limited to just that bucket.

Creating those credentials is surprisingly difficult!

Dogsheep Photos

The last time I solved this problem was for my Dogsheep Photos project. I built a tool that uploads all of my photos from Apple Photos to my own dedicated S3 bucket, and extracts the photo metadata into a SQLite database. This means I can do some really cool tricks using SQL to analyze my photos, as described in Using SQL to find my best photo of a pelican according to Apple Photos.

The photos are stored in a S3 private bucket, with a custom proxy in front of them that I can use to grant access to specific photographs via a signed URL.

For the proxy, I decided to create dedicated credentials that were allowed to make read-only requests to my private S3 bucket.

I made detailed notes along the way as I figured out to do that. It was really hard! There's one step where you literally have to hand-edit a JSON policy document that looks like this (replace dogsheep-photos-simon with your own bucket name) and paste that into the AWS web console:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "s3:*",
      "Resource": [
        "arn:aws:s3:::dogsheep-photos-simon/*"
      ]
    }
  ]
}

I set myself an ambition to try and fix this at some point in the future (that was in April 2020).

Today I found myself wanting new bucket credentials, so I could play with Litestream. I decided to solve this problem once and for all.

I've also been meaning to really get my head around Amazon's IAM permission model for years, and this felt like a great excuse to figure it out through writing code.

The process in full

Here are the steps you need to take in order to get long-lasting credentials for accessing a specific S3 bucket.

Create an S3 bucket
Create a new, dedicated user. You need a user and not a role because long-lasting AWS credentials cannot be created for roles - and we want credentials we can use in a project without constantly needing to update them.
Assign an "inline policy" to that user granting them read-only or read-write access to the specific S3 bucket - this is the JSON format shown above.
Create AWS credentials for that user.

There are plenty of other ways you can achieve this: you can add permissions to a group and assign that user to a group, or you can create a named "managed policy" and attach that to the user. But using an inline policy seems to be the simplest of the available options.

Using the boto3 Python client library for AWS this sequence converts to the following API calls:

import boto3
import json

s3 = boto3.client("s3")
iam = boto3.client("iam")

username = "my-new-user"
bucket_name = "my-new-bucket"
policy_name = "user-can-access-bucket"

policy_document = {
    "... that big JSON document ...": ""
}

# Create the bucket
s3.create_bucket(Bucket=bucket_name)

# Create the user
iam.create_user(UserName=username)

# Assign the policy to the user
iam.put_user_policy(
    PolicyDocument=json.dumps(policy_document),
    PolicyName=policy_name,
    UserName=username,
)

# Retrieve and print the credentials
response = iam.create_access_key(
    UserName=username,
)
print(response["AccessKey"])

Turning it into a CLI tool

I never want to have to figure out how to do this again, so I decided to build a tool around it.

s3-credentials is a Python CLI utility built on top of Click using my click-app cookicutter template.

It's available through PyPI, so you can install it using:

% pip install s3-credentials

The main command is s3-credentials create, which runs through the above sequence of steps.

To create read-only credentials for my existing static.niche-museums.com bucket I can run the following:

% s3-credentials create static.niche-museums.com --read-only

Created user: s3.read-only.static.niche-museums.com with permissions boundary: arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess
Attached policy s3.read-only.static.niche-museums.com to user s3.read-only.static.niche-museums.com
Created access key for user: s3.read-only.static.niche-museums.com
{
    "UserName": "s3.read-only.static.niche-museums.com",
    "AccessKeyId": "AKIAWXFXAIOZJ26NEGBN",
    "Status": "Active",
    "SecretAccessKey": "...",
    "CreateDate": "2021-11-03 03:21:12+00:00"
}

The command shows each step as it executes, and at the end it outputs the newly created access key and secret key.

It defaults to creating a user with a username that reflects what it will be able to do: s3.read-only.static.niche-museums.com. You can pass --username something to specify a custom username instead.

If you omit the --read-only flag it will create a user with read and write access to the bucket. There's also a --write-only flag which creates a user that can write to but not read from the bucket - useful for use-cases like logging or backup scripts.

The README has full documentation on the various other options, plus details of the other s3-credentials utility commands list-users, list-buckets, list-user-policies and whoami.

Learned along the way

This really was a fantastic project for deepening my understanding of S3, IAM and how it all fits together. A few extra points I picked up:

AWS users can be created with something called a permissions boundary. This is an advanced security feature which lets a user be restricted to a set of maximum permissions - for example, only allowed to interact with S3, not any other AWS service.

Pemissions boundaries do not themselves grant permissions - a user will not be able to do anything until extra policies are added to their account. It instead acts as defense in depth, setting an upper limit to what a user can do no matter what other policies are applied to them.

There's one big catch: the value you set for a permissions boundary is a very weakly documented ARN string - the boto3 documentation simply calls it "The ARN of the policy that is used to set the permissions boundary for the user". I used GitHub code search to dig up some examples, and found arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess and arn:aws:iam::aws:policy/AmazonS3FullAccess to be the ones most relevant to my project. This random file appears to contain more.
Those JSON policy documents really are the dark secret magic that holds AWS together. Finding trustworthy examples of read-only, read-write and write-only policies for specific S3 buckets was not at all easy. I made detailed notes in this comment thread - the policies I went with are baked into the policies.py file in the s3-credentials repository. If you know your way around IAM I would love to hear your feedback on the policies I ended up using!
Writing automated tests for code that makes extensive use of boto3 - such that those tests don't make any real HTTP requests to the API - is a bit fiddly. I explored a few options for this - potential candidates included the botocore.stub.Stubber class and the VCR.py class for saving and replaying HTTP traffic (see this TIL). I ended up going with Python's Mock class, via pytest-mock - here's another TIL on the pattern I used for that. (Update: Jeff Triplett pointed me to moto which looks like a really great solution for this.)

Feedback from AWS experts wanted

The tool I've built solves my specific problem pretty well. I'm nervous about it though: I am by no means an IAM expert, and I'm somewhat paranoid that I may have made a dumb mistake and baked it into the tooling.

As such, the README currently carries a warning that you should review what the tool is doing carefully before trusting it against your own AWS account!

Update 20 February 2022: I removed that warning, since I've now spent long enough working on this tool that I'm comfortable with how it works.

If you are an AWS expert, you can help: I have an open issue requesting expert feedback, and I'd love to hear from people with deep experience who can either validate that my approach is sound or help explain what I'm doing wrong and how the process can be fixed.

Tags: cli, projects, python, s3, security, s3-credentials

Abusing Terraform to Upload Static Websites to S3

2021-10-06T18:51:27+00:00

Abusing Terraform to Upload Static Websites to S3

I found this really interesting. Terraform is infrastructure as code software which mostly handles creating and updating infrastructure resources, so it’s a poor fit for uploading files to S3 and setting the correct Content-Type headers for them. But... in figuring out how to do that, this article taught me a ton about how Terraform works. I wonder if that’s a useful general pattern? Get a tool to do something that it’s poorly designed to handle and see how much you learn about that tool along the way.

Via Hacker News

Tags: s3

Quoting Brian LeRoux

2021-05-13T15:01:26+00:00

Folks think s3 is static assets hosting but really it's a consistent and highly available key value store with first class blob support

— Brian LeRoux

Tags: s3

logpaste

2021-03-17T15:48:45+00:00

logpaste

Useful example of how to use the Litestream SQLite replication tool in a Dockerized application: S3 credentials are passed to the container on startup, it then attempts to restore the SQLite database from S3 and starts a Litestream process in the same container to periodically synchronize changes back up to the S3 bucket.

Via @deliberatecoder

Tags: replication, s3, sqlite, docker, litestream

athena-sqlite

2019-12-18T09:05:40+00:00

athena-sqlite

Amazon Athena is the AWS tool for querying data stored in S3—as CSV, JSON or Apache Parquet files—using SQL. It’s an interesting way of buliding a very cheap data warehouse on top of S3 without having to run any additional services. Athena recently added a query federation SDK which lets you define additional custom data sources using Lambda functions. Damon Cortesi used this to write a custom connector for SQLite, which lets you run queries against data stored in SQLite files that you have uploaded to S3. You can then run joins between that data and other Athena sources.

Via @dacort

Tags: athena, aws, s3, sql, sqlite, parquet

Client-side instrumentation for under $1 per month. No servers necessary.

2019-03-15T16:03:48+00:00

Client-side instrumentation for under $1 per month. No servers necessary.

Rolling your own analytics used to be too complex and expensive to be worth the effort. Thanks to cloud technologies like Cloudfront, Athena, S3 and Lambda you can now inexpensively implement client-side analytics (via requests to a tracking pixel) that stores detailed logs on S3, then use Amazon Athena to run queries against those logs ($5/TB scanned) to get detailed reporting. This post also introduced me to Snowplow, an open source JavaScript analytics script (released by a commercial analytics platform) which looks very neat—it’s based on piwik.js, the tracker from the open-source Piwik analytics tool.

Via Hacker News

Tags: analytics, athena, cloudfront, lambda, s3

Django Bakery

2018-06-16T01:49:35+00:00

Django Bakery

“A set of helpers for baking your Django site out as flat files”. Released by the LA Times Data Desk, who use it for a large number of projects from election results to data journalism interactives. Statically publishing these projects to S3 lets them handle huge traffic spikes at a very low cost.

Via @palewire

Tags: data-journalism, django, s3, static-generator, ben-welsh