Simon Willison's Weblog: scaling

Serving the For You feed

2026-04-24T01:08:17+00:00

One of Bluesky's most interesting features is that anyone can run their own custom "feed" implementation and make it available to other users - effectively enabling custom algorithms that can use any mechanism they like to recommend posts.

spacecowboy runs the For You Feed, used by around 72,000 people. This guest post on the AT Protocol blog explains how it works.

The architecture is fascinating. The feed is served by a single Go process using SQLite on a "gaming" PC in spacecowboy's living room - 16 cores, 96GB of RAM and 4TB of attached NVMe storage.

Recommendations are based on likes: what else are the people who like the same things as you liking on the platform?

That Go server consumes the Bluesky firehose and stores the relevant details in SQLite, keeping the last 90 days of relevant data, which currently uses around 419GB of SQLite storage.

Public internet traffic is handled by a $7/month VPS on OVH, which talks to the living room server via Tailscale.

Total cost is now $30/month: $20 in electricity, $7 in VPS and $3 for the two domain names. spacecowboy estimates that the existing system could handle all ~1 million daily active Bluesky users if they were to switch to the cheapest algorithm they have found to work.

Tags: go, scaling, sqlite, software-architecture, tailscale, bluesky

Quoting Matthew Prince

2025-11-19T08:02:36+00:00

Cloudflare's network began experiencing significant failures to deliver core network traffic [...] triggered by a change to one of our database systems' permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system. That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network. [...] The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail. [...]

This resulted in the following panic which in turn resulted in a 5xx error:

thread fl2_worker_thread panicked: called Result::unwrap() on an Err value

— Matthew Prince, Cloudflare outage on November 18, 2025, see also this comment

Tags: scaling, rust, cloudflare, postmortem

The case against pgvector

2025-11-03T20:26:10+00:00

The case against pgvector

I wasn't keen on the title of this piece but the content is great: Alex Jacobs talks through lessons learned trying to run the popular pgvector PostgreSQL vector indexing extension at scale, in particular the challenges involved in maintaining a large index with close-to-realtime updates using the IVFFlat or HNSW index types.

The section on pre-v.s.-post filtering is particularly useful:

Okay but let's say you solve your index and insert problems. Now you have a document search system with millions of vectors. Documents have metadata---maybe they're marked as draft, published, or archived. A user searches for something, and you only want to return published documents.

[...] should Postgres filter on status first (pre-filter) or do the vector search first and then filter (post-filter)?

This seems like an implementation detail. It’s not. It’s the difference between queries that take 50ms and queries that take 5 seconds. It’s also the difference between returning the most relevant results and… not.

The Hacker News thread for this article attracted a robust discussion, including some fascinating comments by Discourse developer Rafael dos Santos Silva (xfalcox) about how they are using pgvector at scale:

We [run pgvector in production] at Discourse, in thousands of databases, and it's leveraged in most of the billions of page views we serve. [...]

Also worth mentioning that we use quantization extensively:

halfvec (16bit float) for storage - bit (binary vectors) for indexes

Which makes the storage cost and on-going performance good enough that we could enable this in all our hosting. [...]

In Discourse embeddings power:

Related Topics, a list of topics to read next, which uses embeddings of the current topic as the key to search for similar ones

Suggesting tags and categories when composing a new topic

Augmented search

RAG for uploaded files

Via Hacker News

Tags: postgresql, scaling, vector-search, embeddings

Quoting AWS

2025-10-23T04:49:59+00:00

For resiliency, the DNS Enactor operates redundantly and fully independently in three different Availability Zones (AZs). [...] When the second Enactor (applying the newest plan) completed its endpoint updates, it then invoked the plan clean-up process, which identifies plans that are significantly older than the one it just applied and deletes them. At the same time that this clean-up process was invoked, the first Enactor (which had been unusually delayed) applied its much older plan to the regional DDB endpoint, overwriting the newer plan. [...] The second Enactor's clean-up process then deleted this older plan because it was many generations older than the plan it had just applied. As this plan was deleted, all IP addresses for the regional endpoint were immediately removed.

— AWS, Amazon DynamoDB Service Disruption in Northern Virginia (US-EAST-1) Region (14.5 hours long!)

Tags: aws, dns, scaling, postmortem

Quoting Kumar Aditya

2025-09-11T03:07:16+00:00

In Python 3.14, I have implemented several changes to fix thread safety of asyncio and enable it to scale effectively on the free-threaded build of CPython. It is now implemented using lock-free data structures and per-thread state, allowing for highly efficient task management and execution across multiple threads. In the general case of multiple event loops running in parallel, there is no lock contention and performance scales linearly with the number of threads. [...]

For a deeper dive into the implementation, check out the internal docs for asyncio.

— Kumar Aditya, Scaling asyncio on Free-Threaded Python

Tags: async, gil, python, scaling, threads

Announcing PlanetScale for Postgres

2025-07-01T18:16:12+00:00

Announcing PlanetScale for Postgres

PlanetScale formed in 2018 to build a commercial offering on top of the Vitess MySQL sharding open source project, which was originally released by YouTube in 2012. The PlanetScale founders were the co-creators and maintainers of Vitess.

Today PlanetScale are announcing a private preview of their new horizontally sharded PostgreSQL solution, due to "overwhelming" demand.

Notably, it doesn't use Vitess under the hood:

Vitess is one of PlanetScale’s greatest strengths [...] We have made explicit sharding accessible to hundreds of thousands of users and it is time to bring this power to Postgres. We will not however be using Vitess to do this.

Vitess’ achievements are enabled by leveraging MySQL’s strengths and engineering around its weaknesses. To achieve Vitess’ power for Postgres we are architecting from first principles.

Meanwhile, on June 10th Supabase announced that they had hired Vitess co-creator Sugu Sougoumarane to help them build "Multigres: Vitess for Postgres". Sugu said:

For some time, I've been considering a Vitess adaptation for Postgres, and this feeling had been gradually intensifying. The recent explosion in the popularity of Postgres has fueled this into a full-blown obsession. [...]

The project to address this problem must begin now, and I'm convinced that Vitess provides the most promising foundation.

I remember when MySQL was an order of magnitude more popular than PostgreSQL, and Heroku's decision to only offer PostgreSQL back in 2007 was a surprising move. The vibes have certainly shifted.

Tags: databases, mysql, postgresql, scaling, sharding, vitess

GitHub Issues search now supports nested queries and boolean operators: Here’s how we (re)built it

2025-05-26T07:23:17+00:00

GitHub Issues search now supports nested queries and boolean operators: Here’s how we (re)built it

GitHub Issues got a significant search upgrade back in January. Deborah Digges provides some behind the scene details about how it works and how they rolled it out.

The signature new feature is complex boolean logic: you can now search for things like is:issue state:open author:rileybroughten (type:Bug OR type:Epic), up to five levels of nesting deep.

Queries are parsed into an AST using the Ruby parslet PEG grammar library. The AST is then compiled into a nested Elasticsearch bool JSON query.

GitHub Issues search deals with around 2,000 queries a second so robust testing is extremely important! The team rolled it out invisibly to 1% of live traffic, running the new implementation via a queue and competing the number of results returned to try and spot any degradations compared to the old production code.

Tags: elasticsearch, github, ops, parsing, ruby, scaling, search, github-issues

Building, launching, and scaling ChatGPT Images

2025-05-13T23:52:22+00:00

Building, launching, and scaling ChatGPT Images

Gergely Orosz landed a fantastic deep dive interview with OpenAI's Sulman Choudhry (head of engineering, ChatGPT) and Srinivas Narayanan (VP of engineering, OpenAI) to talk about the launch back in March of ChatGPT images - their new image generation mode built on top of multi-modal GPT-4o.

The feature kept on having new viral spikes, including one that added one million new users in a single hour. They signed up 100 million new users in the first week after the feature's launch.

When this vertical growth spike started, most of our engineering teams didn't believe it. They assumed there must be something wrong with the metrics.

Under the hood the infrastructure is mostly Python and FastAPI! I hope they're sponsoring those projects (and Starlette, which is used by FastAPI under the hood.)

They're also using some C, and Temporal as a workflow engine. They addressed the early scaling challenge by adding an asynchronous queue to defer the load for their free users (resulting in longer generation times) at peak demand.

There are plenty more details tucked away behind the firewall, including an exclusive I've not been able to find anywhere else: OpenAI's core engineering principles.

Ship relentlessly - move quickly and continuously improve, without waiting for perfect conditions

Own the outcome - take full responsibility for products, end-to-end

Follow through - finish what is started and ensure the work lands fully

I tried getting o4-mini-high to track down a copy of those principles online and was delighted to see it either leak or hallucinate the URL to OpenAI's internal engineering handbook!

Gergely has a whole series of posts like this called Real World Engineering Challenges, including another one on ChatGPT a year ago.

Via @GergelyOrosz

Tags: python, scaling, ai, openai, generative-ai, chatgpt, llms, gergely-orosz, starlette

Merklemap runs a 16TB PostgreSQL

2025-03-14T20:13:41+00:00

Merklemap runs a 16TB PostgreSQL

Interesting thread on Hacker News where Pierre Barre describes the database architecture behind Merklemap, a certificate transparency search engine.

I run a 100 billion+ rows Postgres database [0], that is around 16TB, it's pretty painless!

There are a few tricks that make it run well (PostgreSQL compiled with a non-standard block size, ZFS, careful VACUUM planning). But nothing too out of the ordinary.

ATM, I insert about 150,000 rows a second, run 40,000 transactions a second, and read 4 million rows a second.

[...]

It's self-hosted on bare metal, with standby replication, normal settings, nothing "weird" there.

6 NVMe drives in raidz-1, 1024GB of memory, a 96 core AMD EPYC cpu.

[...]

About 28K euros of hardware per replica [one-time cost] IIRC + [ongoing] colo costs.

Tags: postgresql, scaling

Quoting Jeremy Edberg

2024-12-24T07:13:01+00:00

[On Reddit] we had to look up every single comment on the page to see if you had voted on it [...]

But with a bloom filter, we could very quickly look up all the comments and get back a list of all the ones you voted on (with a couple of false positives in there). Then we could go to the cache and see if your actual vote was there (and if it was an upvote or a downvote). It was only after a failed cache hit did we have to actually go to the database.

But that bloom filter saved us from doing sometimes 1000s of cache lookups.

— Jeremy Edberg

Tags: bloom-filters, reddit, scaling

DSQL Vignette: Reads and Compute

2024-12-06T17:12:10+00:00

DSQL Vignette: Reads and Compute

Marc Brooker is one of the engineers behind AWS's new Aurora DSQL horizontally scalable database. Here he shares all sorts of interesting details about how it works under the hood.

The system is built around the principle of separating storage from compute: storage uses S3, while compute runs in Firecracker:

Each transaction inside DSQL runs in a customized Postgres engine inside a Firecracker MicroVM, dedicated to your database. When you connect to DSQL, we make sure there are enough of these MicroVMs to serve your load, and scale up dynamically if needed. We add MicroVMs in the AZs and regions your connections are coming from, keeping your SQL query processor engine as close to your client as possible to optimize for latency.

We opted to use PostgreSQL here because of its pedigree, modularity, extensibility, and performance. We’re not using any of the storage or transaction processing parts of PostgreSQL, but are using the SQL engine, an adapted version of the planner and optimizer, and the client protocol implementation.

The system then provides strong repeatable-read transaction isolation using MVCC and EC2's high precision clocks, enabling reads "as of time X" including against nearby read replicas.

The storage layer supports index scans, which means the compute layer can push down some operations allowing it to load a subset of the rows it needs, reducing round-trips that are affected by speed-of-light latency.

The overall approach here is disaggregation: we’ve taken each of the critical components of an OLTP database and made it a dedicated service. Each of those services is independently horizontally scalable, most of them are shared-nothing, and each can make the design choices that is most optimal in its domain.

Tags: aws, databases, ec2, postgresql, s3, scaling, software-architecture, firecracker

Quoting Javi Santana

2024-12-01T05:02:42+00:00

Most people don’t have an intuition about what current hardware can and can’t do. There is a simple math that can help you with that: “you can process about 500MB in one second on a single machine”. I know it’s not a universal truth and there are a lot of details that can change that but believe me, this estimation is a pretty good tool to have under your belt.

— Javi Santana

Tags: scaling, big-data

Amazon S3 adds new functionality for conditional writes

2024-11-26T01:14:29+00:00

Amazon S3 adds new functionality for conditional writes

Amazon S3 can now perform conditional writes that evaluate if an object is unmodified before updating it. This helps you coordinate simultaneous writes to the same object and prevents multiple concurrent writers from unintentionally overwriting the object without knowing the state of its content. You can use this capability by providing the ETag of an object [...]

This new conditional header can help improve the efficiency of your large-scale analytics, distributed machine learning, and other highly parallelized workloads by reliably offloading compare and swap operations to S3.

(Both Azure Blob Storage and Google Cloud have this feature already.)

When AWS added conditional write support just for if an object with that key exists or not back in August I wrote about Gunnar Morling's trick for Leader Election With S3 Conditional Writes. This new capability opens up a whole set of new patterns for implementing distributed locking systems along those lines.

Here's a useful illustrative example by lxgr on Hacker News:

As a (horribly inefficient, in case of non-trivial write contention) toy example, you could use S3 as a lock-free concurrent SQLite storage backend: Reads work as expected by fetching the entire database and satisfying the operation locally; writes work like this:

Download the current database copy

Perform your write locally

Upload it back using "Put-If-Match" and the pre-edit copy as the matched object.

If you get success, consider the transaction successful.

If you get failure, go back to step 1 and try again.

AWS also just added the ability to enforce conditional writes in bucket policies:

To enforce conditional write operations, you can now use s3:if-none-match or s3:if-match condition keys to write a bucket policy that mandates the use of HTTP if-none-match or HTTP if-match conditional headers in S3 PutObject and CompleteMultipartUpload API requests. With this bucket policy in place, any attempt to write an object to your bucket without the required conditional header will be rejected.

Via Hacker News

Tags: aws, s3, scaling, software-architecture

Amazon S3 Express One Zone now supports the ability to append data to an object

2024-11-22T04:39:35+00:00

Amazon S3 Express One Zone now supports the ability to append data to an object

This is a first for Amazon S3: it is now possible to append data to an existing object in a bucket, where previously the only supported operation was to atomically replace the object with an updated version.

This is only available for S3 Express One Zone, a bucket class introduced a year ago which provides storage in just a single availability zone, providing significantly lower latency at the cost of reduced redundancy and a much higher price (16c/GB/month compared to 2.3c for S3 standard tier).

The fact that appends have never been supported for multi-availability zone S3 provides an interesting clue as to the underlying architecture. Guaranteeing that every copy of an object has received and applied an append is significantly harder than doing a distributed atomic swap to a new version.

Using Rust in non-Rust servers to improve performance

2024-10-23T15:45:42+00:00

Using Rust in non-Rust servers to improve performance

Deep dive into different strategies for optimizing part of a web server application - in this case written in Node.js, but the same strategies should work for Python as well - by integrating with Rust in different ways.

The example app renders QR codes, initially using the pure JavaScript qrcode package. That ran at 1,464 req/sec, but switching it to calling a tiny Rust CLI wrapper around the qrcode crate using Node.js spawn() increased that to 2,572 req/sec.

This is yet another reminder to me that I need to get over my cgi-bin era bias that says that shelling out to another process during a web request is a bad idea. It turns out modern computers can quite happily spawn and terminate 2,500+ processes a second!

The article optimizes further first through a Rust library compiled to WebAssembly (2,978 req/sec) and then through a Rust function exposed to Node.js as a native library (5,490 req/sec), then finishes with a full Rust rewrite of the server that replaces Node.js entirely, running at 7,212 req/sec.

Full source code to accompany the article is available in the using-rust-in-non-rust-servers repository.

Via lobste.rs

Tags: javascript, nodejs, performance, scaling, rust

Supercharge the One Person Framework with SQLite: Rails World 2024

2024-10-16T22:24:45+00:00

Supercharge the One Person Framework with SQLite: Rails World 2024

Stephen Margheim shares an annotated transcript of the YouTube video of his recent talk at this year's Rails World conference in Toronto.

The Rails community is leaning hard into SQLite right now. Stephen's talk is some of the most effective evangelism I've seen anywhere for SQLite as a production database for web applications, highlighting several new changes in Rails 8:

... there are two additions coming with Rails 8 that merit closer consideration. Because these changes make Rails 8 the first version of Rails (and, as far as I know, the first version of any web framework) that provides a fully production-ready SQLite experience out-of-the-box.

Those changes: Ensure SQLite transaction default to IMMEDIATE mode to avoid "database is locked" errors when a deferred transaction attempts to upgrade itself with a write lock (discussed here previously, and added to Datasette 1.0a14 in August) and SQLite non-GVL-blocking, fair retry interval busy handler - a lower-level change that ensures SQLite's busy handler doesn't hold Ruby's Global VM Lock (the Ruby version of Python's GIL) while a thread is waiting on a SQLite lock.

The rest of the talk makes a passionate and convincing case for SQLite as an option for production deployments, in line with the Rails goal of being a One Person Framework - "a toolkit so powerful that it allows a single individual to create modern applications upon which they might build a competitive business".

Back in April Stephen published SQLite on Rails: The how and why of optimal performance describing some of these challenges in more detail (including the best explanation I've seen anywhere of BEGIN IMMEDIATE TRANSACTION) and promising:

Unfortunately, running SQLite on Rails out-of-the-box isn’t viable today. But, with a bit of tweaking and fine-tuning, you can ship a very performant, resilient Rails application with SQLite. And my personal goal for Rails 8 is to make the out-of-the-box experience fully production-ready.

It looks like he achieved that goal!

Via Hacker News

Tags: gil, rails, ruby, scaling, sqlite, sqlite-busy

Zero-latency SQLite storage in every Durable Object

2024-10-13T22:26:49+00:00

Zero-latency SQLite storage in every Durable Object

Kenton Varda introduces the next iteration of Cloudflare's Durable Object platform, which recently upgraded from a key/value store to a full relational system based on SQLite.

For useful background on the first version of Durable Objects take a look at Cloudflare's durable multiplayer moat by Paul Butler, who digs into its popularity for building WebSocket-based realtime collaborative applications.

The new SQLite-backed Durable Objects is a fascinating piece of distributed system design, which advocates for a really interesting way to architect a large scale application.

The key idea behind Durable Objects is to colocate application logic with the data it operates on. A Durable Object comprises code that executes on the same physical host as the SQLite database that it uses, resulting in blazingly fast read and write performance.

How could this work at scale?

A single object is inherently limited in throughput since it runs on a single thread of a single machine. To handle more traffic, you create more objects. This is easiest when different objects can handle different logical units of state (like different documents, different users, or different "shards" of a database), where each unit of state has low enough traffic to be handled by a single object

Kenton presents the example of a flight booking system, where each flight can map to a dedicated Durable Object with its own SQLite database - thousands of fresh databases per airline per day.

Each DO has a unique name, and Cloudflare's network then handles routing requests to that object wherever it might live on their global network.

The technical details are fascinating. Inspired by Litestream, each DO constantly streams a sequence of WAL entries to object storage - batched every 16MB or every ten seconds. This also enables point-in-time recovery for up to 30 days through replaying those logged transactions.

To ensure durability within that ten second window, writes are also forwarded to five replicas in separate nearby data centers as soon as they commit, and the write is only acknowledged once three of them have confirmed it.

The JavaScript API design is interesting too: it's blocking rather than async, because the whole point of the design is to provide fast single threaded persistence operations:

let docs = sql.exec(`
  SELECT title, authorId FROM documents
  ORDER BY lastModified DESC
  LIMIT 100
`).toArray();

for (let doc of docs) {
  doc.authorName = sql.exec(
    "SELECT name FROM users WHERE id = ?",
    doc.authorId).one().name;
}

This one of their examples deliberately exhibits the N+1 query pattern, because that's something SQLite is uniquely well suited to handling.

The system underlying Durable Objects is called Storage Relay Service, and it's been powering Cloudflare's existing-but-different D1 SQLite system for over a year.

I was curious as to where the objects are created. According to this (via Hacker News):

Durable Objects do not currently change locations after they are created. By default, a Durable Object is instantiated in a data center close to where the initial get() request is made. [...] To manually create Durable Objects in another location, provide an optional locationHint parameter to get().

And in a footnote:

Dynamic relocation of existing Durable Objects is planned for the future.

where.durableobjects.live is a neat site that tracks where in the Cloudflare network DOs are created - I just visited it and it said:

This page tracks where new Durable Objects are created; for example, when you loaded this page from Half Moon Bay, a worker in San Jose, California, United States (SJC) created a durable object in San Jose, California, United States (SJC).

Via lobste.rs

Tags: scaling, sqlite, websockets, software-architecture, cloudflare, litestream, kenton-varda

Serving a billion web requests with boring code

2024-06-28T16:22:45+00:00

Serving a billion web requests with boring code

Bill Mill provides a deep retrospective from his work helping build a relaunch of the medicare.gov/plan-compare site.

It's a fascinating case study of the choose boring technology mantra put into action. The "boring" choices here were PostgreSQL, Go and React, all three of which are so widely used and understood at this point that you're very unlikely to stumble into surprises with them.

Key goals for the site were accessibility, in terms of users, devices and performance. Despite best efforts:

The result fell prey after a few years to a common failure mode of react apps, and became quite heavy and loaded somewhat slowly.

I've seen this pattern myself many times over, and I'd love to understand why. React itself isn't a particularly large dependency but somehow it always seems to lead to architectural bloat over time. Maybe that's more of an SPA thing than something that's specific to React.

Loads of other interesting details in here. The ETL details - where brand new read-only RDS databases were spun up every morning after a four hour build process - are particularly notable.

Via Lobste.rs

Tags: accessibility, go, postgresql, scaling, react, boring-technology

Quoting Richard Schneeman

2024-05-16T05:44:39+00:00

[...] by default Heroku will spin up multiple dynos in different availability zones. It also has multiple routers in different zones so if one zone should go completely offline, having a second dyno will mean that your app can still serve traffic.

— Richard Schneeman

Tags: scaling, heroku

How Figma’s databases team lived to tell the scale

2024-03-14T21:23:37+00:00

How Figma’s databases team lived to tell the scale

The best kind of scaling war story:

"Figma’s database stack has grown almost 100x since 2020. [...] In 2020, we were running a single Postgres database hosted on AWS’s largest physical instance, and by the end of 2022, we had built out a distributed architecture with caching, read replicas, and a dozen vertically partitioned databases."

I like the concept of "colos", their internal name for sharded groups of related tables arranged such that those tables can be queried using joins.

Also smart: separating the migration into "logical sharding" - where queries all still run against a single database, even though they are logically routed as if the database was already sharded - followed by "physical sharding" where the data is actually copied to and served from the new database servers.

Logical sharding was implemented using PostgreSQL views, which can accept both reads and writes:

CREATE VIEW table_shard1 AS SELECT * FROM table WHERE hash(shard_key) >= min_shard_range AND hash(shard_key) < max_shard_range)

The final piece of the puzzle was DBProxy, a custom PostgreSQL query proxy written in Go that can parse the query to an AST and use that to decide which shard the query should be sent to. Impressively it also has a scatter-gather mechanism, so select * from table can be sent to all shards at once and the results combined back together again.

Via Hacker News

Tags: databases, postgresql, scaling, sharding

The power of two random choices, visualized

2024-02-06T22:21:47+00:00

The power of two random choices, visualized

Grant Slatton shares a visualization illustrating “a favorite load balancing technique at AWS”: pick two nodes at random and then send the task to whichever of those two has the lowest current load score.

Why just two nodes? “The function grows logarithmically, so it’s a big jump from 1 to 2 and then tapers off *real* quick.”

Tags: aws, load-balancing, scaling

Database “sharding” came from UO?

2023-04-07T13:56:17+00:00

Database “sharding” came from UO?

Raph Koster coined the term “shard” back in 1996 in a design document proposing a way of scaling Ultima Online: “[...] we realized we would need to run multiple whole copies of Ultima Online for users to connect to, we needed to come up with a fiction for it. [...] the evil wizard Mondain had attempted to gain control over Sosaria by trapping its essence in a crystal. When the Stranger at the end of Ultima I defeated Mondain and shattered the crystal, the crystal shards each held a refracted copy of Sosaria.”

Via Hacker News

Tags: scaling, sharding

How Discord Stores Trillions of Messages

2023-03-08T19:07:08+00:00

How Discord Stores Trillions of Messages

This is a really interesting case-study. Discord migrated from MongoDB to Cassandra back in 2016 to handle billions of messages. Today they're handling trillions, and they completed a migration from Cassandra to Scylla, a Cassandra-like data store written in C++ (as opposed to Cassandra's Java) to help avoid problems like GC pauses. In addition to being a really good scaling war story this has some interesting details about their increased usage of Rust. As a fan of request coalescing (which I've previously referred to as dogpile prevention) I particularly liked this bit:

Our data services sit between the API and our ScyllaDB clusters. They contain roughly one gRPC endpoint per database query and intentionally contain no business logic. The big feature our data services provide is request coalescing. If multiple users are requesting the same row at the same time, we’ll only query the database once. The first user that makes a request causes a worker task to spin up in the service. Subsequent requests will check for the existence of that task and subscribe to it. That worker task will query the database and return the row to all subscribers.

Via lobste.rs

Tags: cassandra, dogpile, nosql, scaling, rust, discord

Scaling Mastodon: The Compendium

2022-11-29T05:46:03+00:00

Scaling Mastodon: The Compendium

Hazel Weakly’s collection of notes on scaling Mastodon, covering PostgreSQL, Sidekiq, Redis, object storage and more.

Via hachyderm.io/@nova

Tags: postgresql, redis, scaling, mastodon, sidekiq

Transactionally Staged Job Drains in Postgres

2021-12-18T01:34:13+00:00

Transactionally Staged Job Drains in Postgres

Any time I see people argue that relational databases shouldn’t be used to implement job queues I think of this post by Brandur from 2017. If you write to a queue before committing a transaction you run the risk of a queue consumer trying to read from the database before the new row becomes visible. If you write to the queue after the transaction there’s a risk an error might result in your message never being written. So: write to a relational staging table as part of the transaction, then have a separate process read from that table and write to the queue.

Tags: postgresql, queues, scaling, brandur-leach

Centrifuge: a reliable system for delivering billions of events per day

2021-12-06T01:41:54+00:00

Centrifuge: a reliable system for delivering billions of events per day

From 2018, a write-up from Segment explaining how they solved the problem of delivering webhooks from thousands of different producers to hundreds of potentially unreliable endpoints. They started with Kafka and ended up on a custom system written in Go against RDS MySQL that was specifically tuned to their write-heavy requirements.

Via @jkakar

Tags: scaling, webhooks, segment

How Discord Stores Billions of Messages

2021-08-24T21:31:36+00:00

How Discord Stores Billions of Messages

Fascinating article from 2017 describing how Discord migrated their primary message store to Cassandra (from MongoDB, but I could easily see them making the same decision if they had started with PostgreSQL or MySQL).

The trick with scalable NoSQL databases like Cassandra is that you need to have a very deep understanding of the kinds of queries you will need to answer - and Discord had exactly that.

In the article they talk about their desire to eventually migrate to Scylla (a compatible Cassandra alternative written in C++) - in the Hacker News comments they confirm that in 2021 they are using Scylla for a few things but they still have their core messages in Cassandra.

Via Hacker News

Tags: cassandra, nosql, scaling, discord

Notes on streaming large API responses

2021-06-25T16:26:49+00:00

I started a Twitter conversation last week about API endpoints that stream large amounts of data as an alternative to APIs that return 100 results at a time and require clients to paginate through all of the pages in order to retrieve all of the data:

Any unexpected downsides to offering streaming HTTP API endpoints that serve up eg 100,000 JSON objects in a go rather than asking users to paginate 100 at a time over 1,000 requests, assuming efficient implementation of that streaming endpoint?
— Simon Willison (@simonw) June 17, 2021

I got a ton of great replies. I tried to tie them together in a thread attached to the tweet, but I'm also going to synthesize them into some thoughts here.

Bulk exporting data

The more time I spend with APIs, especially with regard to my Datasette and Dogsheep projects, the more I realize that my favourite APIs are the ones that let you extract all of your data as quickly and easily as possible.

There are generally three ways an API might provide this:

Click an "export everything" button, then wait for a while for an email to show up with a link to a downloadable zip file. This isn't really an API, in particular since it's usually hard if not impossible to automate that initial "click", but it's still better than nothing. Google's Takeout is one notable implementation of this pattern.
Provide a JSON API which allows users to paginate through their data. This is a very common pattern, although it can run into difficulties: what happens if new data is added while you are paginating through the original data, for example? Some systems only allow access to the first N pages too, for performance reasons.
Providing a single HTTP endpoint you can hit that will return ALL of your data - potentially dozens or hundreds of MBs of it - in one go.

It's that last option that I'm interested in talking about today.

Efficiently streaming data

It used to be that most web engineers would quickly discount the idea of an API endpoint that streams out an unlimited number of rows. HTTP requests should be served as quickly as possible! Anything more than a couple of seconds spent processing a request is a red flag that something should be reconsidered.

Almost everything in the web stack is optimized for quickly serving small requests. But over the past decade the tide has turned somewhat: Node.js made async web servers commonplace, WebSockets taught us to handle long-running connections and in the Python world asyncio and ASGI provided a firm foundation for handling long-running requests using smaller amounts of RAM and CPU.

I've been experimenting in this area for a few years now.

Datasette has the ability to use ASGI trickery to stream all rows from a table (or filtered table) as CSV, potentially returning hundreds of MBs of data.

Django SQL Dashboard can export the full results of a SQL query as CSV or TSV, this time using Django's StreamingHttpResponse (which does tie up a full worker process, but that's OK if you restrict it to a controlled number of authenticated users).

VIAL implements streaming responses to offer an "export from the admin" feature. It also has an API-key-protected search API which can stream out all matching rows in JSON or GeoJSON.

Implementation notes

The key thing to watch out for when implementing this pattern is memory usage: if your server buffers 100MB+ of data any time it needs to serve an export request you're going to run into trouble.

Some export formats are friendlier for streaming than others. CSV and TSV are pretty easy to stream, as is newline-delimited JSON.

Regular JSON requires a bit more thought: you can output a [ character, then output each row in a stream with a comma suffix, then skip the comma for the last row and output a ]. Doing that requires peeking ahead (looping two at a time) to verify that you haven't yet reached the end.

Or... Martin De Wulf pointed out that you can output the first row, then output every other row with a preceeding comma - which avoids the whole "iterate two at a time" problem entirely.

The next challenge is efficiently looping through every database result without first pulling them all into memory.

PostgreSQL (and the psycopg2 Python module) offers server-side cursors, which means you can stream results through your code without loading them all at once. I use these in Django SQL Dashboard.

Server-side cursors make me nervous though, because they seem like they likely tie up resources in the database itself. So the other technique I would consider here is keyset pagination.

Keyset pagination works against any data that is ordered by a unique column - it works especially well against a primary key (or other indexed column). Each page of data is retrieved using a query something like this:

select * from items order by id limit 21

Note the limit 21 - if we are retrieving pages of 20 items we ask for 21, since then we can use the last returned item to tell if there is a next page or not.

Then for subsequent pages take the 20th id value and ask for things greater than that:

select * from items where id > 20 limit 21

Each of these queries is fast to respond (since it's against an ordered index) and uses a predictable, fixed amount of memory. Using keyset pagination we can loop through an abitrarily large table of data, streaming each page out one at a time, without exhausting any resources.

And since each query is small and fast, we don't need to worry about huge queries tying up database resources either.

What can go wrong?

I really like these patterns. They haven't bitten me yet, though I've not deployed them for anything truly huge scale. So I asked Twitter what kind of problems I should look for.

Based on the Twitter conversation, here are some of the challenges that this approach faces.

Challenge: restarting servers

If the stream takes a significantly long time to finish then rolling out updates becomes a problem. You don't want to interrupt a download but also don't want to wait forever for it to finish to spin down the server.
— Adam Lowry (@robotadam) June 17, 2021

This came up a few times, and is something I hadn't considered. If your deployment process involves restarting your servers (and it's hard to imagine one that doesn't) you need to take long-running connections into account when you do that. If there's a user half way through a 500MB stream you can either truncate their connection or wait for them to finish.

Challenge: how to return errors

If you're streaming a response, you start with an HTTP 200 code... but then what happens if an error occurs half-way through, potentially while paginating through the database?

You've already started sending the request, so you can't change the status code to a 500. Instead, you need to write some kind of error to the stream that's being produced.

If you're serving up a huge JSON document, you can at least make that JSON become invalid, which should indicate to your client that something went wrong.

Formats like CSV are harder. How do you let your user know that their CSV data is incomplete?

And what if someone's connection drops - are they definitely going to notice that they are missing something, or will they assume that the truncated file is all of the data?

Challenge: resumable downloads

If a user is paginating through your API, they get resumability for free: if something goes wrong they can start again at the last page that they fetched.

Resuming a single stream is a lot harder.

The HTTP range mechanism can be used to provide resumable downloads against large files, but it only works if you generate the entire file in advance.

There is a way to design APIs to support this, provided the data in the stream is in a predictable order (which it has to be if you're using keyset pagination, described above).

Have the endpoint that triggers the download take an optional ?since= parameter, like this:

GET /stream-everything?since=b24ou34
[
    {"id": "m442ecc", "name": "..."},
    {"id": "c663qo2", "name": "..."},
    {"id": "z434hh3", "name": "..."},
]

Here the b24ou34 is an identifier - it can be a deliberately opaque token, but it needs to be served up as part of the response.

If the user is disconnected for any reason, they can start back where they left off by passing in the last ID that they successfully retrieved:

GET /stream-everything?since=z434hh3

This still requires some level of intelligence from the client application, but it's a reasonably simple pattern both to implement on the server and as a client.

Easiest solution: generate and return from cloud storage

It seems the most robust way to implement this kind of API is the least technically exciting: spin off a background task that generates the large response and pushes it to cloud storage (S3 or GCS), then redirect the user to a signed URL to download the resulting file.

This is easy to scale, gives users complete files with content-length headers that they know they can download (and even resume-downloading, since range headers are supported by S3 and GCS). It also avoids any issues with server restarts caused by long connections.

This is how Mixpanel handle their export feature, and it's the solution Sean Coates came to when trying to find a workaround for the AWS Lambda/API Gate response size limit.

If your goal is to provide your users a robust, reliable bulk-export mechanism for their data, export to cloud storage is probably the way to go.

But streaming dynamic responses are a really neat trick, and I plan to keep exploring them!

Tags: apis, http, scaling, streaming, asgi, http-range-requests

Multi-region PostgreSQL on Fly

2021-06-17T18:39:06+00:00

Multi-region PostgreSQL on Fly

Really interesting piece of architectural design from Fly here. Fly can run your application (as a Docker container run using Firecracker) in multiple regions around the world, and they’ve now quietly added PostgreSQL multi-region support. The way it works is that all-but-one region can have a read-only replica, and requests sent to application servers can perform read-only queries against their local region’s replica. If a request needs to execute a SQL update your application code can return a “fly-replay: region=scl” HTTP header and the Fly CDN will transparently replay the request against the region containing the leader database. This also means you can implement tricks like setting a 10s expiring cookie every time the user performs a write, such that their requests in the next 10s will go straight to the leader and avoid them experiencing any replication lag that hasn’t caught up with their latest update.

Via @mrkurt

Tags: postgresql, replication, scaling, fly

Why I Built Litestream

2021-02-11T19:25:26+00:00

Why I Built Litestream

Litestream is a really exciting new piece of technology by Ben Johnson, who previously built BoltDB, the key-value store written in Go that is used by etcd. It adds replication to SQLite by running a process that converts the SQLite WAL log into a stream that can be saved to another folder or pushed to S3. The S3 option is particularly exciting—Ben estimates that keeping a full point-in-time recovery log of a high write SQLite database should cost in the order of a few dollars a month. I think this could greatly expand the set of use-cases for which SQLite is sensible choice.

Tags: replication, scaling, sqlite, ben-johnson