Simon Willison's Weblog: mysql

Announcing PlanetScale for Postgres

2025-07-01T18:16:12+00:00

PlanetScale formed in 2018 to build a commercial offering on top of the Vitess MySQL sharding open source project, which was originally released by YouTube in 2012. The PlanetScale founders were the co-creators and maintainers of Vitess.

Today PlanetScale are announcing a private preview of their new horizontally sharded PostgreSQL solution, due to "overwhelming" demand.

Notably, it doesn't use Vitess under the hood:

Vitess is one of PlanetScale’s greatest strengths [...] We have made explicit sharding accessible to hundreds of thousands of users and it is time to bring this power to Postgres. We will not however be using Vitess to do this.

Vitess’ achievements are enabled by leveraging MySQL’s strengths and engineering around its weaknesses. To achieve Vitess’ power for Postgres we are architecting from first principles.

Meanwhile, on June 10th Supabase announced that they had hired Vitess co-creator Sugu Sougoumarane to help them build "Multigres: Vitess for Postgres". Sugu said:

For some time, I've been considering a Vitess adaptation for Postgres, and this feeling had been gradually intensifying. The recent explosion in the popularity of Postgres has fueled this into a full-blown obsession. [...]

The project to address this problem must begin now, and I'm convinced that Vitess provides the most promising foundation.

I remember when MySQL was an order of magnitude more popular than PostgreSQL, and Heroku's decision to only offer PostgreSQL back in 2007 was a surprising move. The vibes have certainly shifted.

Tags: databases, mysql, postgresql, scaling, sharding, vitess

Announcing DuckDB 0.10.0

2024-02-13T17:57:17+00:00

Announcing DuckDB 0.10.0

Somewhat buried in this announcement: DuckDB has Fixed-Length Arrays now, along with array_cross_product(a1, a2), array_cosine_similarity(a1, a2) and array_inner_product(a1, a2) functions.

This means you can now use DuckDB to find related content (and other tricks) using vector embeddings!

Also notable:

DuckDB can now attach MySQL, Postgres, and SQLite databases in addition to databases stored in its own format. This allows data to be read into DuckDB and moved between these systems in a convenient manner, as attached databases are fully functional, appear just as regular tables, and can be updated in a safe, transactional manner.

Tags: databases, mysql, postgresql, sql, sqlite, duckdb, embeddings

Upgrading GitHub.com to MySQL 8.0

2023-12-10T20:36:23+00:00

Upgrading GitHub.com to MySQL 8.0

I love a good zero-downtime upgrade story, and this is a fine example of the genre. GitHub spent a year upgrading MySQL from 5.7 to 8 across 1200+ hosts, covering 300+ TB that was serving 5.5 million queries per second. The key technique was extremely carefully managed replication, plus tricks like leaving enough 5.7 replicas available to handle a rollback should one be needed.

Via lobste.rs

Tags: github, mysql, ops, replication, zero-downtime

Scaling Datastores at Slack with Vitess

2020-12-01T21:30:26+00:00

Scaling Datastores at Slack with Vitess

Slack spent three years migrating 99% of their MySQL query load to run against Vitess, the open source MySQL sharding system originally built by YouTube. “Today, we serve 2.3 million QPS at peak. 2M of those queries are reads and 300K are writes. Our median query latency is 2 ms, and our p99 query latency is 11 ms.”

Via Maggie Zhou

Tags: mysql, scaling, sharding, youtube, slack, vitess

Generated Columns in SQLite

2020-01-24T04:20:35+00:00

Generated Columns in SQLite

SQLite 3.31.0 released today, and generated columns are the single most notable new feature. PostgreSQL 12 added these in October 2019, and MySQL has had them since 5.7 in October 2015. MySQL and SQLite both offer either “stored” or “virtual” generated columns, with virtual columns being calculated at runtime. PostgreSQL currently only supports stored columns.

Via SQLite Release 3.31.0

Tags: mysql, postgresql, sql, sqlite

db-to-sqlite 1.0 release

2019-07-01T01:35:40+00:00

db-to-sqlite 1.0 release

I’ve released version 1.0 of my db-to-sqlite tool, which lets you create a SQLite database copy of any database supported by SQLAlchemy (I’ve tested it against MySQL and PostgreSQL). The tool has a bunch of new features: you can use --redact to redact specific columns, specify --table multiple times to copy a subset of tables, and the --all option now efficiently adds all foreign keys at the end of the import. The project now has unit tests which run against MySQL and PostgreSQL in Travis CI. Also included in the README: a shell one-liner for creating a local SQLite copy of a remote Heroku Postgres database based on extracting the connection string from a Heroku config environment variable.

Tags: mysql, postgresql, projects, sqlite, heroku, datasette

MySQL: How to get the top N rows for each group

2019-03-04T23:38:24+00:00

MySQL: How to get the top N rows for each group

MySQL doesn’t support the row_number() window function that’s available in PostgreSQL (and recent SQLite), which means it can’t easily answer questions like “for each of these authors, give me the most recent three blog entries they have written” in a single query. Only it turns out it can, if you abuse MySQL session variables in a devious way. This isn’t a new feature: MySQL has had this for over a decade, and in my rough testing it works quickly even on tables with millions of rows.

Tags: mysql

Vitess

2019-02-14T05:35:41+00:00

Vitess

I remember looking at Vitess when it was first released by YouTube in 2012. The idea of a proven horizontally scalable sharding mechanism for MySQL was exciting, but I was put off by the need for a custom Go or Java client library. Apparently that changed with Vitess 2.1 in April 2017, the first version to introduce a MySQL protocol compatible proxy which can be connected to by existing code written in any language. Vitess 3.0 came out last December so now the MySQL proxy layer is much more stable. Vitess is used in production by a bunch of other companies now (including Slack and Square) so it’s definitely worth a closer look.

Via Baron Schwartz

Tags: mysql, scaling, sharding, youtube, slack, vitess

Migrating Messenger storage to optimize performance

2018-06-27T15:05:36+00:00

Migrating Messenger storage to optimize performance

Fascinating case-study of a truly gargantuan migration. Messenger has over a billion users, and Facebook successfully migrated its backend storage from HBase to their MyRocks database (a fork of MySQL with a storage engine built on their SSD-optimized RocksDB key/value library) without any user-visible downtime. They ended up using two migration paths: one for the 99.9% of regular accounts, and a separate path for extremely high volume accounts (businesses with very active chat bots or support systems).

Via Hacker News

Tags: facebook, migration, mysql, scaling, zero-downtime

MySQL High Availability at GitHub

2018-06-20T23:05:29+00:00

MySQL High Availability at GitHub

Cutting edge high availability case-study: GitHub are now using Consul, raft, their own custom load balancer and their own custom orchestrator replication management toolkit to achieve cross-datacenter failover for their MySQL master/replica clusters.

Tags: github, highavailability, mysql, scaling, shlominoach

github/gh-ost: Thoughts on Foreign Keys?

2018-06-19T16:12:42+00:00

github/gh-ost: Thoughts on Foreign Keys?

The biggest challenge I’ve seen with foreign key constraints at scale (at least with MySQL) is how they conflict with online schema migrations using tools like pt-online-schema-change or GitHub’s gh-ost. This is a good explanation of the issue by Shlomi Noach, one of the gh-ost maintainers.

Tags: databases, mysql, scaling, sql, shlominoach

mycli

2018-06-11T19:08:00+00:00

mycli

Really neat auto-complete enabled MySQL terminal client, built using the excellent python-prompt-toolkit. Has a sister-project for PostgreSQL called pgcli.

Tags: mysql, postgresql, python

Showdown: MySQL 8 vs PostgreSQL 10

2018-05-23T17:02:40+00:00

Showdown: MySQL 8 vs PostgreSQL 10

MySQL 8 makes comparisons between PostgreSQL and MySQL far more interesting, as it closes some of the key feature gaps. Meanwhile the PostgreSQL replication story (long one of MySQL’s key advantages) has improved dramatically in recent versions. This article offers a useful overview of the current differences, including diving into some of the less obvious implementation details that differ between the two.

Via Hacker News

Tags: databases, mysql, postgresql

How to number rows in MySQL

2018-05-16T21:06:57+00:00

How to number rows in MySQL

MySQL’s user variables can be used to add a “rank” or “row_number” column to a database query that shows the ranking of a row against a specific unique value. This means you can return the first N rows for any given column—for example, given a list of articles return just the first three tags for each article. I’ve recently found myself using this trick for a few different things—once you know it, chances to use it crop up surprisingly often.

Tags: mysql

What’s New in MySQL 8.0

2018-04-19T16:03:49+00:00

What’s New in MySQL 8.0

MySQL 8 has lots of exciting improvements: Window functions, SRS aware spatial types for GIS, utf8mb4 by default, a ton of JSON improvements and atomic DDL. I no longer feel at a significant disadvantage when I have to use MySQL in place of PostgreSQL.

Tags: mysql

What are the key insights in mastering SQL queries?

2013-11-08T09:16:00+00:00

My answer to What are the key insights in mastering SQL queries? on Quora

You may find this article useful (despite the list-o-matic name): 10 Easy Steps to a Complete Understanding of SQL - I've been using SQL for years but I found that some of the concepts explained there helped firm up my fundamental understanding of how to use it effectively.

Tags: databases, mysql, oracle, sql, quora

How was FriendFeed's schema less db faster than pure MySQL?

2013-10-30T16:27:00+00:00

My answer to How was FriendFeed's schema less db faster than pure MySQL? on Quora

The principle reason they switched to a schemaless DB was to work around the challenges of having to make schemes changes in MySQL, which can lock the table and take hours if bit days to complete in large tables.

The performance improvement shown in the graph is almost certainly because they almost entirely eliminated joins and complex queries when they switched to the new mechanism. This meant that all it their database traffic was now simple queries, which have much more predictable performance characteristics. MySQL (in fact all databases) are extremely fast at primary key lookups and index scans.

Tags: databases, friendfeed, mysql, nosql, quora, rdbms

Is there a maximum number of records one can fetch with a MySQL query?

2013-09-23T12:48:00+00:00

My answer to Is there a maximum number of records one can fetch with a MySQL query? on Quora

To my knowledge there is no upper limit - that's why good database libraries provide abstractions that let you iterate over large queries without loading the entire result set in to memory at once.

Tags: mysql, quora

Should I store markdown instead of HTML into database fields?

2013-09-08T15:57:00+00:00

My answer to Should I store markdown instead of HTML into database fields? on Quora

You should store the exact format that was entered by the user.

- This lets you offer an "edit" feature without round-tripping between two formats.
- This makes debugging much easier
- Related: if you need to investigate a security bug, having the original input is essential.

If you're worried about performance, you can cache the transformed HTML somewhere - or even denormalize it to an extra table column. Just make sure you always have the original input available.

Tags: cms, databases, html, mysql, quora, markdown

How can I detect manual record insert from mysql cansole into my code in django .?

2012-12-28T14:31:00+00:00

My answer to How can I detect manual record insert from mysql cansole into my code in django .? on Quora

You can't. The best you can do is have Django periodically poll MySQL to see if anything has changed (maybe with a custom management command run by cron) - having a TIMESTAMP field on every table which will be automatically set to the current time when a record is inserted will help you spot things that have changed.

Tags: django, mysql, web-development, quora

What tools and techniques are used for relational database version control (structure and data)?

2012-12-24T12:29:00+00:00

My answer to What tools and techniques are used for relational database version control (structure and data)? on Quora

The term you are looking for is database migrations (sometimes called database change scripts).

The basic concept is pretty straight forward: you set up a table in the database that records which change scripts have already been applied. When you need to make a change (adding a column, adding a table, denormalising some data for performance reasons, adding an index etc) you write a change script that applies the change - in raw SQL or in another programming language, depending on how your migration system is set up.

These change scripts (let's call them migrations from here) are numbered so they can be applied in the correct order. Then you run a command which checks for scripts that have not yet been applied and runs them in the correct order - then records that they have been run to the relevant database table.

The setup I've described above is a pretty good start. Some systems let you have reversible migrations: each migration includes instructions for reversing its effect (removing the index that was added, moving data back to its old location) which lets you run a command to revert back to a previous database state. In practise this is a nice-to-have but not essential: many migrations are by their nature irreversible, but it can make development faster if you can easily try out and then revert a database structure change within your development environment.

Really clever migration systems can even introspect your database, figure out what has changed and attempt to generate the migration scripts automatically! South, the most popular migration system for Django, does this with surprisingly good results for many cases.

If you're interested in learning more, it's worth reading through the South documentation: http://south.readthedocs.org/en/...

Tags: databases, mysql, oracle, postgresql, sql, quora, rdbms

Any source available to download sample data (in 10+ GB) for testing?

2012-10-15T13:21:00+00:00

My answer to Any source available to download sample data (in 10+ GB) for testing? on Quora

Wikipedia has some pretty interesting dumps, in both XML and SQL format: http://meta.wikimedia.org/wiki/I...

It's pretty easy to generate 10GB of random data for testing though, which may be a better option as you could better approximate the kind of data your application will be dealing with. There's a neat Ruby module for doing this called Faker (itself a port of the Perl module of the same name): http://faker.rubyforge.org/ - and here's a Python port of the Ruby one: https://github.com/threadsafelab...

Tags: mysql, nosql, programming, web-development, quora, rdbms

What is the most efficient way to lookup an object (e.g. a user) by only a string?

2012-05-31T17:27:00+00:00

My answer to What is the most efficient way to lookup an object (e.g. a user) by only a string? on Quora

Yes - an index on a varchar column is exactly how you would implement this.

Tags: mysql, twitter, urls, quora

Python Django load MySQL database from csv files performance issue?

2012-02-12T11:02:00+00:00

My answer to Python Django load MySQL database from csv files performance issue? on Quora

Don't use the Django ORM for bulk imports - the performance overhead is pretty small for regular web page stuff, but it adds up if you are running millions of inserts.

Either write your import code to call MySQLdb directly or use the mysqlimport command line tool.

Tags: django, mysql, python, quora

How can you build a search engine for a website built in PHP/MySQL?

2012-02-11T18:39:00+00:00

My answer to How can you build a search engine for a website built in PHP/MySQL? on Quora

There are a bunch of options.

The easiest to implement is to build search on top of MySQL LIKE queries - performance will be pretty terrible (since every search will require a full table scan) but provided your tables only have a few thousand records on them and your site doesn't have to cope with more than a dozen or so hits a second it should work fine.

Next easiest: use MySQL's built-in full text indexing feature. It's not particularly good, and it requires you to use MyISAM tables (InnoDB is much more reliable, but doesn't support full text indexing) - but it will do the job. You could always keep your main site data in InnoDB and denormalise in to a MyISAM table just for search - or you could use the trick Flickr used to use, which is to set up MySQL replication and run MyISAM on one of the slaves purely to support fulltext search.

Past that, you're looking at adding another component to the stack. Sphinx can integrate directly with MySQL and lets you run SQL-style queries against a proper full text index. Personally I'm a big fan of Solr, which runs as a separate (Java) server and requires you to index documents over HTTP. The great thing about Solr is that you can talk to it from any language that has an HTTP client library.

The last option is to go for a hosted solution. Google Custom Search is free, but not particularly flexible. IndexTank was a good option here but they were acquired by LinkedIn and are shutting down the hosted service - they've since open sourced their software and other companies such as http://www.searchify.com/ are starting to offer it as a hosted solution.

Tags: mysql, php, search-engines, sphinx-search, quora

What are XML feed best practices?

2012-01-31T14:29:00+00:00

My answer to What are XML feed best practices? on Quora

It sounds like you're pretty much screwed already, if you're dealing with companies that still think FTPing XML around is a sensible thing to do.

I would suggest focusing on what you can control. Assume that you will be passed bad data - weird formats, not-well-formed XML, duplicate entries etc. Your job is to handle all of this without going mad, and without your codebase turning in to an unmanageable ball of mud.

So, start by figuring out your own core data model / abstraction. It will need to be VERY loose - as few required fields as possible, since you can be sure some if the feeds you are consuming will come in with stuff missing at some point or another.

Separate your feed consumers from the rest of your code. Having your own good internal Web API (which could consume JSON rather than XML since you control it) might be smart, since that will provide a solid separation and you can then write all of your feed consumers as separate pieces of code that just POST new items to the API.

Learn to love, respect and cherish unique identifiers... but be very wary of supposedly unique identifiers from external sources unless you can be absolutely sure they won't change on you. Create your own unique IDs at the first available opportunity, treat them properly within your own system and map external identifiers to them whenever you can.

Write your consumers in a dynamic language with a solid interactive prompt, like Python or Ruby. This will make them much easier to write and debug. Use whatever you like for your core data storage / API.

Since your incoming data will come in all shapes and sizes, consider a document store such as MongoDB or Riak over a SQL database. Avoiding SQL migrations will help you out a lot.

Log and store absolutely everything. Ideally you should be able to re-execute every import that the system has ever executed, in order, to make debugging and fixing errors non terrifying. That will almost certainly prove impossible, but it's a nice thought.

Good luck!

Tags: databases, mysql, php, xml, quora

Has anyone implemented a message queue with mysql and many workers?

2012-01-03T11:11:00+00:00

My answer to Has anyone implemented a message queue with mysql and many workers? on Quora

Flickr built their own message queue on top of MySQL: http://code.flickr.com/blog/2008...

Tags: flickr, mysql, quora

Is a relational database with many-to-many relationships difficult to develop into a web app?

2011-02-08T18:28:00+00:00

My answer to Is a relational database with many-to-many relationships difficult to develop into a web app? on Quora

Many to Many tables can be a bit of a pain to deal with using regular SQL, but a good ORM can abstract away any potential complexity almost entirely. I find using the Django ORM means I'm much less likely to shy away from a design that involves a many-to-many relationship because I know it won't increase the complexity of the application. I imagine the Rails ORM has the same effect.

Tags: databases, google, mysql, webapps, quora, rdbms

What are the pros and cons of switching from MySQL to one of the NoSQL databases?

2011-01-06T16:48:00+00:00

My answer to What are the pros and cons of switching from MySQL to one of the NoSQL databases? on Quora

Pro: If your own benchmarks tell you you need to switch to a specific NoSQL solution, you'll know exactly what the pro is.

Pro: If you're doing something that's hard to model in a regular schema you might find it easier to use a document database such as CouchDB or MongoDB.

Pro: Depending on how you approach the problem, you may find NoSQL makes schema modifications a LOT less painful than using a relational database.

Con: For many projects, losing out on the relational model is a big disadvantage. Most NoSQL solutions require you to design your data storage with your queries in mind. When you are building a product you often don't know what kind of queries you are going to run. This has bitten me with AppEngine projects in the past. See also Edmond Lau's answer to What did Marissa Mayer mean when she said that Orkut failed because of "infrastructure issues"?

For my money, the smart way of taking advantage of NoSQL is in conjunction with a relational engine. Use a regular database for your core data, but take advantage of Redis or MongoDB for things like counters, smart caches, rolling log storage etc. Polyglot persistence is the way to go.

Tags: mysql, nosql, quora

Using MySQL as a NoSQL - A story for exceeding 750,000 qps on a commodity server

2010-10-27T23:10:00+00:00

Using MySQL as a NoSQL - A story for exceeding 750,000 qps on a commodity server

Very interesting approach: much of the speed difference between MySQL/InnoDB and memcached is due to the overhead involved in parsing and processing SQL, so the team at DeNA wrote their own MySQL plugin, HandlerSocket, which exposes a NoSQL-style network protocol for directly calling the low level MySQL storage engine APIs—resulting in a 7.5x performance increase.

Tags: mysql, nosql, scaling, recovered