<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: How it's trained</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/series/how-its-trained.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2023-04-17T18:57:42+00:00</updated><author><name>Simon Willison</name></author><entry><title>What's in the RedPajama-Data-1T LLM training set</title><link href="https://simonwillison.net/2023/Apr/17/redpajama-data/#atom-series" rel="alternate"/><published>2023-04-17T18:57:42+00:00</published><updated>2023-04-17T18:57:42+00:00</updated><id>https://simonwillison.net/2023/Apr/17/redpajama-data/#atom-series</id><summary type="html">
    &lt;p&gt;&lt;a href="https://www.together.xyz/blog/redpajama"&gt;RedPajama&lt;/a&gt; is "a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens". It's a collaboration between Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute.&lt;/p&gt;
&lt;p&gt;They just announced their first release: &lt;a href="https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T"&gt;RedPajama-Data-1T&lt;/a&gt;, a 1.2 trillion token dataset modelled on the training data described in &lt;a href="https://www.arxiv-vanity.com/papers/2302.13971/"&gt;the original LLaMA paper&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The full dataset is 2.67TB, so I decided not to try and download the whole thing! Here's what I've figured out about it so far.&lt;/p&gt;
&lt;h4&gt;How to download it&lt;/h4&gt;
&lt;p&gt;The data is split across 2,084 different files. These are listed in a plain text file here:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt"&gt;https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The dataset card suggests you could download them all like this - assuming you have 2.67TB of disk space and bandwith to spare:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;wget -i https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I prompted GPT-4 a few times to write a quick Python script to run a &lt;code&gt;HEAD&lt;/code&gt; request against each URL in that file instead, in order to collect the &lt;code&gt;Content-Length&lt;/code&gt; and calculate the total size of the data. My script is at the bottom of this post.&lt;/p&gt;
&lt;p&gt;I then processed the size data into &lt;a href="https://gist.github.com/simonw/73d15c0dd1025d1196829740bacf4464"&gt;a format&lt;/a&gt; suitable for loading into &lt;a href="https://github.com/simonw/datasette-lite"&gt;Datasette Lite&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;Exploring the size data&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://lite.datasette.io/?json=https://gist.github.com/simonw/73d15c0dd1025d1196829740bacf4464#/data/raw?_facet=top_folder&amp;amp;_facet=top_folders&amp;amp;_sort_desc=size_gb"&gt;Here's a link&lt;/a&gt; to a Datasette Lite page showing all 2,084 files, sorted by size and with some useful facets.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2023/redpajama-sizes.jpg" alt="Datasette showing the rows, faceted by top_folder and top_folders. The largest file is wikipedia/wiki.jsonl at 111GB, then book/book.jsonl at 100GB, then stackexchange/stackexchange.jsonl at 74GB, then various filtered GitHub files" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;This is already revealing a lot about the data.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;top_folders&lt;/code&gt; facet inspired me to &lt;a href="https://lite.datasette.io/?install=datasette-copyable&amp;amp;json=https://gist.github.com/simonw/73d15c0dd1025d1196829740bacf4464#/data?sql=select%0A++top_folders%2C%0A++cast+%28sum%28size_gb%29+as+integer%29+as+total_gb%2C%0A++count%28*%29+as+num_files%0Afrom+raw%0Agroup+by+top_folders%0Aorder+by+sum%28size_gb%29+desc"&gt;run this SQL query&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;select&lt;/span&gt;
  top_folders,
  cast (&lt;span class="pl-c1"&gt;sum&lt;/span&gt;(size_gb) &lt;span class="pl-k"&gt;as&lt;/span&gt; &lt;span class="pl-k"&gt;integer&lt;/span&gt;) &lt;span class="pl-k"&gt;as&lt;/span&gt; total_gb,
  &lt;span class="pl-c1"&gt;count&lt;/span&gt;(&lt;span class="pl-k"&gt;*&lt;/span&gt;) &lt;span class="pl-k"&gt;as&lt;/span&gt; num_files
&lt;span class="pl-k"&gt;from&lt;/span&gt; raw
&lt;span class="pl-k"&gt;group by&lt;/span&gt; top_folders
&lt;span class="pl-k"&gt;order by&lt;/span&gt; &lt;span class="pl-c1"&gt;sum&lt;/span&gt;(size_gb) &lt;span class="pl-k"&gt;desc&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here are the results:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;top_folders&lt;/th&gt;
&lt;th&gt;total_gb&lt;/th&gt;
&lt;th&gt;num_files&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;c4&lt;/td&gt;
&lt;td&gt;806&lt;/td&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;common_crawl/2023-06&lt;/td&gt;
&lt;td&gt;288&lt;/td&gt;
&lt;td&gt;175&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;common_crawl/2020-05&lt;/td&gt;
&lt;td&gt;286&lt;/td&gt;
&lt;td&gt;198&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;common_crawl/2021-04&lt;/td&gt;
&lt;td&gt;276&lt;/td&gt;
&lt;td&gt;176&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;common_crawl/2022-05&lt;/td&gt;
&lt;td&gt;251&lt;/td&gt;
&lt;td&gt;157&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;common_crawl/2019-30&lt;/td&gt;
&lt;td&gt;237&lt;/td&gt;
&lt;td&gt;153&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;github&lt;/td&gt;
&lt;td&gt;212&lt;/td&gt;
&lt;td&gt;98&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;wikipedia&lt;/td&gt;
&lt;td&gt;111&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;book&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;arxiv&lt;/td&gt;
&lt;td&gt;87&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;stackexchange&lt;/td&gt;
&lt;td&gt;74&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;There's a lot of Common Crawl data in there!&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://www.together.xyz/blog/redpajama"&gt;RedPajama announcement&lt;/a&gt; says:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;CommonCrawl: Five dumps of CommonCrawl, processed using the CCNet pipeline, and filtered via several quality filters including a linear classifier that selects for Wikipedia-like pages.&lt;/li&gt;
&lt;li&gt;C4: Standard C4 dataset&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;It looks like they used &lt;a href="https://commoncrawl.org/"&gt;CommonCrawl&lt;/a&gt; from 5 different dates, from 2019-30 (30? That's not a valid month - looks like &lt;a href="https://hachyderm.io/@xek/110215763306634784"&gt;it's a week number&lt;/a&gt;) to 2022-05. I wonder if they de-duplicated content within those different crawls?&lt;/p&gt;
&lt;p&gt;&lt;a href="https://paperswithcode.com/dataset/c4"&gt;C4&lt;/a&gt; is "a colossal, cleaned version of Common Crawl's web crawl corpus" - so yet another copy of Common Crawl, cleaned in a different way.&lt;/p&gt;
&lt;p&gt;I downloaded the first 100MB of that 100GB &lt;code&gt;book.jsonl&lt;/code&gt; file - the first 300 rows in it are all full-text books from Project Gutenberg, starting with &lt;a href="https://www.gutenberg.org/ebooks/10"&gt;The Bible Both Testaments King James Version&lt;/a&gt; from 1611.&lt;/p&gt;
&lt;p&gt;The data all appears to be in JSONL format - newline-delimited JSON. Different files I looked at had different shapes, though a common pattern was a &lt;code&gt;"text"&lt;/code&gt; key containing the text and a &lt;code&gt;"meta"&lt;/code&gt; key containing a dictionary of metadata.&lt;/p&gt;
&lt;p&gt;For example, the first line of &lt;code&gt;books.jsonl&lt;/code&gt; looks like this (after pretty-printing using &lt;code&gt;jq&lt;/code&gt;):&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
  &lt;span class="pl-ent"&gt;"meta"&lt;/span&gt;: {
    &lt;span class="pl-ent"&gt;"short_book_title"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;The Bible Both Testaments King James Version&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"publication_date"&lt;/span&gt;: &lt;span class="pl-c1"&gt;1611&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"url"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;http://www.gutenberg.org/ebooks/10&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
  },
  &lt;span class="pl-ent"&gt;"text"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-cce"&gt;\n\n&lt;/span&gt;The Old Testament of the King James Version of the Bible&lt;span class="pl-cce"&gt;\n&lt;/span&gt;...&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;There are more details on the composition of the dataset in &lt;a href="https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T#dataset-creation"&gt;the dataset card&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;My Python script&lt;/h4&gt;
&lt;p&gt;I wrote a quick Python script to do the next best thing: run a &lt;code&gt;HEAD&lt;/code&gt; request against each URL to figure out the total size of the data.&lt;/p&gt;
&lt;p&gt;I &lt;a href="https://gist.github.com/simonw/38246d2f230bd1d5cf8b4907e8871ed1"&gt;prompted GPT-4 a few times&lt;/a&gt;, and came up with this:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;httpx&lt;/span&gt;
&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;tqdm&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;tqdm&lt;/span&gt;

&lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;get_sizes&lt;/span&gt;(&lt;span class="pl-s1"&gt;urls&lt;/span&gt;):
    &lt;span class="pl-s1"&gt;sizes&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; {}
    &lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;fetch_size&lt;/span&gt;(&lt;span class="pl-s1"&gt;url&lt;/span&gt;):
        &lt;span class="pl-k"&gt;try&lt;/span&gt;:
            &lt;span class="pl-s1"&gt;response&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;client&lt;/span&gt;.&lt;span class="pl-en"&gt;head&lt;/span&gt;(&lt;span class="pl-s1"&gt;url&lt;/span&gt;)
            &lt;span class="pl-s1"&gt;content_length&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;response&lt;/span&gt;.&lt;span class="pl-s1"&gt;headers&lt;/span&gt;.&lt;span class="pl-en"&gt;get&lt;/span&gt;(&lt;span class="pl-s"&gt;'Content-Length'&lt;/span&gt;)
            &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-s1"&gt;content_length&lt;/span&gt; &lt;span class="pl-c1"&gt;is&lt;/span&gt; &lt;span class="pl-c1"&gt;not&lt;/span&gt; &lt;span class="pl-c1"&gt;None&lt;/span&gt;:
                &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s1"&gt;url&lt;/span&gt;, &lt;span class="pl-en"&gt;int&lt;/span&gt;(&lt;span class="pl-s1"&gt;content_length&lt;/span&gt;)
        &lt;span class="pl-k"&gt;except&lt;/span&gt; &lt;span class="pl-v"&gt;Exception&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; &lt;span class="pl-s1"&gt;e&lt;/span&gt;:
            &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s"&gt;f"Error while processing URL '&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;url&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;': &lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;e&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;"&lt;/span&gt;)
        &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s1"&gt;url&lt;/span&gt;, &lt;span class="pl-c1"&gt;0&lt;/span&gt;
    &lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;with&lt;/span&gt; &lt;span class="pl-s1"&gt;httpx&lt;/span&gt;.&lt;span class="pl-v"&gt;AsyncClient&lt;/span&gt;() &lt;span class="pl-k"&gt;as&lt;/span&gt; &lt;span class="pl-s1"&gt;client&lt;/span&gt;:
        &lt;span class="pl-c"&gt;# Create a progress bar using tqdm&lt;/span&gt;
        &lt;span class="pl-k"&gt;with&lt;/span&gt; &lt;span class="pl-en"&gt;tqdm&lt;/span&gt;(&lt;span class="pl-s1"&gt;total&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-en"&gt;len&lt;/span&gt;(&lt;span class="pl-s1"&gt;urls&lt;/span&gt;), &lt;span class="pl-s1"&gt;desc&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"Fetching sizes"&lt;/span&gt;, &lt;span class="pl-s1"&gt;unit&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"url"&lt;/span&gt;) &lt;span class="pl-k"&gt;as&lt;/span&gt; &lt;span class="pl-s1"&gt;pbar&lt;/span&gt;:
            &lt;span class="pl-c"&gt;# Use asyncio.as_completed to process results as they arrive&lt;/span&gt;
            &lt;span class="pl-s1"&gt;coros&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; [&lt;span class="pl-en"&gt;fetch_size&lt;/span&gt;(&lt;span class="pl-s1"&gt;url&lt;/span&gt;) &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;url&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;urls&lt;/span&gt;]
            &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;coro&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;asyncio&lt;/span&gt;.&lt;span class="pl-en"&gt;as_completed&lt;/span&gt;(&lt;span class="pl-s1"&gt;coros&lt;/span&gt;):
                &lt;span class="pl-s1"&gt;url&lt;/span&gt;, &lt;span class="pl-s1"&gt;size&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;coro&lt;/span&gt;
                &lt;span class="pl-s1"&gt;sizes&lt;/span&gt;[&lt;span class="pl-s1"&gt;url&lt;/span&gt;] &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;size&lt;/span&gt;
                &lt;span class="pl-c"&gt;# Update the progress bar&lt;/span&gt;
                &lt;span class="pl-s1"&gt;pbar&lt;/span&gt;.&lt;span class="pl-en"&gt;update&lt;/span&gt;(&lt;span class="pl-c1"&gt;1&lt;/span&gt;)
    &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s1"&gt;sizes&lt;/span&gt;&lt;/pre&gt;
&lt;p&gt;I pasted this into &lt;code&gt;python3 -m asyncio&lt;/code&gt; - the &lt;code&gt;-m asyncio&lt;/code&gt; flag ensures the &lt;code&gt;await&lt;/code&gt; statement can be used in the interactive interpreter - and ran the following:&lt;/p&gt;
&lt;div class="highlight highlight-text-python-console"&gt;&lt;pre&gt;&amp;gt;&amp;gt;&amp;gt; urls &lt;span class="pl-k"&gt;=&lt;/span&gt; httpx.get(&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;).text.splitlines()
&amp;gt;&amp;gt;&amp;gt; sizes &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;await&lt;/span&gt; get_sizes(urls)
Fetching sizes: 100%|██████████████████████████████████████| 2084/2084 [00:08&amp;lt;00:00, 256.60url/s]
&amp;gt;&amp;gt;&amp;gt; &lt;span class="pl-c1"&gt;sum&lt;/span&gt;(sizes.values())
2936454998167&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then I added the following to turn the data into something that would work with Datasette Lite:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-s1"&gt;output&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; []
&lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;url&lt;/span&gt;, &lt;span class="pl-s1"&gt;size&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;sizes&lt;/span&gt;.&lt;span class="pl-en"&gt;items&lt;/span&gt;():
    &lt;span class="pl-s1"&gt;path&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;url&lt;/span&gt;.&lt;span class="pl-en"&gt;split&lt;/span&gt;(&lt;span class="pl-s"&gt;'/redpajama-data-1T/v1.0.0/'&lt;/span&gt;)[&lt;span class="pl-c1"&gt;1&lt;/span&gt;]
    &lt;span class="pl-s1"&gt;output&lt;/span&gt;.&lt;span class="pl-en"&gt;append&lt;/span&gt;({
        &lt;span class="pl-s"&gt;"url"&lt;/span&gt;: &lt;span class="pl-s1"&gt;url&lt;/span&gt;,
        &lt;span class="pl-s"&gt;"size"&lt;/span&gt;: &lt;span class="pl-s1"&gt;size&lt;/span&gt;,
        &lt;span class="pl-s"&gt;"size_mb"&lt;/span&gt;: &lt;span class="pl-s1"&gt;size&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;1024&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;1024&lt;/span&gt;,
        &lt;span class="pl-s"&gt;"size_gb"&lt;/span&gt;: &lt;span class="pl-s1"&gt;size&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;1024&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;1024&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;1024&lt;/span&gt;,
        &lt;span class="pl-s"&gt;"path"&lt;/span&gt;: &lt;span class="pl-s1"&gt;path&lt;/span&gt;,
        &lt;span class="pl-s"&gt;"top_folder"&lt;/span&gt;: &lt;span class="pl-s1"&gt;path&lt;/span&gt;.&lt;span class="pl-en"&gt;split&lt;/span&gt;(&lt;span class="pl-s"&gt;"/"&lt;/span&gt;)[&lt;span class="pl-c1"&gt;0&lt;/span&gt;],
        &lt;span class="pl-s"&gt;"top_folders"&lt;/span&gt;: &lt;span class="pl-s1"&gt;path&lt;/span&gt;.&lt;span class="pl-en"&gt;rsplit&lt;/span&gt;(&lt;span class="pl-s"&gt;"/"&lt;/span&gt;, &lt;span class="pl-c1"&gt;1&lt;/span&gt;)[&lt;span class="pl-c1"&gt;0&lt;/span&gt;],
    })
&lt;span class="pl-en"&gt;open&lt;/span&gt;(&lt;span class="pl-s"&gt;"/tmp/sizes.json"&lt;/span&gt;, &lt;span class="pl-s"&gt;"w"&lt;/span&gt;).&lt;span class="pl-en"&gt;write&lt;/span&gt;(&lt;span class="pl-s1"&gt;json&lt;/span&gt;.&lt;span class="pl-en"&gt;dumps&lt;/span&gt;(&lt;span class="pl-s1"&gt;output&lt;/span&gt;, &lt;span class="pl-s1"&gt;indent&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;2&lt;/span&gt;))&lt;/pre&gt;
&lt;p&gt;I pasted the result &lt;a href="https://gist.github.com/simonw/73d15c0dd1025d1196829740bacf4464"&gt;into a Gist&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette-lite"&gt;datasette-lite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/redpajama"&gt;redpajama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/training-data"&gt;training-data&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="datasette"/><category term="datasette-lite"/><category term="generative-ai"/><category term="llama"/><category term="local-llms"/><category term="llms"/><category term="ai-assisted-programming"/><category term="redpajama"/><category term="training-data"/></entry><entry><title>Exploring MusicCaps, the evaluation data released to accompany Google's MusicLM text-to-music model</title><link href="https://simonwillison.net/2023/Jan/27/exploring-musiccaps/#atom-series" rel="alternate"/><published>2023-01-27T21:34:56+00:00</published><updated>2023-01-27T21:34:56+00:00</updated><id>https://simonwillison.net/2023/Jan/27/exploring-musiccaps/#atom-series</id><summary type="html">
    &lt;p&gt;Google Research just released &lt;a href="https://google-research.github.io/seanet/musiclm/examples/"&gt;MusicLM: Generating Music From Text&lt;/a&gt;. It's a new generative AI model that takes a descriptive prompt and produces a "high-fidelity" music track. Here's &lt;a href="https://arxiv.org/abs/2301.11325"&gt;the paper&lt;/a&gt; (and &lt;a href="https://www.arxiv-vanity.com/papers/2301.11325/"&gt;a more readable version&lt;/a&gt; using arXiv Vanity).&lt;/p&gt;
&lt;p&gt;There's no interactive demo yet, but there are dozens of examples &lt;a href="https://google-research.github.io/seanet/musiclm/examples/"&gt;on the site&lt;/a&gt;. The prompts are things like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A fusion of reggaeton and electronic dance music, with a spacey, otherworldly sound. Induces the experience of being lost in space, and the music would be designed to evoke a sense of wonder and awe, while being danceable.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Included are examples of opera, jazz, peruvian punk, berlin 90s house and many more. It's a really fun page to explore.&lt;/p&gt;
&lt;h4&gt;The MusicCaps dataset&lt;/h4&gt;
&lt;p&gt;The paper abstract includes this line:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Later in the paper they provide a little more detail:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;To evaluate MusicLM, we prepare MusicCaps, a high-quality music caption dataset, which we make publicly available. This dataset includes 5.5k music clips from AudioSet (Gemmeke et al., 2017), each paired with corresponding text descriptions in English, written by ten professional musicians. For each 10-second music clip, MusicCaps provides: (1) a free-text caption consisting of four sentences on average, describing the music and (2) a list of music aspects, describing genre, mood, tempo, singer voices, instrumentation, dissonances, rhythm, etc. On average, the dataset includes eleven aspects per clip.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's where they published &lt;a href="https://www.kaggle.com/datasets/googleai/musiccaps"&gt;that data on Kaggle&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I &lt;em&gt;love&lt;/em&gt; &lt;a href="https://simonwillison.net/series/how-its-trained/"&gt;digging into these training datasets&lt;/a&gt; - and this one is pretty tiny. I decided to take a look and see what I could learn.&lt;/p&gt;
&lt;p&gt;I built &lt;strong&gt;&lt;a href="https://musiccaps.datasette.io/"&gt;musiccaps.datasette.io&lt;/a&gt;&lt;/strong&gt; to support exploring and searching the data.&lt;/p&gt;
&lt;p&gt;&lt;img 
    alt="A Datasette table. Title is &amp;quot;musiccaps_details (view)&amp;quot;,
    subheading is &amp;quot;3 rows where search matches &amp;quot;ballad&amp;quot; and musiccaps_names contains &amp;quot;Electric piano&amp;quot;&amp;quot;
    The first result is a YouTube video from the channel &amp;quot;Marks Piano - PGN Piano&amp;quot;, with a musiccaps_caption of &amp;quot;This audio contains someone playing a ballad on a piano then a male voice starts talking and the piano-playing ends. This song may be playing at home practicing piano.&amp;quot;"
    src="https://static.simonwillison.net/static/2023/musiccaps-datasette-2.jpg" style="max-width: 100%" /&gt;&lt;/p&gt;
&lt;p&gt;The dataset itself turns out to not have any audio clips in it at all - instead, each row of the data includes a YouTube video ID and a start and end time for a clip within it.&lt;/p&gt;
&lt;p&gt;This is similar to &lt;a href="https://simonwillison.net/2022/Sep/5/laion-aesthetics-weeknotes/"&gt;how the LAION dataset used for Stable Diffusion works&lt;/a&gt; - that dataset contains URLs to images, but not the images themselves.&lt;/p&gt;
&lt;p&gt;A YouTube video ID is just a string like &lt;code&gt;zYM0gtd_PRo&lt;/code&gt; - the dataset has no further information.&lt;/p&gt;
&lt;p&gt;But... information on those videos is available via the YouTube API. So I fetched back full details of all 5,500 videos and included them in the database as well.&lt;/p&gt;
&lt;p&gt;This means we can start answering all sorts of interesting questions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://musiccaps.datasette.io/musiccaps/musiccaps_details?_facet=youtube_channel&amp;amp;_fts_table=musiccaps_fts&amp;amp;_fts_pk=musiccaps_rowid"&gt;Which YouTube channels are most represented in the dataset?&lt;/a&gt; - the answer is &lt;a href="https://www.youtube.com/channel/UCXT6G8O3its4RLZ4kzjMr_w"&gt;ProGuitarShopDemos&lt;/a&gt; with 12, &lt;a href="https://www.youtube.com/@berlinphil"&gt;Berliner Philharmoniker&lt;/a&gt; with 8, &lt;a href="https://www.youtube.com/@PrymaxeDemos"&gt;Prymaxe&lt;/a&gt; with 8 - generally it's a pretty wide distribution though, with 5,163 channels represented.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://musiccaps.datasette.io/musiccaps/musiccaps_details?youtube_id__isnull=1"&gt;How many videos are no longer on YouTube?&lt;/a&gt; - 18 of the videos referenced in the dataset no longer exist according to the YouTube API.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://musiccaps.datasette.io/musiccaps/musiccaps_details?_facet_array=musiccaps_names&amp;amp;_facet_array=musiccaps_aspects&amp;amp;_fts_table=musiccaps_fts&amp;amp;_facet=musiccaps_author&amp;amp;_fts_pk=musiccaps_rowid&amp;amp;youtube_description__contains=No+copyright+infringement+intended"&gt;How many videos include the magic YouTube description phrase "No copyright infringement intended"?&lt;/a&gt; - 31 of them do. See &lt;a href="https://waxy.org/2011/12/no_copyright_intended/"&gt;No Copyright Intended&lt;/a&gt; by Andy Baio for background on this superstition.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The search feature is configured to run against the human-written descriptions provided as the key feature of the MusicCaps dataset - try some searches like         &lt;a href="/musiccaps/musiccaps_details?_search=opera&amp;amp;_facet_array=musiccaps_names&amp;amp;_facet_array=musiccaps_aspects&amp;amp;_facet=musiccaps_author&amp;amp;_fts_table=musiccaps_fts&amp;amp;_fts_pk=musiccaps_rowid"&gt;opera&lt;/a&gt;,
&lt;a href="/musiccaps/musiccaps_details?_search=ballad&amp;amp;_facet_array=musiccaps_names&amp;amp;_facet_array=musiccaps_aspects&amp;amp;_facet=musiccaps_author&amp;amp;_fts_table=musiccaps_fts&amp;amp;_fts_pk=musiccaps_rowid"&gt;ballad&lt;/a&gt;,
&lt;a href="/musiccaps/musiccaps_details?_search=guitar&amp;amp;_facet_array=musiccaps_names&amp;amp;_facet_array=musiccaps_aspects&amp;amp;_facet=musiccaps_author&amp;amp;_fts_table=musiccaps_fts&amp;amp;_fts_pk=musiccaps_rowid"&gt;guitar&lt;/a&gt;, or
&lt;a href="/musiccaps/musiccaps_details?_search=whimsical&amp;amp;_facet_array=musiccaps_names&amp;amp;_facet_array=musiccaps_aspects&amp;amp;_facet=musiccaps_author&amp;amp;_fts_table=musiccaps_fts&amp;amp;_fts_pk=musiccaps_rowid"&gt;whimsical&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;How I built this&lt;/h4&gt;
&lt;p&gt;The dataset is &lt;a href="https://www.kaggle.com/datasets/googleai/musiccaps"&gt;available on Kaggle&lt;/a&gt;, but since it's licensed CC BY-SA 4.0 I grabbed a copy of it and dropped the CSV &lt;a href="https://github.com/simonw/musiccaps-datasette/blob/main/data/musiccaps-public.csv"&gt;into this GitHub repo&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;You can explore that using &lt;a href="https://simonwillison.net/2022/May/4/datasette-lite/"&gt;Datasette Lite&lt;/a&gt; at this URL:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://lite.datasette.io/?csv=https://github.com/simonw/musiccaps-datasette/blob/main/data/musiccaps-public.csv"&gt;https://lite.datasette.io/?csv=https://github.com/simonw/musiccaps-datasette/blob/main/data/musiccaps-public.csv&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;This was a decent starting point, but it felt really important to be able to click "play" and listen to that audio.&lt;/p&gt;
&lt;p&gt;I built a new Datasette plugin for this: &lt;a href="datasette.io/plugins/datasette-youtube-embed"&gt;datasette-youtube-embed&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The plugin works by looking out for YouTube URLs of the following format:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;https://www.youtube.com/watch?v=-U16iKiXGuY&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;https://www.youtube.com/watch?v=-U16iKiXGuY&amp;amp;start=30&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;https://www.youtube.com/watch?v=-U16iKiXGuY&amp;amp;start=30&amp;amp;end=40&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If it finds one of those, it replaces it with a YouTube embed that passes through the &lt;code&gt;start&lt;/code&gt; and &lt;code&gt;end&lt;/code&gt; parameters, if present.&lt;/p&gt;
&lt;p&gt;This means it can play the exact clip that was referenced by the MusicCaps dataset.&lt;/p&gt;
&lt;p&gt;My first attempt at this plugin used regular YouTube embeds, but Datasette defaults to returning up to 100 rows on a page, and 100 YouTube iframe embeds is pretty heavy!&lt;/p&gt;
&lt;p&gt;Instead, I switched the plugin to use the &lt;a href="https://github.com/paulirish/lite-youtube-embed"&gt;Lite YouTube Embed&lt;/a&gt; Web Component by Paul Irish.&lt;/p&gt;
&lt;p&gt;Sadly this means the plugin doesn't work with Datasette Lite, so I switched to deploying a full Datasette instance to Vercel instead.&lt;/p&gt;
&lt;h4&gt;Adding video details from the YouTube API&lt;/h4&gt;
&lt;p&gt;I wanted to add more context about each of the videos. The YouTube Data API has a &lt;a href="https://developers.google.com/youtube/v3/docs/videos/list"&gt;videos&lt;/a&gt; endpoint which accepts a comma-separated list of video IDs (up to 50 at a time, not mentioned in the documentation) and returns details about each video.&lt;/p&gt;
&lt;p&gt;After some experimentation, this turned out to be the recipe that gave me the key data I wanted:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;https://www.googleapis.com/youtube/v3/videos
  ?part=snippet,statistics
  &amp;amp;id=video_id1,video_id2,video_id3
  &amp;amp;key=youtube-api-key
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I built &lt;a href="https://github.com/simonw/musiccaps-datasette/blob/main/MusicCaps%20to%20Datasette.ipynb"&gt;a Jupyter notebook&lt;/a&gt; that batched up all of the IDs into groups of 50, fetched the data and wrote it into my SQLite database using &lt;a href="https://sqlite-utils.datasette.io/"&gt;sqlite-utils&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;audioset_positive_labels&lt;/code&gt; column in the initial CSV had values like &lt;code&gt;/m/0140xf,/m/02cjck,/m/04rlf&lt;/code&gt; - these turned out to be match IDs in &lt;a href="https://github.com/tensorflow/models/blob/master/research/audioset/yamnet/yamnet_class_map.csv"&gt;this CSV file&lt;/a&gt; of AudioSet categories in the &lt;a href="https://github.com/tensorflow/models/tree/master/research/audioset"&gt;tensorflow/models&lt;/a&gt; GitHub repo, so I fetched and transformed those as well.&lt;/p&gt;
&lt;p&gt;I had to do a little bit of extra cleanup to get everything working how I wanted. The final result was two tables, with the following schema:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;CREATE TABLE [musiccaps] (
   [ytid] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [url] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [caption] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [aspect_list] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [audioset_names] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [author_id] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [start_s] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [end_s] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [is_balanced_subset] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;,
   [is_audioset_eval] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;,
   [audioset_ids] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;
);
CREATE TABLE [videos] (
   [id] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [publishedAt] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [channelId] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [title] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [description] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [thumbnails] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [channelTitle] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [tags] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [categoryId] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [liveBroadcastContent] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [localized] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [viewCount] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;,
   [likeCount] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;,
   [favoriteCount] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;,
   [commentCount] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;,
   [defaultAudioLanguage] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [defaultLanguage] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;
);&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I &lt;a href="https://sqlite-utils.datasette.io/en/stable/python-api.html#full-text-search"&gt;configured SQLite full-text search&lt;/a&gt; against the &lt;code&gt;musiccaps.caption&lt;/code&gt; column.&lt;/p&gt;
&lt;p&gt;The last step was to create a SQL view that combined the key data from the two tables. After some more iteration I came up with this one:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;CREATE&lt;/span&gt; &lt;span class="pl-k"&gt;VIEW&lt;/span&gt; &lt;span class="pl-en"&gt;musiccaps_details&lt;/span&gt; &lt;span class="pl-k"&gt;AS&lt;/span&gt; &lt;span class="pl-k"&gt;select&lt;/span&gt;
  &lt;span class="pl-c1"&gt;musiccaps&lt;/span&gt;.&lt;span class="pl-c1"&gt;url&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; video,
  json_object(
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;label&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;,
    coalesce(&lt;span class="pl-c1"&gt;videos&lt;/span&gt;.&lt;span class="pl-c1"&gt;title&lt;/span&gt;, &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Missing from YouTube&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;),
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;href&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-c1"&gt;musiccaps&lt;/span&gt;.&lt;span class="pl-c1"&gt;url&lt;/span&gt;
  ) &lt;span class="pl-k"&gt;as&lt;/span&gt; youtube_link,
  &lt;span class="pl-c1"&gt;musiccaps&lt;/span&gt;.&lt;span class="pl-c1"&gt;caption&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; musiccaps_caption,
  &lt;span class="pl-c1"&gt;videos&lt;/span&gt;.&lt;span class="pl-c1"&gt;publishedAt&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; youtube_published,
  &lt;span class="pl-c1"&gt;videos&lt;/span&gt;.&lt;span class="pl-c1"&gt;channelTitle&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; youtube_channel,
  &lt;span class="pl-c1"&gt;videos&lt;/span&gt;.&lt;span class="pl-c1"&gt;description&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; youtube_description,
  &lt;span class="pl-c1"&gt;musiccaps&lt;/span&gt;.&lt;span class="pl-c1"&gt;audioset_names&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; musiccaps_names,
  &lt;span class="pl-c1"&gt;musiccaps&lt;/span&gt;.&lt;span class="pl-c1"&gt;aspect_list&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; musiccaps_aspects,
  &lt;span class="pl-c1"&gt;musiccaps&lt;/span&gt;.&lt;span class="pl-c1"&gt;author_id&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; musiccaps_author,
  &lt;span class="pl-c1"&gt;videos&lt;/span&gt;.&lt;span class="pl-c1"&gt;id&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; youtube_id,
  &lt;span class="pl-c1"&gt;musiccaps&lt;/span&gt;.&lt;span class="pl-c1"&gt;rowid&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; musiccaps_rowid
&lt;span class="pl-k"&gt;from&lt;/span&gt;
  musiccaps
  &lt;span class="pl-k"&gt;left join&lt;/span&gt; videos &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;musiccaps&lt;/span&gt;.&lt;span class="pl-c1"&gt;ytid&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;videos&lt;/span&gt;.&lt;span class="pl-c1"&gt;id&lt;/span&gt;;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I built a &lt;a href="https://github.com/simonw/musiccaps-datasette/blob/main/templates/index.html"&gt;custom template&lt;/a&gt; for the instance homepage to add a search box, then shipped the whole thing to Vercel using the &lt;a href="https://datasette.io/plugins/datasette-publish-vercel"&gt;datasette-publish-vercel&lt;/a&gt; plugin.&lt;/p&gt;
&lt;h4&gt;Let me know what you find&lt;/h4&gt;
&lt;p&gt;Digging around in this data is a lot of fun. I'd love to hear what you find. Hit me up &lt;a href="https://fedi.simonwillison.net/@simon"&gt;on Mastodon&lt;/a&gt; if you find anything interesting!&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ethics"&gt;ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/youtube"&gt;youtube&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/training-data"&gt;training-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ethics"/><category term="google"/><category term="projects"/><category term="youtube"/><category term="ai"/><category term="datasette"/><category term="generative-ai"/><category term="training-data"/><category term="ai-ethics"/></entry><entry><title>Exploring 10m scraped Shutterstock videos used to train Meta's Make-A-Video text-to-video model</title><link href="https://simonwillison.net/2022/Sep/29/webvid/#atom-series" rel="alternate"/><published>2022-09-29T19:31:24+00:00</published><updated>2022-09-29T19:31:24+00:00</updated><id>https://simonwillison.net/2022/Sep/29/webvid/#atom-series</id><summary type="html">
    &lt;p&gt;&lt;a href="https://makeavideo.studio/"&gt;Make-A-Video&lt;/a&gt; is a new "state-of-the-art AI system that generates videos from text" from Meta AI. It looks incredible - it really is DALL-E / Stable Diffusion for video. And it appears to have been trained on 10m video preview clips scraped from Shutterstock.&lt;/p&gt;
&lt;p&gt;I built a new search engine to explore those ten million clips:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://webvid.datasette.io/webvid/videos"&gt;https://webvid.datasette.io/webvid/videos&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/webvid-datasette.jpg" alt="A search for mars rocks returns 33 videos, each shown with a video player" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;This is similar to &lt;a href="https://simonwillison.net/2022/Sep/5/laion-aesthetics-weeknotes/"&gt;the system I built with Andy Baio&lt;/a&gt; a few weeks ago to explore the LAION data used to train Stable Diffusion.&lt;/p&gt;
&lt;h4&gt;Make-A-Video training data&lt;/h4&gt;
&lt;p&gt;Meta AI's &lt;a href="https://makeavideo.studio/Make-A-Video.pdf"&gt;paper describing the model&lt;/a&gt; includes this section about the training data:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Datasets.&lt;/strong&gt; To train the image models, we use a 2.3B subset of the dataset from (Schuhmann et al.) where the text is English. We filter out sample pairs with NSFW images 2, toxic words in the text, or images with a watermark probability larger than 0.5.&lt;/p&gt;
&lt;p&gt;We use WebVid-10M (Bain et al., 2021) and a 10M subset from HD-VILA-100M (Xue et al., 2022) 3 to train our video generation models. Note that only the videos (no aligned text) are used.&lt;/p&gt;
&lt;p&gt;The decoder Dt and the interpolation model is trained on WebVid-10M. SRt l is trained on both WebVid-10M and HD-VILA-10M. While prior work (Hong et al., 2022; Ho et al., 2022) have collected private text-video pairs for T2V generation, we use only public datasets (and no paired text for videos). We conduct automatic evaluation on UCF-101 (Soomro et al., 2012) and MSR-VTT (Xu et al., 2016) in a zero-shot setting.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That 2.3B subset of images is the same LAION data &lt;a href="https://simonwillison.net/2022/Sep/5/laion-aesthetics-weeknotes/"&gt;I explored previously&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/microsoft/XPretrain/tree/main/hd-vila-100m"&gt;HD-VILA-100M&lt;/a&gt; was collected by Microsoft Research Asia - &lt;a href="https://twitter.com/waxpancake/status/1575564723546103808"&gt;Andy Baio notes&lt;/a&gt; that these were scraped from YouTube.&lt;/p&gt;
&lt;p&gt;I decided to take a look at the &lt;a href="https://m-bain.github.io/webvid-dataset/"&gt;WebVid-10M&lt;/a&gt; data.&lt;/p&gt;
&lt;h4&gt;WebVid-10M&lt;/h4&gt;
&lt;p&gt;The &lt;a href="https://m-bain.github.io/webvid-dataset/"&gt;WebVid-10M site&lt;/a&gt; describes the data like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;WebVid-10M is a large-scale dataset of short videos with textual descriptions sourced from the web. The videos are diverse and rich in their content.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The &lt;a href="https://www.arxiv-vanity.com/papers/2104.00650/"&gt;accompanying paper&lt;/a&gt; provides a little bit more detail:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We scrape the web for a new dataset of videos with textual description annotations, called WebVid-2M. Our dataset consists of 2.5M video-text pairs, which is an order of magnitude larger than existing video captioning datasets (see Table 1).&lt;/p&gt;
&lt;p&gt;The data was scraped from the web following a similar procedure to Google Conceptual Captions [55] (CC3M). We note that more than 10% of CC3M images are in fact thumbnails from videos, which motivates us to use such video sources to scrape a total of 2.5M text-video pairs. The use of data collected for this study is authorised via the &lt;a href="https://www.gov.uk/guidance/exceptions-to-copyright/"&gt;Intellectual Property Office’s Exceptions to Copyright for Non-Commercial Research and Private Study&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'm presuming that Web-10M is a larger version of the WebVid-2M dataset described in the paper.&lt;/p&gt;
&lt;p&gt;Most importantly though, the website includes a link to a 2.7GB CSV file - &lt;code&gt;results_10M_train.csv&lt;/code&gt; - containing the full WebVid-10M dataset. The CSV file looks like this:&lt;/p&gt;
&lt;pre lang="csv"&gt;&lt;code&gt;videoid,contentUrl,duration,page_dir,name
21179416,https://ak.picdn.net/shutterstock/videos/21179416/preview/stock-footage-aerial-shot-winter-forest.mp4,PT00H00M11S,006001_006050,Aerial shot winter forest
5629184,https://ak.picdn.net/shutterstock/videos/5629184/preview/stock-footage-senior-couple-looking-through-binoculars-on-sailboat-together-shot-on-red-epic-for-high-quality-k.mp4,PT00H00M29S,071501_071550,"Senior couple looking through binoculars on sailboat together. shot on red epic for high quality 4k, uhd, ultra hd resolution."
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I &lt;a href="https://til.simonwillison.net/sqlite/import-csv"&gt;loaded it into SQLite&lt;/a&gt; and started digging around.&lt;/p&gt;
&lt;h4&gt;It's all from Shutterstock!&lt;/h4&gt;
&lt;p&gt;The big surprise for me when I started exploring the data was this: every single one of the 10,727,582 videos linked in the Datasette started with the same URL prefix:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;https://ak.picdn.net/shutterstock/videos/&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;They're all from Shutterstock. The paper talks about "scraping the web", but it turns out there was only one scraped website involved.&lt;/p&gt;
&lt;p&gt;Here's that first row from the CSV file on Shutterstock itself:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.shutterstock.com/video/clip-21179416-aerial-shot-winter-forest"&gt;https://www.shutterstock.com/video/clip-21179416-aerial-shot-winter-forest&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;As far as I can tell, the training set used here isn't even full Shutterstock videos: it's the free, watermarked preview clips that Shutterstock makes available.&lt;/p&gt;
&lt;p&gt;I guess Shutterstock have really high quality captions for their videos, perfect for training a model on.&lt;/p&gt;
&lt;h4&gt;Implementation notes&lt;/h4&gt;
&lt;p&gt;My &lt;a href="https://github.com/simonw/webvid-datasette"&gt;simonw/webvid-datasette&lt;/a&gt; repository contains the code I used to build the Datasette instance.&lt;/p&gt;
&lt;p&gt;I built a SQLite database with full-text search enabled using &lt;a href="https://sqlite-utils.datasette.io/"&gt;sqlite-utils&lt;/a&gt;. I deployed it directly to Fly by building a Docker image that bundled the 2.5G SQLite database, taking advantage of the &lt;a href="https://simonwillison.net/2021/Jul/28/baked-data/"&gt;Baked Data architectural pattern&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The most interesting custom piece of implementation is the plugin I wrote to add a video player to each result. Here's the &lt;a href="https://github.com/simonw/webvid-datasette/blob/8b8b56b9a8fa4913520fe11c11f45920f6737a86/plugins/render_cell.py"&gt;implementation of that plugin&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;datasette&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;hookimpl&lt;/span&gt;
&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;markupsafe&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-v"&gt;Markup&lt;/span&gt;

&lt;span class="pl-v"&gt;TEMPLATE&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"""&lt;/span&gt;
&lt;span class="pl-s"&gt;&amp;lt;video controls width="400" preload="none" poster="{poster}"&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;  &amp;lt;source src="{url}" type="video/mp4"&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&amp;lt;/video&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&amp;lt;p&amp;gt;{filename}&amp;lt;br&amp;gt;On &amp;lt;a href="https://www.shutterstock.com/video/clip-{id}"&amp;gt;Shutterstock&amp;lt;/a&amp;gt;&amp;lt;/p&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;"""&lt;/span&gt;.&lt;span class="pl-en"&gt;strip&lt;/span&gt;()
&lt;span class="pl-v"&gt;VIDEO_URL&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"https://ak.picdn.net/shutterstock/videos/{id}/preview/{filename}"&lt;/span&gt;
&lt;span class="pl-v"&gt;POSTER_URL&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"https://ak.picdn.net/shutterstock/videos/{id}/thumb/1.jpg?ip=x480"&lt;/span&gt;


&lt;span class="pl-en"&gt;@&lt;span class="pl-s1"&gt;hookimpl&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;render_cell&lt;/span&gt;(&lt;span class="pl-s1"&gt;row&lt;/span&gt;, &lt;span class="pl-s1"&gt;column&lt;/span&gt;, &lt;span class="pl-s1"&gt;value&lt;/span&gt;):
    &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-s1"&gt;column&lt;/span&gt; &lt;span class="pl-c1"&gt;!=&lt;/span&gt; &lt;span class="pl-s"&gt;"filename"&lt;/span&gt;:
        &lt;span class="pl-k"&gt;return&lt;/span&gt;
    &lt;span class="pl-s1"&gt;id&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;row&lt;/span&gt;[&lt;span class="pl-s"&gt;"id"&lt;/span&gt;]
    &lt;span class="pl-s1"&gt;url&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-v"&gt;VIDEO_URL&lt;/span&gt;.&lt;span class="pl-en"&gt;format&lt;/span&gt;(&lt;span class="pl-s1"&gt;id&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;id&lt;/span&gt;, &lt;span class="pl-s1"&gt;filename&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;value&lt;/span&gt;)
    &lt;span class="pl-s1"&gt;poster&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-v"&gt;POSTER_URL&lt;/span&gt;.&lt;span class="pl-en"&gt;format&lt;/span&gt;(&lt;span class="pl-s1"&gt;id&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;id&lt;/span&gt;)
    &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-v"&gt;Markup&lt;/span&gt;(&lt;span class="pl-v"&gt;TEMPLATE&lt;/span&gt;.&lt;span class="pl-en"&gt;format&lt;/span&gt;(&lt;span class="pl-s1"&gt;url&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;url&lt;/span&gt;, &lt;span class="pl-s1"&gt;poster&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;poster&lt;/span&gt;, &lt;span class="pl-s1"&gt;filename&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;value&lt;/span&gt;, &lt;span class="pl-s1"&gt;id&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;id&lt;/span&gt;))&lt;/pre&gt;
&lt;p&gt;I'm using the new &lt;code&gt;render_cell(row)&lt;/code&gt; argument added &lt;a href="https://docs.datasette.io/en/stable/changelog.html#v0-62"&gt;in Datasette 0.62&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The plugin outputs a &lt;code&gt;&amp;lt;video&amp;gt;&lt;/code&gt; element with &lt;code&gt;preload="none"&lt;/code&gt; to avoid the browser downloading the video until the user clicks play (see &lt;a href="https://til.simonwillison.net/html/video-preload-none"&gt;this TIL&lt;/a&gt;). I set the &lt;code&gt;poster&lt;/code&gt; attribute to a thumbnail image from Shutterstock.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ethics"&gt;ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/facebook"&gt;facebook&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/machine-learning"&gt;machine-learning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/training-data"&gt;training-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/text-to-video"&gt;text-to-video&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ethics"/><category term="facebook"/><category term="machine-learning"/><category term="projects"/><category term="ai"/><category term="datasette"/><category term="generative-ai"/><category term="training-data"/><category term="text-to-video"/><category term="ai-ethics"/></entry><entry><title>Exploring the training data behind Stable Diffusion</title><link href="https://simonwillison.net/2022/Sep/5/laion-aesthetics-weeknotes/#atom-series" rel="alternate"/><published>2022-09-05T00:18:42+00:00</published><updated>2022-09-05T00:18:42+00:00</updated><id>https://simonwillison.net/2022/Sep/5/laion-aesthetics-weeknotes/#atom-series</id><summary type="html">
    &lt;p&gt;Two weeks ago, the Stable Diffusion image generation model was &lt;a href="https://stability.ai/blog/stable-diffusion-public-release"&gt;released to the public&lt;/a&gt;. I wrote about this last week, in &lt;a href="https://simonwillison.net/2022/Aug/29/stable-diffusion/"&gt;Stable Diffusion is a really big deal&lt;/a&gt; - a post which has since become one of the top ten results for "stable diffusion" on Google and shown up in all sorts of different places online.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://waxy.org"&gt;Andy Baio&lt;/a&gt; pinged me a week ago on Friday and asked if I'd be interested in collaborating with him on digging into the training data. The Stable Diffusion &lt;a href="https://github.com/CompVis/stable-diffusion/blob/main/Stable_Diffusion_v1_Model_Card.md"&gt;Model Card&lt;/a&gt; provides a detailed description of how the model was trained - primarily on the &lt;a href="https://huggingface.co/datasets/laion/laion2B-en"&gt;LAION 2B-en&lt;/a&gt;) dataset (a subset of &lt;a href="https://laion.ai/blog/laion-5b/"&gt;LAION 5B&lt;/a&gt;), with further emphasis given to images with higher calculated aesthetic scores.&lt;/p&gt;
&lt;p&gt;We ended up deciding to dig into the &lt;a href="https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus"&gt;improved_aesthetics_6plus&lt;/a&gt; subset, which consists of 12 million images with an aesthetics score of 6 or higher.&lt;/p&gt;
&lt;p&gt;This isn't the full training set used for the model, but it's small enough that it fits comfortably in a SQLite database on inexpensive hosting...&lt;/p&gt;
&lt;p&gt;So I built a search engine, powered by &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update, 20th December 2023:&lt;/strong&gt; This search tool is no longer available.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;You can search for images by keyword using the following interface:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/images"&gt;laion-aesthetic.datasette.io/laion-aesthetic-6pls/images&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/laion-lemur.jpg" alt="Screenshot of the search interface, showing the results for lemur" style="max-width: 100%" /&gt;&lt;/p&gt;

&lt;p&gt;Or see a breakdown of image counts by the domain they were scraped from &lt;a href="https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/domain?_sort_desc=image_counts"&gt;on this page&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The search engine provides access to 12,096,835 rows, and uses SQLite full-text search to power search across their text descriptions.&lt;/p&gt;
&lt;p&gt;Andy used this Datasette instance to conduct a thorough analysis of the underlying training data, which he wrote about in &lt;a href="https://waxy.org/2022/08/exploring-12-million-of-the-images-used-to-train-stable-diffusions-image-generator/"&gt;Exploring 12 Million of the 2.3 Billion Images Used to Train Stable Diffusion’s Image Generator&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This analysis has had a really huge impact! Stories mentioning it made the front page of the websites of both the New York Times and the Washington Post on the same day:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.washingtonpost.com/technology/2022/09/02/midjourney-artificial-intelligence-state-fair-colorado/"&gt;He used AI art from Midjourney to win a fine-arts prize. Did he cheat?&lt;/a&gt; - The Washington Post&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.nytimes.com/2022/09/02/technology/ai-artificial-intelligence-artists.html"&gt;An A.I.-Generated Picture Won an Art Prize. Artists Aren’t Happy.&lt;/a&gt; - The New York Times&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Further afield, we spotted coverage from publications that included:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.spiegel.de/netzwelt/web/stable-diffusion-verstoerend-gute-ki-kunst-und-jeder-kann-mitmischen-a-0bde407d-c0d5-474a-92fc-de2a99c01774"&gt;Diese Software macht Sie zum KI-Künstler&lt;/a&gt; - Der Spiegel. I get quoted in this one (a translated snippet from my blog at least).&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://gigazine.net/news/20220831-exploring-stable-diffusions/"&gt;23億枚もの画像で構成された画像生成AI「Stable Diffusion」のデータセットのうち1200万枚がどこから入手した画像かを調査した結果が公開される&lt;/a&gt; - Gigazine, a long-running (22 years old) Japanese online news magazine.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="how-i-built-the-database"&gt;How I built the database&lt;/h4&gt;
&lt;p&gt;The code for the Datasette instance can be found in &lt;a href="https://github.com/simonw/laion-aesthetic-datasette"&gt;this GitHub repository&lt;/a&gt;. The &lt;a href="https://github.com/simonw/laion-aesthetic-datasette/issues"&gt;issues&lt;/a&gt; in that repo contain a detailed record of the various steps I took to build the database.&lt;/p&gt;
&lt;p&gt;The data subset I loaded into the search engine is &lt;a href="https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus"&gt;published on Hugging Face&lt;/a&gt; by Christoph Schuhmann. It consists of 7 parquet files, each of which are 325MB and stored in a GitHub repo using Git LFS.&lt;/p&gt;
&lt;p&gt;The first step was to fetch that data.&lt;/p&gt;
&lt;p&gt;This was my first time running &lt;code&gt;git lfs&lt;/code&gt; - I had to install it first using:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;brew install git-lfs
git lfs install
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then I cloned the repo and fetched the data like this. Note that to make the actual files available in the directory you need to run both &lt;code&gt;git lfs fetch&lt;/code&gt; and &lt;code&gt;git lfs checkout&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;git clone https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus
&lt;span class="pl-c1"&gt;cd&lt;/span&gt; improved_aesthetics_6plus
git lfs fetch
git lfs checkout&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The result is 7 parquet files. I wanted to load these into SQLite.&lt;/p&gt;
&lt;p&gt;The first solution I found that worked was to use the &lt;a href="https://pypi.org/project/parquet-tools/"&gt; parquet-tools&lt;/a&gt; Python package:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;pipx install parquet-tools
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I could then convert the parquet data to CSV like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;parquet-tools csv train-00002-of-00007-709151a2715d894d.parquet
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This outputs the contents of the file as CSV.&lt;/p&gt;
&lt;p&gt;Since this is a lot of data it made sense to create an empty SQLite table first (with columns with the correct column types) before inserting the data. I did that like so:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;sqlite3 laion-aesthetic-6pls.db &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;CREATE TABLE IF NOT EXISTS images (&lt;/span&gt;
&lt;span class="pl-s"&gt;   [url] TEXT,&lt;/span&gt;
&lt;span class="pl-s"&gt;   [text] TEXT,&lt;/span&gt;
&lt;span class="pl-s"&gt;   [width] INTEGER,&lt;/span&gt;
&lt;span class="pl-s"&gt;   [height] INTEGER,&lt;/span&gt;
&lt;span class="pl-s"&gt;   [similarity] FLOAT,&lt;/span&gt;
&lt;span class="pl-s"&gt;   [punsafe] FLOAT,&lt;/span&gt;
&lt;span class="pl-s"&gt;   [pwatermark] FLOAT,&lt;/span&gt;
&lt;span class="pl-s"&gt;   [aesthetic] FLOAT,&lt;/span&gt;
&lt;span class="pl-s"&gt;   [hash] TEXT,&lt;/span&gt;
&lt;span class="pl-s"&gt;   [__index_level_0__] INTEGER&lt;/span&gt;
&lt;span class="pl-s"&gt;);&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then I used a &lt;code&gt;bash&lt;/code&gt; loop to insert all of the data:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-smi"&gt;filename&lt;/span&gt; &lt;span class="pl-k"&gt;in&lt;/span&gt; &lt;span class="pl-k"&gt;*&lt;/span&gt;.parquet&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt;
    parquet-tools csv &lt;span class="pl-smi"&gt;$filename&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; sqlite3 -csv laion-aesthetic-6pls.db &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;.import --skip 1 '|cat -' images&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-k"&gt;done&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This uses the &lt;code&gt;sqlite3&lt;/code&gt; tool's &lt;code&gt;.import&lt;/code&gt; mechanism, because it's &lt;a href="https://til.simonwillison.net/sqlite/import-csv"&gt;really fast&lt;/a&gt;. The &lt;code&gt;--skip 1&lt;/code&gt; option is necessary to skip the first line, which is the CSV column names. The &lt;code&gt;'|cat -'&lt;/code&gt; is the idiom used to tell SQLite to read from standard input.&lt;/p&gt;
&lt;p&gt;This did the job! The result was a SQLite database file, about 3.5GB in size.&lt;/p&gt;
&lt;h4&gt;Enabling search&lt;/h4&gt;
&lt;p&gt;To enable SQLite full-text search against the images, I used &lt;a href="https://sqlite-utils.datasette.io/en/stable/cli.html#configuring-full-text-search"&gt;sqlite-utils enable-fts&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;sqlite-utils enable-fts laion-aesthetic-6pls.db images text
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This took about a minute and a half to run. The resulting database file was around 3.9GB in size - the full text index didn't add as much to the file size as I had expected.&lt;/p&gt;
&lt;p&gt;Best of all, the search was fast! Most search queries took in the order of 20ms to run. My opinion of SQLite FTS keeps improving the more I use it.&lt;/p&gt;
&lt;h4&gt;Extracting domains with sqlite-utils --functions&lt;/h4&gt;
&lt;p&gt;We knew we wanted to count how many images had been scraped from each domain - but we currently only had the full image URLs:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;https://cdn.idahopotato.com/cache/4075b86c99bc2c46f927f3be5949d161_w310.jpg&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;While walking &lt;a href="https://twitter.com/cleopaws"&gt;Cleo&lt;/a&gt; I had an idea: what if &lt;code&gt;sqlite-utils&lt;/code&gt; made it really easy to register custom SQL functions and use them from the command-line? Then I could use a Python function to extract the domain names.&lt;/p&gt;
&lt;p&gt;This became the impetus for releasing &lt;a href="https://sqlite-utils.datasette.io/en/stable/changelog.html#v3-29"&gt;sqlite-utils 3.29&lt;/a&gt; with a brand new feature: &lt;a href="https://sqlite-utils.datasette.io/en/stable/cli.html#cli-query-functions"&gt;sqlite-utils --functions&lt;/a&gt;, which lets you do exactly that.&lt;/p&gt;
&lt;p&gt;Here's how I used that to extract the domain names from the URLs:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; First, add an empty 'domain' column to the table&lt;/span&gt;
sqlite-utils add-column data.db images domain

&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Now populate it using a custom SQL function:&lt;/span&gt;
sqlite-utils laion-aesthetic-6pls.db &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;update images set domain = domain(url)&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
--functions &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;from urllib.parse import urlparse&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;def domain(url):&lt;/span&gt;
&lt;span class="pl-s"&gt;    return urlparse(url).netloc&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here we are executing this SQL query against the database:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;update&lt;/span&gt; images &lt;span class="pl-k"&gt;set&lt;/span&gt; domain &lt;span class="pl-k"&gt;=&lt;/span&gt; domain(url)&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Where that &lt;code&gt;domain(url)&lt;/code&gt; function is defined in the Python snippet passed to the &lt;code&gt;--functions&lt;/code&gt; option:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;urllib&lt;/span&gt;.&lt;span class="pl-s1"&gt;parse&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;urlparse&lt;/span&gt;

&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;domain&lt;/span&gt;(&lt;span class="pl-s1"&gt;url&lt;/span&gt;):
    &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-en"&gt;urlparse&lt;/span&gt;(&lt;span class="pl-s1"&gt;url&lt;/span&gt;).&lt;span class="pl-s1"&gt;netloc&lt;/span&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;sqlite-utils&lt;/code&gt; runs &lt;code&gt;eval()&lt;/code&gt; against the code in that block, then loops through any callable objects defined by that code (skipping them if their name starts with an underscore) and registers those as custom SQL functions with SQLite.&lt;/p&gt;
&lt;p&gt;I'm really excited about this pattern. I think it makes &lt;code&gt;sqlite-utils&lt;/code&gt; an even more useful tool for running ad-hoc data cleanup and enrichment tasks.&lt;/p&gt;
&lt;h4&gt;Populating the domains table&lt;/h4&gt;
&lt;p&gt;The &lt;code&gt;domain&lt;/code&gt; column in the &lt;code&gt;images&lt;/code&gt; table was now populated, but it was a bit of a verbose column: it duplicated a chunk of text from the existing &lt;code&gt;url&lt;/code&gt;, and was repeated for over 12 million rows.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://sqlite-utils.datasette.io/en/stable/cli.html#extracting-columns-into-a-separate-table"&gt;sqlite-utils extract&lt;/a&gt; command is designed for this exact use-case. It can extract a column from an existing table out into a separate lookup table, reducing the database size by swapping those duplicate text fields for a much smaller integer foreign key column instead.&lt;/p&gt;
&lt;p&gt;I ran that like so:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;sqlite-utils extract laion-aesthetic-6pls.db images domain&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The result was a new &lt;code&gt;domains&lt;/code&gt; table, and a &lt;code&gt;domain_id&lt;/code&gt; column in the &lt;code&gt;images&lt;/code&gt; table that pointed to records there.&lt;/p&gt;
&lt;p&gt;One more step: I didn't want people visiting the site to have to run an expensive group by/count query to see which domains had the most images. So I denormalized that data into the &lt;code&gt;domains&lt;/code&gt; table.&lt;/p&gt;
&lt;p&gt;First I added a new integer column to it, called &lt;code&gt;image_counts&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;sqlite-utils add-column laion-aesthetic-6pls.db domain image_counts integer&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then I populated it with a query like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;sqlite&lt;span class="pl-k"&gt;-&lt;/span&gt;utils laion&lt;span class="pl-k"&gt;-&lt;/span&gt;aesthetic&lt;span class="pl-k"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;6pls&lt;/span&gt;.&lt;span class="pl-c1"&gt;db&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;with counts as (&lt;/span&gt;
&lt;span class="pl-s"&gt;  select domain_id, count(*) as c from images group by domain_id&lt;/span&gt;
&lt;span class="pl-s"&gt;)&lt;/span&gt;
&lt;span class="pl-s"&gt;update domain&lt;/span&gt;
&lt;span class="pl-s"&gt;  set image_counts = counts.c&lt;/span&gt;
&lt;span class="pl-s"&gt;  from counts&lt;/span&gt;
&lt;span class="pl-s"&gt;  where id = counts.domain_id&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I first learned to combine CTEs and SQL updates while working &lt;a href="https://til.simonwillison.net/django/migration-using-cte"&gt;with Django migrations&lt;/a&gt; - I was delighted to see the same trick works for SQLite as well.&lt;/p&gt;
&lt;p&gt;You can see the result of this query in the &lt;a href="https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/domain?_sort_desc=image_counts"&gt;domain&lt;/a&gt; table. The first five rows look like this:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;id&lt;/th&gt;
&lt;th&gt;domain&lt;/th&gt;
&lt;th&gt;image_counts&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;td&gt;i.pinimg.com&lt;/td&gt;
&lt;td&gt;1043949&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;render.fineartamerica.com&lt;/td&gt;
&lt;td&gt;601106&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;us.123rf.com&lt;/td&gt;
&lt;td&gt;497244&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;cdn.shopify.com&lt;/td&gt;
&lt;td&gt;241632&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;136&lt;/td&gt;
&lt;td&gt;photos.smugmug.com&lt;/td&gt;
&lt;td&gt;225582&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h4&gt;Doing the same for celebrities, artists, characters&lt;/h4&gt;
&lt;p&gt;We also wanted to provide pre-calculated counts for searches against a number of celebrities, artists and fictional characters - to help give a sense of the kinds of images that were included in the data.&lt;/p&gt;
&lt;p&gt;Andy gathered the ones we wanted to track in &lt;a href="https://docs.google.com/spreadsheets/d/1JLQQ3U6P0d4vDkAGuB8avmXOPIDPeDUdAEZsWTRYpng/edit"&gt;this Google Sheet&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I recently learned how to use the &lt;code&gt;/export?format=csv&lt;/code&gt; endpoint to export a Google Sheet as CSV. I found out that you can use &lt;code&gt;/export?format=csv&amp;amp;gid=1037423923&lt;/code&gt; to target a specific tab in a multi-tabbed sheet.&lt;/p&gt;
&lt;p&gt;So I imported Andy's data into SQLite using the following:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;curl -L &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://docs.google.com/spreadsheets/d/1JLQQ3U6P0d4vDkAGuB8avmXOPIDPeDUdAEZsWTRYpng/export?format=csv&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
  &lt;span class="pl-k"&gt;|&lt;/span&gt; sqlite-utils insert laion-aesthetic-6pls.db artists - --csv
curl -L &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://docs.google.com/spreadsheets/d/1JLQQ3U6P0d4vDkAGuB8avmXOPIDPeDUdAEZsWTRYpng/export?format=csv&amp;amp;gid=1037423923&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
  &lt;span class="pl-k"&gt;|&lt;/span&gt; sqlite-utils insert laion-aesthetic-6pls.db celebrities - --csv
curl -L &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://docs.google.com/spreadsheets/d/1JLQQ3U6P0d4vDkAGuB8avmXOPIDPeDUdAEZsWTRYpng/export?format=csv&amp;amp;gid=480391249&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
  &lt;span class="pl-k"&gt;|&lt;/span&gt; sqlite-utils insert laion-aesthetic-6pls.db characters - --csv&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This gave me &lt;code&gt;artists&lt;/code&gt;, &lt;code&gt;celebrities&lt;/code&gt; and &lt;code&gt;characters&lt;/code&gt; tables.&lt;/p&gt;
&lt;p&gt;The next challenge was to run a search query for each row in each of those tables and return the count of results. After some experimentation I found that this one worked:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;select&lt;/span&gt; name, (
  &lt;span class="pl-k"&gt;select&lt;/span&gt; &lt;span class="pl-c1"&gt;count&lt;/span&gt;(&lt;span class="pl-k"&gt;*&lt;/span&gt;) &lt;span class="pl-k"&gt;from&lt;/span&gt; images_fts &lt;span class="pl-k"&gt;where&lt;/span&gt; images_fts match &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;"&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;||&lt;/span&gt; name &lt;span class="pl-k"&gt;||&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;"&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
) &lt;span class="pl-k"&gt;as&lt;/span&gt; search_count &lt;span class="pl-k"&gt;from&lt;/span&gt; celebrities &lt;span class="pl-k"&gt;order by&lt;/span&gt; search_count &lt;span class="pl-k"&gt;desc&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Note the &lt;code&gt;match '"' || name || '"'&lt;/code&gt; part - this was necessary to ensure the name was correctly quoted in a way that would avoid names like &lt;code&gt;Dwayne 'The Rock' Johnson&lt;/code&gt; from breaking the search query.&lt;/p&gt;
&lt;p&gt;Now that I had the query I could use that same CTE update trick to populate a counts column in the tables:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;sqlite-utils add-column laion-aesthetic-6pls.db celebrities image_counts integer

sqlite-utils laion-aesthetic-6pls.db &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;cat &lt;span class="pl-s"&gt;&lt;span class="pl-k"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="pl-k"&gt;EOF&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;with counts as (&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  select name,&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;    (&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;      select count(*) from images_fts where images_fts match '"' || name || '"'&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;    ) as search_count&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;    from celebrities&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;update celebrities&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  set image_counts = counts.search_count&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  from counts&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  where celebrities.name = counts.name&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-k"&gt;EOF&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I'm using the &lt;code&gt;cat &amp;lt;&amp;lt;EOF&lt;/code&gt; trick here to avoid having to use shell escaping for the single and double quotes, as described in this TIL: &lt;a href="https://til.simonwillison.net/zsh/argument-heredoc"&gt;Passing command arguments using heredoc syntax&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here are the finished tables: &lt;a href="https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/characters"&gt;characters&lt;/a&gt;, &lt;a href="https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/celebrities"&gt;celebrities&lt;/a&gt;, &lt;a href="https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/artists"&gt;artists&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;Deploying it to Fly&lt;/h4&gt;
&lt;p&gt;At just under 4GB the resulting SQLite database was an awkward size. I often deploy ~1GB databases to Google Cloud Run, but this was a bit too large for me to feel comfortable with that. Cloud Run can also get expensive for projects that attract a great deal of traffic.&lt;/p&gt;
&lt;p&gt;I decided to use &lt;a href="https://fly.io/"&gt;Fly&lt;/a&gt; instead. Fly includes support for mountable volumes, which means it's a great fit for these larger database files.&lt;/p&gt;
&lt;p&gt;I wrote about &lt;a href="https://simonwillison.net/2022/Feb/15/fly-volumes/"&gt;Using SQLite and Datasette with Fly Volumes&lt;/a&gt; back in February, when I added support to volumes to the &lt;a href="https://datasette.io/plugins/datasette-publish-fly"&gt;datasette-publish-fly&lt;/a&gt; Datasette plugin.&lt;/p&gt;
&lt;p&gt;This was still the largest database I had ever deployed to Fly, and it took a little bit of work to figure out the best way to handle it.&lt;/p&gt;
&lt;p&gt;In the end, I used the following recipe:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;datasette publish fly \
  --app laion-aesthetic \
  --volume-name datasette \
  --install datasette-json-html \
  --extra-options &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;-i /data/laion-aesthetic-6pls.db --inspect-file /data/inspect.json --setting sql_time_limit_ms 10000 --setting suggest_facets 0 --setting allow_download 0&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
  -m metadata.yml&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The first time I ran this I used &lt;code&gt;--create-volume 20&lt;/code&gt; to create a 20GB volume called &lt;code&gt;datasette&lt;/code&gt;. I over-provisioned this so I could run commands like &lt;code&gt;sqlite-utils vacuum&lt;/code&gt;, which need twice the amount of space as is taken up by the database file itself.&lt;/p&gt;
&lt;p&gt;I uploaded the database file itself &lt;a href="https://til.simonwillison.net/fly/scp"&gt;using scp&lt;/a&gt;, and ran &lt;code&gt;fly ssh console -a laion-aesthetic&lt;/code&gt; to SSH in and execute other commands such as &lt;code&gt;datasette inspect  laion-aesthetic-6pls.db &gt; inspect.json&lt;/code&gt; to create the inspect JSON file.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;--extra-options&lt;/code&gt; deserve explanation.&lt;/p&gt;
&lt;p&gt;Normally when you run &lt;code&gt;datasette publish&lt;/code&gt; the file you pass to the command is automatically deployed using &lt;a href="https://docs.datasette.io/en/stable/performance.html?highlight=immutable#immutable-mode"&gt;immutable mode&lt;/a&gt;. This mode is specifically designed for running read-only databases, and uses optimizations like only counting the rows in the table once on startup (or loading the counts from a pre-prepared &lt;code&gt;inspect.json&lt;/code&gt; file).&lt;/p&gt;
&lt;p&gt;I wanted those optimizations for this project. But &lt;code&gt;datasette publish fly&lt;/code&gt; is currently designed with the assumption that any databases you put in the &lt;code&gt;/data&lt;/code&gt; volume are designed to accept writes, and hence shouldn't be opened in immutable mode.&lt;/p&gt;
&lt;p&gt;I ended up coming up with a horrible hack. I add &lt;code&gt;-i /data/laion-aesthetic-6pls.db&lt;/code&gt; to the &lt;code&gt;--extra-options&lt;/code&gt; command to tell Datasette to open the file in immutable mode.&lt;/p&gt;
&lt;p&gt;But this wasn't enough! &lt;code&gt;datasette publish fly&lt;/code&gt; also configures Datasette to automatically open any databases in &lt;code&gt;/data&lt;/code&gt; in read-only mode, so that newly saved database files will be served correctly.&lt;/p&gt;
&lt;p&gt;This meant my instance was loading the same database twice - once in read-only mode and once in immutable mode.&lt;/p&gt;
&lt;p&gt;Rather than fixing the design of &lt;code&gt;datasette-publish-fly&lt;/code&gt;, I went for a cheap workaround. I start Datasette with the following &lt;code&gt;metadata.yml&lt;/code&gt; configuration (simplified):&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;databases&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;laion-aesthetic-6pls&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;tables&lt;/span&gt;:
      &lt;span class="pl-ent"&gt;domain&lt;/span&gt;:
        &lt;span class="pl-ent"&gt;label_column&lt;/span&gt;: &lt;span class="pl-s"&gt;domain&lt;/span&gt;
  &lt;span class="pl-ent"&gt;laion-aesthetic-6pls_2&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;allow&lt;/span&gt;: &lt;span class="pl-c1"&gt;false&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This ensures that the &lt;code&gt;laion-aesthetic-6pls&lt;/code&gt; database - the immutable one - is served correctly, and has a label column set for the &lt;code&gt;domain&lt;/code&gt; table too.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;laion-aesthetic-6pls_2&lt;/code&gt; is the second copy of that database, loaded because Datasette spotted it in the &lt;code&gt;/data&lt;/code&gt; directory. Setting &lt;code&gt;allow: false&lt;/code&gt; on it uses Datasette's &lt;a href="https://docs.datasette.io/en/stable/authentication.html#defining-permissions-with-allow-blocks"&gt;permissions framework&lt;/a&gt; to hide that duplicate database from view.&lt;/p&gt;
&lt;p&gt;I'm not proud of these workarounds, and I hope to fix them in the future - but for the moment this is what it took to deploy the project.&lt;/p&gt;
&lt;h4&gt;Scaling it to meet demand&lt;/h4&gt;
&lt;p&gt;I launched the first version of the application on Fly's cheapest instance - 256MB of RAM, costing $1.87/month.&lt;/p&gt;
&lt;p&gt;This worked fine when it was just me and Andy playing with the site, but it started to struggle as traffic started to increase.&lt;/p&gt;
&lt;p&gt;Fly have a "scale app" button which lets you upgrade your instance. I hadn't actually used it before, but I was delighted to find that it worked exactly as expected: I bumped the RAM up to 4GB (not coincidentally the size of the SQLite database file) and the instance restarted within a few seconds with upgraded capacity.&lt;/p&gt;
&lt;p&gt;Fly provide a preconfigured Grafana interface for watching your instances, and it helped me feel confident that the resized instance was happily dealing with the traffic.&lt;/p&gt;
&lt;p&gt;I plan to dial back down to a cheaper instance once interest in the project starts to fade.&lt;/p&gt;
&lt;h4&gt;Got a problem? Throw a search engine at it&lt;/h4&gt;
&lt;p&gt;This is the third time I've used Datasette to build a search engine in the past three weeks! My other two recent projects are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://scotrail.datasette.io/"&gt;scotrail.datasette.io&lt;/a&gt;, described in &lt;a href="https://simonwillison.net/2022/Aug/21/scotrail/"&gt;Analyzing ScotRail audio announcements with Datasette—from prototype to production&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://archive.sfmicrosociety.org"&gt;archive.sfmicrosociety.org&lt;/a&gt;, described in &lt;a href="https://simonwillison.net/2022/Aug/25/sfms-archive/"&gt;Building a searchable archive for the San Francisco Microscopical Society&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The ability to spin up a full search engine for anything that you can stuff into a SQLite database table (which it turns out is almost everything) is a really powerful ability. I plan to write a &lt;a href="https://datasette.io/tutorials"&gt;Datasette tutorial&lt;/a&gt; about this in the future.&lt;/p&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-render-image-tags"&gt;datasette-render-image-tags&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-render-image-tags/releases/tag/0.1"&gt;0.1&lt;/a&gt; - 2022-09-04
&lt;br /&gt;Turn any URLs ending in .jpg/.png/.gif into img tags with width 200&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-sitemap"&gt;datasette-sitemap&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-sitemap/releases/tag/1.0"&gt;1.0&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-sitemap/releases"&gt;3 releases total&lt;/a&gt;) - 2022-08-30
&lt;br /&gt;Generate sitemap.xml for Datasette sites&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-block-robots"&gt;datasette-block-robots&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-block-robots/releases/tag/1.1"&gt;1.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-block-robots/releases"&gt;6 releases total&lt;/a&gt;) - 2022-08-30
&lt;br /&gt;Datasette plugin that blocks robots and crawlers using robots.txt&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/sqlite-utils/releases/tag/3.29"&gt;3.29&lt;/a&gt; - (&lt;a href="https://github.com/simonw/sqlite-utils/releases"&gt;103 releases total&lt;/a&gt;) - 2022-08-28
&lt;br /&gt;Python CLI utility and library for manipulating SQLite databases&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/sqlite/vacum-disk-full"&gt;SQLite VACUUM: database or disk is full&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/fly/scp"&gt;How to scp files to and from Fly&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/twitter/birdwatch-sqlite"&gt;Loading Twitter Birdwatch into SQLite for analysis with Datasette&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/andy-baio"&gt;andy-baio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ethics"&gt;ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/fly"&gt;fly&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parquet"&gt;parquet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/laion"&gt;laion&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/stable-diffusion"&gt;stable-diffusion&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/training-data"&gt;training-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/text-to-image"&gt;text-to-image&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="andy-baio"/><category term="ethics"/><category term="fly"/><category term="search"/><category term="parquet"/><category term="weeknotes"/><category term="laion"/><category term="datasette"/><category term="stable-diffusion"/><category term="sqlite-utils"/><category term="training-data"/><category term="text-to-image"/><category term="ai-ethics"/><category term="ai"/><category term="generative-ai"/></entry></feed>