<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: laion</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/laion.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2023-03-28T21:59:27+00:00</updated><author><name>Simon Willison</name></author><entry><title>Announcing Open Flamingo</title><link href="https://simonwillison.net/2023/Mar/28/announcing-open-flamingo/#atom-tag" rel="alternate"/><published>2023-03-28T21:59:27+00:00</published><updated>2023-03-28T21:59:27+00:00</updated><id>https://simonwillison.net/2023/Mar/28/announcing-open-flamingo/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://laion.ai/blog/open-flamingo/"&gt;Announcing Open Flamingo&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New from LAION: “OpenFlamingo is a framework that enables training and evaluation of large multimodal models (LMMs)”. Multimodal here means it can answer questions about images—their interactive demo includes tools for image captioning,  animal recognition, counting objects and visual question answering. Theye’ve released the OpenFlamingo-9B model built on top of LLaMA 7B and CLIP ViT/L-14—the model checkpoint is a 5.24 GB download from Hugging Face, and is available under a non-commercial research license.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=35347588"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/laion"&gt;laion&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/clip"&gt;clip&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="laion"/><category term="llms"/><category term="clip"/></entry><entry><title>Exploring the training data behind Stable Diffusion</title><link href="https://simonwillison.net/2022/Sep/5/laion-aesthetics-weeknotes/#atom-tag" rel="alternate"/><published>2022-09-05T00:18:42+00:00</published><updated>2022-09-05T00:18:42+00:00</updated><id>https://simonwillison.net/2022/Sep/5/laion-aesthetics-weeknotes/#atom-tag</id><summary type="html">
    &lt;p&gt;Two weeks ago, the Stable Diffusion image generation model was &lt;a href="https://stability.ai/blog/stable-diffusion-public-release"&gt;released to the public&lt;/a&gt;. I wrote about this last week, in &lt;a href="https://simonwillison.net/2022/Aug/29/stable-diffusion/"&gt;Stable Diffusion is a really big deal&lt;/a&gt; - a post which has since become one of the top ten results for "stable diffusion" on Google and shown up in all sorts of different places online.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://waxy.org"&gt;Andy Baio&lt;/a&gt; pinged me a week ago on Friday and asked if I'd be interested in collaborating with him on digging into the training data. The Stable Diffusion &lt;a href="https://github.com/CompVis/stable-diffusion/blob/main/Stable_Diffusion_v1_Model_Card.md"&gt;Model Card&lt;/a&gt; provides a detailed description of how the model was trained - primarily on the &lt;a href="https://huggingface.co/datasets/laion/laion2B-en"&gt;LAION 2B-en&lt;/a&gt;) dataset (a subset of &lt;a href="https://laion.ai/blog/laion-5b/"&gt;LAION 5B&lt;/a&gt;), with further emphasis given to images with higher calculated aesthetic scores.&lt;/p&gt;
&lt;p&gt;We ended up deciding to dig into the &lt;a href="https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus"&gt;improved_aesthetics_6plus&lt;/a&gt; subset, which consists of 12 million images with an aesthetics score of 6 or higher.&lt;/p&gt;
&lt;p&gt;This isn't the full training set used for the model, but it's small enough that it fits comfortably in a SQLite database on inexpensive hosting...&lt;/p&gt;
&lt;p&gt;So I built a search engine, powered by &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update, 20th December 2023:&lt;/strong&gt; This search tool is no longer available.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;You can search for images by keyword using the following interface:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/images"&gt;laion-aesthetic.datasette.io/laion-aesthetic-6pls/images&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/laion-lemur.jpg" alt="Screenshot of the search interface, showing the results for lemur" style="max-width: 100%" /&gt;&lt;/p&gt;

&lt;p&gt;Or see a breakdown of image counts by the domain they were scraped from &lt;a href="https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/domain?_sort_desc=image_counts"&gt;on this page&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The search engine provides access to 12,096,835 rows, and uses SQLite full-text search to power search across their text descriptions.&lt;/p&gt;
&lt;p&gt;Andy used this Datasette instance to conduct a thorough analysis of the underlying training data, which he wrote about in &lt;a href="https://waxy.org/2022/08/exploring-12-million-of-the-images-used-to-train-stable-diffusions-image-generator/"&gt;Exploring 12 Million of the 2.3 Billion Images Used to Train Stable Diffusion’s Image Generator&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This analysis has had a really huge impact! Stories mentioning it made the front page of the websites of both the New York Times and the Washington Post on the same day:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.washingtonpost.com/technology/2022/09/02/midjourney-artificial-intelligence-state-fair-colorado/"&gt;He used AI art from Midjourney to win a fine-arts prize. Did he cheat?&lt;/a&gt; - The Washington Post&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.nytimes.com/2022/09/02/technology/ai-artificial-intelligence-artists.html"&gt;An A.I.-Generated Picture Won an Art Prize. Artists Aren’t Happy.&lt;/a&gt; - The New York Times&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Further afield, we spotted coverage from publications that included:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.spiegel.de/netzwelt/web/stable-diffusion-verstoerend-gute-ki-kunst-und-jeder-kann-mitmischen-a-0bde407d-c0d5-474a-92fc-de2a99c01774"&gt;Diese Software macht Sie zum KI-Künstler&lt;/a&gt; - Der Spiegel. I get quoted in this one (a translated snippet from my blog at least).&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://gigazine.net/news/20220831-exploring-stable-diffusions/"&gt;23億枚もの画像で構成された画像生成AI「Stable Diffusion」のデータセットのうち1200万枚がどこから入手した画像かを調査した結果が公開される&lt;/a&gt; - Gigazine, a long-running (22 years old) Japanese online news magazine.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="how-i-built-the-database"&gt;How I built the database&lt;/h4&gt;
&lt;p&gt;The code for the Datasette instance can be found in &lt;a href="https://github.com/simonw/laion-aesthetic-datasette"&gt;this GitHub repository&lt;/a&gt;. The &lt;a href="https://github.com/simonw/laion-aesthetic-datasette/issues"&gt;issues&lt;/a&gt; in that repo contain a detailed record of the various steps I took to build the database.&lt;/p&gt;
&lt;p&gt;The data subset I loaded into the search engine is &lt;a href="https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus"&gt;published on Hugging Face&lt;/a&gt; by Christoph Schuhmann. It consists of 7 parquet files, each of which are 325MB and stored in a GitHub repo using Git LFS.&lt;/p&gt;
&lt;p&gt;The first step was to fetch that data.&lt;/p&gt;
&lt;p&gt;This was my first time running &lt;code&gt;git lfs&lt;/code&gt; - I had to install it first using:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;brew install git-lfs
git lfs install
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then I cloned the repo and fetched the data like this. Note that to make the actual files available in the directory you need to run both &lt;code&gt;git lfs fetch&lt;/code&gt; and &lt;code&gt;git lfs checkout&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;git clone https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus
&lt;span class="pl-c1"&gt;cd&lt;/span&gt; improved_aesthetics_6plus
git lfs fetch
git lfs checkout&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The result is 7 parquet files. I wanted to load these into SQLite.&lt;/p&gt;
&lt;p&gt;The first solution I found that worked was to use the &lt;a href="https://pypi.org/project/parquet-tools/"&gt; parquet-tools&lt;/a&gt; Python package:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;pipx install parquet-tools
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I could then convert the parquet data to CSV like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;parquet-tools csv train-00002-of-00007-709151a2715d894d.parquet
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This outputs the contents of the file as CSV.&lt;/p&gt;
&lt;p&gt;Since this is a lot of data it made sense to create an empty SQLite table first (with columns with the correct column types) before inserting the data. I did that like so:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;sqlite3 laion-aesthetic-6pls.db &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;CREATE TABLE IF NOT EXISTS images (&lt;/span&gt;
&lt;span class="pl-s"&gt;   [url] TEXT,&lt;/span&gt;
&lt;span class="pl-s"&gt;   [text] TEXT,&lt;/span&gt;
&lt;span class="pl-s"&gt;   [width] INTEGER,&lt;/span&gt;
&lt;span class="pl-s"&gt;   [height] INTEGER,&lt;/span&gt;
&lt;span class="pl-s"&gt;   [similarity] FLOAT,&lt;/span&gt;
&lt;span class="pl-s"&gt;   [punsafe] FLOAT,&lt;/span&gt;
&lt;span class="pl-s"&gt;   [pwatermark] FLOAT,&lt;/span&gt;
&lt;span class="pl-s"&gt;   [aesthetic] FLOAT,&lt;/span&gt;
&lt;span class="pl-s"&gt;   [hash] TEXT,&lt;/span&gt;
&lt;span class="pl-s"&gt;   [__index_level_0__] INTEGER&lt;/span&gt;
&lt;span class="pl-s"&gt;);&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then I used a &lt;code&gt;bash&lt;/code&gt; loop to insert all of the data:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-smi"&gt;filename&lt;/span&gt; &lt;span class="pl-k"&gt;in&lt;/span&gt; &lt;span class="pl-k"&gt;*&lt;/span&gt;.parquet&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt;
    parquet-tools csv &lt;span class="pl-smi"&gt;$filename&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; sqlite3 -csv laion-aesthetic-6pls.db &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;.import --skip 1 '|cat -' images&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-k"&gt;done&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This uses the &lt;code&gt;sqlite3&lt;/code&gt; tool's &lt;code&gt;.import&lt;/code&gt; mechanism, because it's &lt;a href="https://til.simonwillison.net/sqlite/import-csv"&gt;really fast&lt;/a&gt;. The &lt;code&gt;--skip 1&lt;/code&gt; option is necessary to skip the first line, which is the CSV column names. The &lt;code&gt;'|cat -'&lt;/code&gt; is the idiom used to tell SQLite to read from standard input.&lt;/p&gt;
&lt;p&gt;This did the job! The result was a SQLite database file, about 3.5GB in size.&lt;/p&gt;
&lt;h4&gt;Enabling search&lt;/h4&gt;
&lt;p&gt;To enable SQLite full-text search against the images, I used &lt;a href="https://sqlite-utils.datasette.io/en/stable/cli.html#configuring-full-text-search"&gt;sqlite-utils enable-fts&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;sqlite-utils enable-fts laion-aesthetic-6pls.db images text
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This took about a minute and a half to run. The resulting database file was around 3.9GB in size - the full text index didn't add as much to the file size as I had expected.&lt;/p&gt;
&lt;p&gt;Best of all, the search was fast! Most search queries took in the order of 20ms to run. My opinion of SQLite FTS keeps improving the more I use it.&lt;/p&gt;
&lt;h4&gt;Extracting domains with sqlite-utils --functions&lt;/h4&gt;
&lt;p&gt;We knew we wanted to count how many images had been scraped from each domain - but we currently only had the full image URLs:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;https://cdn.idahopotato.com/cache/4075b86c99bc2c46f927f3be5949d161_w310.jpg&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;While walking &lt;a href="https://twitter.com/cleopaws"&gt;Cleo&lt;/a&gt; I had an idea: what if &lt;code&gt;sqlite-utils&lt;/code&gt; made it really easy to register custom SQL functions and use them from the command-line? Then I could use a Python function to extract the domain names.&lt;/p&gt;
&lt;p&gt;This became the impetus for releasing &lt;a href="https://sqlite-utils.datasette.io/en/stable/changelog.html#v3-29"&gt;sqlite-utils 3.29&lt;/a&gt; with a brand new feature: &lt;a href="https://sqlite-utils.datasette.io/en/stable/cli.html#cli-query-functions"&gt;sqlite-utils --functions&lt;/a&gt;, which lets you do exactly that.&lt;/p&gt;
&lt;p&gt;Here's how I used that to extract the domain names from the URLs:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; First, add an empty 'domain' column to the table&lt;/span&gt;
sqlite-utils add-column data.db images domain

&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Now populate it using a custom SQL function:&lt;/span&gt;
sqlite-utils laion-aesthetic-6pls.db &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;update images set domain = domain(url)&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
--functions &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;from urllib.parse import urlparse&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;def domain(url):&lt;/span&gt;
&lt;span class="pl-s"&gt;    return urlparse(url).netloc&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here we are executing this SQL query against the database:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;update&lt;/span&gt; images &lt;span class="pl-k"&gt;set&lt;/span&gt; domain &lt;span class="pl-k"&gt;=&lt;/span&gt; domain(url)&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Where that &lt;code&gt;domain(url)&lt;/code&gt; function is defined in the Python snippet passed to the &lt;code&gt;--functions&lt;/code&gt; option:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;urllib&lt;/span&gt;.&lt;span class="pl-s1"&gt;parse&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;urlparse&lt;/span&gt;

&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;domain&lt;/span&gt;(&lt;span class="pl-s1"&gt;url&lt;/span&gt;):
    &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-en"&gt;urlparse&lt;/span&gt;(&lt;span class="pl-s1"&gt;url&lt;/span&gt;).&lt;span class="pl-s1"&gt;netloc&lt;/span&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;sqlite-utils&lt;/code&gt; runs &lt;code&gt;eval()&lt;/code&gt; against the code in that block, then loops through any callable objects defined by that code (skipping them if their name starts with an underscore) and registers those as custom SQL functions with SQLite.&lt;/p&gt;
&lt;p&gt;I'm really excited about this pattern. I think it makes &lt;code&gt;sqlite-utils&lt;/code&gt; an even more useful tool for running ad-hoc data cleanup and enrichment tasks.&lt;/p&gt;
&lt;h4&gt;Populating the domains table&lt;/h4&gt;
&lt;p&gt;The &lt;code&gt;domain&lt;/code&gt; column in the &lt;code&gt;images&lt;/code&gt; table was now populated, but it was a bit of a verbose column: it duplicated a chunk of text from the existing &lt;code&gt;url&lt;/code&gt;, and was repeated for over 12 million rows.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://sqlite-utils.datasette.io/en/stable/cli.html#extracting-columns-into-a-separate-table"&gt;sqlite-utils extract&lt;/a&gt; command is designed for this exact use-case. It can extract a column from an existing table out into a separate lookup table, reducing the database size by swapping those duplicate text fields for a much smaller integer foreign key column instead.&lt;/p&gt;
&lt;p&gt;I ran that like so:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;sqlite-utils extract laion-aesthetic-6pls.db images domain&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The result was a new &lt;code&gt;domains&lt;/code&gt; table, and a &lt;code&gt;domain_id&lt;/code&gt; column in the &lt;code&gt;images&lt;/code&gt; table that pointed to records there.&lt;/p&gt;
&lt;p&gt;One more step: I didn't want people visiting the site to have to run an expensive group by/count query to see which domains had the most images. So I denormalized that data into the &lt;code&gt;domains&lt;/code&gt; table.&lt;/p&gt;
&lt;p&gt;First I added a new integer column to it, called &lt;code&gt;image_counts&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;sqlite-utils add-column laion-aesthetic-6pls.db domain image_counts integer&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then I populated it with a query like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;sqlite&lt;span class="pl-k"&gt;-&lt;/span&gt;utils laion&lt;span class="pl-k"&gt;-&lt;/span&gt;aesthetic&lt;span class="pl-k"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;6pls&lt;/span&gt;.&lt;span class="pl-c1"&gt;db&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;with counts as (&lt;/span&gt;
&lt;span class="pl-s"&gt;  select domain_id, count(*) as c from images group by domain_id&lt;/span&gt;
&lt;span class="pl-s"&gt;)&lt;/span&gt;
&lt;span class="pl-s"&gt;update domain&lt;/span&gt;
&lt;span class="pl-s"&gt;  set image_counts = counts.c&lt;/span&gt;
&lt;span class="pl-s"&gt;  from counts&lt;/span&gt;
&lt;span class="pl-s"&gt;  where id = counts.domain_id&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I first learned to combine CTEs and SQL updates while working &lt;a href="https://til.simonwillison.net/django/migration-using-cte"&gt;with Django migrations&lt;/a&gt; - I was delighted to see the same trick works for SQLite as well.&lt;/p&gt;
&lt;p&gt;You can see the result of this query in the &lt;a href="https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/domain?_sort_desc=image_counts"&gt;domain&lt;/a&gt; table. The first five rows look like this:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;id&lt;/th&gt;
&lt;th&gt;domain&lt;/th&gt;
&lt;th&gt;image_counts&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;td&gt;i.pinimg.com&lt;/td&gt;
&lt;td&gt;1043949&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;render.fineartamerica.com&lt;/td&gt;
&lt;td&gt;601106&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;us.123rf.com&lt;/td&gt;
&lt;td&gt;497244&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;cdn.shopify.com&lt;/td&gt;
&lt;td&gt;241632&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;136&lt;/td&gt;
&lt;td&gt;photos.smugmug.com&lt;/td&gt;
&lt;td&gt;225582&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h4&gt;Doing the same for celebrities, artists, characters&lt;/h4&gt;
&lt;p&gt;We also wanted to provide pre-calculated counts for searches against a number of celebrities, artists and fictional characters - to help give a sense of the kinds of images that were included in the data.&lt;/p&gt;
&lt;p&gt;Andy gathered the ones we wanted to track in &lt;a href="https://docs.google.com/spreadsheets/d/1JLQQ3U6P0d4vDkAGuB8avmXOPIDPeDUdAEZsWTRYpng/edit"&gt;this Google Sheet&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I recently learned how to use the &lt;code&gt;/export?format=csv&lt;/code&gt; endpoint to export a Google Sheet as CSV. I found out that you can use &lt;code&gt;/export?format=csv&amp;amp;gid=1037423923&lt;/code&gt; to target a specific tab in a multi-tabbed sheet.&lt;/p&gt;
&lt;p&gt;So I imported Andy's data into SQLite using the following:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;curl -L &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://docs.google.com/spreadsheets/d/1JLQQ3U6P0d4vDkAGuB8avmXOPIDPeDUdAEZsWTRYpng/export?format=csv&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
  &lt;span class="pl-k"&gt;|&lt;/span&gt; sqlite-utils insert laion-aesthetic-6pls.db artists - --csv
curl -L &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://docs.google.com/spreadsheets/d/1JLQQ3U6P0d4vDkAGuB8avmXOPIDPeDUdAEZsWTRYpng/export?format=csv&amp;amp;gid=1037423923&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
  &lt;span class="pl-k"&gt;|&lt;/span&gt; sqlite-utils insert laion-aesthetic-6pls.db celebrities - --csv
curl -L &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://docs.google.com/spreadsheets/d/1JLQQ3U6P0d4vDkAGuB8avmXOPIDPeDUdAEZsWTRYpng/export?format=csv&amp;amp;gid=480391249&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
  &lt;span class="pl-k"&gt;|&lt;/span&gt; sqlite-utils insert laion-aesthetic-6pls.db characters - --csv&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This gave me &lt;code&gt;artists&lt;/code&gt;, &lt;code&gt;celebrities&lt;/code&gt; and &lt;code&gt;characters&lt;/code&gt; tables.&lt;/p&gt;
&lt;p&gt;The next challenge was to run a search query for each row in each of those tables and return the count of results. After some experimentation I found that this one worked:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;select&lt;/span&gt; name, (
  &lt;span class="pl-k"&gt;select&lt;/span&gt; &lt;span class="pl-c1"&gt;count&lt;/span&gt;(&lt;span class="pl-k"&gt;*&lt;/span&gt;) &lt;span class="pl-k"&gt;from&lt;/span&gt; images_fts &lt;span class="pl-k"&gt;where&lt;/span&gt; images_fts match &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;"&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;||&lt;/span&gt; name &lt;span class="pl-k"&gt;||&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;"&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
) &lt;span class="pl-k"&gt;as&lt;/span&gt; search_count &lt;span class="pl-k"&gt;from&lt;/span&gt; celebrities &lt;span class="pl-k"&gt;order by&lt;/span&gt; search_count &lt;span class="pl-k"&gt;desc&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Note the &lt;code&gt;match '"' || name || '"'&lt;/code&gt; part - this was necessary to ensure the name was correctly quoted in a way that would avoid names like &lt;code&gt;Dwayne 'The Rock' Johnson&lt;/code&gt; from breaking the search query.&lt;/p&gt;
&lt;p&gt;Now that I had the query I could use that same CTE update trick to populate a counts column in the tables:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;sqlite-utils add-column laion-aesthetic-6pls.db celebrities image_counts integer

sqlite-utils laion-aesthetic-6pls.db &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;cat &lt;span class="pl-s"&gt;&lt;span class="pl-k"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="pl-k"&gt;EOF&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;with counts as (&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  select name,&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;    (&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;      select count(*) from images_fts where images_fts match '"' || name || '"'&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;    ) as search_count&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;    from celebrities&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;update celebrities&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  set image_counts = counts.search_count&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  from counts&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  where celebrities.name = counts.name&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-k"&gt;EOF&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I'm using the &lt;code&gt;cat &amp;lt;&amp;lt;EOF&lt;/code&gt; trick here to avoid having to use shell escaping for the single and double quotes, as described in this TIL: &lt;a href="https://til.simonwillison.net/zsh/argument-heredoc"&gt;Passing command arguments using heredoc syntax&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here are the finished tables: &lt;a href="https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/characters"&gt;characters&lt;/a&gt;, &lt;a href="https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/celebrities"&gt;celebrities&lt;/a&gt;, &lt;a href="https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/artists"&gt;artists&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;Deploying it to Fly&lt;/h4&gt;
&lt;p&gt;At just under 4GB the resulting SQLite database was an awkward size. I often deploy ~1GB databases to Google Cloud Run, but this was a bit too large for me to feel comfortable with that. Cloud Run can also get expensive for projects that attract a great deal of traffic.&lt;/p&gt;
&lt;p&gt;I decided to use &lt;a href="https://fly.io/"&gt;Fly&lt;/a&gt; instead. Fly includes support for mountable volumes, which means it's a great fit for these larger database files.&lt;/p&gt;
&lt;p&gt;I wrote about &lt;a href="https://simonwillison.net/2022/Feb/15/fly-volumes/"&gt;Using SQLite and Datasette with Fly Volumes&lt;/a&gt; back in February, when I added support to volumes to the &lt;a href="https://datasette.io/plugins/datasette-publish-fly"&gt;datasette-publish-fly&lt;/a&gt; Datasette plugin.&lt;/p&gt;
&lt;p&gt;This was still the largest database I had ever deployed to Fly, and it took a little bit of work to figure out the best way to handle it.&lt;/p&gt;
&lt;p&gt;In the end, I used the following recipe:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;datasette publish fly \
  --app laion-aesthetic \
  --volume-name datasette \
  --install datasette-json-html \
  --extra-options &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;-i /data/laion-aesthetic-6pls.db --inspect-file /data/inspect.json --setting sql_time_limit_ms 10000 --setting suggest_facets 0 --setting allow_download 0&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
  -m metadata.yml&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The first time I ran this I used &lt;code&gt;--create-volume 20&lt;/code&gt; to create a 20GB volume called &lt;code&gt;datasette&lt;/code&gt;. I over-provisioned this so I could run commands like &lt;code&gt;sqlite-utils vacuum&lt;/code&gt;, which need twice the amount of space as is taken up by the database file itself.&lt;/p&gt;
&lt;p&gt;I uploaded the database file itself &lt;a href="https://til.simonwillison.net/fly/scp"&gt;using scp&lt;/a&gt;, and ran &lt;code&gt;fly ssh console -a laion-aesthetic&lt;/code&gt; to SSH in and execute other commands such as &lt;code&gt;datasette inspect  laion-aesthetic-6pls.db &gt; inspect.json&lt;/code&gt; to create the inspect JSON file.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;--extra-options&lt;/code&gt; deserve explanation.&lt;/p&gt;
&lt;p&gt;Normally when you run &lt;code&gt;datasette publish&lt;/code&gt; the file you pass to the command is automatically deployed using &lt;a href="https://docs.datasette.io/en/stable/performance.html?highlight=immutable#immutable-mode"&gt;immutable mode&lt;/a&gt;. This mode is specifically designed for running read-only databases, and uses optimizations like only counting the rows in the table once on startup (or loading the counts from a pre-prepared &lt;code&gt;inspect.json&lt;/code&gt; file).&lt;/p&gt;
&lt;p&gt;I wanted those optimizations for this project. But &lt;code&gt;datasette publish fly&lt;/code&gt; is currently designed with the assumption that any databases you put in the &lt;code&gt;/data&lt;/code&gt; volume are designed to accept writes, and hence shouldn't be opened in immutable mode.&lt;/p&gt;
&lt;p&gt;I ended up coming up with a horrible hack. I add &lt;code&gt;-i /data/laion-aesthetic-6pls.db&lt;/code&gt; to the &lt;code&gt;--extra-options&lt;/code&gt; command to tell Datasette to open the file in immutable mode.&lt;/p&gt;
&lt;p&gt;But this wasn't enough! &lt;code&gt;datasette publish fly&lt;/code&gt; also configures Datasette to automatically open any databases in &lt;code&gt;/data&lt;/code&gt; in read-only mode, so that newly saved database files will be served correctly.&lt;/p&gt;
&lt;p&gt;This meant my instance was loading the same database twice - once in read-only mode and once in immutable mode.&lt;/p&gt;
&lt;p&gt;Rather than fixing the design of &lt;code&gt;datasette-publish-fly&lt;/code&gt;, I went for a cheap workaround. I start Datasette with the following &lt;code&gt;metadata.yml&lt;/code&gt; configuration (simplified):&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;databases&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;laion-aesthetic-6pls&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;tables&lt;/span&gt;:
      &lt;span class="pl-ent"&gt;domain&lt;/span&gt;:
        &lt;span class="pl-ent"&gt;label_column&lt;/span&gt;: &lt;span class="pl-s"&gt;domain&lt;/span&gt;
  &lt;span class="pl-ent"&gt;laion-aesthetic-6pls_2&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;allow&lt;/span&gt;: &lt;span class="pl-c1"&gt;false&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This ensures that the &lt;code&gt;laion-aesthetic-6pls&lt;/code&gt; database - the immutable one - is served correctly, and has a label column set for the &lt;code&gt;domain&lt;/code&gt; table too.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;laion-aesthetic-6pls_2&lt;/code&gt; is the second copy of that database, loaded because Datasette spotted it in the &lt;code&gt;/data&lt;/code&gt; directory. Setting &lt;code&gt;allow: false&lt;/code&gt; on it uses Datasette's &lt;a href="https://docs.datasette.io/en/stable/authentication.html#defining-permissions-with-allow-blocks"&gt;permissions framework&lt;/a&gt; to hide that duplicate database from view.&lt;/p&gt;
&lt;p&gt;I'm not proud of these workarounds, and I hope to fix them in the future - but for the moment this is what it took to deploy the project.&lt;/p&gt;
&lt;h4&gt;Scaling it to meet demand&lt;/h4&gt;
&lt;p&gt;I launched the first version of the application on Fly's cheapest instance - 256MB of RAM, costing $1.87/month.&lt;/p&gt;
&lt;p&gt;This worked fine when it was just me and Andy playing with the site, but it started to struggle as traffic started to increase.&lt;/p&gt;
&lt;p&gt;Fly have a "scale app" button which lets you upgrade your instance. I hadn't actually used it before, but I was delighted to find that it worked exactly as expected: I bumped the RAM up to 4GB (not coincidentally the size of the SQLite database file) and the instance restarted within a few seconds with upgraded capacity.&lt;/p&gt;
&lt;p&gt;Fly provide a preconfigured Grafana interface for watching your instances, and it helped me feel confident that the resized instance was happily dealing with the traffic.&lt;/p&gt;
&lt;p&gt;I plan to dial back down to a cheaper instance once interest in the project starts to fade.&lt;/p&gt;
&lt;h4&gt;Got a problem? Throw a search engine at it&lt;/h4&gt;
&lt;p&gt;This is the third time I've used Datasette to build a search engine in the past three weeks! My other two recent projects are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://scotrail.datasette.io/"&gt;scotrail.datasette.io&lt;/a&gt;, described in &lt;a href="https://simonwillison.net/2022/Aug/21/scotrail/"&gt;Analyzing ScotRail audio announcements with Datasette—from prototype to production&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://archive.sfmicrosociety.org"&gt;archive.sfmicrosociety.org&lt;/a&gt;, described in &lt;a href="https://simonwillison.net/2022/Aug/25/sfms-archive/"&gt;Building a searchable archive for the San Francisco Microscopical Society&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The ability to spin up a full search engine for anything that you can stuff into a SQLite database table (which it turns out is almost everything) is a really powerful ability. I plan to write a &lt;a href="https://datasette.io/tutorials"&gt;Datasette tutorial&lt;/a&gt; about this in the future.&lt;/p&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-render-image-tags"&gt;datasette-render-image-tags&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-render-image-tags/releases/tag/0.1"&gt;0.1&lt;/a&gt; - 2022-09-04
&lt;br /&gt;Turn any URLs ending in .jpg/.png/.gif into img tags with width 200&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-sitemap"&gt;datasette-sitemap&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-sitemap/releases/tag/1.0"&gt;1.0&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-sitemap/releases"&gt;3 releases total&lt;/a&gt;) - 2022-08-30
&lt;br /&gt;Generate sitemap.xml for Datasette sites&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-block-robots"&gt;datasette-block-robots&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-block-robots/releases/tag/1.1"&gt;1.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-block-robots/releases"&gt;6 releases total&lt;/a&gt;) - 2022-08-30
&lt;br /&gt;Datasette plugin that blocks robots and crawlers using robots.txt&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/sqlite-utils/releases/tag/3.29"&gt;3.29&lt;/a&gt; - (&lt;a href="https://github.com/simonw/sqlite-utils/releases"&gt;103 releases total&lt;/a&gt;) - 2022-08-28
&lt;br /&gt;Python CLI utility and library for manipulating SQLite databases&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/sqlite/vacum-disk-full"&gt;SQLite VACUUM: database or disk is full&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/fly/scp"&gt;How to scp files to and from Fly&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/twitter/birdwatch-sqlite"&gt;Loading Twitter Birdwatch into SQLite for analysis with Datasette&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/andy-baio"&gt;andy-baio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ethics"&gt;ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parquet"&gt;parquet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/fly"&gt;fly&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/stable-diffusion"&gt;stable-diffusion&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/laion"&gt;laion&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/training-data"&gt;training-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/text-to-image"&gt;text-to-image&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="andy-baio"/><category term="ethics"/><category term="search"/><category term="ai"/><category term="datasette"/><category term="parquet"/><category term="weeknotes"/><category term="fly"/><category term="sqlite-utils"/><category term="stable-diffusion"/><category term="generative-ai"/><category term="laion"/><category term="training-data"/><category term="text-to-image"/><category term="ai-ethics"/></entry><entry><title>Exploring 12 Million of the 2.3 Billion Images Used to Train Stable Diffusion’s Image Generator</title><link href="https://simonwillison.net/2022/Aug/31/laion-aesthetics/#atom-tag" rel="alternate"/><published>2022-08-31T02:10:26+00:00</published><updated>2022-08-31T02:10:26+00:00</updated><id>https://simonwillison.net/2022/Aug/31/laion-aesthetics/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://waxy.org/2022/08/exploring-12-million-of-the-images-used-to-train-stable-diffusions-image-generator/"&gt;Exploring 12 Million of the 2.3 Billion Images Used to Train Stable Diffusion’s Image Generator&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Andy Baio and I collaborated on an investigation into the training set used for Stable Diffusion. I built a Datasette instance with 12m image records sourced from the LAION-Aesthetics v2 6+ aesthetic score data used as part of the training process, and built a tool so people could run searches and explore the data. Andy did some extensive analysis of things like the domains scraped for the images and names of celebrities and artists represented in the data. His write-up here explains our project in detail and some of the patterns we’ve uncovered so far.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/machine-learning"&gt;machine-learning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/stable-diffusion"&gt;stable-diffusion&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/laion"&gt;laion&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/training-data"&gt;training-data&lt;/a&gt;&lt;/p&gt;



</summary><category term="machine-learning"/><category term="ai"/><category term="stable-diffusion"/><category term="generative-ai"/><category term="laion"/><category term="training-data"/></entry><entry><title>Stable Diffusion is a really big deal</title><link href="https://simonwillison.net/2022/Aug/29/stable-diffusion/#atom-tag" rel="alternate"/><published>2022-08-29T01:09:04+00:00</published><updated>2022-08-29T01:09:04+00:00</updated><id>https://simonwillison.net/2022/Aug/29/stable-diffusion/#atom-tag</id><summary type="html">
    &lt;p&gt;If you haven't been paying attention to what's going on with Stable Diffusion, you really should be.&lt;/p&gt;
&lt;p&gt;Stable Diffusion is a new "text-to-image diffusion model" that was &lt;a href="https://stability.ai/blog/stable-diffusion-public-release"&gt;released to the public&lt;/a&gt; by &lt;a href="https://stability.ai/"&gt;Stability.ai&lt;/a&gt; six days ago, on August 22nd.&lt;/p&gt;
&lt;p&gt;It's similar to models like Open AI's DALL-E, but with one crucial difference: they released the whole thing.&lt;/p&gt;
&lt;p&gt;You can try it out online at &lt;a href="https://beta.dreamstudio.ai/"&gt;beta.dreamstudio.ai&lt;/a&gt; (currently for free). Type in a text prompt and the model will generate an image.&lt;/p&gt;
&lt;p&gt;You can download and run the model on your own computer (if you have a powerful enough graphics card). Here's &lt;a href="https://www.reddit.com/r/StableDiffusion/comments/wuyu2u/how_do_i_run_stable_diffusion_and_sharing_faqs/"&gt;an FAQ&lt;/a&gt; on how to do that.&lt;/p&gt;
&lt;p&gt;You can use it for commercial and non-commercial purposes, under the terms of the &lt;a href="https://huggingface.co/spaces/CompVis/stable-diffusion-license"&gt;Creative ML OpenRAIL-M license&lt;/a&gt; - which lists some usage restrictions that include avoiding using it to break applicable laws, generate false information, discriminate against individuals or provide medical advice.&lt;/p&gt;
&lt;p&gt;In just a few days, there has been an &lt;strong&gt;explosion&lt;/strong&gt; of innovation around it. The things people are building are absolutely astonishing.&lt;/p&gt;
&lt;p&gt;I've been tracking the &lt;a href="https://www.reddit.com/r/StableDiffusion/"&gt;r/StableDiffusion&lt;/a&gt; subreddit and following Stability.ai founder &lt;a href="https://twitter.com/EMostaque"&gt;Emad Mostaque&lt;/a&gt; on Twitter.&lt;/p&gt;
&lt;h4&gt;img2img&lt;/h4&gt;
&lt;p&gt;Generating images from text is one thing, but generating images from other images is a whole new ballgame.&lt;/p&gt;
&lt;p&gt;My favourite example so far comes &lt;a href="https://www.reddit.com/r/StableDiffusion/comments/wzlmty/its_some_kind_of_black_magic_i_swear/"&gt;from Reddit user argaman123&lt;/a&gt;. They created this image:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/stable-diffusion-img2img.png" alt="A simple looking Microsoft Paint style image made of flat colours: a sky blue background, a rough yellow desert in the foreground, a semi-circle black line representing a half dome over five shapes in two shades of grey representing buildings inside the dome. A yellow circle represents the sun in the top right of the image, above the dome." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And added this prompt (or "&lt;a href="https://www.reddit.com/r/StableDiffusion/comments/wzlmty/comment/im35vwf/?utm_source=reddit&amp;amp;utm_medium=web2x&amp;amp;context=3"&gt;something along those lines&lt;/a&gt;"):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A distant futuristic city full of tall buildings inside a huge transparent glass dome, In the middle of a barren desert full of large dunes, Sun rays, Artstation, Dark sky full of stars with a shiny sun, Massive scale, Fog, Highly detailed, Cinematic, Colorful&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The model produced the following two images:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/stable-diffusion-1.jpg" alt="A gorgeous image of a futuristic city under a glass domb, in front of a wind-swept desert. The composition matches the Microsoft Paint input, but everything is rendered in great detail." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/stable-diffusion-2.jpg" alt="A second image, similar to the first but this time the domb is composed of triangle sheets of glass. The composition remains the same." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;These are amazing. In my &lt;a href="https://simonwillison.net/2022/Jun/23/dall-e/"&gt;previous experiments with DALL-E&lt;/a&gt; I've tried to recreate photographs I have taken, but getting the exact composition I wanted has always proved impossible using just text. With this new capability I feel like I could get the AI to do pretty much exactly what I have in my mind.&lt;/p&gt;
&lt;p&gt;Imagine having an on-demand concept artist that can generate anything you can imagine, and can iterate with you towards your ideal result. For free (or at least for very-cheap).&lt;/p&gt;
&lt;p&gt;You can run this today on your own computer, if you can figure out how to set it up. You can &lt;a href="https://replicate.com/stability-ai/stable-diffusion"&gt;try it in your browser&lt;/a&gt; using Replicate, &lt;a href="https://huggingface.co/spaces/huggingface/diffuse-the-rest"&gt;or Hugging Face&lt;/a&gt;. This capability is apparently coming to the DreamStudio interface &lt;a href="https://twitter.com/EMostaque/status/1563632874091421697"&gt;next week&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;There's so much more going on.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/hlky/stable-diffusion-webui"&gt;stable-diffusion-webui&lt;/a&gt; is an open source UI you can run on your own machine providing a powerful interface to the model. &lt;a href="https://twitter.com/altryne/status/1563452692399214594"&gt;Here's a Twitter thread&lt;/a&gt; showing what it can do.&lt;/p&gt;
&lt;p&gt;Reddit user alpacaAI shared &lt;a href="https://old.reddit.com/r/StableDiffusion/comments/wyduk1/show_rstablediffusion_integrating_sd_in_photoshop/"&gt;a video demo&lt;/a&gt; of a Photoshop plugin they are developing which has to be seen to be believed. They have a registration form up on &lt;a href="https://www.getalpaca.io/"&gt;getalpaca.io&lt;/a&gt; for people who want to try it out once it's ready.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/stable-diffusion-alpaca.jpg" alt="A screenshot of Photoshop - a complex image on multiple layers is shown in the background. The user has open a dialog where they have entered the prompt &amp;quot;a dog seating on a path going up in a hill&amp;quot; - with modifiers of &amp;quot;studio ghibli::3&amp;quot;, &amp;quot;highly detailed::1&amp;quot;, &amp;quot;mang anime::1&amp;quot;, &amp;quot;cel-shading::1&amp;quot; and &amp;quot;game characters::1&amp;quot;." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Reddit user Hoppss ran a 2D animated clip from Disney's Aladdin through &lt;code&gt;img2img&lt;/code&gt; frame-by frame, using the following parameters:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;--prompt "3D render" --strength 0.15 --seed 82345912 --n_samples 1 --ddim_steps 100 --n_iter 1 --scale 30.0 --skip_grid&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;The result was &lt;a href="https://www.reddit.com/r/StableDiffusion/comments/wys3w5/applying_img2img_to_video_3d_aladdin/"&gt;a 3D animated video&lt;/a&gt;. Not a great quality one, but pretty stunning for a shell script and a two word prompt!&lt;/p&gt;
&lt;p&gt;The best description I've seen so far of an iterative process to build up an image using Stable Diffusion comes from Andy Salerno: &lt;a href="https://andys.page/posts/how-to-draw/"&gt;4.2 Gigabytes, or: How to Draw Anything&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Ben Firshman has published detailed instructions on how to &lt;a href="https://replicate.com/blog/run-stable-diffusion-on-m1-mac"&gt;Run Stable Diffusion on your M1 Mac’s GPU&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;And there's so much more to come&lt;/h4&gt;
&lt;p&gt;All of this happened in just six days since the model release. Emad Mostaque &lt;a href="https://twitter.com/EMostaque/status/1564011883572633600"&gt;on Twitter&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We use as much compute as stable diffusion used every 36 hours for our upcoming open source models&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This made me think of &lt;a href="https://parti.research.google/"&gt;Google's Parti paper&lt;/a&gt;, which included a demonstration that showed that once the model was trained to 200bn parameters it could generate images with correctly spelled text!&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/stable-diffusion-parti.jpg" alt="Four images of a kangaroo holding a sign generated by Parti. In the 350M and 750M parameter images the text on the sign is garbage symbols. At 3B parameters it does at least look like words, but is still not correct. At 20B parametecs the sign reads &amp;quot;Welcome friends&amp;quot;." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="ai-vegan"&gt;Ethics: will you be an AI vegan?&lt;/h4&gt;
&lt;p&gt;I'm finding the ethics of all of this extremely difficult.&lt;/p&gt;
&lt;p&gt;Stable Diffusion has been trained on millions of copyrighted images scraped from the web.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/CompVis/stable-diffusion/blob/69ae4b35e0a0f6ee1af8bb9a5d0016ccb27e36dc/Stable_Diffusion_v1_Model_Card.md"&gt;Stable Diffusion v1 Model Card&lt;/a&gt; has the full details, but the short version is that it uses &lt;a href="https://laion.ai/blog/laion-5b/"&gt;LAION-5B&lt;/a&gt; (5.85 billion image-text pairs) and its &lt;a href="https://laion.ai/blog/laion-aesthetics/"&gt;laion-aesthetics v2 5+&lt;/a&gt; subset (which I think is ~600M pairs filtered for aesthetics). These images were scraped from the web.&lt;/p&gt;
&lt;p&gt;I'm not qualified to speak to the legality of this. I'm personally more concerned with the morality.&lt;/p&gt;
&lt;p&gt;The final model is I believe around 4.2GB of data - a binary blob of floating point numbers. The fact that it can compress such an enormous quantity of visual information into such a small space is itself a fascinating detail.&lt;/p&gt;
&lt;p&gt;As such, each image in the training set contributes only a tiny amount of information - a few tweaks to some numeric weights spread across the entire network.&lt;/p&gt;
&lt;p&gt;But... the people who created these images did not give their consent. And the model can be seen as a direct threat to their livelihoods. No-one expected creative AIs to come for the artist jobs first, but here we are!&lt;/p&gt;
&lt;p&gt;I'm still thinking through this, and I'm eager to consume more commentary about it. But my current mental model is to think about this in terms of veganism, as an analogy for people making their own personal ethical decisions.&lt;/p&gt;
&lt;p&gt;I know many vegans. They have access to the same information as I do about the treatment of animals, and they have made informed decisions about their lifestyle, which I fully respect.&lt;/p&gt;
&lt;p&gt;I myself remain a meat-eater.&lt;/p&gt;
&lt;p&gt;There will be many people who will decide that the  AI models trained on copyrighted images are incompatible with their values. I understand and respect that decision.&lt;/p&gt;
&lt;p&gt;But when I look at that img2img example of the futuristic city in the dome, I can't resist imagining what I could do with that capability.&lt;/p&gt;
&lt;p&gt;If someone were to create &lt;a href="https://twitter.com/simonw/status/1563201333821288452"&gt;a vegan model&lt;/a&gt;, trained entirely on out-of-copyright images, I would be delighted to promote it and try it out. If its results were good enough, I might even switch to it entirely.&lt;/p&gt;
&lt;h4&gt;Understanding the training data&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: 30th August 2022. Andy Baio and I worked together on a deep dive into the training data behind Stable Diffusion. Andy wrote up some of our findings in &lt;a href="https://waxy.org/2022/08/exploring-12-million-of-the-images-used-to-train-stable-diffusions-image-generator/"&gt;Exploring 12 Million of the 2.3 Billion Images Used to Train Stable Diffusion’s Image Generator&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;Indistinguishable from magic&lt;/h4&gt;
&lt;p&gt;Just a few months ago, if I'd seen someone on a fictional TV show using an interface like that Photoshop plugin I'd have grumbled about how that was a step too far even by the standards of American network TV dramas.&lt;/p&gt;
&lt;p&gt;Science fiction is real now. Machine learning generative models are here, and the rate with which they are improving is unreal. It's worth paying real attention to what they can do and how they are developing.&lt;/p&gt;
&lt;p&gt;I'm tweeting about this stuff a lot these days. &lt;a href="https://twitter.com/simonw"&gt;Follow @simonw on Twitter&lt;/a&gt; for more.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ethics"&gt;ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/machine-learning"&gt;machine-learning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dalle"&gt;dalle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/stable-diffusion"&gt;stable-diffusion&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-engineering"&gt;prompt-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/laion"&gt;laion&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/text-to-image"&gt;text-to-image&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ethics"/><category term="machine-learning"/><category term="ai"/><category term="dalle"/><category term="stable-diffusion"/><category term="prompt-engineering"/><category term="generative-ai"/><category term="laion"/><category term="text-to-image"/><category term="ai-ethics"/></entry></feed>