<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: Entries</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/atom/entries/" rel="self"/><id>http://simonwillison.net/</id><updated>2026-06-11T23:35:17+00:00</updated><author><name>Simon Willison</name></author><entry><title>Claude Fable is relentlessly proactive</title><link href="https://simonwillison.net/2026/Jun/11/fable-is-relentlessly-proactive/#atom-entries" rel="alternate"/><published>2026-06-11T23:35:17+00:00</published><updated>2026-06-11T23:35:17+00:00</updated><id>https://simonwillison.net/2026/Jun/11/fable-is-relentlessly-proactive/#atom-entries</id><summary type="html">&lt;p&gt;After two days of experience with &lt;a href="https://simonwillison.net/2026/Jun/9/claude-fable-5/"&gt;Claude Fable 5&lt;/a&gt; I think the best way to describe it is &lt;strong&gt;relentlessly proactive&lt;/strong&gt;. It knows a whole lot of tricks and it will deploy pretty much any of them to get to its goal.&lt;/p&gt;
&lt;p&gt;I'll illustrate this with an example. I was hacking on &lt;a href="https://agent.datasette.io/"&gt;Datasette Agent&lt;/a&gt; today when I noticed a glitch: a horizontal scrollbar that shouldn't be there in the jump menu chat prompt. I snapped this screenshot:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/jump-to-bug.jpg" alt="Screenshot of a modal dialog demonstrating a scrollbar bug. At the top is a focused search input with blue outline and placeholder &amp;quot;Jump to...&amp;quot;, with an X close button to its right. Below, a heading reads &amp;quot;Start a new agent chat&amp;quot; above a textarea with the placeholder &amp;quot;Ask a question about your data...&amp;quot; — the bug: a thick gray horizontal scrollbar is incorrectly displayed along the bottom edge of the empty textarea, spanning nearly its full width, next to the resize handle. Below the textarea: &amp;quot;Press Enter to start. Shift+Enter adds a new line.&amp;quot; followed by a blue &amp;quot;Start chat&amp;quot; button." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Then I started a fresh &lt;code&gt;claude&lt;/code&gt; session in my &lt;code&gt;datasette-agent&lt;/code&gt; checkout, dragged in the screenshot and told it:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Look at dependencies to help figure out why there is a horizontal scrollbar here&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I had a hunch the cause was in a dependency of Datasette Agent (likely Datasette itself) and I knew Fable was good at digging into dependency code, either by inspecting installed files in its own virtual environment &lt;code&gt;site-packages&lt;/code&gt; or by referencing a local checkout on disk. Telling it to start with dependencies felt like a good bet.&lt;/p&gt;
&lt;p&gt;I got distracted by a domestic task and wandered away from my computer.&lt;/p&gt;
&lt;p&gt;When I came back a few minutes later I saw my machine &lt;em&gt;open a browser window&lt;/em&gt; in my regular Firefox and then &lt;em&gt;navigate to the dialog in question&lt;/em&gt;. I had not told Claude Code to use any browser automation, and I was pretty sure it wasn't possible for it to trigger mouse movements or keyboard shortcuts within a window, so how was it doing that?&lt;/p&gt;
&lt;p&gt;I watched in fascination as it continued with its explorations, then saw it open a Safari window instead of Firefox. I also grabbed this snapshot from the Claude terminal:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/fable-bash-pyobjc.jpg" alt="Screenshot of two Bash tool calls in a dark terminal interface. First: Bash(open -a Safari /tmp/textarea-scrollbar-test.html &amp;amp;&amp;amp; sleep 4 &amp;amp;&amp;amp; uv run --with pyobjc-framework-Quartz python - &amp;lt;&amp;lt;'EOF' import Quartz wins = Quartz.CGWindowListCopyWindowInfo(Quartz.kCGWindowListOptionOnScreenOnly, Quartz.kCGNullWindowID) for w in wins: if (w.get('kCGWindowOwnerName') or '') == 'Safari' and 'textarea' in (w.get('kCGWindowName') or '').lower(): print(w.get('kCGWindowNumber')) EOF) with output 153551. Second: Bash(screencapture -x -o -l 153551 /tmp/safari-cases.png &amp;amp;&amp;amp; echo ok) with output ok." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;What was it doing there with &lt;code&gt;uv run --with pyobjc-framework-Quartz&lt;/code&gt;?&lt;/p&gt;
&lt;p&gt;It turns out Fable had hacked up its own pattern for taking screenshots of browser windows. It was using Python to iterate through all available windows on my machine, then filtering for Safari windows with expected strings such as &lt;code&gt;"textarea"&lt;/code&gt; in the window name. It used that to find their window number - an integer like 153551 - which it could then use with the &lt;code&gt;screencapture&lt;/code&gt; CLI tool to grab a PNG.&lt;/p&gt;
&lt;p&gt;OK fine, that's a neat way of taking screenshots. But what was it taking screenshots of?&lt;/p&gt;
&lt;p&gt;Turns out it had been writing its own scratch HTML pages to try and recreate the bug, then opening Safari and grabbing screenshots.&lt;/p&gt;
&lt;p&gt;Here's that &lt;a href="https://static.simonwillison.net/static/2026/textarea-scrollbar-test.html"&gt;/tmp/textarea-scrollbar-test.html&lt;/a&gt; page it created, and the screenshot it took with &lt;code&gt;screencapture -x -o -l 153551 /tmp/safari-cases.png&lt;/code&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/safari-cases.jpg" alt="Screenshot of a Safari browser window showing a textarea scrollbar test page at file:///private/tmp/textarea-scrollbar-test.html. Page text reads: scrollbar thickness: 17px | UA: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/26.4 Safari/605.1.15 | devicePixelRatio: 2. Four numbered test cases follow, each with a textarea containing the placeholder &amp;quot;Ask a question about your data...&amp;quot;: 1. Exact plugin CSS (resize: vertical, default overflow), 2. Plugin CSS + overflow-x: hidden, 3. Plugin CSS + resize: none, and 4. Bare default textarea, which is a much smaller box with the placeholder wrapping onto two lines." style="max-width: 100%;" /&gt;
(I have way too many open tabs!)&lt;/p&gt;
&lt;p&gt;OK, so I can see how it's opening test pages and taking screenshots, but how on earth was it triggering the modal dialog that was meant to be under test? That's only available via a click or a keyboard shortcut, and I couldn't see a mechanism for it to run those in Safari.&lt;/p&gt;
&lt;p&gt;I eventually figured out what it had done.&lt;/p&gt;
&lt;p&gt;Claude was running in a folder that contained the source code for the application. It knows enough about &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; to be able to run a local development server. It turns out it was editing Datasette's own templates to add JavaScript that would trigger the correct keyboard shortcut as soon as the window opened, adding code like this:&lt;/p&gt;
&lt;div class="highlight highlight-text-html-basic"&gt;&lt;pre&gt;&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;script&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-smi"&gt;window&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;addEventListener&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;"load"&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-k"&gt;function&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
  &lt;span class="pl-en"&gt;setTimeout&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-k"&gt;function&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
    &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;dispatchEvent&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-k"&gt;new&lt;/span&gt; &lt;span class="pl-v"&gt;KeyboardEvent&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;"keydown"&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-c1"&gt;key&lt;/span&gt;: &lt;span class="pl-s"&gt;"/"&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;bubbles&lt;/span&gt;: &lt;span class="pl-c1"&gt;true&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;1200&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;script&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;1.2 seconds after the window opens, this code triggers a simulated &lt;code&gt;/&lt;/code&gt; key, which is the keyboard shortcut for opening the modal dialog.&lt;/p&gt;
&lt;p&gt;There was one challenge left. In order to understand what was going on, Claude needed to run JavaScript on the page to take measurements for itself.&lt;/p&gt;
&lt;p&gt;It wrote its own custom web application to capture information via CORS, then ran that as a local server and opened a page with JavaScript that would POST directly to it!&lt;/p&gt;
&lt;p&gt;Here's the Python web app it wrote, using the standard library &lt;a href="https://docs.python.org/3/library/http.server.html"&gt;http.server&lt;/a&gt; package:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;http&lt;/span&gt;.&lt;span class="pl-s1"&gt;server&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-v"&gt;HTTPServer&lt;/span&gt;, &lt;span class="pl-v"&gt;BaseHTTPRequestHandler&lt;/span&gt;

&lt;span class="pl-k"&gt;class&lt;/span&gt; &lt;span class="pl-c1"&gt;H&lt;/span&gt;(&lt;span class="pl-v"&gt;BaseHTTPRequestHandler&lt;/span&gt;):
    &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;do_POST&lt;/span&gt;(&lt;span class="pl-s1"&gt;self&lt;/span&gt;):
        &lt;span class="pl-s1"&gt;n&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;int&lt;/span&gt;(&lt;span class="pl-s1"&gt;self&lt;/span&gt;.&lt;span class="pl-c1"&gt;headers&lt;/span&gt;.&lt;span class="pl-c1"&gt;get&lt;/span&gt;(&lt;span class="pl-s"&gt;"Content-Length"&lt;/span&gt;, &lt;span class="pl-c1"&gt;0&lt;/span&gt;))
        &lt;span class="pl-en"&gt;open&lt;/span&gt;(&lt;span class="pl-s"&gt;"/tmp/diag.json"&lt;/span&gt;, &lt;span class="pl-s"&gt;"w"&lt;/span&gt;).&lt;span class="pl-c1"&gt;write&lt;/span&gt;(&lt;span class="pl-s1"&gt;self&lt;/span&gt;.&lt;span class="pl-c1"&gt;rfile&lt;/span&gt;.&lt;span class="pl-c1"&gt;read&lt;/span&gt;(&lt;span class="pl-s1"&gt;n&lt;/span&gt;).&lt;span class="pl-c1"&gt;decode&lt;/span&gt;())
        &lt;span class="pl-s1"&gt;self&lt;/span&gt;.&lt;span class="pl-c1"&gt;send_response&lt;/span&gt;(&lt;span class="pl-c1"&gt;200&lt;/span&gt;)
        &lt;span class="pl-s1"&gt;self&lt;/span&gt;.&lt;span class="pl-c1"&gt;send_header&lt;/span&gt;(&lt;span class="pl-s"&gt;"Access-Control-Allow-Origin"&lt;/span&gt;, &lt;span class="pl-s"&gt;"*"&lt;/span&gt;)
        &lt;span class="pl-s1"&gt;self&lt;/span&gt;.&lt;span class="pl-c1"&gt;end_headers&lt;/span&gt;()
    &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;do_OPTIONS&lt;/span&gt;(&lt;span class="pl-s1"&gt;self&lt;/span&gt;):
        &lt;span class="pl-s1"&gt;self&lt;/span&gt;.&lt;span class="pl-c1"&gt;send_response&lt;/span&gt;(&lt;span class="pl-c1"&gt;200&lt;/span&gt;)
        &lt;span class="pl-s1"&gt;self&lt;/span&gt;.&lt;span class="pl-c1"&gt;send_header&lt;/span&gt;(&lt;span class="pl-s"&gt;"Access-Control-Allow-Origin"&lt;/span&gt;, &lt;span class="pl-s"&gt;"*"&lt;/span&gt;)
        &lt;span class="pl-s1"&gt;self&lt;/span&gt;.&lt;span class="pl-c1"&gt;send_header&lt;/span&gt;(&lt;span class="pl-s"&gt;"Access-Control-Allow-Headers"&lt;/span&gt;, &lt;span class="pl-s"&gt;"*"&lt;/span&gt;)
        &lt;span class="pl-s1"&gt;self&lt;/span&gt;.&lt;span class="pl-c1"&gt;end_headers&lt;/span&gt;()
    &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;log_message&lt;/span&gt;(&lt;span class="pl-s1"&gt;self&lt;/span&gt;, &lt;span class="pl-c1"&gt;*&lt;/span&gt;&lt;span class="pl-s1"&gt;a&lt;/span&gt;):  &lt;span class="pl-c"&gt;# quiet&lt;/span&gt;
        &lt;span class="pl-k"&gt;pass&lt;/span&gt;

&lt;span class="pl-en"&gt;HTTPServer&lt;/span&gt;((&lt;span class="pl-s"&gt;"127.0.0.1"&lt;/span&gt;, &lt;span class="pl-c1"&gt;9999&lt;/span&gt;), &lt;span class="pl-c1"&gt;H&lt;/span&gt;).&lt;span class="pl-c1"&gt;serve_forever&lt;/span&gt;()&lt;/pre&gt;
&lt;p&gt;All this does is accept a POST request full of JSON and write that to the &lt;code&gt;/tmp/diag.json&lt;/code&gt; file. It sends &lt;code&gt;Access-Control-Allow-Origin: *&lt;/code&gt; headers (including from &lt;code&gt;OPTIONS&lt;/code&gt; requests) so that code running on another domain can still communicate back to it.&lt;/p&gt;
&lt;p&gt;Then Claude injected this code into the template that it was loading in a browser:&lt;/p&gt;
&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;host&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelector&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;"navigation-search"&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;ta&lt;/span&gt;   &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;host&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;shadowRoot&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelector&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;"textarea"&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;cs&lt;/span&gt;   &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;getComputedStyle&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;ta&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-en"&gt;fetch&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;"http://127.0.0.1:9999/diag"&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
  &lt;span class="pl-c1"&gt;method&lt;/span&gt;: &lt;span class="pl-s"&gt;"POST"&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
  &lt;span class="pl-c1"&gt;body&lt;/span&gt;: &lt;span class="pl-c1"&gt;JSON&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;stringify&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;
    &lt;span class="pl-c1"&gt;dpr&lt;/span&gt;: &lt;span class="pl-smi"&gt;window&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;devicePixelRatio&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
    &lt;span class="pl-c1"&gt;scrollWidth&lt;/span&gt;: &lt;span class="pl-s1"&gt;ta&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;scrollWidth&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;clientWidth&lt;/span&gt;: &lt;span class="pl-s1"&gt;ta&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;clientWidth&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
    &lt;span class="pl-c1"&gt;whiteSpace&lt;/span&gt;: &lt;span class="pl-s1"&gt;cs&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;whiteSpace&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;width&lt;/span&gt;: &lt;span class="pl-s1"&gt;cs&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
  &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This took measurements of the &lt;code&gt;&amp;lt;textarea&amp;gt;&lt;/code&gt; inside the &lt;code&gt;&amp;lt;navigation-search&amp;gt;&lt;/code&gt; Web Component and sent them to the server, which wrote them to a file on disk, which Claude could then read.&lt;/p&gt;
&lt;p&gt;Having figured out all of these tricks Fable... hit some invisible guardrail and downgraded itself to Opus. Thankfully Opus had access to the full transcript and could continue using the tricks pioneered by Fable, and shortly afterwards found, tested and verified &lt;a href="https://github.com/datasette/datasette-agent/commit/a75a8b727b42c30ced1fc41dc8add7eb9f04fefe"&gt;the fix&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I prompted Opus to:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Write a report in /tmp/automation-report.md where you note down all of the tricks you have used in this session to test against real browsers on my computer, include runnable code examples&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Which produced &lt;a href="https://gist.github.com/simonw/aef7f7db9ac992643110a74e43d6d42f"&gt;this report&lt;/a&gt;, which was invaluable for piecing together the details of what had happened for this post.&lt;/p&gt;
&lt;p&gt;I've shared &lt;a href="https://gisthost.github.io/?cc14774f6d37eb67bf089f3ac3925f8f"&gt;the full terminal transcript&lt;/a&gt; of the Claude Code session as well.&lt;/p&gt;
&lt;h4 id="a-review-of-everything-it-did"&gt;A review of everything it did&lt;/h4&gt;
&lt;p&gt;Based on a screenshot and a one-line prompt, Claude Fable 5 + Claude Code:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Figured out the recipe to run the local development server (with fake environment variables needed to get it running)&lt;/li&gt;
&lt;li&gt;Fired up a Playwright Chrome session&lt;/li&gt;
&lt;li&gt;Turned on the visible scrollbars setting for Chrome &lt;code&gt;defaults write com.google.chrome.for.testing AppleShowScrollBars Always&lt;/code&gt; (it turned that off again later)&lt;/li&gt;
&lt;li&gt;Cycled through Firefox and WebKit in Playwright too, failing to recreate the bug&lt;/li&gt;
&lt;li&gt;Worked out my default browser was Safari&lt;/li&gt;
&lt;li&gt;Built a &lt;code&gt;textarea-scrollbar-test.html&lt;/code&gt; HTML document&lt;/li&gt;
&lt;li&gt;Opened that in real (not Playwright) Firefox&lt;/li&gt;
&lt;li&gt;Found that &lt;code&gt;osascript -e 'tell application "System Events" to tell process "firefox" to id of window 1'&lt;/code&gt; was blocked because "osascript is not allowed assistive access"&lt;/li&gt;
&lt;li&gt;Figured out that &lt;code&gt;uv run --with pyobjc-framework-Quartz python&lt;/code&gt; workaround, described above&lt;/li&gt;
&lt;li&gt;Added JavaScript to the site templates in order to trigger the &lt;code&gt;/&lt;/code&gt; key&lt;/li&gt;
&lt;li&gt;Built its own little Python CORS web server to capture JSON data&lt;/li&gt;
&lt;li&gt;Rewrote the template to capture that data and send it to the server&lt;/li&gt;
&lt;li&gt;Scripted its way through the Web Component shadow DOM to the information it needed&lt;/li&gt;
&lt;li&gt;Opened Safari to confirm the source of the bug&lt;/li&gt;
&lt;li&gt;Modified its custom template to hack in a potential fix&lt;/li&gt;
&lt;li&gt;Confirmed the hacked fix worked&lt;/li&gt;
&lt;li&gt;Reported back on how to fix the problem&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Like I said, relentlessly proactive!&lt;/p&gt;
&lt;h4 id="an-estimate-of-the-cost"&gt;An estimate of the cost&lt;/h4&gt;
&lt;p&gt;I'm currently on the $100/month Claude Max plan, which includes a generous allowance for Fable up until June 22nd after which Anthropic say they'll start charging full API prices for it.&lt;/p&gt;
&lt;p&gt;I'm using &lt;a href="https://www.agentsview.io"&gt;AgentsView&lt;/a&gt; to track my spending (see &lt;a href="https://til.simonwillison.net/llms/agentsview-custom-model-price"&gt;this TIL&lt;/a&gt;). Here's what AgentsView says this session would have cost me if I was paying full price for it:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;~ % uvx agentsview session usage be8850a7-6119-46a0-b5d6-79c7fff5ae2b
Session:       be8850a7-6119-46a0-b5d6-79c7fff5ae2b
Agent:         claude
Output:        68606
Peak ctx:      113178
Cost:          ~$12.11 (claude-fable-5, claude-opus-4-8)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you don't keep a close eye on it, Fable will quite happily burn $12 in tokens inventing new ways to debug your CSS.&lt;/p&gt;
&lt;h4 id="i-really-need-to-lock-this-thing-down"&gt;I really need to lock this thing down&lt;/h4&gt;
&lt;p&gt;On the one hand, watching Fable go to extreme lengths to get the information that it needed to debug what was, in the end, a two-line CSS fix, was &lt;em&gt;fascinating&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;But on the other hand... this is a robust reminder that coding agents can do anything &lt;em&gt;you&lt;/em&gt; can do by typing commands into a terminal - and frontier models know every trick in the book, and evidently a few that nobody has ever written down before.&lt;/p&gt;
&lt;p&gt;If Fable had been acting on malicious instructions - a prompt injection attack hidden in code or an issue thread, or something I'd carelessly pasted into my terminal - it's alarming to think quite how far it could go to exfiltrate data or cause other forms of mischief.&lt;/p&gt;
&lt;p&gt;Running coding agents outside of a sandbox has always been a bad idea - it's my top contender for &lt;a href="https://simonwillison.net/2026/Jan/8/llm-predictions-for-2026/#1-year-a-challenger-disaster-for-coding-agent-security"&gt;a Challenger disaster&lt;/a&gt; incident, as described by Johann Rehberger in &lt;a href="https://embracethered.com/blog/posts/2025/the-normalization-of-deviance-in-ai/"&gt;The Normalization of Deviance in AI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Fable is arguably smarter and hence more suspicious of potentially malicious instructions. But that smartness is very much a two-edged sword: if it &lt;em&gt;does&lt;/em&gt; get subverted by instructions, the amount of damage it can do given its relentless proactivity is terrifying.&lt;/p&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="ai"/><category term="prompt-injection"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="coding-agents"/><category term="claude-code"/><category term="claude-mythos"/></entry><entry><title>Initial impressions of Claude Fable 5</title><link href="https://simonwillison.net/2026/Jun/9/claude-fable-5/#atom-entries" rel="alternate"/><published>2026-06-09T23:59:54+00:00</published><updated>2026-06-09T23:59:54+00:00</updated><id>https://simonwillison.net/2026/Jun/9/claude-fable-5/#atom-entries</id><summary type="html">&lt;p&gt;I didn't have early access to today's &lt;a href="https://www.anthropic.com/news/claude-fable-5-mythos-5"&gt;Claude Fable 5&lt;/a&gt; release, but I've spent the past ~5.5 hours putting it through its paces. My initial impressions are that this is something of a &lt;em&gt;beast&lt;/em&gt;. It's slow, expensive and has been quite happily churning through everything I've thrown at it so far. As is frequently the case with current frontier models the challenge is finding tasks that it can't do.&lt;/p&gt;
&lt;p&gt;First, let's review the key characteristics.&lt;/p&gt;
&lt;p&gt;Anthropic claim that &lt;a href="https://www.anthropic.com/news/claude-fable-5-mythos-5"&gt;Claude Fable 5&lt;/a&gt; offers the same performance as Claude Mythos 5, except with much more strict guardrails in place to prevent it being used for harmful things. Those guardrails trigger often enough that the Claude API has new mechanisms for letting you know when you hit them, and even has a &lt;a href="https://platform.claude.com/docs/en/build-with-claude/refusals-and-fallback"&gt;new option&lt;/a&gt; to request it falls back to another model automatically if something gets rejected.&lt;/p&gt;
&lt;p&gt;Claude Mythos 5 is out today as well, &lt;a href="https://platform.claude.com/docs/en/about-claude/models/introducing-claude-fable-5-and-claude-mythos-5"&gt;Anthropic say it&lt;/a&gt; "Shares Claude Fable 5's capabilities without the safety classifiers".&lt;/p&gt;
&lt;p&gt;The models have a 1 million token context window, 128,000 maximum output tokens and a knowledge cut-off date of January 2026.&lt;/p&gt;
&lt;p&gt;They are priced at twice the price of Claude Opus 4.5/4.6/4.7/4.8: $10/million input tokens and $50/million output tokens. There's no increase in price for longer context usage.&lt;/p&gt;
&lt;p&gt;Other than that the &lt;a href="https://platform.claude.com/docs/en/about-claude/models/introducing-claude-fable-5-and-claude-mythos-5"&gt;upgrade guide&lt;/a&gt; is substantially thinner than &lt;a href="https://platform.claude.com/docs/en/about-claude/models/whats-new-claude-4-8"&gt;the similar guide for Opus 4.8&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="the-big-model-smell"&gt;The big model smell&lt;/h4&gt;
&lt;p&gt;The best way to describe Fable is that it feels &lt;em&gt;big&lt;/em&gt;. Not just in terms of speed and cost, but also in how much it knows.&lt;/p&gt;
&lt;p&gt;Here's an example prompt, which I used to compare Fable's knowledge with Opus 4.8 (typo included):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;List all of Simon Willion's open source projects, most recent first, each with a rough date of when they were first released&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Opus 4.8's reply &lt;a href="https://gist.github.com/simonw/e77abf645156ade8c5c610f524226dcb#response-1"&gt;started like this&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I don't have a reliable, comprehensive, and date-verified list of all of Simon Willison's open source projects, and I want to be honest about that rather than risk giving you inaccurate dates or fabricated entries.&lt;/p&gt;
&lt;p&gt;Here's what I can tell you with reasonable confidence:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Well-known projects by Simon Willison:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;LLM&lt;/strong&gt; (~2023) – A command-line tool and Python library for interacting with large language models.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Datasette&lt;/strong&gt; (~2017) – A tool for exploring and publishing data, probably his most prominent recent project.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;sqlite-utils&lt;/strong&gt; (~2018-2019) – A Python library and CLI tool for manipulating SQLite databases, often used alongside Datasette.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Django&lt;/strong&gt; (2005) – He is one of the co-creators of the Django web framework, along with Adrian Holovaty, originally developed at the Lawrence Journal-World newspaper.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;He also maintains a large ecosystem of &lt;strong&gt;Datasette plugins&lt;/strong&gt; and many smaller utilities (often with names like &lt;code&gt;shot-scraper&lt;/code&gt;, various &lt;code&gt;*-to-sqlite&lt;/code&gt; tools for importing data, etc.).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That's not bad. Here's &lt;a href="https://gist.github.com/simonw/e77abf645156ade8c5c610f524226dcb#response"&gt;the start of the response&lt;/a&gt; for the exact same prompt from Fable 5 - both models were running without access to search:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I think you mean &lt;strong&gt;Simon Willison&lt;/strong&gt; — the prolific open source developer, co-creator of Django, and creator of Datasette.&lt;/p&gt;
&lt;p&gt;An important caveat upfront: a truly complete list isn't feasible. Simon Willison is famously prolific — he has published &lt;strong&gt;hundreds&lt;/strong&gt; of repositories (Datasette alone has 100+ plugins, many written by him), and my knowledge has a cutoff date, so I'll be missing his most recent work. Below are his &lt;strong&gt;notable&lt;/strong&gt; projects with approximate initial release dates, most recent first:&lt;/p&gt;
&lt;h2 id="notable-projects-approximate-dates-"&gt;Notable projects (approximate dates)&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;files-to-prompt&lt;/strong&gt; (April 2024) – concatenates files into a single prompt for LLMs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;datasette-extract&lt;/strong&gt; (2024) – plugin for extracting structured data using LLMs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM&lt;/strong&gt; (May–June 2023) – CLI tool and Python library for interacting with large language models, plus a large ecosystem of plugins (llm-gpt4all, llm-mistral, llm-claude, etc., 2023 onward)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;symbex&lt;/strong&gt; (June 2023) – search Python code for symbols&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ttok&lt;/strong&gt; and &lt;strong&gt;strip-tags&lt;/strong&gt; (May 2023) – CLI tools for token counting and HTML cleanup for LLM pipelines&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;datasette-lite&lt;/strong&gt; (May 2022) – Datasette running entirely in the browser via WebAssembly/Pyodide&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;shot-scraper&lt;/strong&gt; (March 2022) – automated website screenshots via Playwright&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;s3-credentials&lt;/strong&gt; (November 2021) – CLI for creating scoped AWS S3 credentials&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;django-sql-dashboard&lt;/strong&gt; (2021) – SQL reporting dashboards for Django&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dogsheep suite&lt;/strong&gt; (2019) – personal analytics tools: twitter-to-sqlite, github-to-sqlite, healthkit-to-sqlite, dogsheep-beta, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sqlite-utils&lt;/strong&gt; (2018) – CLI and Python library for manipulating SQLite databases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Datasette&lt;/strong&gt; (November 2017) – his flagship project; tool for exploring and publishing data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;csvs-to-sqlite&lt;/strong&gt; (2017) – convert CSV files to SQLite&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Various early tools&lt;/strong&gt; (~2007–2010) – soupselect, json-head, geocoders, and others&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Django&lt;/strong&gt; (developed 2003–2005, open-sourced July 2005) – co-created with Adrian Holovaty at the Lawrence Journal-World&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;(Here's &lt;a href="https://gist.github.com/simonw/8d05ce836d44bd8543055c9614f7c478"&gt;GPT-5.5 for good measure&lt;/a&gt;. It listed even more projects than Fable did!)&lt;/p&gt;
&lt;p&gt;In the past I've stated that I don't care about how much models &lt;em&gt;know&lt;/em&gt; - I want them to be able to manipulate text and code in useful ways and actively look up the information they need via search tools, not bake it into their weights.&lt;/p&gt;
&lt;p&gt;But knowledge like this is a reasonably good proxy for model size - you can cram a whole lot more details about the world into a larger number of parameters.&lt;/p&gt;
&lt;p&gt;Does &lt;em&gt;knowing more stuff&lt;/em&gt; mean the model is better at the tasks we pose to it? I can certainly imagine how a coding model with deeper knowledge of modern libraries and patterns could crunch through coding tasks more effectively.&lt;/p&gt;
&lt;p&gt;Is Fable really bigger than Opus? Anthropic haven't said anything about model size, so all we have are tea-leaves, but the speed, pricing and my own poking at its knowledge make me think that it's a large model. Maybe the largest yet from any vendor.&lt;/p&gt;
&lt;h4 id="using-fable-in-claude-ai"&gt;Using Fable in Claude.ai&lt;/h4&gt;
&lt;p&gt;Anthropic made Fable 5 available across all of their surfaces - the &lt;a href="https://claude.ai/"&gt;Claude.ai&lt;/a&gt; chat interface, Claude Code for web, Claude Code CLI and Claude Cowork as well. The model is available "until June 22nd" on the subscription plans (I'm on $100/month Max at the moment), after which it will be billed extra.&lt;/p&gt;
&lt;p&gt;Claude.ai is often under-estimated. Since &lt;a href="https://simonwillison.net/2025/Sep/9/claude-code-interpreter/"&gt;September 2025&lt;/a&gt; every chat has had access to a full container environment to run code, including the ability to install additional packages and even clone repositories directly from GitHub.&lt;/p&gt;
&lt;p&gt;Last week &lt;a href="https://simonwillison.net/2026/Jun/6/micropython-in-a-sandbox/"&gt;I released micropython-wasm&lt;/a&gt;, a Python library that uses &lt;a href="https://wasmtime.dev/"&gt;wasmtime&lt;/a&gt; to run a custom build of MicroPython in WebAssembly to act as a sandbox for untrusted Python code.&lt;/p&gt;
&lt;p&gt;I decided to see if Fable could upgrade that to running full Python instead. I started with this prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Clone simonw/micropython-wasm from GitHub and research how this could use a full Python as opposed to MicroPython&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Fable identified that it could use Brett Cannon's &lt;a href="https://github.com/brettcannon/cpython-wasi-build/releases"&gt;cpython-wasi-build&lt;/a&gt; builds for this, but was unable to download them itself due to environment restrictions.&lt;/p&gt;
&lt;p&gt;So I grabbed the two zip files from that page and uploaded them to Claude:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Here's the Brett Cannon builds&lt;/code&gt; (&lt;code&gt;python-3.zip&lt;/code&gt;, &lt;code&gt;_build-python-3.zip&lt;/code&gt; as attachments)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And that was that. It churned away for a few minutes and got the entire thing working.&lt;/p&gt;
&lt;p&gt;Part of the response included:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I tried the cleaner single-zip-stdlib approach to shrink the filesystem surface, but CPython's &lt;code&gt;getpath&lt;/code&gt; bootstrap fails to find &lt;code&gt;encodings&lt;/code&gt; from inside a zip without more prefix finessing — the directory-preopen approach works reliably, so that's what the PoC uses. The zip path is solvable but needs &lt;code&gt;_PYTHONHOME&lt;/code&gt;/frozen-getpath work.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;So I said:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Try a bit more at the single-zip-stdlib problem&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then a little later:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;I want a wheel that has the whole system in it, the Python wrappers and the WASM files and the stdlibrary, so I can do uv run --with path-to-whl python -c "demo code"&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;... and it gave me &lt;a href="https://static.simonwillison.net/static/cors-allow/2026/cpython_wasm-0.1.0-py3-none-any.whl"&gt;this 13.9MB cpython_wasm-0.1.0-py3-none-any.whl&lt;/a&gt; file. You can try running Python code in a sandbox using that wheel URL and &lt;code&gt;uv&lt;/code&gt; like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uv run --with https://static.simonwillison.net/static/cors-allow/2026/cpython_wasm-0.1.0-py3-none-any.whl \
  cpython-wasm -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;print(45 ** 56)&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's &lt;a href="https://claude.ai/share/a73b8b8b-8ebc-4fef-9e5c-7438e5e7ae35"&gt;the full chat transcript&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This was a &lt;em&gt;very&lt;/em&gt; strong start.&lt;/p&gt;
&lt;h4 id="adding-features-to-datasette-agent-and-llm-using-claude-code"&gt;Adding features to Datasette Agent and LLM using Claude Code&lt;/h4&gt;
&lt;p&gt;Before I'd realized it was Fable day, my stretch goal for today was to add a new feature to &lt;a href="https://agent.datasette.io/"&gt;Datasette Agent&lt;/a&gt;: I wanted tool calls within that agent software to gain the ability to pause mid-execution and request approval directly from the user.&lt;/p&gt;
&lt;p&gt;This felt like a suitably meaty task to throw at the new model.&lt;/p&gt;
&lt;p&gt;Over the course of the day Fable not only &lt;a href="https://github.com/datasette/datasette-agent/pull/20"&gt;solved that problem&lt;/a&gt;, it also identified and then implemented four issues in my underlying LLM library that would help support this kind of advanced pause-resume mechanism in tool calls.&lt;/p&gt;
&lt;p&gt;It got everything working first using somewhat gnarly hacks, but the moment I told it that changes to LLM itself were in scope it set to work unraveling the hacks and turning them into supported features of LLM instead.&lt;/p&gt;
&lt;p&gt;My stretch goal turned into &lt;a href="https://llm.datasette.io/en/latest/changelog.html#a3-2026-06-09"&gt;LLM 0.32a3&lt;/a&gt;, almost entirely written by Fable. Here are the release notes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Driven by the needs of &lt;a href="https://github.com/datasette/datasette-agent"&gt;Datasette Agent&lt;/a&gt;'s human-in-the-loop &lt;code&gt;ask_user()&lt;/code&gt; feature, made the following improvements to how tool calls work:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Tool implementations can declare a parameter named &lt;code&gt;llm_tool_call&lt;/code&gt; in order to be passed the &lt;code&gt;llm.ToolCall&lt;/code&gt; object for the current invocation. This allows them to access the current &lt;code&gt;llm_tool_call.tool_call_id&lt;/code&gt;. See &lt;a href="https://llm.datasette.io/en/latest/python-api.html#python-api-tools-llm-tool-call"&gt;Accessing the tool call from inside a tool&lt;/a&gt;. &lt;a href="https://github.com/simonw/llm/pull/1480"&gt;#1480&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Every tool call is now guaranteed a unique &lt;code&gt;tool_call_id&lt;/code&gt; - providers that do not supply one get a synthesized &lt;code&gt;tc_&lt;/code&gt;-prefixed ULID. &lt;a href="https://github.com/simonw/llm/pull/1481"&gt;#1481&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Tools can raise a &lt;code&gt;llm.PauseChain&lt;/code&gt; exception to cleanly pause the tool chain, useful for things like waiting for human approval. The exception propagates to the caller with &lt;code&gt;.tool_call&lt;/code&gt; and &lt;code&gt;.tool_results&lt;/code&gt; (completed sibling results) attached, and no model call is made with a placeholder result. See &lt;a href="https://llm.datasette.io/en/latest/python-api.html#python-api-tools-pause"&gt;Pausing a chain from inside a tool&lt;/a&gt;. &lt;a href="https://github.com/simonw/llm/pull/1482"&gt;#1482&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Failure semantics for concurrent tool execution: async sibling tool calls always run to completion before a pause or hook exception propagates. &lt;a href="https://github.com/simonw/llm/pull/1482"&gt;#1482&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Chains can now resume from a &lt;code&gt;messages=&lt;/code&gt; history ending in unresolved tool calls: the calls are executed through the normal &lt;code&gt;before_call&lt;/code&gt;/&lt;code&gt;after_call&lt;/code&gt; machinery before the first model call, skipping any that already have results. The &lt;code&gt;execute_tool_calls()&lt;/code&gt; method also accepts a new optional &lt;code&gt;tool_calls_list=&lt;/code&gt; argument for executing an explicit list of &lt;code&gt;ToolCall&lt;/code&gt; objects in place of the calls requested by the response. See &lt;a href="https://llm.datasette.io/en/latest/python-api.html#python-api-tools-resume"&gt;Resuming a chain with pending tool calls&lt;/a&gt;. &lt;a href="https://github.com/simonw/llm/pull/1482"&gt;#1482&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Fixed a bug where the async tool executor silently dropped calls to tools not present in &lt;code&gt;tools=&lt;/code&gt; - these now return &lt;code&gt;Error: tool "..." does not exist&lt;/code&gt; results, matching the sync executor. &lt;a href="https://github.com/simonw/llm/pull/1483"&gt;#1483&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'm really impressed with the quality of API design, tests, code and documentation that Fable put together for this. I spent several hours on it today, but it feels like several days' worth of work.&lt;/p&gt;
&lt;h4 id="how-much-i-ve-spent"&gt;How much I've spent&lt;/h4&gt;
&lt;p&gt;I recently started using &lt;a href="https://agentsview.io"&gt;AgentsView&lt;/a&gt; to help track my local LLM usage across all of the different coding agents. I published a &lt;a href="https://til.simonwillison.net/llms/agentsview-custom-model-price"&gt;TIL today&lt;/a&gt; about adding custom Fable pricing to that tool, which I expect will not be necessary in the very near future.&lt;/p&gt;
&lt;p&gt;After setting the price, I ran this command to start a localhost web server to explore my usage:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uvx agentsview serve
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's the treemap showing the breakdown of my Fable usage across various projects today:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/agentsview-fable-full-day.jpg" alt="Screenshot of a cost tracking dashboard with two panels. The first panel is titled &amp;quot;Cost Attribution&amp;quot; with toggle buttons for Project / Model / Agent and Treemap / List, with Project and Treemap selected. Italic text reads &amp;quot;Click to hide from chart&amp;quot;. A treemap shows a large red block labeled prod_datasette_agent $99.26 89.9%, with smaller blocks to its right labeled cloud (blue), datasette (teal), llm (red), and money (pink), plus a tiny orange sliver. A legend lists: 1 prod_datasette_agent $99.26, 2 cloud $3.98, 3 datasette $2.81, 4 llm $2.30, 5 money $1.92, 6 simon $0.15. The second panel is titled &amp;quot;Top Sessions by Cost&amp;quot; and lists nine sessions, each with a &amp;quot;Claude&amp;quot; badge, a prompt excerpt, a project name with a session UUID (omitted here), a token count, and a cost: 1. Review ./datasette-agent and ./datasette-apps - we are going to add a new feature to agent but you ... prod_datasette_agent, 78.2M, $99.26. 2. issues.db is a copy of the Datasette issues database. There are a LOT of notes in there relating to... datasette, 826.8k, $2.81. 3. Consult fly-docs and then look at datasette.cloud (which launches fly machines) and datasettecloud-... cloud, 924.7k, $2.61. 4. simonwillisonblog.db is a copy of my blog, plus all my software releases and other interesting thin... money, 542.9k, $1.92. 5. Look in datasette.cloud and figure out all remaining steps and decisions that need to be made in or... cloud, 455k, $1.37. 6. Review PRs and issues filed against this repo within the last 4 weeks and see if any deserve to be ... llm, 323.3k, $0.95. 7. run mypy, llm, 320.9k, $0.76. 8. [Image #1] fix this in github actions, llm, 183.9k, $0.59. 9. simon, simon, 26.4k, $0.15." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I used $110.42 worth of tokens today, all as part of my $100/month subscription.&lt;/p&gt;
&lt;h4 id="and-some-pelicans"&gt;And some pelicans&lt;/h4&gt;
&lt;p&gt;I ran "Generate an SVG of a pelican riding a bicycle" against all five thinking effort levels with Fable.&lt;/p&gt;
&lt;p&gt;Here are &lt;a href="https://tools.simonwillison.net/markdown-svg-renderer#url=https%3A%2F%2Fgist.github.com%2Fsimonw%2F94fde31c34a0400c1d29f57e6a708e6b"&gt;the results&lt;/a&gt;, including the token cost for each one:&lt;/p&gt;

&lt;div style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 1em"&gt;
  &lt;figure style="margin: 0; flex: 1 1 30%;"&gt;
    &lt;img src="https://static.simonwillison.net/static/2026/fable-low.jpg" alt="low" style="width: 100%; height: auto;" /&gt;
    &lt;figcaption style="text-align: center;"&gt;low: &lt;a href="https://www.llm-prices.com/#it=25&amp;amp;ot=1929&amp;amp;sel=claude-fable-5"&gt;1,929 out, 9.67c&lt;/a&gt;&lt;/figcaption&gt;
  &lt;/figure&gt;
  &lt;figure style="margin: 0; flex: 1 1 30%;"&gt;
    &lt;img src="https://static.simonwillison.net/static/2026/fable-medium.jpg" alt="medium" style="width: 100%; height: auto;" /&gt;
    &lt;figcaption style="text-align: center;"&gt;medium: &lt;a href="https://www.llm-prices.com/#it=25&amp;amp;ot=2290&amp;amp;sel=claude-fable-5"&gt;2,290 out, 11.475c&lt;/a&gt;&lt;/figcaption&gt;
  &lt;/figure&gt;
  &lt;figure style="margin: 0; flex: 1 1 30%;"&gt;
    &lt;img src="https://static.simonwillison.net/static/2026/fable-high.jpg" alt="high" style="width: 100%; height: auto;" /&gt;
    &lt;figcaption style="text-align: center;"&gt;high: &lt;a href="https://www.llm-prices.com/#it=25&amp;amp;ot=2057&amp;amp;sel=claude-fable-5"&gt;2,057 out, 10.31c&lt;/a&gt;&lt;/figcaption&gt;
  &lt;/figure&gt;
  &lt;figure style="margin: 0; flex: 1 1 45%;"&gt;
    &lt;img src="https://static.simonwillison.net/static/2026/fable-xhigh.jpg" alt="xhigh" style="width: 100%; height: auto;" /&gt;
    &lt;figcaption style="text-align: center;"&gt;xhigh: &lt;a href="https://www.llm-prices.com/#it=25&amp;amp;ot=5992&amp;amp;sel=claude-fable-5"&gt;5,992 out, 29.985c&lt;/a&gt;&lt;/figcaption&gt;
  &lt;/figure&gt;
  &lt;figure style="margin: 0; flex: 1 1 45%;"&gt;
    &lt;img src="https://static.simonwillison.net/static/2026/fable-max.jpg" alt="max" style="width: 100%; height: auto;" /&gt;
    &lt;figcaption style="text-align: center;"&gt;max: &lt;a href="https://www.llm-prices.com/#it=25&amp;amp;ot=14430&amp;amp;sel=claude-fable-5"&gt;14,430 out, 72.175c&lt;/a&gt;&lt;/figcaption&gt;
  &lt;/figure&gt;
&lt;/div&gt;

&lt;p&gt;It's interesting that high ended up using fewer tokens than medium for this particular run.&lt;/p&gt;

&lt;p&gt;Here are the &lt;a href="https://simonwillison.net/2026/May/28/claude-opus-4-8/#and-some-pelicans"&gt;Opus 4.8 pelicans&lt;/a&gt; for comparison.&lt;/p&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="claude-mythos"/></entry><entry><title>Running Python code in a sandbox with MicroPython and WASM</title><link href="https://simonwillison.net/2026/Jun/6/micropython-in-a-sandbox/#atom-entries" rel="alternate"/><published>2026-06-06T03:53:34+00:00</published><updated>2026-06-06T03:53:34+00:00</updated><id>https://simonwillison.net/2026/Jun/6/micropython-in-a-sandbox/#atom-entries</id><summary type="html">&lt;p&gt;I've been experimenting with different approaches to running code in a sandbox for several years now, but my latest attempt feels like it might finally have all of the characteristics I've been looking for. I've released it as an alpha package called &lt;a href="https://github.com/simonw/micropython-wasm"&gt;micropython-wasm&lt;/a&gt;, and I'm using it for a code execution sandbox plugin for &lt;a href="https://github.com/datasette/datasette-agent"&gt;Datasette Agent&lt;/a&gt; called &lt;a href="https://github.com/datasette/datasette-agent-micropython"&gt;datasette-agent-micropython&lt;/a&gt;.&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Jun/6/micropython-in-a-sandbox/#why-do-i-want-a-sandbox-"&gt;Why do I want a sandbox?&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Jun/6/micropython-in-a-sandbox/#what-i-want-from-a-sandbox"&gt;What I want from a sandbox&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Jun/6/micropython-in-a-sandbox/#webassembly-looks-really-promising-here"&gt;WebAssembly looks really promising here&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Jun/6/micropython-in-a-sandbox/#micropython-in-webassembly"&gt;MicroPython in WebAssembly&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Jun/6/micropython-in-a-sandbox/#building-the-first-version"&gt;Building the first version&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Jun/6/micropython-in-a-sandbox/#try-it-yourself"&gt;Try it yourself&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Jun/6/micropython-in-a-sandbox/#should-you-trust-my-vibe-coded-sandbox-"&gt;Should you trust my vibe-coded sandbox?&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="why-do-i-want-a-sandbox-"&gt;Why do I want a sandbox?&lt;/h4&gt;
&lt;p&gt;My key open source projects - &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt;, &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt;, even &lt;a href="https://sqlite-utils.datasette.io/"&gt;sqlite-utils&lt;/a&gt; - all support plugins.&lt;/p&gt;
&lt;p&gt;I absolutely love plugins as a mechanism for extending software. A carefully designed plugin system reduces the risk involved in trying new things to almost nothing - even the wildest ideas won't leave a lasting influence on the core application itself. My software can grow a new feature overnight and I don't even have to review a pull request!&lt;/p&gt;
&lt;p&gt;There's one major drawback: my plugin systems all use Python and &lt;a href="https://pluggy.readthedocs.io/en/latest/"&gt;Pluggy&lt;/a&gt;, and plugin code executes with full privileges within my applications. A buggy or malicious plugin could break everything or leak private data.&lt;/p&gt;
&lt;p&gt;I'd love to be able to run plugin-style code in an environment where it is unable to read unapproved files, connect to a network, or generally operate in a way that's risky or harmful to the rest of the application or the user's computer.&lt;/p&gt;
&lt;p&gt;My interest covers more than just plugins. For Datasette in particular there are many features I'd like to support where arbitrary code execution would be useful. I've already experimented with this for &lt;a href="https://enrichments.datasette.io/"&gt;Datasette Enrichments&lt;/a&gt;, where code can be used to transform values stored in a table. I'd love to build a mechanism where you can run code on a schedule that fetches JSON from an approved location, runs a tiny bit of code to reformat it into a list of dictionaries, then inserts those as rows in a SQLite database table.&lt;/p&gt;
&lt;h4 id="what-i-want-from-a-sandbox"&gt;What I want from a sandbox&lt;/h4&gt;
&lt;p&gt;My goal is to execute code safely within my own Python applications. Here's what I need:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Dependencies that &lt;strong&gt;cleanly install from PyPI&lt;/strong&gt;, including binary wheels across multiple platforms if necessary. I don't want people using my software to have to take any extra steps beyond directly installing my Python package.&lt;/li&gt;
&lt;li&gt;Executed code must be subject to both &lt;strong&gt;memory&lt;/strong&gt; and &lt;strong&gt;CPU&lt;/strong&gt; limits. I don't want &lt;code&gt;while True: s += "longer string"&lt;/code&gt; to crash my application or the user's computer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File access must be strictly controlled&lt;/strong&gt;. Either no filesystem access at all or I get to define exactly which files can be read and which files can be written to.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network access is controlled as well&lt;/strong&gt;. Sandboxed code should not be able to communicate with anything without going through a layer I fully control.&lt;/li&gt;
&lt;li&gt;Support for interaction with &lt;strong&gt;host functions&lt;/strong&gt;. A sandbox isn't much use if I can't carefully expose selected platform features to the code that it's running.&lt;/li&gt;
&lt;li&gt;It has to be &lt;strong&gt;robust, supported, and clearly documented&lt;/strong&gt;. I've lost count of the number of sandbox projects I've seen in repos with warnings that they aren't actively maintained!&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="webassembly-looks-really-promising-here"&gt;WebAssembly looks really promising here&lt;/h4&gt;
&lt;p&gt;Web browsers operate in the most hostile environment imaginable when it comes to malicious code. Their job is to download &lt;em&gt;and execute&lt;/em&gt; untrusted code from the web on almost every page load.&lt;/p&gt;
&lt;p&gt;Given this, JavaScript engines should be excellent candidates for sandboxes. Sadly those engines are also extremely complicated, and are not designed for easy embedding in other projects. Most of the V8-in-Python projects I've seen are infrequently maintained and come with warnings not to use them with completely untrusted code.&lt;/p&gt;
&lt;p&gt;WebAssembly is a &lt;em&gt;much better&lt;/em&gt; candidate. It was designed from the start to support all of the characteristics I care about and has been tested in browsers for nearly a decade. The &lt;a href="https://pypi.org/project/wasmtime"&gt;wasmtime&lt;/a&gt; Python library brings WASM to Python, is actively maintained, and has binary wheels.&lt;/p&gt;
&lt;h4 id="micropython-in-webassembly"&gt;MicroPython in WebAssembly&lt;/h4&gt;
&lt;p&gt;WebAssembly engines like wasmtime run WebAssembly binaries. Some programming languages like Rust are easy to compile directly to WebAssembly. Dynamic languages like JavaScript and Python are harder - they support language primitives like &lt;code&gt;eval()&lt;/code&gt;, which means they need a full interpreter available at runtime.&lt;/p&gt;
&lt;p&gt;To run Python we need a full Python interpreter compiled to WebAssembly, wired up in a way that makes it easy to feed it code, hook up host functions and access the results.&lt;/p&gt;
&lt;p&gt;Pyodide offers an outstanding package for running Python using WebAssembly in the browser, but using Pyodide in server-side Python isn't supported. The most recent advice I could find was &lt;a href="https://github.com/pyodide/pyodide/discussions/5145"&gt;from October 2024&lt;/a&gt; stating "Pyodide is built by the Emscripten toolchain and can only run in a browser or Node.js".&lt;/p&gt;
&lt;p&gt;The other day I decided to take a look at &lt;a href="https://micropython.org"&gt;MicroPython&lt;/a&gt; as an option for this. The MicroPython site says:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;MicroPython is a lean and efficient implementation of the Python 3 programming language that includes a small subset of the Python standard library and is optimised to run on microcontrollers and in constrained environments.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;WebAssembly sure feels like a constrained environment to me!&lt;/p&gt;
&lt;h4 id="building-the-first-version"&gt;Building the first version&lt;/h4&gt;
&lt;p&gt;I had GPT-5.5 Pro &lt;a href="https://chatgpt.com/share/6a1e2a5c-58b8-8328-ba1c-0e6aadb0a051"&gt;do some research for me&lt;/a&gt;, which turned up &lt;a href="https://github.com/micropython/micropython/pull/13676"&gt;this PR against MicroPython&lt;/a&gt; by &lt;a href="https://github.com/yamt"&gt;Yamamoto Takahashi&lt;/a&gt; titled "Experimental WASI support for ports/unix".&lt;/p&gt;
&lt;p&gt;It then produced this &lt;a href="https://github.com/simonw/micropython-wasm/blob/c08fbd2276b15dc8c9bdff82845f750971f45647/research.md"&gt;research.md document&lt;/a&gt;, so I let Codex Desktop and GPT-5.5 high &lt;a href="https://gist.github.com/simonw/27461a16d76f28f8619c609444d544fe"&gt;loose on it&lt;/a&gt; to see what would happen:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;read the research.md document and build this. You will probably need to write a script that compiles a custom WASM version of MicroPython as part of this project - fetch the MicroPython code to a /tmp directory for this as part of that script.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It worked. I now had a prototype Python library that could execute Python code inside a WebAssembly sandbox!&lt;/p&gt;
&lt;p&gt;The trickiest piece to solve was persistent interpreter state. The WASM build we are using here exposes a single entry point which starts the interpreter, runs the code and then stops the interpreter at the end.&lt;/p&gt;
&lt;p&gt;This works fine for one-off scripts, but for Datasette Agent I want variables and functions to stay resident in memory so I can reuse them across multiple code execution calls.&lt;/p&gt;
&lt;p&gt;A neat thing about working with coding agents is that you can get from an idea to a proof of concept quickly. I prompted:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;For keeping variables resident: what if we ran code inside micropython itself which called a host function get_next_python_code() and then passed that to eval() - and that host function blocked until new code was available, maybe by running in a thread with a queue? Could that or a similar idea help here?&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;After some iteration we got to a version of this that works! In Python code you can now do this:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;micropython_wasm&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-v"&gt;MicroPythonSession&lt;/span&gt;

&lt;span class="pl-k"&gt;with&lt;/span&gt; &lt;span class="pl-en"&gt;MicroPythonSession&lt;/span&gt;() &lt;span class="pl-k"&gt;as&lt;/span&gt; &lt;span class="pl-s1"&gt;session&lt;/span&gt;:
    &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;session&lt;/span&gt;.&lt;span class="pl-c1"&gt;run&lt;/span&gt;(&lt;span class="pl-s"&gt;"x = 10&lt;span class="pl-cce"&gt;\n&lt;/span&gt;print(x)"&lt;/span&gt;).&lt;span class="pl-c1"&gt;stdout&lt;/span&gt;)
    &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;session&lt;/span&gt;.&lt;span class="pl-c1"&gt;run&lt;/span&gt;(&lt;span class="pl-s"&gt;"x += 5&lt;span class="pl-cce"&gt;\n&lt;/span&gt;print(x)"&lt;/span&gt;).&lt;span class="pl-c1"&gt;stdout&lt;/span&gt;)
    &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;session&lt;/span&gt;.&lt;span class="pl-c1"&gt;run&lt;/span&gt;(&lt;span class="pl-s"&gt;"print(x * 2)"&lt;/span&gt;).&lt;span class="pl-c1"&gt;stdout&lt;/span&gt;)&lt;/pre&gt;
&lt;p&gt;Under the hood this starts a thread, sets up a request queue and then sends messages to that queue for the &lt;code&gt;session.run()&lt;/code&gt; command, each time waiting on a reply queue for the result of that execution. Inside WASM the MicroPython interpreter blocks waiting for a &lt;code&gt;__session_next__()&lt;/code&gt; host function to return the next line of code, which it runs &lt;code&gt;eval()&lt;/code&gt; on before calling &lt;code&gt;__session_result__({"id": request_id, "ok": True})&lt;/code&gt; when each block has been successfully executed.&lt;/p&gt;
&lt;p&gt;The other piece of complexity was supporting host functions, so my Python library could selectively expose functions that could then be called by code running in MicroPython.&lt;/p&gt;
&lt;p&gt;Codex ended up solving this with &lt;a href="https://github.com/simonw/micropython-wasm/blob/0.1a1/micropython_wasm/usercmodule/host/hostmodule.c"&gt;78 lines of C&lt;/a&gt;, which ends up compiled into the &lt;a href="https://github.com/simonw/micropython-wasm/blob/0.1a1/micropython_wasm/artifacts/micropython-wasi.wasm"&gt;362KB WebAssembly blob&lt;/a&gt; I'm distributing with the package.&lt;/p&gt;
&lt;p&gt;I am by no means a C programmer, but I've read the C and had two different models explain it to me (here's &lt;a href="https://claude.ai/share/62f74371-cc3c-44f2-b406-33d03513de9e"&gt;Claude's explanation&lt;/a&gt;) and I've subjected it to a barrage of tests.&lt;/p&gt;
&lt;p&gt;The great thing about working with WebAssembly is that if the C turns out to be fatally flawed the worst that can happen is the WebAssembly execution will fail with an exception. I can live with that risk.&lt;/p&gt;
&lt;p&gt;Memory limits are directly supported by wasmtime. CPU limits are a little harder: wasmtime offers a "fuel" concept to limit how many operations a WebAssembly call can execute, and that's the correct fit for this problem, but the units are hard to reason about. I'm experimenting with a 20 million default "fuel" setting now but I'm not confident that it's the most appropriate value.&lt;/p&gt;
&lt;h4 id="try-it-yourself"&gt;Try it yourself&lt;/h4&gt;
&lt;p&gt;The &lt;code&gt;micropython-wasm&lt;/code&gt; alpha is now &lt;a href="https://pypi.org/project/micropython-wasm"&gt;live on PyPI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;You can try it from your own Python code as &lt;a href="https://github.com/simonw/micropython-wasm"&gt;described in the README&lt;/a&gt;. I've also added a simple CLI mode in &lt;a href="https://github.com/simonw/micropython-wasm/releases/tag/0.1a2"&gt;version 0.1a2&lt;/a&gt; which means you can try it using &lt;code&gt;uvx&lt;/code&gt; without first installing it like so:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uvx micropython-wasm -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;print("Hello world")&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; To see it run out of fuel:&lt;/span&gt;
uvx micropython-wasm -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;s = ""; while True: s += "longer"&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Outputs: micropython-wasm: guest exited with code 1&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;You can also try it in &lt;a href="https://agent.datasette.io/"&gt;Datasette Agent&lt;/a&gt; like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uvx llm keys &lt;span class="pl-c1"&gt;set&lt;/span&gt; openai
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Paste in an OpenAI key, then:&lt;/span&gt;
uvx --with datasette-agent \
  --with datasette-agent-micropython \
  --prerelease allow \
  datasette --internal internal.db \
    -s plugins.datasette-llm.default_model gpt-5.5 \
    --root -o&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then navigate to &lt;a href="http://127.0.0.1:8001/-/agent"&gt;http://127.0.0.1:8001/-/agent&lt;/a&gt; and run the prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;show me some micropython&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/micropython-in-datasette-agent.jpg" style="max-width: 100%" alt="Screenshot of a chat application interface with a dark blue-grey header reading &amp;quot;home&amp;quot; on the left and &amp;quot;root&amp;quot; with a hamburger menu icon on the right. Below is a navigation row with &amp;quot;← Back&amp;quot; and &amp;quot;Chat&amp;quot; on the left and an &amp;quot;EXPORT&amp;quot; button on the right. A blue user message bubble reads &amp;quot;show me some micropython&amp;quot;. Below it a collapsed thinking section reads &amp;quot;▸Thinking: … to show the result clearly. After that, I can wrap up with a brief explanation!&amp;quot; followed by a &amp;quot;▶ Tool: execute_micropython&amp;quot; label. A code block follows: &amp;quot;# A tiny MicroPython example: blink-style logic + Fibonacci&amp;quot; / &amp;quot;def fib(n):&amp;quot; / &amp;quot;    a, b = 0, 1&amp;quot; / &amp;quot;    out = []&amp;quot; / &amp;quot;    for _ in range(n):&amp;quot; / &amp;quot;        out.append(a)&amp;quot; / &amp;quot;        a, b = b, a + b&amp;quot; / &amp;quot;    return out&amp;quot; / &amp;#39;print(&amp;quot;Hello from MicroPython!&amp;quot;)&amp;#39; / &amp;#39;print(&amp;quot;First 10 Fibonacci numbers:&amp;quot;, fib(10))&amp;#39; / &amp;quot;# MicroPython often runs on microcontrollers, e.g.:&amp;quot; / &amp;quot;# from machine import Pin&amp;quot; / &amp;quot;# led = Pin(2, Pin.OUT)&amp;quot; / &amp;quot;# led.value(1)  # turn LED on&amp;quot; / &amp;quot;# led.value(0)  # turn LED off&amp;quot;. Below a horizontal divider is the output: &amp;quot;Hello from MicroPython!&amp;quot; / &amp;quot;First 10 Fibonacci numbers: [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]&amp;quot;, followed by a &amp;quot;▶ Result: execute_micropython&amp;quot; label. At the bottom is a text input field with placeholder &amp;quot;Type a message...&amp;quot; and a blue &amp;quot;Send&amp;quot; button." /&gt;&lt;/p&gt;

&lt;p&gt;You can try a live demo of that plugin running in Datasette Agent by signing into &lt;a href="https://agent.datasette.io"&gt;agent.datasette.io&lt;/a&gt; with your GitHub account.&lt;/p&gt;

&lt;h4 id="should-you-trust-my-vibe-coded-sandbox-"&gt;Should you trust my vibe-coded sandbox?&lt;/h4&gt;
&lt;p&gt;Having complained about immature, loosely-maintained sandboxing libraries, it's deeply ironic that I've now built my own!&lt;/p&gt;
&lt;p&gt;I deliberately slapped an alpha release version on it, and I'm not ready to recommend it to anyone who isn't willing to take a significant risk.&lt;/p&gt;
&lt;p&gt;I've put it through enough testing that I'm OK using it myself. I've shipped my first plugin that uses it, &lt;a href="https://github.com/datasette/datasette-agent-micropython"&gt;datasette-agent-micropython&lt;/a&gt;. I've also locked GPT-5.5 xhigh in that Datasette Agent plugin and &lt;a href="https://gist.github.com/simonw/5de497c44d25f9fd459c8aa2c959fe4a"&gt;challenged it to break out of the sandbox&lt;/a&gt; and so far it has not managed to.&lt;/p&gt;
&lt;p&gt;I'm hoping this implementation can convince some companies with professional security teams and high-stakes problems to commit to using Python in WebAssembly as a sandboxing approach and open source their own solutions.&lt;/p&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="python"/><category term="sandboxing"/><category term="ai"/><category term="datasette"/><category term="webassembly"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="codex"/><category term="datasette-agent"/><category term="micropython"/></entry><entry><title>Claude Opus 4.8: "a modest but tangible improvement"</title><link href="https://simonwillison.net/2026/May/28/claude-opus-4-8/#atom-entries" rel="alternate"/><published>2026-05-28T23:59:50+00:00</published><updated>2026-05-28T23:59:50+00:00</updated><id>https://simonwillison.net/2026/May/28/claude-opus-4-8/#atom-entries</id><summary type="html">&lt;p&gt;Anthropic shipped &lt;a href="https://www.anthropic.com/news/claude-opus-4-8"&gt;Claude Opus 4.8&lt;/a&gt; today. My favourite thing about it is this note in the release announcement:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor. There’s still more to be done: we’re working on developing and releasing models that provide many of the same capabilities as Opus at a lower cost.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It's so refreshing to see an AI lab honestly describe a release as a minor incremental improvement over the previous model!&lt;/p&gt;
&lt;p&gt;Honesty seems to be a theme. Here's my other favorite note from that announcement:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;One of the most prominent improvements in Opus 4.8 is its &lt;em&gt;honesty&lt;/em&gt;. We train all our models to be honest---for instance, to avoid making claims that they can't support. But a general problem with AI models is that they sometimes jump to conclusions, confidently claiming to have made progress in their work despite the evidence being thin. Early testers report that Opus 4.8 is more likely to flag uncertainties about its work and less likely to make unsupported claims. This is borne out in &lt;a href="https://www.anthropic.com/claude-opus-4-8-system-card"&gt;our evaluations&lt;/a&gt;, which show that Opus 4.8 is around four times less likely than its predecessor to allow flaws in code it has written to pass unremarked.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That linked system card includes the following:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Claude Opus 4.8 had the lowest incorrect-rate of the six models on every benchmark—the most direct measure of factual hallucination. It achieved this mainly by abstaining on questions about which it was uncertain rather than by answering more questions correctly.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4 id="model-characteristics"&gt;Model characteristics&lt;/h4&gt;
&lt;p&gt;Not much has changed since 4.7.&lt;/p&gt;
&lt;p&gt;It's priced the same as Opus 4.5/4.6/4.7 - $5/million input and $25 per million output. "Fast mode" is twice that price, which is a significant reduction from their previous models - fast mode on 4.6/4.7 remains at $30/$150. Note that &lt;a href="https://platform.claude.com/docs/en/build-with-claude/fast-mode"&gt;fast mode&lt;/a&gt; is only available to organizations that are part of the research preview, "Contact your account manager to request access".&lt;/p&gt;
&lt;p&gt;Both the reliable knowledge cutoff and the training data cutoff are January 2026, the same as for 4.7.&lt;/p&gt;
&lt;p&gt;The context window is still 1,000,000 tokens, and the max output is 128,000 tokens.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://platform.claude.com/docs/en/about-claude/models/whats-new-claude-4-8"&gt;What's new in Claude Opus 4.8&lt;/a&gt; document has some of the more interesting details. These caught my eye:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Mid-conversation system messages&lt;/strong&gt;. Claude Opus 4.8 accepts &lt;code&gt;role: "system"&lt;/code&gt; messages immediately after a user turn in the &lt;code&gt;messages&lt;/code&gt; array (subject to &lt;a href="https://platform.claude.com/docs/en/build-with-claude/mid-conversation-system-messages#limitations"&gt;placement rules&lt;/a&gt;). This lets you append updated instructions later in a long-running conversation without restating the full system prompt, which preserves &lt;a href="https://platform.claude.com/docs/en/build-with-claude/prompt-caching"&gt;prompt cache&lt;/a&gt; hits on the earlier turns and reduces input cost on agentic loops.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;See also &lt;a href="https://github.com/anthropics/anthropic-sdk-python/commit/2b826760101664ef89db42132932f53ba97c894d#diff-a947c9c02eab58e8ddbe799a11832d533836d242e07c7251997f8543f0981f2f"&gt;this update&lt;/a&gt; to the Anthropic Python SDK. Being able to steer the system prompt mid-conversation sounds really powerful. I was worried this would be incompatible with the abstraction provided by my own &lt;a href="https://llm.datasette.io/en/stable/python-api.html#system-prompts"&gt;LLM library&lt;/a&gt;, which expects a single system prompt per conversation... but it turns out my recent &lt;a href="https://simonwillison.net/2026/Apr/29/llm/"&gt;redesign&lt;/a&gt; should handle that &lt;a href="https://github.com/simonw/llm-anthropic/issues/73"&gt;just fine&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Lower prompt cache minimum&lt;/strong&gt;. The minimum cacheable prompt length on Claude Opus 4.8 is 1,024 tokens, lower than on Claude Opus 4.7.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I checked and 4.7's minimum &lt;a href="https://platform.claude.com/docs/en/build-with-claude/prompt-caching#cache-limitations"&gt;was 4,096&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="and-some-pelicans"&gt;And some pelicans&lt;/h4&gt;
&lt;p&gt;Here are &lt;a href="https://tools.simonwillison.net/markdown-svg-renderer#url=https%3A%2F%2Fgist.github.com%2Fsimonw%2Ffea4f7546626d627862dc241a4e3a86a"&gt;pelicans riding bicycles&lt;/a&gt; for all five thinking levels, &lt;code&gt;low&lt;/code&gt;, &lt;code&gt;medium&lt;/code&gt;, &lt;code&gt;high&lt;/code&gt;, &lt;code&gt;xhigh&lt;/code&gt;, and &lt;code&gt;max&lt;/code&gt;:&lt;/p&gt;

&lt;div style="display: grid; grid-template-columns: repeat(6, 1fr); gap: 1rem; max-width: 900px; margin: 0 auto;"&gt;
    &lt;figure style="grid-column: span 2; margin: 0; text-align: center;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/claude-opus-4.8-low.png" alt="Flat-style cartoon illustration of a white duck with an orange beak and legs riding a black bicycle, its feet on the pedals, against a blue sky and green grass background." style="width: 100%; height: auto; border: 1px solid #ccc;" /&gt;
        &lt;figcaption style="margin-top: 0.5rem; font-family: system-ui, sans-serif; font-weight: bold;"&gt;
            &lt;a href="https://gist.github.com/simonw/fea4f7546626d627862dc241a4e3a86a#response"&gt;low&lt;/a&gt;
        &lt;/figcaption&gt;
    &lt;/figure&gt;
    &lt;figure style="grid-column: span 2; margin: 0; text-align: center;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/claude-opus-4.8-medium.png" alt="Flat-style illustration of a white egret or heron with an orange beak and legs riding a black bicycle, against a blue sky and green grass background." style="width: 100%; height: auto; border: 1px solid #ccc;" /&gt;
        &lt;figcaption style="margin-top: 0.5rem; font-family: system-ui, sans-serif; font-weight: bold;"&gt;
            &lt;a href="https://gist.github.com/simonw/fea4f7546626d627862dc241a4e3a86a#response-1"&gt;medium&lt;/a&gt;
        &lt;/figcaption&gt;
    &lt;/figure&gt;
    &lt;figure style="grid-column: span 2; margin: 0; text-align: center;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/claude-opus-4.8-high.png" alt="Cartoon illustration of a white duck with an orange beak riding a black bicycle, against a light blue sky with a pale yellow sun in the upper left and a green ground line at the bottom." style="width: 100%; height: auto; border: 1px solid #ccc;" /&gt;
        &lt;figcaption style="margin-top: 0.5rem; font-family: system-ui, sans-serif; font-weight: bold;"&gt;
            &lt;a href="https://gist.github.com/simonw/fea4f7546626d627862dc241a4e3a86a#response-2"&gt;high&lt;/a&gt;
        &lt;/figcaption&gt;
    &lt;/figure&gt;
    &lt;figure style="grid-column: span 3; margin: 0; text-align: center;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/claude-opus-4.8-xhigh.png" alt="Cartoon illustration of a white pelican with an orange beak riding a black bicycle, its orange legs extending down to the pedals, against a blue sky with a yellow sun and green ground." style="width: 100%; height: auto; border: 1px solid #ccc;" /&gt;
        &lt;figcaption style="margin-top: 0.5rem; font-family: system-ui, sans-serif; font-weight: bold;"&gt;
            &lt;a href="https://gist.github.com/simonw/fea4f7546626d627862dc241a4e3a86a#response-3"&gt;xhigh&lt;/a&gt;
        &lt;/figcaption&gt;
    &lt;/figure&gt;
    &lt;figure style="grid-column: span 3; margin: 0; text-align: center;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/claude-opus-4.8-max.png" alt="Cartoon illustration of a white pelican with an orange beak riding a red bicycle on green grass, against a light blue sky with a fluffy white cloud and a yellow sun." style="width: 100%; height: auto; border: 1px solid #ccc;" /&gt;
        &lt;figcaption style="margin-top: 0.5rem; font-family: system-ui, sans-serif; font-weight: bold;"&gt;&lt;a href="https://gist.github.com/simonw/fea4f7546626d627862dc241a4e3a86a#response-4"&gt;max&lt;/a&gt;&lt;/figcaption&gt;
    &lt;/figure&gt;
&lt;/div&gt;


&lt;p&gt;This time I ran them using the &lt;a href="https://llm.datasette.io/en/stable/usage.html"&gt;LLM CLI&lt;/a&gt;, exported the logs to Markdown and then had Claude Opus 4.8 &lt;a href="https://github.com/simonw/tools/commit/71e4944766b577a327ff048cc63b739ba4cbade9"&gt;build me&lt;/a&gt; an HTML tool that could render that Markdown with the &lt;code&gt;svg&lt;/code&gt; fenced code blocks displayed as SVGs on the page.&lt;/p&gt;

&lt;p&gt;(I later had GPT-5.5 xhigh in Codex &lt;a href="https://gist.github.com/simonw/bb5a267f8144dfe4e92e50a014e49e98"&gt;update that code&lt;/a&gt; to remove any XSS holes. I'm sure Claude could have done that if I'd asked, but GPT-5.5 is my code security blanket at the moment.)&lt;/p&gt;

&lt;p&gt;The max one  was clearly the best, but it did take 25 input, 17,167 output tokens for a total cost of &lt;a href="https://www.llm-prices.com/#it=25&amp;amp;ot=17167&amp;amp;ic=5&amp;amp;oc=25&amp;amp;sel=claude-opus-4-5"&gt;43 cents&lt;/a&gt;!&lt;/p&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/></entry><entry><title>I think Anthropic and OpenAI have found product-market fit</title><link href="https://simonwillison.net/2026/May/27/product-market-fit/#atom-entries" rel="alternate"/><published>2026-05-27T16:38:35+00:00</published><updated>2026-05-27T16:38:35+00:00</updated><id>https://simonwillison.net/2026/May/27/product-market-fit/#atom-entries</id><summary type="html">&lt;p&gt;Anthropic are &lt;a href="https://techcrunch.com/2026/05/20/anthropic-says-its-about-to-have-its-first-profitable-quarter/"&gt;strongly rumored&lt;/a&gt; to be about to have their first profitable quarter. Stories &lt;a href="https://www.theinformation.com/newsletters/applied-ai/uber-cto-shows-claude-code-can-blow-ai-budgets"&gt;are circulating&lt;/a&gt; of companies surprised at how expensive their LLM bills are becoming from usage by their staff. I think this is because OpenAI and Anthropic have both found product-market fit.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/May/27/product-market-fit/#enterprise-customers-are-now-paying-api-prices"&gt;Enterprise customers are now paying API prices&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/May/27/product-market-fit/#i-think-they-ve-found-product-market-fit"&gt;I think they've found product-market fit&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/May/27/product-market-fit/#and-they-re-ramping-up"&gt;And they're ramping up&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/May/27/product-market-fit/#the-ai-failure-stories-around-this-are-pretty-thin"&gt;The AI-failure stories around this are pretty thin&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/May/27/product-market-fit/#we-also-know-the-labs-are-spending-a-lot"&gt;We also know the labs are spending a lot&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/May/27/product-market-fit/#api-revenue-is-becoming-less-important"&gt;API revenue is becoming less important&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/May/27/product-market-fit/#april-is-a-new-inflection-point"&gt;April is a new inflection point&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id="enterprise-customers-are-now-paying-api-prices"&gt;Enterprise customers are now paying API prices&lt;/h4&gt;
&lt;p&gt;I currently subscribe to the $100/month Max plan from Anthropic and the $100/month Pro plan from OpenAI. If you are a heavy user of coding agents these plans are a fantastic deal. I just ran the &lt;a href="https://github.com/ryoppippi/ccusage"&gt;ccusage&lt;/a&gt; tool on my laptop to get an estimate of how much I would have spent if I were to pay for API tokens in the past 30 days and got:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;$1,199.79 for Anthropic Claude Code&lt;/li&gt;
&lt;li&gt;$980.37 for OpenAI Codex&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That's $2,180.16 worth of tokens for $200 - not bad at all! I'm a moderately heavy user of these tools, but I'm certainly not running agents every hour of the day and night.&lt;/p&gt;
&lt;p&gt;I had assumed that companies making extensive use of agents were getting similar discounts. It turns out I &lt;em&gt;could not have been more wrong&lt;/em&gt; about that.&lt;/p&gt;
&lt;p&gt;I haven't been able to track down the exact date, but at some point in the last six months Anthropic switched their Enterprise plan (originally &lt;a href="https://www.anthropic.com/news/claude-code-on-team-and-enterprise"&gt;"Claude seats include enough usage for a typical workday" back in August 2025&lt;/a&gt;) to $20/seat/month plus API pricing for usage. This story about the change &lt;a href="https://www.theinformation.com/articles/anthropic-changes-pricing-bill-firms-based-ai-use-amid-compute-crunch"&gt;from The Information&lt;/a&gt; is dated Apr 14, 2026, but cites an Anthropic spokesperson claiming that the pricing change occurred in November 2025. Existing customers are finding out about the change as they renew their contracts.&lt;/p&gt;
&lt;p&gt;OpenAI made a similar pricing change in April. The &lt;a href="https://help.openai.com/en/articles/20001106-codex-rate-card"&gt;Codex rate card&lt;/a&gt; (&lt;a href="https://web.archive.org/web/20260519062438/https://help.openai.com/en/articles/20001106-codex-rate-card"&gt;Internet Archive copy&lt;/a&gt;) currently says:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: On April 2, 2026, we updated Codex pricing to align with API token usage, instead of per-message pricing. This change was applicable to new and existing Plus, Pro, ChatGPT Business and new ChatGPT Enterprise plans.&lt;/p&gt;
&lt;p&gt;On April 23, 2026, we made this update for all existing ChatGPT Enterprise plans as well, inclusive of Edu, Health, Gov, and ChatGPT for Teachers.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It's a little harder to decode as they quote prices in "credits", but as far as I can tell those credit costs are an exact match for the API token costs listed for those models.&lt;/p&gt;
&lt;p&gt;All of which is to say that as of April 2026 the "Enterprise" cost for both OpenAI Codex and Anthropic Claude Code/Cowork is the same as the listed API price.&lt;/p&gt;
&lt;p&gt;GPT-5.5 (released April 23rd) is 2x the API price of GPT-5.4. Opus 4.7 (April 16th) is &lt;a href="https://simonwillison.net/2026/Apr/20/claude-token-counts/"&gt;around 1.4x&lt;/a&gt; the price of Opus 4.6 when you take their new tokenizer into account.&lt;/p&gt;
&lt;p&gt;So April saw both leading model companies release new frontier models with a higher API price, &lt;em&gt;and&lt;/em&gt; both companies now have measures to lock their enterprise customers (who tend to sign year-long deals) at those API prices, not the previous extreme discounts.&lt;/p&gt;
&lt;h4 id="i-think-they-ve-found-product-market-fit"&gt;I think they've found product-market fit&lt;/h4&gt;
&lt;p&gt;Why these sudden aggressive moves on pricing? Both Anthropic and OpenAI are planning to IPO, but I suspect there's a more important factor here: I think they've finally found product-market fit, with the coding/general-purpose agent products embodied by Claude Code/Cowork and Codex.&lt;/p&gt;
&lt;p&gt;Tools like ChatGPT are wildly popular, but that wild popularity has been difficult to turn into revenue. In February &lt;a href="https://finance.yahoo.com/news/chatgpt-almost-1-billion-weekly-212157499.html"&gt;OpenAI boasted&lt;/a&gt; more than 900 million weekly active users for ChatGPT, but only 50 million - 5.6% of that - were paying consumer subscribers.&lt;/p&gt;
&lt;p&gt;Charging $10-$20/month per user is an OK business, but you'd need 1-2 billion subscribers sticking around for four years to cover &lt;a href="https://openai.com/global-affairs/seizing-the-ai-opportunity/"&gt;$1 trillion in infrastructure&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Companies spending $200+/month/user will get you there a whole lot faster - and as noted above, as a power-user I'm at ~$1,000/month in API costs per vendor already.&lt;/p&gt;
&lt;p&gt;Coding agents really did change everything. These are tools which burn &lt;em&gt;vastly&lt;/em&gt; more tokens, but are also quickly becoming daily drivers for the work carried out by extremely well-compensated professionals. Right now that's still mostly software engineers, but a coding agent is a tool that can automate anything you can do by typing commands into a computer... so they are clearly applicable to a much wider set of skilled knowledge workers.&lt;/p&gt;
&lt;p&gt;As I've &lt;a href="https://simonwillison.net/tags/november-2025-inflection/"&gt;discussed on this site at length&lt;/a&gt;, the models released in November 2025 elevated agents to being genuinely useful. We've had six months to get used to that idea now - it's no wonder companies are beginning to spend real money on this technology.&lt;/p&gt;
&lt;p&gt;You could argue that ChatGPT achieved product-market fit when it became the &lt;a href="https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/"&gt;fastest-growing consumer app in history&lt;/a&gt; back in February 2023... but it certainly wasn't making any actual money back then. Coding agents plus enterprise pricing marks the point when these companies start making &lt;em&gt;very&lt;/em&gt; real revenue. Maybe even enough to start covering their costs!&lt;/p&gt;
&lt;h4 id="and-they-re-ramping-up"&gt;And they're ramping up&lt;/h4&gt;
&lt;p&gt;As further evidence that enterprise agents represent product-market fit for these companies, consider their open job listings.&lt;/p&gt;
&lt;p&gt;OpenAI have &lt;a href="https://openai.com/careers/search/"&gt;703 open jobs&lt;/a&gt; right now, of which I'd categorize 229 (32.6%) as relating to enterprise sales and support - account executives, "Go To Market", "Forward Deployed Engineers" and the like.&lt;/p&gt;
&lt;p&gt;Anthropic have &lt;a href="https://www.anthropic.com/careers/jobs"&gt;390 open jobs&lt;/a&gt;, 105 (26.9%) of which look enterprisey to me.&lt;/p&gt;
&lt;p&gt;It's pleasingly ironic that these AI labs have picked a business model with such a heavy demand on human labor - enterprise sales contracts don't close themselves without a whole lot of humans in the mix!&lt;/p&gt;
&lt;p&gt;&lt;small&gt;(I ran this analysis by scraping their job sites with Claude Code, then having it use Datasette's &lt;a href="https://docs.datasette.io/en/latest/json_api.html"&gt;JSON API&lt;/a&gt; to pipe that data into Datasette Cloud where I used &lt;a href="https://agent.datasette.io/"&gt;Datasette Agent&lt;/a&gt; for the analysis, &lt;a href="https://gist.github.com/simonw/5632d208d76b3c8b34f1fdbaf69eb1b8#agent-4"&gt;exported here&lt;/a&gt;. Dogfood!)&lt;/small&gt;&lt;/p&gt;
&lt;h4 id="the-ai-failure-stories-around-this-are-pretty-thin"&gt;The AI-failure stories around this are pretty thin&lt;/h4&gt;
&lt;p&gt;I started digging into this in response to &lt;a href="https://news.ycombinator.com/item?id=48287025#48287219"&gt;a growing volume&lt;/a&gt; of stories claiming that large companies were sounding the alarm because their AI usage costs had grown so large.&lt;/p&gt;
&lt;p&gt;The most widely cited of these stories appear quite overblown to me.&lt;/p&gt;
&lt;p&gt;The most discussed has been Uber, based on &lt;a href="https://www.theinformation.com/newsletters/applied-ai/uber-cto-shows-claude-code-can-blow-ai-budgets"&gt;this report&lt;/a&gt; where CTO Praveen Neppalli Naga indicated that Uber had "maxed out its full year AI budget just a few months into 2026", mostly thanks to Claude Code.&lt;/p&gt;
&lt;p&gt;Given that Claude Code only got &lt;em&gt;really&lt;/em&gt; good in November it's entirely unsurprising to me that a budget set in 2025 may have failed to predict demand for that tool in 2026!&lt;/p&gt;
&lt;p&gt;That Uber story was further fueled by comments made by Uber's COO, Andrew Macdonald, on the Rapid Response podcast. I tracked down &lt;a href="https://www.youtube.com/watch?v=y_mQ6xLcKyc&amp;amp;t=1616s"&gt;the segment&lt;/a&gt; and there really isn't much there. Here's what Andrew said:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;But then you sometimes go and talk to your senior engineering leaders and you're saying, OK, how many projects that were on the cutting room floor got moved above the line because of the productivity gains because 25% of our code commits were via Claude Code last quarter?&lt;/p&gt;
&lt;p&gt;That link is not there yet, right? I think maybe implicitly there's more that is getting shipped. But it's very hard to draw a line between one of those stats and, OK, now we're actually producing like 25% more useful consumer features, right? And that line is hard to draw.&lt;/p&gt;
&lt;p&gt;[...] And so if you're not actually able to draw a direct line to how much useful features and functionality you're shipping to your users, that trade becomes harder to justify.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Somehow this fragment turned into headlines like &lt;a href="https://www.businessinsider.com/uber-coo-andrew-macdonald-ai-token-spending-harder-justify-2026-5"&gt;Uber's COO says it's getting harder to justify the money spent on AI tokenmaxxing&lt;/a&gt;, because the market for stories about AI failures remains enormous.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update 29th May 2026&lt;/strong&gt;: I edited the above quote to add that last paragraph ending in "becomes harder to justify" on &lt;a href="https://x.com/MadisonMills22/status/2060343512936186240"&gt;the suggestion of Madison Mills&lt;/a&gt; - previously my quoted section stopped at "hard to draw". Here's the &lt;a href="https://gist.github.com/simonw/59096a338c82f6f95e40e3d7c7b5bad9"&gt;full unedited transcript&lt;/a&gt; from MacWhisper.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The other popular story around this is &lt;a href="https://www.theverge.com/tech/930447/microsoft-claude-code-discontinued-notepad"&gt;Microsoft starts canceling Claude Code licenses&lt;/a&gt;, ostensibly to encourage their engineers to dogfood their own Copilot CLI agent instead - but The Verge reporter Tom Warren says "sources tell me the decision is also a financial one", triggered by the June 30th end of Microsoft's financial year.&lt;/p&gt;
&lt;p&gt;I think both of these stories support my "product-market fit" hypothesis. The best advice I ever heard on pricing a product was that your customer should &lt;em&gt;suck air through their teeth&lt;/em&gt; and then say yes. Uber's budget overrun and Microsoft's seat cancellations look like that effect playing out in practice.&lt;/p&gt;
&lt;h4 id="we-also-know-the-labs-are-spending-a-lot"&gt;We also know the labs are spending a lot&lt;/h4&gt;
&lt;p&gt;The big AI labs spend billions of dollars on both training and inference. Credible figures are hard to come by, but we did get one huge hint as to the figures involved from, oddly enough, the recent &lt;a href="https://www.sec.gov/Archives/edgar/data/1181412/000162828026036936/spaceexplorationtechnologi.htm"&gt;SpaceX S-1&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[...] in May 2026, we entered into &lt;strong&gt;Cloud Services Agreements with Anthropic PBC&lt;/strong&gt; (“Anthropic”), an AI research and development public benefit corporation, with respect to access to &lt;strong&gt;compute capacity across COLOSSUS and COLOSSUS II&lt;/strong&gt;. Pursuant to these agreements, the customer &lt;strong&gt;has agreed to pay us $1.25 billion per month&lt;/strong&gt; through May 2029 [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The &lt;a href="https://www.anthropic.com/news/higher-limits-spacex"&gt;Anthropic announcement&lt;/a&gt; said that this deal meant they could "increase our usage limits for Claude Code and the Claude API", heavily implying that Colossus is being used for inference, not model training.&lt;/p&gt;
&lt;p&gt;Anthropic already have vast amounts of compute from other providers. The fact that they're willing to spend $1.25 billion per month for extra capacity from just &lt;em&gt;one&lt;/em&gt; of their vendors hints at how big these inference budgets have become.&lt;/p&gt;
&lt;h4 id="api-revenue-is-becoming-less-important"&gt;API revenue is becoming less important&lt;/h4&gt;
&lt;p&gt;Over the past two years my impression has been that OpenAI made more of their income from subscription revenue while Anthropic made more from their API.&lt;/p&gt;
&lt;p&gt;Anthropic's API revenue was historically quite dependent on a small number of large API customers - &lt;a href="https://venturebeat.com/ai/anthropic-revenue-tied-to-two-customers-as-ai-pricing-war-threatens-margins"&gt;this VentureBeat story from August 2025&lt;/a&gt; quotes "sources familiar with the matter" suggesting that just Cursor and GitHub Copilot were responsible for $1.2 billion of the company's then-$4 billion revenue.&lt;/p&gt;
&lt;p&gt;Today Anthropic are rumored to hit &lt;a href="https://www.wsj.com/tech/ai/mind-blowing-growth-is-about-to-propel-anthropic-into-its-first-profitable-quarter-7edbf2f4"&gt;$10.9 billion in the second quarter&lt;/a&gt;, potentially even operating at a profit for the first time.&lt;/p&gt;
&lt;p&gt;This pivot-to-Enterprise suggests that the labs have realized that the real money lies in cutting out the middlemen. Anthropic's Claude Code directly competes with Cursor and Copilot. No wonder Cursor are &lt;a href="https://cursor.com/blog/composer-2"&gt;investing in their own models&lt;/a&gt;!&lt;/p&gt;
&lt;h4 id="april-is-a-new-inflection-point"&gt;April is a new inflection point&lt;/h4&gt;
&lt;p&gt;I've called November 2025 the &lt;a href="https://simonwillison.net/tags/november-2025-inflection/"&gt;November inflection point&lt;/a&gt; because that was when GPT-5.1 and Opus 4.5, combined with their respective coding agent harnesses, got &lt;em&gt;good&lt;/em&gt; - good enough that we've spent the last six months adapting to agent systems that can reliably get useful work done.&lt;/p&gt;
&lt;p&gt;I think April 2026 is a new inflection point where the revenue implications of this have started to land, to the benefit of the frontier AI labs and with material impacts on the budgets of large companies.&lt;/p&gt;
&lt;p&gt;We'll know for sure how real this moment is when the S-1 documents for the upcoming Anthropic and OpenAI IPOs give us some real, audited numbers to get our teeth into.&lt;/p&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="ai"/><category term="datasette"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="llm-pricing"/><category term="coding-agents"/><category term="claude-code"/><category term="codex"/><category term="claude-cowork"/><category term="november-2025-inflection"/><category term="datasette-agent"/><category term="uber"/></entry><entry><title>Notes on Pope Leo XIV's encyclical on AI</title><link href="https://simonwillison.net/2026/May/25/encyclical-on-ai/#atom-entries" rel="alternate"/><published>2026-05-25T23:58:17+00:00</published><updated>2026-05-25T23:58:17+00:00</updated><id>https://simonwillison.net/2026/May/25/encyclical-on-ai/#atom-entries</id><summary type="html">&lt;p&gt;Dropped this morning by the Vatican: &lt;a href="https://www.vatican.va/content/leo-xiv/en/encyclicals/documents/20260515-magnifica-humanitas.html"&gt;Magnifica Humanitas of His Holiness Pope Leo XIV on Safeguarding the Human Person in the Time of Artificial Intelligence&lt;/a&gt;. This is a &lt;em&gt;very interesting&lt;/em&gt; document. It's some of the clearest writing I've seen on the ethics of integrating AI into modern society.&lt;/p&gt;
&lt;p&gt;Pope Leo XIV chose the name Leo in honor of Pope Leo XIII, who is known for his 1891 &lt;em&gt;&lt;a href="https://en.wikipedia.org/wiki/Rerum_novarum"&gt;Rerum novarum&lt;/a&gt;&lt;/em&gt; encyclical on "Rights and Duties of Capital and Labor".&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.vaticannews.va/en/church/news/2025-05/leo-xiii-s-times-and-our-own.html"&gt;This story&lt;/a&gt; on Vatican News further clarifies the significance of that decision:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Meeting with the College of Cardinals for their first formal encounter after his election, Pope Leo XIV explained part of the reason for the choice of his papal name. "There are different reasons for this," he said, before going on to explain that he chose the name Leo "mainly because Pope Leo XIII, in his historic encyclical &lt;em&gt;&lt;a href="https://www.vatican.va/content/leo-xiii/en/encyclicals/documents/hf_l-xiii_enc_15051891_rerum-novarum.html"&gt;Rerum novarum&lt;/a&gt;&lt;/em&gt; addressed the social question in the context of the first great industrial revolution."&lt;/p&gt;
&lt;p&gt;"In our own day," he continued, "the Church offers to everyone the treasury of her social teaching in response to another industrial revolution and to developments in the field of artificial intelligence that pose new challenges for the defence of human dignity, justice, and labour."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And now we get Pope Leo XIV's own encyclical on the AI revolution. There's a lot in here, but the writing style is very approachable, including to non-Catholics.&lt;/p&gt;
&lt;h4 id="a-few-of-my-highlights"&gt;A few of my highlights&lt;/h4&gt;
&lt;p&gt;&lt;small&gt;(I listened to most of the encyclical on a walk with our dog, my first time trying the &lt;a href="https://apps.apple.com/us/app/elevenreader-read-books-aloud/id6479373050"&gt;ElevenReader iPhone app&lt;/a&gt;. It worked very well: I pasted in a URL to the document and it read it to me in a very high quality voice, highlighting each paragraph as it went.)&lt;/small&gt;&lt;/p&gt;
&lt;p&gt;Here are some of my highlights. In each case below &lt;strong&gt;emphasis&lt;/strong&gt; is mine.&lt;/p&gt;
&lt;p&gt;Here's a useful description of the interpretability problem for LLMs in section 98:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;First, any statement regarding AI risks becoming quickly outdated, given the remarkable pace at which these systems are developing. Second, all of us, including those who design them, possess only a limited understanding of their actual functioning. Indeed, &lt;strong&gt;current AI systems are more “cultivated” than “built,” for developers do not directly design every detail, but instead create a framework within which the intelligence “grows.”&lt;/strong&gt; As a result, fundamental scientific aspects — such as the internal representations and computational processes of these systems — remain, at present, unknown.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I liked section 83's description of the relationship between development and dignity:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;For individuals as well as for nations, development is both a duty and a right. Minimum conditions are required for enabling every person and people to flourish in accord with their dignity, without being kept in a state of dependence or excluded from access to necessary goods. Development is truly human when it places people at the center instead of the accumulation of wealth, and when it concerns peoples as well as individuals. Justice demands the recognition of the rights of society and the rights of peoples, and includes a responsibility toward future generations. &lt;strong&gt;Development is not truly human if it increases consumption for some while shifting costs and burdens onto others, or relegates entire regions to subordinate roles, preventing them from realizing their full potential&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Baked in cultural biases and sycophancy get a mention in section 100:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In personal use, three aspects in particular deserve careful consideration: the ease with which results are obtained, the impression of objectivity and the simulation of human communication. The speed and simplicity with which information, complex analyses, media content and practical assistance can be accessed undoubtedly makes life easier. Yet they can also encourage excessive reliance and the search for ready-made answers, and weaken personal creativity and judgment. &lt;strong&gt;The apparent objectivity of the responses and suggestions these systems provide can lead us to overlook the fact that they reflect the cultural assumptions of those who designed and trained them, with all their strengths and limitations&lt;/strong&gt;. The artificial imitation of positive human communication — words of advice, empathy, friendship and even love — can be engaging and at times genuinely helpful. &lt;strong&gt;However, for less discerning users, it can also be misleading, creating the illusion of a relationship with a real personal subject&lt;/strong&gt;. When words are simulated, they do not build genuine relationships, but only their appearance. The artificial imitation of care or support can become particularly risky when it enters contexts where real relationships and emotional bonds are lacking.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;101 touches on the environmental impact:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Current AI systems require enormous amounts of energy and water, significantly influencing carbon dioxide emissions, and place heavy demands on natural resources. &lt;strong&gt;As their complexity increases, especially in the case of large language models, the need for computing power and storage capacity grows too, which requires an extensive network of machines, cables, data centers and energy-intensive infrastructure&lt;/strong&gt;. For this reason, it is essential to develop more sustainable technological solutions that reduce environmental impact and help protect our common home.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;102 covers the risks of algorithmic systems making decisions that impact people's lives without "compassion, mercy, forgiveness":&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The use of AI is never a purely technical matter: &lt;strong&gt;when it enters processes that affect people’s lives, it touches on rights, opportunities, status and freedom&lt;/strong&gt;. Important and sensitive decisions — concerning employment, credit, access to public services or even a person’s reputation — &lt;strong&gt;risk being fully delegated to automated systems that do not know “compassion, mercy, forgiveness, and above all, the hope that people are able to change,”&lt;/strong&gt; and can therefore give rise to new forms of exclusion.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;105 emphasizes the need for human accountability in how these systems are applied:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;For AI to respect human dignity and truly serve the common good, responsibility must be clearly defined at every stage: &lt;strong&gt;from those who design and develop these systems to those who use them and rely on them for concrete decisions&lt;/strong&gt;. In many cases, however, the internal processes leading to a result remain opaque, making it harder to assign responsibility and correct errors. &lt;strong&gt;This is where accountability becomes crucial: the possibility of identifying who must “account” for decisions, justify them, monitor them, and, when necessary, challenge them and remedy any harm caused&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And 108 touches on the way AI amplifies the power of those with resources:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In fact, as with every major technological shift, &lt;strong&gt;AI tends to amplify the power of those who already possess economic resources, expertise and access to data&lt;/strong&gt;. In light of the common good and the universal destination of goods, this raises serious concerns, since small but highly influential groups can shape information and consumption patterns, influence democratic processes and steer economic dynamics to their own advantage, undermining social justice and solidarity among peoples. For this reason, it is essential that the use of AI, especially when it touches on public goods and fundamental rights, be guided by clear criteria and effective oversight, grounded in participation and subsidiarity.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That same section explicitly calls out data as something that should be thought of more as a public good:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[...] Moreover, &lt;strong&gt;ownership of data cannot be left solely in private hands&lt;/strong&gt; but must be appropriately regulated. &lt;strong&gt;Data is the product of many contributors and should not be treated as something to be sold off or entrusted to a select few&lt;/strong&gt;. It is necessary to think creatively in order to manage data as a common or shared good, in a spirit of participation, as &lt;a href="https://www.vatican.va/content/john-paul-ii/en.html"&gt;Saint John Paul II&lt;/a&gt; already suggested regarding collective goods.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Given that Palantir is named after a &lt;em&gt;Lord of the Rings&lt;/em&gt; reference, I can't help but wonder if the J.R.R. Tolkien quote from &lt;em&gt;The Return of the King&lt;/em&gt; (section 213) was the Pope throwing a little shade at Peter Thiel.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The twentieth-century Catholic author J.R.R. Tolkien, in the words of a protagonist in one of his novels, described our responsibility in this way: “It is not our part to master all the tides of the world, but to do what is in us for the succour of those years wherein we are set, uprooting the evil in the fields that we know, so that those who live after may have clean earth to till.” The civilization of love will not arise from a single or spectacular gesture, but from the sum total of small and steadfast acts of fidelity that serve as a bulwark against dehumanization. For this reason, it is worthwhile pausing to reflect on some aspects of how we, each in our own way, can cooperate in building the civilization of love.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4 id="another-2026-prediction-down"&gt;Another 2026 prediction down&lt;/h4&gt;
&lt;p&gt;On 6th January this year I joined the  &lt;a href="https://oxide-and-friends.transistor.fm/episodes/predictions-2026"&gt;Oxide and Friends 2026 predictions&lt;/a&gt; podcast episode to talk about predictions for 2026, 2029 and 2032. I &lt;a href="https://simonwillison.net/2026/Jan/8/llm-predictions-for-2026/"&gt;wrote mine up here&lt;/a&gt;, with hindsight they weren't nearly ambitious enough - it's already undeniable that LLMs write good code, we've made huge advances in sandboxing and New Zealand kākāpō have indeed &lt;a href="https://news.mongabay.com/short-article/2026/03/critically-endangered-kakapo-parrot-has-standout-breeding-season/"&gt;had a truly excellent breeding season&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;There's one segment from the episode that I didn't bother to include in my write-up, but that I can't resist providing as a lightly-edited transcript here:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Bryan Cantrill:&lt;/strong&gt; &lt;a href="https://oxide-and-friends.transistor.fm/episodes/predictions-2026/transcript#t=37m13s"&gt;37:13&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;I think that AI has created some real public perception problems for itself. And I think that you are gonna have one of the frontier model companies, this year, have a white paper explaining how the proliferation of AI will mean prosperity for everybody. They will be trying to make some economic argument - because this is gonna be a 2026 election issue, how we think of these things and how they are regulated and it's a big mess. There's more heat than light in this debate.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Simon Willison:&lt;/strong&gt; &lt;a href="https://oxide-and-friends.transistor.fm/episodes/predictions-2026/transcript#t=38m5s"&gt;38:05&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;I'd like to tag something on to that one: I think that only works if they can sort of wash that through existing trusted experts. Sam Altman and Dario are constantly publishing essays about this stuff and nobody believes a word they say. Get Barack Obama's signature on one of these position papers and &lt;em&gt;maybe&lt;/em&gt; you've got something people might start to trust a little bit.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Adam Leventhal:&lt;/strong&gt; &lt;a href="https://oxide-and-friends.transistor.fm/episodes/predictions-2026/transcript#t=38m27s"&gt;38:27&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Otherwise, it's just like "leaded gas is good for you", says Exxon.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Bryan Cantrill:&lt;/strong&gt; &lt;a href="https://oxide-and-friends.transistor.fm/episodes/predictions-2026/transcript#t=38m31s"&gt;38:31&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;I mean, yeah. God. Obama... let's go with that, that's a great one because if it's like Bill Clinton everyone's gonna kind of roll their eyes, so it's gotta be someone who's got real credibility saying that this is gonna be broad-based... I'd say if they get that person to do it, it's gonna be revealed that that's also a bit crooked.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Simon Willison:&lt;/strong&gt; &lt;a href="https://oxide-and-friends.transistor.fm/episodes/predictions-2026/transcript#t=38m57s"&gt;38:57&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;How about the Pope?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Bryan Cantrill:&lt;/strong&gt; &lt;a href="https://oxide-and-friends.transistor.fm/episodes/predictions-2026/transcript#t=39m1s"&gt;39:01&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The Pope is very into this stuff! That's a great prediction. We've hit pay dirt. The Pope weighing in on LLMs and their economic impact on the world.&lt;/p&gt;
&lt;p&gt;Simon, I'm giving you full credit if the Pope weighs in believing that this is gonna be economic devastation.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;My prediction here looks a whole lot less insightful given the Leo XIV/Leo XIII relationship, which I was unaware of when we recorded the episode!&lt;/p&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="predictions"/><category term="ai"/><category term="kakapo"/><category term="generative-ai"/><category term="llms"/><category term="bryan-cantrill"/><category term="ai-ethics"/></entry><entry><title>Datasette Agent</title><link href="https://simonwillison.net/2026/May/21/datasette-agent/#atom-entries" rel="alternate"/><published>2026-05-21T19:52:19+00:00</published><updated>2026-05-21T19:52:19+00:00</updated><id>https://simonwillison.net/2026/May/21/datasette-agent/#atom-entries</id><summary type="html">&lt;p&gt;We just &lt;a href="https://datasette.io/blog/2026/datasette-agent/"&gt;announced the first release of Datasette Agent&lt;/a&gt;, a new extensible AI assistant for Datasette. I've been working on my &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; Python library for just over three years now, and Datasette Agent represents the moment that LLM and &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; finally come together. I'm really excited about it!&lt;/p&gt;
&lt;p&gt;Datasette Agent provides a conversational interface for asking questions of the data you have stored in Datasette. Add the &lt;a href="https://github.com/datasette/datasette-agent-charts"&gt;datasette-agent-charts&lt;/a&gt; plugin and it can generate charts of your data as well.&lt;/p&gt;
&lt;h4 id="the-demo"&gt;The demo&lt;/h4&gt;
&lt;p&gt;The &lt;a href=""&gt;announcement post&lt;/a&gt; (on the new Datasette project blog) includes this &lt;a href="https://www.youtube.com/watch?v=AFZKp6hbFjI"&gt;demo video&lt;/a&gt;:&lt;/p&gt;

&lt;iframe style="margin-bottom: 1.5em;" width="560" height="315" src="https://www.youtube-nocookie.com/embed/AFZKp6hbFjI" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="allowfullscreen"&gt; &lt;/iframe&gt;


&lt;p&gt;I recorded the video against the new &lt;a href="https://agent.datasette.io/"&gt;agent.datasette.io&lt;/a&gt; live demo instance, which runs Datasette Agent against example databases including the classic &lt;a href="https://datasette.io/global-power-plants"&gt;global-power-plants&lt;/a&gt; by &lt;a href="https://www.wri.org/research/global-database-power-plants"&gt;WRI&lt;/a&gt;, and a copy of the &lt;a href="https://datasette.simonwillison.net/"&gt;Datasette backup&lt;/a&gt; of my blog.&lt;/p&gt;
&lt;p&gt;The live demo runs on &lt;a href="https://ai.google.dev/gemini-api/docs/models/gemini-3.1-flash-lite"&gt;Gemini 3.1 Flash-Lite&lt;/a&gt; - it's cheap, fast and has no trouble writing SQLite queries.&lt;/p&gt;
&lt;p&gt;A question I asked in the demo was:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;when did Simon most recently see a pelican?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Which ran &lt;a href="https://datasette.simonwillison.net/simonwillisonblog?sql=SELECT+title%2C+commentary%2C+created+FROM+blog_beat+WHERE+beat_type+%3D+%27sighting%27+AND+%28title+LIKE+%27%25pelican%25%27+OR+commentary+LIKE+%27%25pelican%25%27%29+ORDER+BY+created+DESC+LIMIT+5"&gt;this SQL query&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;SELECT&lt;/span&gt;
  title, commentary, created
&lt;span class="pl-k"&gt;FROM&lt;/span&gt;
  blog_beat
&lt;span class="pl-k"&gt;WHERE&lt;/span&gt;
  beat_type &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;sighting&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
  &lt;span class="pl-k"&gt;AND&lt;/span&gt; (title &lt;span class="pl-k"&gt;LIKE&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;%pelican%&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;OR&lt;/span&gt; commentary &lt;span class="pl-k"&gt;LIKE&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;%pelican%&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;)
&lt;span class="pl-k"&gt;ORDER BY&lt;/span&gt;
  created &lt;span class="pl-k"&gt;DESC&lt;/span&gt;
&lt;span class="pl-k"&gt;LIMIT&lt;/span&gt;
  &lt;span class="pl-c1"&gt;5&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And replied:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The most recent sighting of a pelican by Simon was recorded on &lt;strong&gt;May 20, 2026&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The observation included a California Brown Pelican, along with a Common Loon, Canada Goose, Striped Shore Crab, and a California Sea Lion.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://simonwillison.net/2026/May/20/sighting-363395265/"&gt;that sighting on my blog&lt;/a&gt;, and the &lt;a href="https://gist.github.com/simonw/a46d17b69659a4866adb1d868280091d"&gt;Markdown export&lt;/a&gt; of the full conversation transcript.&lt;/p&gt;
&lt;h4 id="the-plugins"&gt;The plugins&lt;/h4&gt;
&lt;p&gt;My favorite feature of Datasette Agent is that, like the rest of Datasette, it's extensible using plugins.&lt;/p&gt;
&lt;p&gt;We've shipped three plugins so far:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/datasette/datasette-agent-charts"&gt;datasette-agent-charts&lt;/a&gt;, shown in the video, adds charts to Datasette Agent, powered by &lt;a href="https://observablehq.com/plot/"&gt;Observable Plot&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/datasette/datasette-agent-openai-imagegen"&gt;datasette-agent-openai-imagegen&lt;/a&gt; adds an image generation tool to Datasette Agent using &lt;a href="https://openai.com/index/introducing-chatgpt-images-2-0/"&gt;ChatGPT Images 2.0&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/datasette/datasette-agent-sprites"&gt;datasette-agent-sprites&lt;/a&gt; provides tools for executing code in a &lt;a href="https://sprites.dev/"&gt;Fly Sprites&lt;/a&gt; persistent sandbox.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Building plugins is &lt;em&gt;really fun&lt;/em&gt;. I have a bunch more prototypes that aren't quite alpha-quality yet.&lt;/p&gt;
&lt;p&gt;Claude Code and OpenAI Codex are both proving excellent at writing plugins - just point them at a checkout of the &lt;a href="https://github.com/datasette/datasette-agent"&gt;datasette-agent repo&lt;/a&gt; for reference and tell them what you want to build!&lt;/p&gt;
&lt;h4 id="running-it-against-local-models"&gt;Running it against local models&lt;/h4&gt;
&lt;p&gt;I've also been having fun running the new plugin against local models. Here's a &lt;code&gt;uv&lt;/code&gt; one-liner to run the plugin against &lt;a href="https://huggingface.co/google/gemma-4-26B-A4B"&gt;gemma-4-26b-a4b&lt;/a&gt; in &lt;a href="https://lmstudio.ai"&gt;LM Studio&lt;/a&gt; on a Mac:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uvx --prerelease=allow \
  --with datasette-agent --with llm-lmstudio \
  datasette --internal internal.db --root \
  -s plugins.datasette-llm.default_model lmstudio/google/gemma-4-26b-a4b \
  data.db&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Datasette Agent needs reliable tool calls and the ability for a model to produce SQL queries that run against SQLite. The open weight models released in the past six months are increasingly able to handle that.&lt;/p&gt;
&lt;h4 id="what-s-next"&gt;What's next&lt;/h4&gt;
&lt;p&gt;Datasette Agent opens up &lt;em&gt;so many&lt;/em&gt; opportunities for the LLM and Datasette ecosystem in general.&lt;/p&gt;
&lt;p&gt;It's already informed &lt;a href="https://simonwillison.net/2026/Apr/29/llm/"&gt;the major LLM 0.32a0 refactor&lt;/a&gt; which I'm nearly ready to roll into a stable release, maybe with some additional "LLM agent" abstractions extracte from Datasette Agent itself.&lt;/p&gt;
&lt;p&gt;I've been exploring my own take on the Claude Artifacts, which is shaping up nicely as a plugin.&lt;/p&gt;
&lt;p&gt;I'm excited to use Datasette Agent to build my own &lt;a href="https://simonwillison.net/2026/May/19/5-minute-llms/#5-minutes-llms.013.jpeg"&gt;Claw&lt;/a&gt; - a personal AI assistant built around data imported from different parts of my digital life, which is a neat excuse to revisit my older &lt;a href="https://dogsheep.github.io"&gt;Dogsheep&lt;/a&gt; family of tools.&lt;/p&gt;
&lt;p&gt;We'll also be rolling out Datasette Agent for users of &lt;a href="https://datasette.cloud/"&gt;Datasette Cloud&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Join our &lt;a href="https://discord.gg/hdxyusUFv"&gt;#datasette-agent Discord channel&lt;/a&gt; if you'd like to talk about the project.&lt;/p&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="projects"/><category term="sqlite"/><category term="ai"/><category term="datasette"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="uv"/><category term="datasette-agent"/></entry><entry><title>Gemini 3.5 Flash: more expensive, but Google plan to use it for everything</title><link href="https://simonwillison.net/2026/May/19/gemini-35-flash/#atom-entries" rel="alternate"/><published>2026-05-19T22:40:25+00:00</published><updated>2026-05-19T22:40:25+00:00</updated><id>https://simonwillison.net/2026/May/19/gemini-35-flash/#atom-entries</id><summary type="html">&lt;p&gt;Today at Google I/O, Google &lt;a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-5/"&gt;released Gemini 3.5 Flash&lt;/a&gt;. This one skipped the &lt;code&gt;-preview&lt;/code&gt; modifier and went straight to general availability, and Google appear to be using it for a whole lot of their key products:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;3.5 Flash is available today to billions of people globally:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;For everyone via the Gemini app and AI Mode in &lt;a href="https://blog.google/products-and-platforms/products/search/search-io-2026"&gt;Google Search&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;For developers in our agent-first development platform Google Antigravity and Gemini API in Google AI Studio and Android Studio&lt;/li&gt;
&lt;li&gt;For enterprises in Gemini Enterprise Agent Platform and Gemini Enterprise.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;As usual with Gemini, the most interesting details are tucked away in the &lt;a href="https://ai.google.dev/gemini-api/docs/whats-new-gemini-3.5"&gt;What's new in Gemini 3.5 Flash&lt;/a&gt; developer documentation. It mostly has the same set of platform features as the previous Gemini 3.x series, albeit with no &lt;a href="https://ai.google.dev/gemini-api/docs/computer-use"&gt;computer use&lt;/a&gt;. The model ID is &lt;code&gt;gemini-3.5-flash&lt;/code&gt;. The knowledge cut-off is January 2025, and it supports 1,048,576 input tokens and 65,536 maximum output tokens.&lt;/p&gt;
&lt;p&gt;Google are also pushing a new &lt;a href="https://ai.google.dev/gemini-api/docs/interactions"&gt;Interactions API&lt;/a&gt;, currently in beta, which looks to me like their version of the patterns introduced by &lt;a href="https://developers.openai.com/api/reference/responses/overview"&gt;OpenAI Responses&lt;/a&gt; - in particular server-side history management.&lt;/p&gt;
&lt;h4 id="the-price-has-gone-up"&gt;The price has gone up&lt;/h4&gt;
&lt;p&gt;Gemini 3.5 Flash is accompanied by a notable price bump. The previous models in the "Flash" family were &lt;a href="https://ai.google.dev/gemini-api/docs/models/gemini-3-flash-preview"&gt;Gemini 3 Flash Preview&lt;/a&gt; and &lt;a href="https://ai.google.dev/gemini-api/docs/models/gemini-3.1-flash-lite"&gt;Gemini 3.1 Flash-Lite&lt;/a&gt;. The new 3.5 Flash is 3x the price of 3 Flash Preview and 6x the price of 3.1 Flash-Lite (see &lt;a href="https://www.llm-prices.com/#sel=gemini-3-flash-preview%2Cgemini-3.5-flash%2Cgemini-3.1-flash-lite-preview"&gt;price comparison here&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;At $1.50/million input and $9/million output it's getting close in price to Google's Gemini 3.1 Pro, which is $2 and $12.&lt;/p&gt;
&lt;p&gt;The Gemini team promise that 3.5 Pro will roll out "next month" - presumably at an even higher price.&lt;/p&gt;
&lt;p&gt;This fits a trend: OpenAI's GPT-5.5 was 2x the price of GPT-5.4, and Claude Opus 4.7 is around 1.46x the price of 4.6 when you take the &lt;a href="https://simonwillison.net/2026/Apr/20/claude-token-counts/"&gt;new tokenizer into account&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Given the price increase it's interesting to see Google roll it out for so many of their own free-to-consumer products. It feels like all three of the major AI labs are starting to probe the price tolerance of their API customers.&lt;/p&gt;
&lt;p&gt;Artificial Analysis publish the cost to run their proprietary benchmark against models, which is a useful way to take things like tokenization and increased volume of reasoning tokens into account. Some numbers worth comparing:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://artificialanalysis.ai/models/gemini-3-5-flash"&gt;Gemini 3.5 Flash (high)&lt;/a&gt;: $1,551.60&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://artificialanalysis.ai/models/gemini-3-1-pro-preview"&gt;Gemini 3.1 Pro Preview&lt;/a&gt;: $892.28&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://artificialanalysis.ai/models/gemini-3-flash-reasoning"&gt;Gemini 3 Flash Preview (Reasoning)&lt;/a&gt;: $278.26&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://artificialanalysis.ai/models/gemini-3-1-flash-lite-preview"&gt;Gemini 3.1 Flash-Lite Preview&lt;/a&gt;: $93.60&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Running the benchmark for 3.5 Flash (high) cost significantly more than 3.1 Pro Preview!&lt;/p&gt;
&lt;p&gt;Here are some numbers from other vendors:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://artificialanalysis.ai/models/claude-opus-4-7"&gt;Claude Opus 4.7 (Adaptive Reasoning, Max Effort)&lt;/a&gt;: $5,117.14&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://artificialanalysis.ai/models/claude-opus-4-7-non-reasoning"&gt;Claude Opus 4.7 (Non-reasoning, High Effort)&lt;/a&gt;: $1,217.23&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://artificialanalysis.ai/models/gpt-5-5"&gt;GPT-5.5 (xhigh)&lt;/a&gt;: $3,357.00&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://artificialanalysis.ai/models/gpt-5-5-medium"&gt;GPT-5.5 (medium)&lt;/a&gt;: $1,199.14&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="a-pelican-on-a-bicycle"&gt;A pelican on a bicycle&lt;/h4&gt;
&lt;p&gt;I ran "Generate an SVG of a pelican riding a bicycle" &lt;a href="https://gist.github.com/simonw/09cc5a5545d7e75b33b75ffa92a34601"&gt;against the Gemini API&lt;/a&gt; and got back this pelican, which is a &lt;em&gt;lot&lt;/em&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/gemini-3.5-flash.png" alt="Black background, bats in the sky against a stylized moon. Pelican is funky looking. Very good beak. Bicycle frame is a bit twisted, and the bar from pedals to back wheel is missing. Bike lamp illuminates the road in front. Quite stylish." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;From the code comments: &lt;code&gt;&amp;lt;!-- Pelican Eye / Sunglasses (Cool Retro Aviators) --&amp;gt;&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://news.ycombinator.com/item?id=48196570#48198275"&gt;hedgehog on Hacker News&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;That pelican looks like it's in Miami for a crypto conference.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That one cost me 11 input tokens and 14,403 output tokens, for a total cost of &lt;a href="https://www.llm-prices.com/#it=11&amp;amp;ot=14403&amp;amp;sel=gemini-3.5-flash"&gt;just under 13 cents&lt;/a&gt;.&lt;/p&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="gemini"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/></entry><entry><title>The last six months in LLMs in five minutes</title><link href="https://simonwillison.net/2026/May/19/5-minute-llms/#atom-entries" rel="alternate"/><published>2026-05-19T01:09:44+00:00</published><updated>2026-05-19T01:09:44+00:00</updated><id>https://simonwillison.net/2026/May/19/5-minute-llms/#atom-entries</id><summary type="html">&lt;p&gt;I put together these annotated slides from my five minute lightning talk at PyCon US 2026, using the &lt;a href="https://tools.simonwillison.net/annotated-presentations"&gt;latest iteration&lt;/a&gt; of my &lt;a href="https://simonwillison.net/2023/Aug/6/annotated-presentations/"&gt;annotated presentation tool&lt;/a&gt;.&lt;/p&gt;

&lt;div class="slide" id="5-minutes-llms.001.jpeg"&gt;
  &lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/5-minutes-llms/5-minutes-llms.001.jpeg" alt="The last six months in LLMs in
five minutes

Simon Willison - simonwillison.net

PyCon US 2026 Lightning Talk
" style="max-width: 100%" /&gt;
  &lt;div&gt;&lt;a style="float: right; text-decoration: none; border-bottom: none; padding-left: 1em;" href="https://simonwillison.net/2026/May/19/5-minute-llms/#5-minutes-llms.001.jpeg"&gt;#&lt;/a&gt;
  &lt;p&gt;I presented this lightning talk at PyCon US 2026, attempting to summarize the last six months of developments in LLMs in five minutes.&lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class="slide" id="5-minutes-llms.002.jpeg"&gt;
  &lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/5-minutes-llms/5-minutes-llms.002.jpeg" alt="The November inflection point
" style="max-width: 100%" /&gt;
  &lt;div&gt;&lt;a style="float: right; text-decoration: none; border-bottom: none; padding-left: 1em;" href="https://simonwillison.net/2026/May/19/5-minute-llms/#5-minutes-llms.002.jpeg"&gt;#&lt;/a&gt;
  &lt;p&gt;Six months is a pretty convenient time period to cover, because it captures what I've been calling the &lt;a href="https://simonwillison.net/tags/november-2025-inflection/"&gt;November 2025 inflection point&lt;/a&gt;. November was a critical month in LLMs, especially for coding.&lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class="slide" id="5-minutes-llms.003.jpeg"&gt;
  &lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/5-minutes-llms/5-minutes-llms.003.jpeg" alt="The “best” model changed hands 5 times
between Anthropic, OpenAl and Google
" style="max-width: 100%" /&gt;
  &lt;div&gt;&lt;a style="float: right; text-decoration: none; border-bottom: none; padding-left: 1em;" href="https://simonwillison.net/2026/May/19/5-minute-llms/#5-minutes-llms.003.jpeg"&gt;#&lt;/a&gt;
  &lt;p&gt;For one thing, the supposedly "best" model (depending mostly on vibes) changed hands five times between the three big providers.&lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class="slide" id="5-minutes-llms.004.jpeg"&gt;
  &lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/5-minutes-llms/5-minutes-llms.004.jpeg" alt="Generate an SVG of a
pelican riding a bicycle
" style="max-width: 100%" /&gt;
  &lt;div&gt;&lt;a style="float: right; text-decoration: none; border-bottom: none; padding-left: 1em;" href="https://simonwillison.net/2026/May/19/5-minute-llms/#5-minutes-llms.004.jpeg"&gt;#&lt;/a&gt;
  &lt;p&gt;As always, I'm using my &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;Generate an SVG of a pelican riding a bicycle&lt;/a&gt; test to help illustrate the differences between the models.&lt;/p&gt;
&lt;p&gt;Why this test? Because pelicans are hard to draw, bicycles are hard to draw, pelicans &lt;em&gt;can't ride bicycles&lt;/em&gt;... and there's zero chance any AI lab would train a model for such a ridiculous task.&lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class="slide" id="5-minutes-llms.005.jpeg"&gt;
  &lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/5-minutes-llms/5-minutes-llms.005.jpeg" alt="Five pelicans, one for each of the following models. Varying qualities!" style="max-width: 100%" /&gt;
  &lt;div&gt;&lt;a style="float: right; text-decoration: none; border-bottom: none; padding-left: 1em;" href="https://simonwillison.net/2026/May/19/5-minute-llms/#5-minutes-llms.005.jpeg"&gt;#&lt;/a&gt;
  &lt;p&gt;At the start of November the widely acknowledged "best" model was Claude Sonnet 4.5, released on &lt;a href="https://simonwillison.net/2025/Sep/29/claude-sonnet-4-5/"&gt;29th September&lt;/a&gt;. It drew me this pelican.&lt;/p&gt;
&lt;p&gt;In November it was overtaken by &lt;a href="https://simonwillison.net/2025/Nov/13/gpt-51/"&gt;GPT-5.1&lt;/a&gt;, then &lt;a href="https://simonwillison.net/2025/Nov/18/gemini-3/"&gt;Gemini 3&lt;/a&gt;, then &lt;a href="https://simonwillison.net/2025/Nov/19/gpt-51-codex-max/"&gt;GPT-5.1 Codex Max&lt;/a&gt;, and then Anthropic took the crown back again with &lt;a href="https://simonwillison.net/2025/Nov/24/claude-opus/"&gt;Claude Opus 4.5&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I think Gemini 3 drew the best pelican out of this lot, but pelicans aren't everything. Most practitioners will agree that Opus 4.5 held the crown for the next couple of months.&lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class="slide" id="5-minutes-llms.006.jpeg"&gt;
  &lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/5-minutes-llms/5-minutes-llms.006.jpeg" alt="The coding agents got good
" style="max-width: 100%" /&gt;
  &lt;div&gt;&lt;a style="float: right; text-decoration: none; border-bottom: none; padding-left: 1em;" href="https://simonwillison.net/2026/May/19/5-minute-llms/#5-minutes-llms.006.jpeg"&gt;#&lt;/a&gt;
  &lt;p&gt;It took a little while for this to become clear, but the real news from November was that the coding agents got &lt;em&gt;good&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;OpenAI and Anthropic had spent most of 2025 running &lt;a href="https://simonwillison.net/2025/Dec/19/andrej-karpathy/"&gt;Reinforcement Learning from Verifiable Rewards&lt;/a&gt; to increase the quality of code written by their models, especially when paired up with their Codex and Claude Code agent harnesses.&lt;/p&gt;
&lt;p&gt;In November the results of this work became apparent. Coding agents went from often-work to mostly-work, crossing a quality barrier where you could use them as a daily-driver to get real work done, without needing to spend most of your time fixing their stupid mistakes.&lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class="slide" id="5-minutes-llms.007.jpeg"&gt;
  &lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/5-minutes-llms/5-minutes-llms.007.jpeg" alt="Screenshot of &amp;quot;Initial commit&amp;quot; on GitHub to steipete/Warelay, commit f6dd362, steipete authored on Nov 24, 2025

It&amp;#39;s a copy of the MIT license" style="max-width: 100%" /&gt;
  &lt;div&gt;&lt;a style="float: right; text-decoration: none; border-bottom: none; padding-left: 1em;" href="https://simonwillison.net/2026/May/19/5-minute-llms/#5-minutes-llms.007.jpeg"&gt;#&lt;/a&gt;
  &lt;p&gt;Also in November, this happened - the first commit to an obscure (back then) repo called "Warelay" by some guy called Pete.&lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class="slide" id="5-minutes-llms.008.jpeg"&gt;
  &lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/5-minutes-llms/5-minutes-llms.008.jpeg" alt="December/January
(A little bit of LLM psychosis)
" style="max-width: 100%" /&gt;
  &lt;div&gt;&lt;a style="float: right; text-decoration: none; border-bottom: none; padding-left: 1em;" href="https://simonwillison.net/2026/May/19/5-minute-llms/#5-minutes-llms.008.jpeg"&gt;#&lt;/a&gt;
  &lt;p&gt;Over the holiday period, from December to January, a whole lot of us took advantage of the break to have a poke at these new models and coding agents and see what they could do.&lt;/p&gt;
&lt;p&gt;They could do a lot! Some of us got a little bit over-excited. I had my own short-lived bout of a form of LLM psychosis as I started spinning up wildly ambitious projects to see how far I could push them.&lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class="slide" id="5-minutes-llms.009.jpeg"&gt;
  &lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/5-minutes-llms/5-minutes-llms.009.jpeg" alt="micro-javascript playground
Execute JavaScript code in a sandboxed micro-javascript environment powered by Pyodide

var numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
var doubled = numbers.map(n =&amp;gt; n * 2);
console.log(&amp;#39;Doubled: &amp;quot;&amp;#39;, doubled);
var evens = numbers.filter(n =&amp;gt; n % 2 === 0);
console.log(&amp;#39;Evens: &amp;#39;, evens);
var sum = numbers.reduce((a, b) =&amp;gt; a + b, @);
console.log(&amp;#39;Sum:&amp;quot;, sum);

Output 27
Doubled: [2, 4, 6, 8, 10, 12, 14, 16, 18, 20]
Evens: [2, 4, 6, 8, 10]
Sum: 55
Execution time: 8.00ms
About: micro-javascript is a pure Python JavaScript interpreter with configurable memory and time limits. This playground runs entirely in your browser using
Pyodide (Python compiled to WebAssembly). View on GitHub" style="max-width: 100%" /&gt;
  &lt;div&gt;&lt;a style="float: right; text-decoration: none; border-bottom: none; padding-left: 1em;" href="https://simonwillison.net/2026/May/19/5-minute-llms/#5-minutes-llms.009.jpeg"&gt;#&lt;/a&gt;
  &lt;p&gt;One of my projects was a vibe-coded implementation of JavaScript in Python - a loose port of &lt;a href="https://github.com/bellard/mquickjs"&gt;MicroQuickJS&lt;/a&gt; - which I called &lt;a href="https://github.com/simonw/micro-javascript"&gt;micro-javascript&lt;/a&gt;. You can try it out in your browser in &lt;a href="https://simonw.github.io/micro-javascript/playground.html"&gt;this playground&lt;/a&gt;.&lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class="slide" id="5-minutes-llms.010.jpeg"&gt;
  &lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/5-minutes-llms/5-minutes-llms.010.jpeg" alt="JavaScript running in Python running in Pyodide running in WebAssembly running in JavaScript" style="max-width: 100%" /&gt;
  &lt;div&gt;&lt;a style="float: right; text-decoration: none; border-bottom: none; padding-left: 1em;" href="https://simonwillison.net/2026/May/19/5-minute-llms/#5-minutes-llms.010.jpeg"&gt;#&lt;/a&gt;
  &lt;p&gt;That playground demo shows JavaScript code run using my micro-javascript library, in Python, running inside Pyodide, running in WebAssembly, running in JavaScript, running in a browser!&lt;/p&gt;
&lt;p&gt;It's pretty cool! But did anyone out there &lt;em&gt;need&lt;/em&gt; a buggy, slow, insecure half-baked implementation of JavaScript in Python?&lt;/p&gt;
&lt;p&gt;They did not. I have quite a few other projects from that holiday period that I have since quietly retired!&lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class="slide" id="5-minutes-llms.011.jpeg"&gt;
  &lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/5-minutes-llms/5-minutes-llms.011.jpeg" alt="February 2026
" style="max-width: 100%" /&gt;
  &lt;div&gt;&lt;a style="float: right; text-decoration: none; border-bottom: none; padding-left: 1em;" href="https://simonwillison.net/2026/May/19/5-minute-llms/#5-minutes-llms.011.jpeg"&gt;#&lt;/a&gt;
  &lt;p&gt;On to February. Remember that Warelay project that had its first commit at the end of November?&lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class="slide" id="5-minutes-llms.012.jpeg"&gt;
  &lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/5-minutes-llms/5-minutes-llms.012.jpeg" alt="Warelay → CLAWDIS → CLAWDBOT →
Clawdbot → Moltbot →🦞 OpenClaw" style="max-width: 100%" /&gt;
  &lt;div&gt;&lt;a style="float: right; text-decoration: none; border-bottom: none; padding-left: 1em;" href="https://simonwillison.net/2026/May/19/5-minute-llms/#5-minutes-llms.012.jpeg"&gt;#&lt;/a&gt;
  &lt;p&gt;In December and January it had gone through &lt;a href="https://simonwillison.net/2026/May/16/openclaw-names/"&gt;quite a few name changes&lt;/a&gt;... and by February it was taking the world by storm under its final name, &lt;a href="https://openclaw.ai/"&gt;OpenClaw&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The amount of attention it got is pretty astonishing for a project that was less than three months old.&lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class="slide" id="5-minutes-llms.013.jpeg"&gt;
  &lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/5-minutes-llms/5-minutes-llms.013.jpeg" alt="Generic term: Claw
" style="max-width: 100%" /&gt;
  &lt;div&gt;&lt;a style="float: right; text-decoration: none; border-bottom: none; padding-left: 1em;" href="https://simonwillison.net/2026/May/19/5-minute-llms/#5-minutes-llms.013.jpeg"&gt;#&lt;/a&gt;
  &lt;p&gt;OpenClaw is a "personal AI assistant", and we actually got a generic term for these, based on NanoClaw and ZeroClaw and suchlike... they're called &lt;strong&gt;Claws&lt;/strong&gt;.&lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class="slide" id="5-minutes-llms.014.jpeg"&gt;
  &lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/5-minutes-llms/5-minutes-llms.014.jpeg" alt="An aquarium for your Claw
" style="max-width: 100%" /&gt;
  &lt;div&gt;&lt;a style="float: right; text-decoration: none; border-bottom: none; padding-left: 1em;" href="https://simonwillison.net/2026/May/19/5-minute-llms/#5-minutes-llms.014.jpeg"&gt;#&lt;/a&gt;
  &lt;p&gt;Mac Minis started to sell out around Silicon Valley, because people were buying them to run their Claws.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.dbreunig.com/"&gt;Drew Breunig&lt;/a&gt; joked to me that this is because they're the new digital pets, and a Mac Mini is the perfect aquarium for your Claw.&lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class="slide" id="5-minutes-llms.015.jpeg"&gt;
  &lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/5-minutes-llms/5-minutes-llms.015.jpeg" alt="Alfred Molina&amp;#39;s Doc Ock in Spider-Man 2, tearing apart a New York subway train with his four claws." style="max-width: 100%" /&gt;
  &lt;div&gt;&lt;a style="float: right; text-decoration: none; border-bottom: none; padding-left: 1em;" href="https://simonwillison.net/2026/May/19/5-minute-llms/#5-minutes-llms.015.jpeg"&gt;#&lt;/a&gt;
  &lt;p&gt;My favourite metaphor for Claws is Alfred Molina's Doc Ock in the 2004 movie Spider-Man 2. His claws were powered by AI, and were perfectly safe provided nothing damaged his inhibitor chip... after which they turned evil and took over.&lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class="slide" id="5-minutes-llms.016.jpeg"&gt;
  &lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/5-minutes-llms/5-minutes-llms.016.jpeg" alt="Gemini 3.1 Pro

A really good illustration of a pelican riding a bicycle.
" style="max-width: 100%" /&gt;
  &lt;div&gt;&lt;a style="float: right; text-decoration: none; border-bottom: none; padding-left: 1em;" href="https://simonwillison.net/2026/May/19/5-minute-llms/#5-minutes-llms.016.jpeg"&gt;#&lt;/a&gt;
  &lt;p&gt;Also in February: Gemini 3.1 Pro came out, and drew me a &lt;em&gt;really good pelican riding a bicycle&lt;/em&gt;. Look at this! It's even got a fish in its basket.&lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class="slide" id="5-minutes-llms.017.jpeg"&gt;
  &lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/5-minutes-llms/5-minutes-llms.017.jpeg" alt="Gemini 3 Pro pelican contrasted with Gemini 3.1 Pro, as animated SVGs" style="max-width: 100%" /&gt;
  &lt;div&gt;&lt;a style="float: right; text-decoration: none; border-bottom: none; padding-left: 1em;" href="https://simonwillison.net/2026/May/19/5-minute-llms/#5-minutes-llms.017.jpeg"&gt;#&lt;/a&gt;
  &lt;p&gt;And then Google's Jeff Dean &lt;a href="https://simonwillison.net/2026/Feb/19/gemini-31-pro/#jeff-dean"&gt;tweeted this video&lt;/a&gt; of an animated pelican riding a bicycle, plus a frog on a penny-farthing and a giraffe driving a tiny car and an ostrich on roller skates and a turtle kickflipping a skateboard and a dachshund driving a stretch limousine.&lt;/p&gt;
&lt;p&gt;So maybe the AI labs have been paying attention after all!&lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class="slide" id="5-minutes-llms.018.jpeg"&gt;
  &lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/5-minutes-llms/5-minutes-llms.018.jpeg" alt="April 2026
" style="max-width: 100%" /&gt;
  &lt;div&gt;&lt;a style="float: right; text-decoration: none; border-bottom: none; padding-left: 1em;" href="https://simonwillison.net/2026/May/19/5-minute-llms/#5-minutes-llms.018.jpeg"&gt;#&lt;/a&gt;
  &lt;p&gt;A lot of stuff happened just in the past month.&lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class="slide" id="5-minutes-llms.019.jpeg"&gt;
  &lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/5-minutes-llms/5-minutes-llms.019.jpeg" alt="Gemma 4 26B-A4B (17.99GB)

A pretty decent pelican riding a bicycle, though the bike is a bit mis-shapen." style="max-width: 100%" /&gt;
  &lt;div&gt;&lt;a style="float: right; text-decoration: none; border-bottom: none; padding-left: 1em;" href="https://simonwillison.net/2026/May/19/5-minute-llms/#5-minutes-llms.019.jpeg"&gt;#&lt;/a&gt;
  &lt;p&gt;Google released the &lt;a href="https://simonwillison.net/2026/Apr/2/gemma-4/"&gt;Gemma 4&lt;/a&gt; series of models, which are the most capable open weight models I've seen from a US company.&lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class="slide" id="5-minutes-llms.020.jpeg"&gt;
  &lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/5-minutes-llms/5-minutes-llms.020.jpeg" alt="GLM-5.1
MIT, 754B parameter, 1.51TB!
" style="max-width: 100%" /&gt;
  &lt;div&gt;&lt;a style="float: right; text-decoration: none; border-bottom: none; padding-left: 1em;" href="https://simonwillison.net/2026/May/19/5-minute-llms/#5-minutes-llms.020.jpeg"&gt;#&lt;/a&gt;
  &lt;p&gt;Also last month, Chinese AI lab GLM came out with &lt;a href="https://simonwillison.net/2026/Apr/7/glm-51/"&gt;GLM-5.1&lt;/a&gt; - an open weight 1.5TB monster! This is a very effective model... if you can afford the hardware to run it.&lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class="slide" id="5-minutes-llms.021.jpeg"&gt;
  &lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/5-minutes-llms/5-minutes-llms.021.jpeg" alt="" style="max-width: 100%" /&gt;
  &lt;div&gt;&lt;a style="float: right; text-decoration: none; border-bottom: none; padding-left: 1em;" href="https://simonwillison.net/2026/May/19/5-minute-llms/#5-minutes-llms.021.jpeg"&gt;#&lt;/a&gt;
  &lt;p&gt;GLM-5.1 drew me this very competent pelican on a bicycle.&lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class="slide" id="5-minutes-llms.022.jpeg"&gt;
  &lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/5-minutes-llms/5-minutes-llms.022.jpeg" alt="The bike is wonky, the pelican is floating." style="max-width: 100%" /&gt;
  &lt;div&gt;&lt;a style="float: right; text-decoration: none; border-bottom: none; padding-left: 1em;" href="https://simonwillison.net/2026/May/19/5-minute-llms/#5-minutes-llms.022.jpeg"&gt;#&lt;/a&gt;
  &lt;p&gt;... though when it &lt;a href="https://gisthost.github.io/?73bb6808b18c2482f66e5f082c75f36e"&gt;tried to animate it&lt;/a&gt; the bicycle bounced off into the top and the bicycle got warped.&lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class="slide" id="5-minutes-llms.023.jpeg"&gt;
  &lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/5-minutes-llms/5-minutes-llms.023.jpeg" alt="Screenshot of Bluesky

Charles
‪@charles.capps.me‬
I think you should pester it with another animal using another method of locomotion. 

Something tells me it was trained for this. I can&amp;#39;t quite put my finger on it. /s

NORTH VIRGINIA OPOSSUM ON AN E-SCOOTER!!" style="max-width: 100%" /&gt;
  &lt;div&gt;&lt;a style="float: right; text-decoration: none; border-bottom: none; padding-left: 1em;" href="https://simonwillison.net/2026/May/19/5-minute-llms/#5-minutes-llms.023.jpeg"&gt;#&lt;/a&gt;
  &lt;p&gt;Charles &lt;a href="https://bsky.app/profile/charles.capps.me/post/3miwrn42mjc2t"&gt;on Bluesky&lt;/a&gt; suggested I try it with a North Virginia Opossum on an E-scooter&lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class="slide" id="5-minutes-llms.024.jpeg"&gt;
  &lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/5-minutes-llms/5-minutes-llms.024.jpeg" alt="NORTH VIRGINIA OPOSSUM
CRUISING THE COMMONWEALTH SINCE DUSK

And a really cool illustration of a possum." style="max-width: 100%" /&gt;
  &lt;div&gt;&lt;a style="float: right; text-decoration: none; border-bottom: none; padding-left: 1em;" href="https://simonwillison.net/2026/May/19/5-minute-llms/#5-minutes-llms.024.jpeg"&gt;#&lt;/a&gt;
  &lt;p&gt;And it did this! I've tried this on other models and they don't even come close. "Cruising the commonwealth since dusk" is perfect. It's &lt;a href="https://static.simonwillison.net/static/2026/glm-possum-escooter.html"&gt;animated too&lt;/a&gt;.&lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class="slide" id="5-minutes-llms.025.jpeg"&gt;
  &lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/5-minutes-llms/5-minutes-llms.025.jpeg" alt="Qwen3.6-35B-A3B is a 20.9GB file that runs on my laptop

It drew a better pelican on a bicycle than Opus 4.7, which messed up the bicycle frame." style="max-width: 100%" /&gt;
  &lt;div&gt;&lt;a style="float: right; text-decoration: none; border-bottom: none; padding-left: 1em;" href="https://simonwillison.net/2026/May/19/5-minute-llms/#5-minutes-llms.025.jpeg"&gt;#&lt;/a&gt;
  &lt;p&gt;The other neat Chinese open weight models in April came from Qwen. &lt;a href="https://simonwillison.net/2026/Apr/16/qwen-beats-opus/"&gt;Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7&lt;/a&gt;. That's a 20.9GB open weights model that runs on my laptop!&lt;/p&gt;
&lt;p&gt;(I think this mainly demonstrates that the pelican on the bicycle has firmly exceeded its limits as a useful benchmark.)&lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class="slide" id="5-minutes-llms.026.jpeg"&gt;
  &lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/5-minutes-llms/5-minutes-llms.026.jpeg" alt="Claude Sonnet 4.5 pelican for comparison." style="max-width: 100%" /&gt;
  &lt;div&gt;&lt;a style="float: right; text-decoration: none; border-bottom: none; padding-left: 1em;" href="https://simonwillison.net/2026/May/19/5-minute-llms/#5-minutes-llms.026.jpeg"&gt;#&lt;/a&gt;
  &lt;p&gt;Here's that Claude Sonnet 4.5 pelican from September for comparison. &lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class="slide" id="5-minutes-llms.027.jpeg"&gt;
  &lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/5-minutes-llms/5-minutes-llms.027.jpeg" alt="The themes of the past 6 months:
Coding agents got really good
Local models wildly outperform expectations
" style="max-width: 100%" /&gt;
  &lt;div&gt;&lt;a style="float: right; text-decoration: none; border-bottom: none; padding-left: 1em;" href="https://simonwillison.net/2026/May/19/5-minute-llms/#5-minutes-llms.027.jpeg"&gt;#&lt;/a&gt;
  &lt;p&gt;So those were the two main themes of the past six months. The coding agents got really good... and the laptop-available models, while a lot weaker than the frontier, have started wildly outperforming expectations.&lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="lightning-talks"/><category term="pycon"/><category term="speaking"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="annotated-talks"/><category term="pelican-riding-a-bicycle"/><category term="coding-agents"/></entry><entry><title>Notes on the xAI/Anthropic data center deal</title><link href="https://simonwillison.net/2026/May/7/xai-anthropic/#atom-entries" rel="alternate"/><published>2026-05-07T17:09:28+00:00</published><updated>2026-05-07T17:09:28+00:00</updated><id>https://simonwillison.net/2026/May/7/xai-anthropic/#atom-entries</id><summary type="html">&lt;p&gt;There weren't a lot of big new announcements from Anthropic at yesterday's Code w/ Claude event, but the biggest by far was the deal they've struck with SpaceX/xAI to use "all of the capacity of their Colossus data center".&lt;/p&gt;
&lt;p&gt;As I mentioned in my &lt;a href="https://simonwillison.net/2026/May/6/code-w-claude-2026/"&gt;live blog of the keynote&lt;/a&gt;, that's the one with the &lt;a href="https://www.politico.com/news/2025/05/06/elon-musk-xai-memphis-gas-turbines-air-pollution-permits-00317582"&gt;particularly bad environmental record&lt;/a&gt;. The gas turbines installed to power the facility initially ran without Clean Air Act permits or pollution control devices, which they got away with by classifying them as "temporary". Credible reports link it to increases in hospital admissions relating to low air quality.&lt;/p&gt;
&lt;p&gt;Andy Masley, one of the most prolific voices pushing back against misleading rhetoric about data centers (see &lt;a href="https://blog.andymasley.com/p/the-ai-water-issue-is-fake"&gt;The AI water issue is fake&lt;/a&gt; and &lt;a href="https://blog.andymasley.com/p/data-center-land-use-issues-are-fake"&gt;Data center land issues are fake&lt;/a&gt;), had &lt;a href="https://x.com/andymasley/status/2052070252930826384"&gt;this to say&lt;/a&gt; about Colossus:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I would simply not run my computing out of this specific data center&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I get that Anthropic are severely compute-constrained, but in a world where the very existence of "AI data centers" is a red-hot political issue (see recent &lt;a href="https://kutv.com/news/local/amid-boos-box-elder-county-commission-unanimously-approves-plan-for-massive-data-center"&gt;news out of Utah&lt;/a&gt; for a fresh example), signing up with this particular data center is a really bad look.&lt;/p&gt;
&lt;p&gt;There was a lot of initial chatter about how this meant xAI were clearly giving up on their own Grok models, since all of their capacity would be sold to Anthropic instead. That was a misconception - Anthropic are getting Colossus 1, but xAI are keeping their larger Colossus 2 data center for their own work.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update 11th June&lt;/strong&gt;: Anthropic later turned out to have booked capacity &lt;a href="https://www.anthropic.com/news/series-h"&gt;in Colossus 2 as well&lt;/a&gt;, and &lt;a href="https://techcrunch.com/2026/06/05/google-will-pay-spacex-920m-per-month-for-compute/"&gt;Google also bought capacity&lt;/a&gt; from xAI.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;As an interesting side note, the night before the Anthropic announcement, xAI sent out a deprecation notice for Grok 4.1 Fast and several other models providing just two weeks' notice before shutdown, reported here &lt;a href="https://twitter.com/xlr8harder/status/2051901091906834439"&gt;by @xlr8harder&lt;/a&gt; from SpeechMap:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/grok-fast-shutdown.png" alt="Effective May 15, 2026 at 12:00pm PT, the following models will be retired from the xAI API: grok-4-1-fast-reasoning, grok-4-1-fast-non-reasoning, grok-4-fast-reasoning, grok-4-fast-non-reasoning, grok-4-0709, grok-code-fast-1, grok-3, grok-imagine-image-pro. After May 15, 2026, requests to these models will no longer work." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;This is terrible @xai. I just spent time and money to migrate to grok 4.1 fast, and you're disabling it with less than two weeks notice, after releasing it in November, with no migration path to a fast/cheap alternative.&lt;/p&gt;
&lt;p&gt;I will never depend on one of your products again.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://speechmap.substack.com/p/speechmap-update-xai-loses-top-spot"&gt;SpeechMap's detailed explanation&lt;/a&gt; of how they selected Grok 4.1 Fast for their project in March.&lt;/p&gt;
&lt;p&gt;Were xAI serving those models out of Colossus 1?&lt;/p&gt;
&lt;p&gt;xAI owner Elon Musk (who previously delighted in calling Anthropic &lt;a href="https://twitter.com/search?q=from%3Aelonmusk+misanthropic&amp;amp;src=typed_query&amp;amp;f=live"&gt;"Misanthropic"&lt;/a&gt;) &lt;a href="https://twitter.com/elonmusk/status/2052069691372478511"&gt;tweeted&lt;/a&gt; the following:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;By way of background for those who care, I spent a lot of time last week with senior members of the Anthropic team to understand what they do to ensure Claude is good for humanity and was impressed. [...]&lt;/p&gt;
&lt;p&gt;After that, I was ok leasing Colossus 1 to Anthropic, as SpaceXAI had already moved training to Colossus 2.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And then &lt;a href="https://twitter.com/elonmusk/status/2052076315306864756"&gt;shortly afterwards&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Just as SpaceX launches hundreds of satellites for competitors with fair terms and pricing, we will provide compute to AI companies that are taking the right steps to ensure it is good for humanity.&lt;/p&gt;
&lt;p&gt;We reserve the right to reclaim the compute if their AI engages in actions that harm humanity.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Presumably the criteria for "harm humanity" are decided by Elon himself. Sounds like a new form of supply chain risk for Anthropic to me!&lt;/p&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="ai"/><category term="llms"/><category term="anthropic"/><category term="ai-ethics"/><category term="ai-energy-usage"/><category term="xai"/><category term="andy-masley"/></entry><entry><title>Live blog: Code w/ Claude 2026</title><link href="https://simonwillison.net/2026/May/6/code-w-claude-2026/#atom-entries" rel="alternate"/><published>2026-05-06T15:58:27+00:00</published><updated>2026-05-06T15:58:27+00:00</updated><id>https://simonwillison.net/2026/May/6/code-w-claude-2026/#atom-entries</id><summary type="html">&lt;p&gt;I'm at Anthropic's Code w/ Claude event today. Here's my live blog of the morning keynote sessions.&lt;/p&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="claude-code"/><category term="live-blog"/></entry><entry><title>Vibe coding and agentic engineering are getting closer than I'd like</title><link href="https://simonwillison.net/2026/May/6/vibe-coding-and-agentic-engineering/#atom-entries" rel="alternate"/><published>2026-05-06T14:24:08+00:00</published><updated>2026-05-06T14:24:08+00:00</updated><id>https://simonwillison.net/2026/May/6/vibe-coding-and-agentic-engineering/#atom-entries</id><summary type="html">&lt;p&gt;I recently talked with Joseph Ruscio about AI coding tools for Heavybit's High Leverage podcast: &lt;a href="https://www.heavybit.com/library/podcasts/high-leverage/ep-9-the-ai-coding-paradigm-shift-with-simon-willison"&gt;Ep. #9, The AI Coding Paradigm Shift with Simon Willison&lt;/a&gt;. Here are some of my highlights, including my disturbing realization that vibe coding and agentic engineering have started to converge in my own work.&lt;/p&gt;
&lt;p&gt;One thing I really enjoy about podcasts is that they sometimes push me to think out loud in a way that exposes an idea I've not previously been able to put into words.&lt;/p&gt;
&lt;h4 id="vibe-coding-and-agentic-engineering-are-starting-to-overlap"&gt;Vibe coding and agentic engineering are starting to overlap&lt;/h4&gt;
&lt;p&gt;A few weeks after vibe coding was first coined I published &lt;a href="https://simonwillison.net/2025/Mar/19/vibe-coding/"&gt;Not all AI-assisted programming is vibe coding (but vibe coding rocks)&lt;/a&gt;, where I firmly staked out my belief that "vibe coding" is a very different beast from responsible use of AI to write code, which I've since started to call &lt;a href="https://simonwillison.net/guides/agentic-engineering-patterns/what-is-agentic-engineering/"&gt;agentic engineering&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;When Joseph brought up the distinction between the two I had a sudden realization that they're not nearly as distinct for me as they used to be:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Weirdly though, those things have started to blur for me already, which is quite upsetting.&lt;/p&gt;
&lt;p&gt;I thought we had a very clear delineation where vibe coding is the thing where you're not looking at the code at all. You might not even know how to program. You might be a non-programmer who asks for a thing, and gets a thing, and if the thing works, then great! And if it doesn't, you tell it that it doesn't work and cross your fingers.&lt;/p&gt;
&lt;p&gt;But at no point are you really caring about the code quality or any of those additional constraints. And my take on vibe coding was that it's fantastic, provided you understand when it can be used and when it can't.&lt;/p&gt;
&lt;p&gt;A personal tool for you, where if there's a bug it hurts only you, go ahead!&lt;/p&gt;
&lt;p&gt;If you're building software for other people, vibe coding is grossly irresponsible because it's other people's information. Other people get hurt by your stupid bugs. You need to have a higher level than that.&lt;/p&gt;
&lt;p&gt;This contrasts with agentic engineering where you are a professional software engineer. You understand security and maintainability and operations and performance and so forth. You're using these tools to the highest of your own ability. I'm finding the scope of challenges I can take on has gone up by a significant amount because I've got the support of these tools.&lt;/p&gt;
&lt;p&gt;But I'm still leaning on my 25 years of experience as a software engineer.&lt;/p&gt;
&lt;p&gt;The goal is to build high quality production systems: if you're building lower quality stuff faster, I think that's bad. I want to build &lt;em&gt;higher&lt;/em&gt; quality stuff faster. I want everything I'm building to be better in every way than it was before.&lt;/p&gt;
&lt;p&gt;The problem is that as the coding agents get more reliable, I'm not reviewing every line of code that they write anymore, even for my production level stuff.&lt;/p&gt;
&lt;p&gt;I know full well that if you ask Claude Code to build a JSON API endpoint that runs a SQL query and outputs the results as JSON, it's just going to do it right. It's not going to mess that up. You have it add automated tests, you have it add documentation, you know it's going to be good.&lt;/p&gt;
&lt;p&gt;But I'm not reviewing that code. And now I've got that feeling of guilt: if I haven't reviewed the code, is it really responsible for me to use this in production?&lt;/p&gt;
&lt;p&gt;The thing that really helps me is thinking back to when I've worked at larger organizations where I've been an engineering manager. Other teams are building software that my team depends on.&lt;/p&gt;
&lt;p&gt;If another team hands over something and says, "hey, this is the image resize service, here's how to use it to resize your images"... I'm not going to go and read every line of code that they wrote.&lt;/p&gt;
&lt;p&gt;I'm going to look at their documentation and I'm going to use it to resize some images. And then I'm going to start shipping my own features. And if I start running into problems where the image resizer thing appears to have bugs or the performance isn't good, that's when I might dig into their Git repositories and see what's going on. But for the most part I treat that as a semi-black box that I don't look at until I need to.&lt;/p&gt;
&lt;p&gt;I'm starting to treat the agents in the same way. And it still feels uncomfortable, because human beings are accountable for what they do. A team can build a reputation. I can say "I trust that team over there. They built good software in the past. They're not going to build something rubbish because that affects their professional reputations."&lt;/p&gt;
&lt;p&gt;Claude Code does not have a professional reputation! It can't take accountability for what it's done. But it's been proving itself anyway - time and time again it's churning out straightforward things and doing them right in the style that I like.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;There's an element of &lt;a href="https://simonwillison.net/2025/Dec/10/normalization-of-deviance/"&gt;the normalization of deviance&lt;/a&gt; here - every time a model turns out to have written the right code without me monitoring it closely there's a risk that I'll trust it at the wrong moment in the future and get burned.&lt;/p&gt;
&lt;h4 id="the-new-challenge-of-evaluating-software"&gt;The new challenge of evaluating software&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;It used to be if you found a GitHub repository with a hundred commits and a good readme and automated tests and stuff, you could be pretty sure that the person writing that had put a lot of care and attention into that project.&lt;/p&gt;
&lt;p&gt;And now I can knock out a git repository with a hundred commits and a beautiful readme and comprehensive tests of every line of code in half an hour! It looks identical to those projects that have had a great deal of care and attention. Maybe it is as good as them. I don't know. I can't tell from looking at it. Even for my &lt;em&gt;own&lt;/em&gt; projects, I can't tell.&lt;/p&gt;
&lt;p&gt;So I realized what I value more than the quality of the tests and documentation is that I want somebody to have &lt;em&gt;used&lt;/em&gt; the thing. If you've got a vibe coded thing which you have used every day for the past two weeks, that's much more valuable to me than something that you've just spat out and hardly even exercised.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4 id="the-bottlenecks-have-shifted"&gt;The bottlenecks have shifted&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;If you can go from producing 200 lines of code a day to 2,000 lines of code a day, what else breaks? The entire software development lifecycle was, it turns out, designed around the idea that it takes a day to produce a few hundred lines of code. And now it doesn't.&lt;/p&gt;
&lt;p&gt;It's not just the downstream stuff, it's the upstream stuff as well. I saw &lt;a href="https://simonwillison.net/2026/Jan/24/dont-trust-the-process/"&gt;a great talk by Jenny Wen&lt;/a&gt;, who's the design leader at Anthropic, where she said we have all of these design processes that are based around the idea that you need to get the design &lt;em&gt;right&lt;/em&gt; - because if you hand it off to the engineers and they spend three months building the wrong thing, that's catastrophic.&lt;/p&gt;
&lt;p&gt;There's this whole very extensive design process that you put in place because that design results in expensive work. But if it doesn't take three months to build, maybe the design process can be a whole lot riskier because cost, if you get something wrong, has been reduced so much.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4 id="why-i-m-still-not-afraid-for-my-career"&gt;Why I'm still not afraid for my career&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;When I look at my conversations with the agents, it's very clear to me that this is moon language for the vast majority of human beings.&lt;/p&gt;
&lt;p&gt;There are a whole bunch of reasons I'm not scared that my career as a software engineer is over now that computers can write their own code, partly because these things are amplifiers of existing experience. If you know what you're doing, you can run so much faster with them. [...]&lt;/p&gt;
&lt;p&gt;I'm constantly reminded as I work with these tools how hard the thing that we do is. Producing software is a &lt;em&gt;ferociously&lt;/em&gt; difficult thing to do. And you could give me all of the AI tools in the world and what we're trying to achieve here is still really difficult. [...]&lt;/p&gt;
&lt;p&gt;Matthew Yglesias, who's a political commentator, yesterday &lt;a href="https://twitter.com/mattyglesias/status/2049105745132585161"&gt;tweeted&lt;/a&gt;, "Five months in, I think I've decided that I don't want to vibecode — I want professionally managed software companies to use AI coding assistance to make more/better/cheaper software products that they sell to me for money." And that feels about right to me. I can plumb my house if I watch enough YouTube videos on plumbing. I would rather hire a plumber.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;On the threat to SaaS providers of companies rolling their own solutions instead:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I just realized it's the thing I said earlier about how I only want to use your side project if you've used it for a few weeks. The enterprise version of that is I don't want a CRM unless at least two other giant enterprises have successfully used that CRM for six months. [...] You want solutions that are proven to work before you take a risk on them.&lt;/p&gt;
&lt;/blockquote&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="podcast-appearances"/><category term="vibe-coding"/><category term="coding-agents"/><category term="agentic-engineering"/></entry><entry><title>LLM 0.32a0  is a major backwards-compatible refactor</title><link href="https://simonwillison.net/2026/Apr/29/llm/#atom-entries" rel="alternate"/><published>2026-04-29T19:01:47+00:00</published><updated>2026-04-29T19:01:47+00:00</updated><id>https://simonwillison.net/2026/Apr/29/llm/#atom-entries</id><summary type="html">&lt;p&gt;I just released &lt;a href="https://llm.datasette.io/en/latest/changelog.html#a0-2026-04-28"&gt;LLM 0.32a0&lt;/a&gt;, an alpha release of my &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; Python library and CLI tool for accessing LLMs, with some consequential changes that I've been working towards for quite a while.&lt;/p&gt;
&lt;p&gt;Previous versions of LLM modeled the world in terms of prompts and responses. Send the model a text prompt, get back a text response.&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;

&lt;span class="pl-s1"&gt;model&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;.&lt;span class="pl-c1"&gt;get_model&lt;/span&gt;(&lt;span class="pl-s"&gt;"gpt-5.5"&lt;/span&gt;)
&lt;span class="pl-s1"&gt;response&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;model&lt;/span&gt;.&lt;span class="pl-c1"&gt;prompt&lt;/span&gt;(&lt;span class="pl-s"&gt;"Capital of France?"&lt;/span&gt;)
&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;response&lt;/span&gt;.&lt;span class="pl-c1"&gt;text&lt;/span&gt;())&lt;/pre&gt;
&lt;p&gt;This made sense when I started working on the library back in April 2023. A lot has changed since then!&lt;/p&gt;
&lt;p&gt;LLM provides an abstraction over thousands of different models via its &lt;a href="https://llm.datasette.io/en/stable/plugins/index.html"&gt;plugin system&lt;/a&gt;. The original abstraction - of text input that returns text output - was no longer able to represent everything I needed it to.&lt;/p&gt;
&lt;p&gt;Over time LLM itself has grown &lt;a href="https://simonwillison.net/2024/Oct/29/llm-multi-modal/"&gt;attachments&lt;/a&gt; to handle image, audio, and video input, then &lt;a href="https://simonwillison.net/2025/Feb/28/llm-schemas/"&gt;schemas&lt;/a&gt; for outputting structured JSON, then &lt;a href="https://simonwillison.net/2025/May/27/llm-tools/"&gt;tools&lt;/a&gt; for executing tool calls. Meanwhile LLMs kept evolving, adding reasoning support and the ability to return images and all kinds of other interesting capabilities.&lt;/p&gt;
&lt;p&gt;LLM needs to evolve to better handle the diversity of input and output types that can be processed by today's frontier models.&lt;/p&gt;
&lt;p&gt;The 0.32a0 alpha has two key changes: model inputs can be represented as a sequence of messages, and model responses can be composed of a stream of differently typed parts.&lt;/p&gt;
&lt;h4 id="prompts-as-a-sequence-of-messages"&gt;Prompts as a sequence of messages&lt;/h4&gt;
&lt;p&gt;LLMs accept input as text, but ever since ChatGPT demonstrated the value of a two-way conversational interface, the most common way to prompt them has been to treat that input as a sequence of conversational turns.&lt;/p&gt;
&lt;p&gt;The first turn might look like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;user: Capital of France?
assistant: 
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;(The model then gets to fill out the reply from the assistant.)&lt;/p&gt;
&lt;p&gt;But each subsequent turn needs to replay the entire conversation up to that point, as a sort of screenplay:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;user: Capital of France?
assistant: Paris
user: Germany?
assistant:
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Most of the JSON APIs from the major vendors follow this pattern. Here's what the above looks like using the OpenAI chat completions API, which has been widely imitated by other providers:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;curl https://api.openai.com/v1/chat/completions \
  -H &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Authorization: Bearer &lt;span class="pl-smi"&gt;$OPENAI_API_KEY&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
  -H &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Content-Type: application/json&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
  -d &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{&lt;/span&gt;
&lt;span class="pl-s"&gt;    "model": "gpt-5.5",&lt;/span&gt;
&lt;span class="pl-s"&gt;    "messages": [&lt;/span&gt;
&lt;span class="pl-s"&gt;      {&lt;/span&gt;
&lt;span class="pl-s"&gt;        "role": "user",&lt;/span&gt;
&lt;span class="pl-s"&gt;        "content": "Capital of France?"&lt;/span&gt;
&lt;span class="pl-s"&gt;      },&lt;/span&gt;
&lt;span class="pl-s"&gt;      {&lt;/span&gt;
&lt;span class="pl-s"&gt;        "role": "assistant",&lt;/span&gt;
&lt;span class="pl-s"&gt;        "content": "Paris"&lt;/span&gt;
&lt;span class="pl-s"&gt;      },&lt;/span&gt;
&lt;span class="pl-s"&gt;      {&lt;/span&gt;
&lt;span class="pl-s"&gt;        "role": "user",&lt;/span&gt;
&lt;span class="pl-s"&gt;        "content": "Germany?"&lt;/span&gt;
&lt;span class="pl-s"&gt;      }&lt;/span&gt;
&lt;span class="pl-s"&gt;    ]&lt;/span&gt;
&lt;span class="pl-s"&gt;  }&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Prior to 0.32, LLM modeled these as conversations:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-s1"&gt;model&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;.&lt;span class="pl-c1"&gt;get_model&lt;/span&gt;(&lt;span class="pl-s"&gt;"gpt-5.5"&lt;/span&gt;)

&lt;span class="pl-s1"&gt;conversation&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;model&lt;/span&gt;.&lt;span class="pl-c1"&gt;conversation&lt;/span&gt;()
&lt;span class="pl-s1"&gt;r1&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;conversation&lt;/span&gt;.&lt;span class="pl-c1"&gt;prompt&lt;/span&gt;(&lt;span class="pl-s"&gt;"Capital of France?"&lt;/span&gt;)
&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;r1&lt;/span&gt;.&lt;span class="pl-c1"&gt;text&lt;/span&gt;())
&lt;span class="pl-c"&gt;# Outputs "Paris"&lt;/span&gt;

&lt;span class="pl-s1"&gt;r2&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;conversation&lt;/span&gt;.&lt;span class="pl-c1"&gt;prompt&lt;/span&gt;(&lt;span class="pl-s"&gt;"Germany?"&lt;/span&gt;)
&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;r2&lt;/span&gt;.&lt;span class="pl-c1"&gt;text&lt;/span&gt;())
&lt;span class="pl-c"&gt;# Outputs "Berlin"&lt;/span&gt;&lt;/pre&gt;
&lt;p&gt;This worked if you were building a conversation with the model from scratch, but it didn't provide a way to feed in a previous conversation from the start. This made tasks like building an emulation of the OpenAI chat completions API much harder than they should have been.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;llm&lt;/code&gt; CLI tool worked around this through a custom mechanism for persisting and inflating conversations using SQLite, but that never became a stable part of the LLM API - and there are many places you might want to use the Python library without committing to SQLite as the storage layer.&lt;/p&gt;
&lt;p&gt;The new alpha now supports this:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;
&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;user&lt;/span&gt;, &lt;span class="pl-s1"&gt;assistant&lt;/span&gt;

&lt;span class="pl-s1"&gt;model&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;.&lt;span class="pl-c1"&gt;get_model&lt;/span&gt;(&lt;span class="pl-s"&gt;"gpt-5.5"&lt;/span&gt;)

&lt;span class="pl-s1"&gt;response&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;model&lt;/span&gt;.&lt;span class="pl-c1"&gt;prompt&lt;/span&gt;(&lt;span class="pl-s1"&gt;messages&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;[
    &lt;span class="pl-en"&gt;user&lt;/span&gt;(&lt;span class="pl-s"&gt;"Capital of France?"&lt;/span&gt;),
    &lt;span class="pl-en"&gt;assistant&lt;/span&gt;(&lt;span class="pl-s"&gt;"Paris"&lt;/span&gt;),
    &lt;span class="pl-en"&gt;user&lt;/span&gt;(&lt;span class="pl-s"&gt;"Germany?"&lt;/span&gt;),
])
&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;response&lt;/span&gt;.&lt;span class="pl-c1"&gt;text&lt;/span&gt;())&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;llm.user()&lt;/code&gt; and &lt;code&gt;llm.assistant()&lt;/code&gt; functions are new builder functions designed to be used within that &lt;code&gt;messages=[]&lt;/code&gt; array.&lt;/p&gt;
&lt;p&gt;The previous &lt;code&gt;prompt=&lt;/code&gt; option still works, but LLM upgrades it to a single-item messages array behind the scenes.&lt;/p&gt;
&lt;p&gt;You can also now &lt;em&gt;reply&lt;/em&gt; to a response, as an alternative to building a conversation:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-s1"&gt;response2&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;response&lt;/span&gt;.&lt;span class="pl-c1"&gt;reply&lt;/span&gt;(&lt;span class="pl-s"&gt;"How about Hungary?"&lt;/span&gt;)
&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;response2&lt;/span&gt;) &lt;span class="pl-c"&gt;# Default __str__() calls .text()&lt;/span&gt;&lt;/pre&gt;
&lt;h4 id="streaming-parts"&gt;Streaming parts&lt;/h4&gt;
&lt;p&gt;The other major new interface in the alpha concerns streaming results back from a prompt.&lt;/p&gt;
&lt;p&gt;Previously, LLM supported streaming like this:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-s1"&gt;response&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;model&lt;/span&gt;.&lt;span class="pl-c1"&gt;prompt&lt;/span&gt;(&lt;span class="pl-s"&gt;"Generate an SVG of a pelican riding a bicycle"&lt;/span&gt;)
&lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;chunk&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;response&lt;/span&gt;:
    &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;chunk&lt;/span&gt;, &lt;span class="pl-s1"&gt;end&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;""&lt;/span&gt;)&lt;/pre&gt;
&lt;p&gt;Or this async variant:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;asyncio&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;

&lt;span class="pl-s1"&gt;model&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;.&lt;span class="pl-c1"&gt;get_async_model&lt;/span&gt;(&lt;span class="pl-s"&gt;"gpt-5.5"&lt;/span&gt;)
&lt;span class="pl-s1"&gt;response&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;model&lt;/span&gt;.&lt;span class="pl-c1"&gt;prompt&lt;/span&gt;(&lt;span class="pl-s"&gt;"Generate an SVG of a pelican riding a bicycle"&lt;/span&gt;)

&lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;run&lt;/span&gt;():
    &lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;chunk&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;response&lt;/span&gt;:
        &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;chunk&lt;/span&gt;, &lt;span class="pl-s1"&gt;end&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;""&lt;/span&gt;, &lt;span class="pl-s1"&gt;flush&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)

&lt;span class="pl-s1"&gt;asyncio&lt;/span&gt;.&lt;span class="pl-c1"&gt;run&lt;/span&gt;(&lt;span class="pl-en"&gt;run&lt;/span&gt;())&lt;/pre&gt;
&lt;p&gt;Many of today's models return mixed types of content. A prompt run against Claude might return reasoning output, then text, then a JSON request for a tool call, then more text content.&lt;/p&gt;
&lt;p&gt;Some models can even execute tools on the server-side, for example OpenAI's &lt;a href="https://developers.openai.com/api/docs/guides/tools-code-interpreter?lang=curl"&gt;code interpreter tool&lt;/a&gt; or Anthropic's &lt;a href="https://platform.claude.com/docs/en/agents-and-tools/tool-use/web-search-tool"&gt;web search&lt;/a&gt;. This means the results from the model can combine text, tool calls, tool outputs and other formats.&lt;/p&gt;
&lt;p&gt;Multi-modal output models are starting to emerge too, which can return images or even &lt;a href="https://developers.openai.com/api/docs/guides/audio#add-audio-to-your-existing-application"&gt;snippets of audio&lt;/a&gt; intermixed into that streaming response.&lt;/p&gt;
&lt;p&gt;The new LLM alpha models these as a stream of typed message parts. Here's what that looks like as a Python API consumer:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;asyncio&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;

&lt;span class="pl-s1"&gt;model&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;.&lt;span class="pl-c1"&gt;get_model&lt;/span&gt;(&lt;span class="pl-s"&gt;"gpt-5.5"&lt;/span&gt;)
&lt;span class="pl-s1"&gt;prompt&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"invent 3 cool dogs, first talk about your motivations"&lt;/span&gt;

&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;describe_dog&lt;/span&gt;(&lt;span class="pl-s1"&gt;name&lt;/span&gt;: &lt;span class="pl-smi"&gt;str&lt;/span&gt;, &lt;span class="pl-s1"&gt;bio&lt;/span&gt;: &lt;span class="pl-smi"&gt;str&lt;/span&gt;) &lt;span class="pl-c1"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="pl-smi"&gt;str&lt;/span&gt;:
    &lt;span class="pl-s"&gt;"""Record the name and biography of a hypothetical dog."""&lt;/span&gt;
    &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s"&gt;f"&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;name&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;bio&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;"&lt;/span&gt;

&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;sync_example&lt;/span&gt;():
    &lt;span class="pl-s1"&gt;response&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;model&lt;/span&gt;.&lt;span class="pl-c1"&gt;prompt&lt;/span&gt;(
        &lt;span class="pl-s1"&gt;prompt&lt;/span&gt;,
        &lt;span class="pl-s1"&gt;tools&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;[&lt;span class="pl-s1"&gt;describe_dog&lt;/span&gt;],
    )
    &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;event&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;response&lt;/span&gt;.&lt;span class="pl-c1"&gt;stream_events&lt;/span&gt;():
        &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-s1"&gt;event&lt;/span&gt;.&lt;span class="pl-c1"&gt;type&lt;/span&gt; &lt;span class="pl-c1"&gt;==&lt;/span&gt; &lt;span class="pl-s"&gt;"text"&lt;/span&gt;:
            &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;event&lt;/span&gt;.&lt;span class="pl-c1"&gt;chunk&lt;/span&gt;, &lt;span class="pl-s1"&gt;end&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;""&lt;/span&gt;, &lt;span class="pl-s1"&gt;flush&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)
        &lt;span class="pl-k"&gt;elif&lt;/span&gt; &lt;span class="pl-s1"&gt;event&lt;/span&gt;.&lt;span class="pl-c1"&gt;type&lt;/span&gt; &lt;span class="pl-c1"&gt;==&lt;/span&gt; &lt;span class="pl-s"&gt;"tool_call_name"&lt;/span&gt;:
            &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s"&gt;f"&lt;span class="pl-cce"&gt;\n&lt;/span&gt;Tool call: &lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;event&lt;/span&gt;.&lt;span class="pl-c1"&gt;chunk&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;("&lt;/span&gt;, &lt;span class="pl-s1"&gt;end&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;""&lt;/span&gt;, &lt;span class="pl-s1"&gt;flush&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)
        &lt;span class="pl-k"&gt;elif&lt;/span&gt; &lt;span class="pl-s1"&gt;event&lt;/span&gt;.&lt;span class="pl-c1"&gt;type&lt;/span&gt; &lt;span class="pl-c1"&gt;==&lt;/span&gt; &lt;span class="pl-s"&gt;"tool_call_args"&lt;/span&gt;:
            &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;event&lt;/span&gt;.&lt;span class="pl-c1"&gt;chunk&lt;/span&gt;, &lt;span class="pl-s1"&gt;end&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;""&lt;/span&gt;, &lt;span class="pl-s1"&gt;flush&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)

&lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;async_example&lt;/span&gt;():
    &lt;span class="pl-s1"&gt;model&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;.&lt;span class="pl-c1"&gt;get_async_model&lt;/span&gt;(&lt;span class="pl-s"&gt;"gpt-5.5"&lt;/span&gt;)
    &lt;span class="pl-s1"&gt;response&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;model&lt;/span&gt;.&lt;span class="pl-c1"&gt;prompt&lt;/span&gt;(
        &lt;span class="pl-s1"&gt;prompt&lt;/span&gt;,
        &lt;span class="pl-s1"&gt;tools&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;[&lt;span class="pl-s1"&gt;describe_dog&lt;/span&gt;],
    )
    &lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;event&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;response&lt;/span&gt;.&lt;span class="pl-c1"&gt;astream_events&lt;/span&gt;():
        &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-s1"&gt;event&lt;/span&gt;.&lt;span class="pl-c1"&gt;type&lt;/span&gt; &lt;span class="pl-c1"&gt;==&lt;/span&gt; &lt;span class="pl-s"&gt;"text"&lt;/span&gt;:
            &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;event&lt;/span&gt;.&lt;span class="pl-c1"&gt;chunk&lt;/span&gt;, &lt;span class="pl-s1"&gt;end&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;""&lt;/span&gt;, &lt;span class="pl-s1"&gt;flush&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)
        &lt;span class="pl-k"&gt;elif&lt;/span&gt; &lt;span class="pl-s1"&gt;event&lt;/span&gt;.&lt;span class="pl-c1"&gt;type&lt;/span&gt; &lt;span class="pl-c1"&gt;==&lt;/span&gt; &lt;span class="pl-s"&gt;"tool_call_name"&lt;/span&gt;:
            &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s"&gt;f"&lt;span class="pl-cce"&gt;\n&lt;/span&gt;Tool call: &lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;event&lt;/span&gt;.&lt;span class="pl-c1"&gt;chunk&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;("&lt;/span&gt;, &lt;span class="pl-s1"&gt;end&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;""&lt;/span&gt;, &lt;span class="pl-s1"&gt;flush&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)
        &lt;span class="pl-k"&gt;elif&lt;/span&gt; &lt;span class="pl-s1"&gt;event&lt;/span&gt;.&lt;span class="pl-c1"&gt;type&lt;/span&gt; &lt;span class="pl-c1"&gt;==&lt;/span&gt; &lt;span class="pl-s"&gt;"tool_call_args"&lt;/span&gt;:
            &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;event&lt;/span&gt;.&lt;span class="pl-c1"&gt;chunk&lt;/span&gt;, &lt;span class="pl-s1"&gt;end&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;""&lt;/span&gt;, &lt;span class="pl-s1"&gt;flush&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)

&lt;span class="pl-en"&gt;sync_example&lt;/span&gt;()
&lt;span class="pl-s1"&gt;asyncio&lt;/span&gt;.&lt;span class="pl-c1"&gt;run&lt;/span&gt;(&lt;span class="pl-en"&gt;async_example&lt;/span&gt;())&lt;/pre&gt;
&lt;p&gt;Sample output (from just the first sync example):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;My motivation: create three memorable dogs with distinct “cool” styles—one cinematic, one adventurous, and one charmingly chaotic—so each feels like they could star in their own story.&lt;/code&gt;&lt;br /&gt;
&lt;code&gt;Tool call: describe_dog({"name": "Nova Jetpaw", "bio": "A sleek silver-gray whippet who wears tiny aviator goggles and loves sprinting along moonlit beaches. Nova is fearless, elegant, and rumored to outrun drones just for fun."}&lt;/code&gt;&lt;br /&gt;
&lt;code&gt;Tool call: describe_dog({"name": "Mochi Thunderbark", "bio": "A fluffy corgi with a dramatic black-and-gold bandana and the confidence of a rock star. Mochi is short, loud, loyal, and leads a neighborhood 'security patrol' made entirely of squirrels."}&lt;/code&gt;&lt;br /&gt;
&lt;code&gt;Tool call: describe_dog({"name": "Atlas Snowfang", "bio": "A massive white husky with ice-blue eyes and a backpack full of trail snacks. Atlas is calm, heroic, and always knows the way home—even during blizzards, fog, or confusing camping trips."}&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;At the end of the response you can call &lt;code&gt;response.execute_tool_calls()&lt;/code&gt; to actually run the functions that were requested, or send a &lt;code&gt;response.reply()&lt;/code&gt; to have those tools called and their return values sent back to the model:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;response&lt;/span&gt;.&lt;span class="pl-c1"&gt;reply&lt;/span&gt;(&lt;span class="pl-s"&gt;"Tell me about the dogs"&lt;/span&gt;))&lt;/pre&gt;
&lt;p&gt;This new mechanism for streaming different token types means the CLI tool can now display "thinking" text in a different color from the text in the final response. The thinking text goes to stderr so it won't affect results that are piped into other tools.&lt;/p&gt;
&lt;p&gt;This example uses Claude Sonnet 4.6 (with an updated streaming event version of the &lt;a href="https://github.com/simonw/llm-anthropic"&gt;llm-anthropic&lt;/a&gt; plugin) as Anthropic's models return their reasoning text as part of the response:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m claude-sonnet-4.6 &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Think about 3 cool dogs then describe them&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
  -o thinking_display 1&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/claude-thinking-llm.gif" alt="Animated demo. Starts with ~/dev/scratch/llm-anthropic % uv run llm -m claude-sonnet-4.6 'Think about 3 cool dogs then describe them' -o thinking_display 1 - the text then streams in grey: The user wants me to think about 3 cool dogs and then describe them. Let me come up with 3 interesting, cool dogs and describe them. Then switches to regular color text for the output that describes the dogs." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;You can suppress the output of reasoning tokens using the new &lt;code&gt;-R/--no-reasoning&lt;/code&gt; flag. Surprisingly that ended up being the only CLI-facing change in this release.&lt;/p&gt;
&lt;h4 id="a-mechanism-for-serializing-and-deserializing-responses"&gt;A mechanism for serializing and deserializing responses&lt;/h4&gt;
&lt;p&gt;As mentioned earlier, LLM has quite inflexible code at the moment for persisting conversations to SQLite. I've added a new mechanism in 0.32a0 that should provide Python API users a way to roll their own alternative:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-s1"&gt;serializable&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;response&lt;/span&gt;.&lt;span class="pl-c1"&gt;to_dict&lt;/span&gt;()
&lt;span class="pl-c"&gt;# serializable is a JSON-style dictionary&lt;/span&gt;
&lt;span class="pl-c"&gt;# store it anywhere you like, then inflate it:&lt;/span&gt;
&lt;span class="pl-s1"&gt;response&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-v"&gt;Response&lt;/span&gt;.&lt;span class="pl-c1"&gt;from_dict&lt;/span&gt;(&lt;span class="pl-s1"&gt;serializable&lt;/span&gt;)&lt;/pre&gt;
&lt;p&gt;The dictionary this returns is actually a &lt;code&gt;TypedDict&lt;/code&gt; defined in the new &lt;a href="https://github.com/simonw/llm/blob/main/llm/serialization.py"&gt;llm/serialization.py&lt;/a&gt; module.&lt;/p&gt;
&lt;h4 id="what-s-next-"&gt;What's next?&lt;/h4&gt;
&lt;p&gt;I'm releasing this as an alpha so I can upgrade various plugins and exercise the new design in real world environments for a few days. I expect the stable 0.32 release will be very similar to this alpha, unless alpha testing reveals some design flaw in the way I've put this all together.&lt;/p&gt;
&lt;p&gt;There's one remaining large task: I'd like to redesign the SQLite logging system to better capture the more finely grained details that are returned by this new abstraction.&lt;/p&gt;
&lt;p&gt;Ideally I'd like to model this as a graph, to best support situations like an OpenAI-style chat completions API where the same conversations are constantly extended and then repeated with every prompt. I want to be able to store those without duplicating them in the database.&lt;/p&gt;
&lt;p&gt;I'm undecided as to whether that should be a feature in 0.32 or I should hold it for 0.33.&lt;/p&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="projects"/><category term="python"/><category term="ai"/><category term="annotated-release-notes"/><category term="generative-ai"/><category term="llms"/><category term="llm"/></entry><entry><title>Tracking the history of the now-deceased OpenAI Microsoft AGI clause</title><link href="https://simonwillison.net/2026/Apr/27/now-deceased-agi-clause/#atom-entries" rel="alternate"/><published>2026-04-27T18:38:17+00:00</published><updated>2026-04-27T18:38:17+00:00</updated><id>https://simonwillison.net/2026/Apr/27/now-deceased-agi-clause/#atom-entries</id><summary type="html">&lt;p&gt;For many years, Microsoft and OpenAI's relationship has included a weird clause saying that, should AGI be achieved, Microsoft's commercial IP rights to OpenAI's technology would be null and void. That clause appeared to end today. I decided to try and track its expression over time on &lt;a href="https://openai.com/"&gt;openai.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;OpenAI, July 22nd 2019 in &lt;a href="https://openai.com/index/microsoft-invests-in-and-partners-with-openai/"&gt;Microsoft invests in and partners with OpenAI to support us building beneficial AGI&lt;/a&gt; (emphasis mine):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;OpenAI is producing a sequence of increasingly powerful AI technologies, which requires a lot of capital for computational power. The most obvious way to cover costs is to build a product, but that would mean changing our focus. Instead, we intend to license &lt;strong&gt;some of our pre-AGI technologies&lt;/strong&gt;, with Microsoft becoming our preferred partner for commercializing them.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;But what &lt;em&gt;is&lt;/em&gt; AGI? The &lt;a href="https://openai.com/charter/"&gt;OpenAI Charter&lt;/a&gt; was first published in April 2018 and has remained unchanged at least since this &lt;a href="https://web.archive.org/web/20190311213352/https://openai.com/charter/"&gt;March 11th 2019 archive.org capture&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;OpenAI’s mission is to ensure that artificial general intelligence (AGI)—by which we mean highly autonomous systems that outperform humans at most economically valuable work—benefits all of humanity.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's the problem: if you're going to sign an agreement with Microsoft that is dependent on knowing when "AGI" has been achieved, you need something a little more concrete.&lt;/p&gt;
&lt;p&gt;In December 2024 &lt;a href="https://www.theinformation.com/articles/microsoft-and-openai-wrangle-over-terms-of-their-blockbuster-partnership"&gt;The Information reported the details&lt;/a&gt; (summarized here outside of their paywall &lt;a href="https://techcrunch.com/2024/12/26/microsoft-and-openai-have-a-financial-definition-of-agi-report/"&gt;by TechCrunch&lt;/a&gt;):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Last year’s agreement between Microsoft and OpenAI, which hasn’t been disclosed, said AGI would be achieved only when OpenAI has developed systems that have the ability to generate the maximum total profits to which its earliest investors, including Microsoft, are entitled, according to documents OpenAI distributed to investors. Those profits total about $100 billion, the documents showed.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;So AGI is now whenever OpenAI's systems are capable of generating $100 billion in profit?&lt;/p&gt;
&lt;p&gt;In October 2025 the process changed to being judged by an "independent expert panel". In &lt;a href="https://openai.com/index/next-chapter-of-microsoft-openai-partnership/"&gt;The next chapter of the Microsoft–OpenAI partnership&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The agreement preserves key elements that have fueled this successful partnership—meaning OpenAI remains Microsoft’s frontier model partner and Microsoft continues to have exclusive IP rights and Azure API exclusivity until Artificial General Intelligence (AGI). [...]&lt;/p&gt;
&lt;p&gt;Once AGI is declared by OpenAI, that declaration will now be verified by an independent expert panel. [...]&lt;/p&gt;
&lt;p&gt;Microsoft’s IP rights to research, defined as the confidential methods used in the development of models and systems, will remain until either the expert panel verifies AGI or through 2030, whichever is first.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;OpenAI on February 27th, 2026 in &lt;a href="https://openai.com/index/continuing-microsoft-partnership/"&gt;Joint Statement from OpenAI and Microsoft&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;AGI definition and processes are unchanged&lt;/strong&gt;. The contractual definition of AGI and the process for determining if it has been achieved remains the same.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;OpenAI today, April 27th 2026 in &lt;a href="https://openai.com/index/next-phase-of-microsoft-partnership/"&gt;The next phase of the Microsoft OpenAI partnership&lt;/a&gt; (emphasis mine):&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Microsoft will continue to have a license to OpenAI IP for models and products through 2032.  Microsoft’s license will now be non-exclusive.&lt;/li&gt;
&lt;li&gt;Microsoft will no longer pay a revenue share to OpenAI.&lt;/li&gt;
&lt;li&gt;Revenue share payments from OpenAI to Microsoft continue through 2030, &lt;strong&gt;independent of OpenAI’s technology progress&lt;/strong&gt;, at the same percentage but subject to a total cap.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;As far as I can tell "independent of OpenAI’s technology progress" is a declaration that the AGI clause is now dead. Here's The Verge coming to the same conclusion: &lt;a href="https://www.theverge.com/ai-artificial-intelligence/918981/openai-microsoft-renegotiate-contract"&gt;The AGI clause is dead&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;My all-time favorite commentary on OpenAI's approach to AGI remains this 2023 hypothetical &lt;a href="https://www.bloomberg.com/opinion/articles/2023-11-20/who-controls-openai"&gt;by Matt Levine&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;And the investors wailed and gnashed their teeth but it’s true, that is what they agreed to, and they had no legal recourse. And OpenAI’s new CEO, and its nonprofit board, cut them a check for their capped return and said “bye” and went back to running OpenAI for the benefit of humanity. It turned out that a benign, carefully governed artificial superintelligence is really good for humanity, and OpenAI quickly solved all of humanity’s problems and ushered in an age of peace and abundance in which nobody wanted for anything or needed any Microsoft products. And capitalism came to an end.&lt;/p&gt;
&lt;/blockquote&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="computer-history"/><category term="microsoft"/><category term="ai"/><category term="openai"/></entry><entry><title>DeepSeek V4 - almost on the frontier, a fraction of the price</title><link href="https://simonwillison.net/2026/Apr/24/deepseek-v4/#atom-entries" rel="alternate"/><published>2026-04-24T06:01:04+00:00</published><updated>2026-04-24T06:01:04+00:00</updated><id>https://simonwillison.net/2026/Apr/24/deepseek-v4/#atom-entries</id><summary type="html">&lt;p&gt;Chinese AI lab DeepSeek's last model release was V3.2 (and V3.2 Speciale) &lt;a href="https://simonwillison.net/2025/Dec/1/deepseek-v32/"&gt;last December&lt;/a&gt;. They just dropped the first of their hotly anticipated V4 series in the shape of two preview models, &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro"&gt;DeepSeek-V4-Pro&lt;/a&gt; and &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash"&gt;DeepSeek-V4-Flash&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Both models are 1 million token context Mixture of Experts. Pro is 1.6T total parameters, 49B active. Flash is 284B total, 13B active. They're using the standard MIT license.&lt;/p&gt;
&lt;p&gt;I think this makes DeepSeek-V4-Pro the new largest open weights model. It's larger than Kimi K2.6 (1.1T) and GLM-5.1 (754B) and more than twice the size of DeepSeek V3.2 (685B).&lt;/p&gt;
&lt;p&gt;Pro is 865GB on Hugging Face, Flash is 160GB. I'm hoping that a lightly quantized Flash will run on my 128GB M5 MacBook Pro. It's &lt;em&gt;possible&lt;/em&gt; the Pro model may run on it if I can stream just the necessary active experts from disk.&lt;/p&gt;
&lt;p&gt;For the moment I tried the models out via &lt;a href="https://openrouter.ai/"&gt;OpenRouter&lt;/a&gt;, using &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-openrouter
llm openrouter refresh
llm -m openrouter/deepseek/deepseek-v4-pro 'Generate an SVG of a pelican riding a bicycle'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's the pelican &lt;a href="https://gist.github.com/simonw/4a7a9e75b666a58a0cf81495acddf529"&gt;for DeepSeek-V4-Flash&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/deepseek-v4-flash.png" alt="Excellent bicycle - good frame shape, nice chain, even has a reflector on the front wheel. Pelican has a mean looking expression but has its wings on the handlebars and feet on the pedals. Pouch is a little sharp." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And &lt;a href="https://gist.github.com/simonw/9e8dfed68933ab752c9cf27a03250a7c"&gt;for DeepSeek-V4-Pro&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/deepseek-v4-pro.png" alt="Another solid bicycle, albeit the spokes are a little jagged and the frame is compressed a bit. Pelican has gone a bit wrong - it has a VERY large body, only one wing, a weirdly hairy backside and generally loos like it was drown be a different artist from the bicycle." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;For comparison, take a look at the pelicans I got from &lt;a href="https://simonwillison.net/2025/Dec/1/deepseek-v32/"&gt;DeepSeek V3.2 in December&lt;/a&gt;, &lt;a href="https://simonwillison.net/2025/Aug/22/deepseek-31/"&gt;V3.1 in August&lt;/a&gt;, and &lt;a href="https://simonwillison.net/2025/Mar/24/deepseek/"&gt;V3-0324 in March 2025&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;So the pelicans are pretty good, but what's really notable here is the &lt;em&gt;cost&lt;/em&gt;. DeepSeek V4 is a very, very inexpensive model.&lt;/p&gt;
&lt;p&gt;This is &lt;a href="https://api-docs.deepseek.com/quick_start/pricing"&gt;DeepSeek's pricing page&lt;/a&gt;. They're charging $0.14/million tokens input and $0.28/million tokens output for Flash, and $1.74/million input and $3.48/million output for Pro.&lt;/p&gt;
&lt;p&gt;Here's a comparison table with the frontier models from Gemini, OpenAI and Anthropic:&lt;/p&gt;
&lt;center&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input ($/M)&lt;/th&gt;
&lt;th&gt;Output ($/M)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DeepSeek V4 Flash&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0.14&lt;/td&gt;
&lt;td&gt;$0.28&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.4 Nano&lt;/td&gt;
&lt;td&gt;$0.20&lt;/td&gt;
&lt;td&gt;$1.25&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 3.1 Flash-Lite&lt;/td&gt;
&lt;td&gt;$0.25&lt;/td&gt;
&lt;td&gt;$1.50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 3 Flash Preview&lt;/td&gt;
&lt;td&gt;$0.50&lt;/td&gt;
&lt;td&gt;$3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.4 Mini&lt;/td&gt;
&lt;td&gt;$0.75&lt;/td&gt;
&lt;td&gt;$4.50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Haiku 4.5&lt;/td&gt;
&lt;td&gt;$1&lt;/td&gt;
&lt;td&gt;$5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DeepSeek V4 Pro&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$1.74&lt;/td&gt;
&lt;td&gt;$3.48&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 3.1 Pro&lt;/td&gt;
&lt;td&gt;$2&lt;/td&gt;
&lt;td&gt;$12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.4&lt;/td&gt;
&lt;td&gt;$2.50&lt;/td&gt;
&lt;td&gt;$15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Sonnet 4.6&lt;/td&gt;
&lt;td&gt;$3&lt;/td&gt;
&lt;td&gt;$15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Opus 4.7&lt;/td&gt;
&lt;td&gt;$5&lt;/td&gt;
&lt;td&gt;$25&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.5&lt;/td&gt;
&lt;td&gt;$5&lt;/td&gt;
&lt;td&gt;$30&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/center&gt;
&lt;p&gt;DeepSeek-V4-Flash is the cheapest of the small models, beating even OpenAI's GPT-5.4 Nano. DeepSeek-V4-Pro is the cheapest of the larger frontier models.&lt;/p&gt;
&lt;p&gt;This note from &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash/blob/main/DeepSeek_V4.pdf"&gt;the DeepSeek paper&lt;/a&gt; helps explain why they can price these models so low - they've focused a great deal on efficiency with this release, especially for longer context prompts:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In the scenario of 1M-token context, even DeepSeek-V4-Pro, which has a larger number of activated parameters, attains only 27% of the single-token FLOPs (measured in equivalent FP8 FLOPs) and 10% of the KV cache size relative to DeepSeek-V3.2. Furthermore, DeepSeek-V4-Flash, with its smaller number of activated parameters, pushes efficiency even further: in the 1M-token context setting, it achieves only 10% of the single-token FLOPs and 7% of the KV cache size compared with DeepSeek-V3.2.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;DeepSeek's self-reported benchmarks &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash/blob/main/DeepSeek_V4.pdf"&gt;in their paper&lt;/a&gt; show their Pro model competitive with those other frontier models, albeit with this note:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Through the expansion of reasoning tokens, DeepSeek-V4-Pro-Max demonstrates superior performance relative to GPT-5.2 and Gemini-3.0-Pro on standard reasoning benchmarks. Nevertheless, its performance falls marginally short of GPT-5.4 and Gemini-3.1-Pro, suggesting a developmental trajectory that trails state-of-the-art frontier models by approximately 3 to 6 months.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'm keeping an eye on &lt;a href="https://huggingface.co/unsloth/models"&gt;huggingface.co/unsloth/models&lt;/a&gt; as I expect the Unsloth team will have a set of quantized versions out pretty soon. It's going to be very interesting to see how well that Flash model runs on my own machine.&lt;/p&gt;&lt;p&gt;&lt;em&gt;You are only seeing the long-form articles from my blog. Subscribe to &lt;a href="https://simonwillison.net/atom/everything/"&gt;/atom/everything/&lt;/a&gt; to get all of my posts, or take a look at my &lt;a href="https://simonwillison.net/about/#subscribe"&gt;other subscription options&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="deepseek"/><category term="llm-release"/><category term="openrouter"/><category term="ai-in-china"/></entry></feed>