<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: llm-release</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/llm-release.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2026-06-10T20:00:54+00:00</updated><author><name>Simon Willison</name></author><entry><title>DiffusionGemma</title><link href="https://simonwillison.net/2026/Jun/10/diffusiongemma/#atom-tag" rel="alternate"/><published>2026-06-10T20:00:54+00:00</published><updated>2026-06-10T20:00:54+00:00</updated><id>https://simonwillison.net/2026/Jun/10/diffusiongemma/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-generation/"&gt;DiffusionGemma&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Last May Google briefly released an experimental Gemini Diffusion model. I &lt;a href="https://simonwillison.net/2025/May/21/gemini-diffusion/"&gt;tried the preview at the time&lt;/a&gt; and recorded it running at 857 tokens/second. It was an exciting model, but Google made no further announcements about it.&lt;/p&gt;
&lt;p&gt;That research has returned in the best possible way: as a new open weight (Apache 2 licensed) Gemma model, &lt;a href="https://huggingface.co/google/diffusiongemma-26B-A4B-it"&gt;google/diffusiongemma-26B-A4B-it&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;NVIDIA are currently &lt;a href="https://build.nvidia.com/google/diffusiongemma-26b-a4b-it"&gt;hosting the model for free&lt;/a&gt; on their NIM cloud API. I used that API to &lt;a href="https://tools.simonwillison.net/markdown-svg-renderer#url=https%3A%2F%2Fgist.github.com%2Fsimonw%2Fe5e234a6dc6eef61e209ce1629620042"&gt;generate this pelican&lt;/a&gt;, which took 4.4s (according to &lt;code&gt;time uv run generate.py&lt;/code&gt;) to return 2,409 tokens - so at least 500 tokens/second.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Flat minimalist illustration of a white pelican with a large orange beak riding a red bicycle with black wheels, against a pale blue background with a green line representing the ground" src="https://static.simonwillison.net/static/2026/diffusiongemma-pelican.png" /&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=48478471"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia"&gt;nvidia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="nvidia"/><category term="pelican-riding-a-bicycle"/><category term="gemma"/><category term="llm-release"/><category term="llm-performance"/></entry><entry><title>Initial impressions of Claude Fable 5</title><link href="https://simonwillison.net/2026/Jun/9/claude-fable-5/#atom-tag" rel="alternate"/><published>2026-06-09T23:59:54+00:00</published><updated>2026-06-09T23:59:54+00:00</updated><id>https://simonwillison.net/2026/Jun/9/claude-fable-5/#atom-tag</id><summary type="html">
    &lt;p&gt;I didn't have early access to today's &lt;a href="https://www.anthropic.com/news/claude-fable-5-mythos-5"&gt;Claude Fable 5&lt;/a&gt; release, but I've spent the past ~5.5 hours putting it through its paces. My initial impressions are that this is something of a &lt;em&gt;beast&lt;/em&gt;. It's slow, expensive and has been quite happily churning through everything I've thrown at it so far. As is frequently the case with current frontier models the challenge is finding tasks that it can't do.&lt;/p&gt;
&lt;p&gt;First, let's review the key characteristics.&lt;/p&gt;
&lt;p&gt;Anthropic claim that &lt;a href="https://www.anthropic.com/news/claude-fable-5-mythos-5"&gt;Claude Fable 5&lt;/a&gt; offers the same performance as Claude Mythos 5, except with much more strict guardrails in place to prevent it being used for harmful things. Those guardrails trigger often enough that the Claude API has new mechanisms for letting you know when you hit them, and even has a &lt;a href="https://platform.claude.com/docs/en/build-with-claude/refusals-and-fallback"&gt;new option&lt;/a&gt; to request it falls back to another model automatically if something gets rejected.&lt;/p&gt;
&lt;p&gt;Claude Mythos 5 is out today as well, &lt;a href="https://platform.claude.com/docs/en/about-claude/models/introducing-claude-fable-5-and-claude-mythos-5"&gt;Anthropic say it&lt;/a&gt; "Shares Claude Fable 5's capabilities without the safety classifiers".&lt;/p&gt;
&lt;p&gt;The models have a 1 million token context window, 128,000 maximum output tokens and a knowledge cut-off date of January 2026.&lt;/p&gt;
&lt;p&gt;They are priced at twice the price of Claude Opus 4.5/4.6/4.7/4.8: $10/million input tokens and $50/million output tokens. There's no increase in price for longer context usage.&lt;/p&gt;
&lt;p&gt;Other than that the &lt;a href="https://platform.claude.com/docs/en/about-claude/models/introducing-claude-fable-5-and-claude-mythos-5"&gt;upgrade guide&lt;/a&gt; is substantially thinner than &lt;a href="https://platform.claude.com/docs/en/about-claude/models/whats-new-claude-4-8"&gt;the similar guide for Opus 4.8&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="the-big-model-smell"&gt;The big model smell&lt;/h4&gt;
&lt;p&gt;The best way to describe Fable is that it feels &lt;em&gt;big&lt;/em&gt;. Not just in terms of speed and cost, but also in how much it knows.&lt;/p&gt;
&lt;p&gt;Here's an example prompt, which I used to compare Fable's knowledge with Opus 4.8 (typo included):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;List all of Simon Willion's open source projects, most recent first, each with a rough date of when they were first released&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Opus 4.8's reply &lt;a href="https://gist.github.com/simonw/e77abf645156ade8c5c610f524226dcb#response-1"&gt;started like this&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I don't have a reliable, comprehensive, and date-verified list of all of Simon Willison's open source projects, and I want to be honest about that rather than risk giving you inaccurate dates or fabricated entries.&lt;/p&gt;
&lt;p&gt;Here's what I can tell you with reasonable confidence:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Well-known projects by Simon Willison:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;LLM&lt;/strong&gt; (~2023) – A command-line tool and Python library for interacting with large language models.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Datasette&lt;/strong&gt; (~2017) – A tool for exploring and publishing data, probably his most prominent recent project.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;sqlite-utils&lt;/strong&gt; (~2018-2019) – A Python library and CLI tool for manipulating SQLite databases, often used alongside Datasette.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Django&lt;/strong&gt; (2005) – He is one of the co-creators of the Django web framework, along with Adrian Holovaty, originally developed at the Lawrence Journal-World newspaper.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;He also maintains a large ecosystem of &lt;strong&gt;Datasette plugins&lt;/strong&gt; and many smaller utilities (often with names like &lt;code&gt;shot-scraper&lt;/code&gt;, various &lt;code&gt;*-to-sqlite&lt;/code&gt; tools for importing data, etc.).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That's not bad. Here's &lt;a href="https://gist.github.com/simonw/e77abf645156ade8c5c610f524226dcb#response"&gt;the start of the response&lt;/a&gt; for the exact same prompt from Fable 5 - both models were running without access to search:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I think you mean &lt;strong&gt;Simon Willison&lt;/strong&gt; — the prolific open source developer, co-creator of Django, and creator of Datasette.&lt;/p&gt;
&lt;p&gt;An important caveat upfront: a truly complete list isn't feasible. Simon Willison is famously prolific — he has published &lt;strong&gt;hundreds&lt;/strong&gt; of repositories (Datasette alone has 100+ plugins, many written by him), and my knowledge has a cutoff date, so I'll be missing his most recent work. Below are his &lt;strong&gt;notable&lt;/strong&gt; projects with approximate initial release dates, most recent first:&lt;/p&gt;
&lt;h2 id="notable-projects-approximate-dates-"&gt;Notable projects (approximate dates)&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;files-to-prompt&lt;/strong&gt; (April 2024) – concatenates files into a single prompt for LLMs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;datasette-extract&lt;/strong&gt; (2024) – plugin for extracting structured data using LLMs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM&lt;/strong&gt; (May–June 2023) – CLI tool and Python library for interacting with large language models, plus a large ecosystem of plugins (llm-gpt4all, llm-mistral, llm-claude, etc., 2023 onward)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;symbex&lt;/strong&gt; (June 2023) – search Python code for symbols&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ttok&lt;/strong&gt; and &lt;strong&gt;strip-tags&lt;/strong&gt; (May 2023) – CLI tools for token counting and HTML cleanup for LLM pipelines&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;datasette-lite&lt;/strong&gt; (May 2022) – Datasette running entirely in the browser via WebAssembly/Pyodide&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;shot-scraper&lt;/strong&gt; (March 2022) – automated website screenshots via Playwright&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;s3-credentials&lt;/strong&gt; (November 2021) – CLI for creating scoped AWS S3 credentials&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;django-sql-dashboard&lt;/strong&gt; (2021) – SQL reporting dashboards for Django&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dogsheep suite&lt;/strong&gt; (2019) – personal analytics tools: twitter-to-sqlite, github-to-sqlite, healthkit-to-sqlite, dogsheep-beta, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sqlite-utils&lt;/strong&gt; (2018) – CLI and Python library for manipulating SQLite databases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Datasette&lt;/strong&gt; (November 2017) – his flagship project; tool for exploring and publishing data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;csvs-to-sqlite&lt;/strong&gt; (2017) – convert CSV files to SQLite&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Various early tools&lt;/strong&gt; (~2007–2010) – soupselect, json-head, geocoders, and others&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Django&lt;/strong&gt; (developed 2003–2005, open-sourced July 2005) – co-created with Adrian Holovaty at the Lawrence Journal-World&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;(Here's &lt;a href="https://gist.github.com/simonw/8d05ce836d44bd8543055c9614f7c478"&gt;GPT-5.5 for good measure&lt;/a&gt;. It listed even more projects than Fable did!)&lt;/p&gt;
&lt;p&gt;In the past I've stated that I don't care about how much models &lt;em&gt;know&lt;/em&gt; - I want them to be able to manipulate text and code in useful ways and actively look up the information they need via search tools, not bake it into their weights.&lt;/p&gt;
&lt;p&gt;But knowledge like this is a reasonably good proxy for model size - you can cram a whole lot more details about the world into a larger number of parameters.&lt;/p&gt;
&lt;p&gt;Does &lt;em&gt;knowing more stuff&lt;/em&gt; mean the model is better at the tasks we pose to it? I can certainly imagine how a coding model with deeper knowledge of modern libraries and patterns could crunch through coding tasks more effectively.&lt;/p&gt;
&lt;p&gt;Is Fable really bigger than Opus? Anthropic haven't said anything about model size, so all we have are tea-leaves, but the speed, pricing and my own poking at its knowledge make me think that it's a large model. Maybe the largest yet from any vendor.&lt;/p&gt;
&lt;h4 id="using-fable-in-claude-ai"&gt;Using Fable in Claude.ai&lt;/h4&gt;
&lt;p&gt;Anthropic made Fable 5 available across all of their surfaces - the &lt;a href="https://claude.ai/"&gt;Claude.ai&lt;/a&gt; chat interface, Claude Code for web, Claude Code CLI and Claude Cowork as well. The model is available "until June 22nd" on the subscription plans (I'm on $100/month Max at the moment), after which it will be billed extra.&lt;/p&gt;
&lt;p&gt;Claude.ai is often under-estimated. Since &lt;a href="https://simonwillison.net/2025/Sep/9/claude-code-interpreter/"&gt;September 2025&lt;/a&gt; every chat has had access to a full container environment to run code, including the ability to install additional packages and even clone repositories directly from GitHub.&lt;/p&gt;
&lt;p&gt;Last week &lt;a href="https://simonwillison.net/2026/Jun/6/micropython-in-a-sandbox/"&gt;I released micropython-wasm&lt;/a&gt;, a Python library that uses &lt;a href="https://wasmtime.dev/"&gt;wasmtime&lt;/a&gt; to run a custom build of MicroPython in WebAssembly to act as a sandbox for untrusted Python code.&lt;/p&gt;
&lt;p&gt;I decided to see if Fable could upgrade that to running full Python instead. I started with this prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Clone simonw/micropython-wasm from GitHub and research how this could use a full Python as opposed to MicroPython&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Fable identified that it could use Brett Cannon's &lt;a href="https://github.com/brettcannon/cpython-wasi-build/releases"&gt;cpython-wasi-build&lt;/a&gt; builds for this, but was unable to download them itself due to environment restrictions.&lt;/p&gt;
&lt;p&gt;So I grabbed the two zip files from that page and uploaded them to Claude:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Here's the Brett Cannon builds&lt;/code&gt; (&lt;code&gt;python-3.zip&lt;/code&gt;, &lt;code&gt;_build-python-3.zip&lt;/code&gt; as attachments)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And that was that. It churned away for a few minutes and got the entire thing working.&lt;/p&gt;
&lt;p&gt;Part of the response included:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I tried the cleaner single-zip-stdlib approach to shrink the filesystem surface, but CPython's &lt;code&gt;getpath&lt;/code&gt; bootstrap fails to find &lt;code&gt;encodings&lt;/code&gt; from inside a zip without more prefix finessing — the directory-preopen approach works reliably, so that's what the PoC uses. The zip path is solvable but needs &lt;code&gt;_PYTHONHOME&lt;/code&gt;/frozen-getpath work.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;So I said:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Try a bit more at the single-zip-stdlib problem&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then a little later:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;I want a wheel that has the whole system in it, the Python wrappers and the WASM files and the stdlibrary, so I can do uv run --with path-to-whl python -c "demo code"&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;... and it gave me &lt;a href="https://static.simonwillison.net/static/cors-allow/2026/cpython_wasm-0.1.0-py3-none-any.whl"&gt;this 13.9MB cpython_wasm-0.1.0-py3-none-any.whl&lt;/a&gt; file. You can try running Python code in a sandbox using that wheel URL and &lt;code&gt;uv&lt;/code&gt; like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uv run --with https://static.simonwillison.net/static/cors-allow/2026/cpython_wasm-0.1.0-py3-none-any.whl \
  cpython-wasm -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;print(45 ** 56)&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's &lt;a href="https://claude.ai/share/a73b8b8b-8ebc-4fef-9e5c-7438e5e7ae35"&gt;the full chat transcript&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This was a &lt;em&gt;very&lt;/em&gt; strong start.&lt;/p&gt;
&lt;h4 id="adding-features-to-datasette-agent-and-llm-using-claude-code"&gt;Adding features to Datasette Agent and LLM using Claude Code&lt;/h4&gt;
&lt;p&gt;Before I'd realized it was Fable day, my stretch goal for today was to add a new feature to &lt;a href="https://agent.datasette.io/"&gt;Datasette Agent&lt;/a&gt;: I wanted tool calls within that agent software to gain the ability to pause mid-execution and request approval directly from the user.&lt;/p&gt;
&lt;p&gt;This felt like a suitably meaty task to throw at the new model.&lt;/p&gt;
&lt;p&gt;Over the course of the day Fable not only &lt;a href="https://github.com/datasette/datasette-agent/pull/20"&gt;solved that problem&lt;/a&gt;, it also identified and then implemented four issues in my underlying LLM library that would help support this kind of advanced pause-resume mechanism in tool calls.&lt;/p&gt;
&lt;p&gt;It got everything working first using somewhat gnarly hacks, but the moment I told it that changes to LLM itself were in scope it set to work unraveling the hacks and turning them into supported features of LLM instead.&lt;/p&gt;
&lt;p&gt;My stretch goal turned into &lt;a href="https://llm.datasette.io/en/latest/changelog.html#a3-2026-06-09"&gt;LLM 0.32a3&lt;/a&gt;, almost entirely written by Fable. Here are the release notes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Driven by the needs of &lt;a href="https://github.com/datasette/datasette-agent"&gt;Datasette Agent&lt;/a&gt;'s human-in-the-loop &lt;code&gt;ask_user()&lt;/code&gt; feature, made the following improvements to how tool calls work:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Tool implementations can declare a parameter named &lt;code&gt;llm_tool_call&lt;/code&gt; in order to be passed the &lt;code&gt;llm.ToolCall&lt;/code&gt; object for the current invocation. This allows them to access the current &lt;code&gt;llm_tool_call.tool_call_id&lt;/code&gt;. See &lt;a href="https://llm.datasette.io/en/latest/python-api.html#python-api-tools-llm-tool-call"&gt;Accessing the tool call from inside a tool&lt;/a&gt;. &lt;a href="https://github.com/simonw/llm/pull/1480"&gt;#1480&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Every tool call is now guaranteed a unique &lt;code&gt;tool_call_id&lt;/code&gt; - providers that do not supply one get a synthesized &lt;code&gt;tc_&lt;/code&gt;-prefixed ULID. &lt;a href="https://github.com/simonw/llm/pull/1481"&gt;#1481&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Tools can raise a &lt;code&gt;llm.PauseChain&lt;/code&gt; exception to cleanly pause the tool chain, useful for things like waiting for human approval. The exception propagates to the caller with &lt;code&gt;.tool_call&lt;/code&gt; and &lt;code&gt;.tool_results&lt;/code&gt; (completed sibling results) attached, and no model call is made with a placeholder result. See &lt;a href="https://llm.datasette.io/en/latest/python-api.html#python-api-tools-pause"&gt;Pausing a chain from inside a tool&lt;/a&gt;. &lt;a href="https://github.com/simonw/llm/pull/1482"&gt;#1482&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Failure semantics for concurrent tool execution: async sibling tool calls always run to completion before a pause or hook exception propagates. &lt;a href="https://github.com/simonw/llm/pull/1482"&gt;#1482&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Chains can now resume from a &lt;code&gt;messages=&lt;/code&gt; history ending in unresolved tool calls: the calls are executed through the normal &lt;code&gt;before_call&lt;/code&gt;/&lt;code&gt;after_call&lt;/code&gt; machinery before the first model call, skipping any that already have results. The &lt;code&gt;execute_tool_calls()&lt;/code&gt; method also accepts a new optional &lt;code&gt;tool_calls_list=&lt;/code&gt; argument for executing an explicit list of &lt;code&gt;ToolCall&lt;/code&gt; objects in place of the calls requested by the response. See &lt;a href="https://llm.datasette.io/en/latest/python-api.html#python-api-tools-resume"&gt;Resuming a chain with pending tool calls&lt;/a&gt;. &lt;a href="https://github.com/simonw/llm/pull/1482"&gt;#1482&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Fixed a bug where the async tool executor silently dropped calls to tools not present in &lt;code&gt;tools=&lt;/code&gt; - these now return &lt;code&gt;Error: tool "..." does not exist&lt;/code&gt; results, matching the sync executor. &lt;a href="https://github.com/simonw/llm/pull/1483"&gt;#1483&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'm really impressed with the quality of API design, tests, code and documentation that Fable put together for this. I spent several hours on it today, but it feels like several days' worth of work.&lt;/p&gt;
&lt;h4 id="how-much-i-ve-spent"&gt;How much I've spent&lt;/h4&gt;
&lt;p&gt;I recently started using &lt;a href="https://agentsview.io"&gt;AgentsView&lt;/a&gt; to help track my local LLM usage across all of the different coding agents. I published a &lt;a href="https://til.simonwillison.net/llms/agentsview-custom-model-price"&gt;TIL today&lt;/a&gt; about adding custom Fable pricing to that tool, which I expect will not be necessary in the very near future.&lt;/p&gt;
&lt;p&gt;After setting the price, I ran this command to start a localhost web server to explore my usage:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uvx agentsview serve
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's the treemap showing the breakdown of my Fable usage across various projects today:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/agentsview-fable-full-day.jpg" alt="Screenshot of a cost tracking dashboard with two panels. The first panel is titled &amp;quot;Cost Attribution&amp;quot; with toggle buttons for Project / Model / Agent and Treemap / List, with Project and Treemap selected. Italic text reads &amp;quot;Click to hide from chart&amp;quot;. A treemap shows a large red block labeled prod_datasette_agent $99.26 89.9%, with smaller blocks to its right labeled cloud (blue), datasette (teal), llm (red), and money (pink), plus a tiny orange sliver. A legend lists: 1 prod_datasette_agent $99.26, 2 cloud $3.98, 3 datasette $2.81, 4 llm $2.30, 5 money $1.92, 6 simon $0.15. The second panel is titled &amp;quot;Top Sessions by Cost&amp;quot; and lists nine sessions, each with a &amp;quot;Claude&amp;quot; badge, a prompt excerpt, a project name with a session UUID (omitted here), a token count, and a cost: 1. Review ./datasette-agent and ./datasette-apps - we are going to add a new feature to agent but you ... prod_datasette_agent, 78.2M, $99.26. 2. issues.db is a copy of the Datasette issues database. There are a LOT of notes in there relating to... datasette, 826.8k, $2.81. 3. Consult fly-docs and then look at datasette.cloud (which launches fly machines) and datasettecloud-... cloud, 924.7k, $2.61. 4. simonwillisonblog.db is a copy of my blog, plus all my software releases and other interesting thin... money, 542.9k, $1.92. 5. Look in datasette.cloud and figure out all remaining steps and decisions that need to be made in or... cloud, 455k, $1.37. 6. Review PRs and issues filed against this repo within the last 4 weeks and see if any deserve to be ... llm, 323.3k, $0.95. 7. run mypy, llm, 320.9k, $0.76. 8. [Image #1] fix this in github actions, llm, 183.9k, $0.59. 9. simon, simon, 26.4k, $0.15." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I used $110.42 worth of tokens today, all as part of my $100/month subscription.&lt;/p&gt;
&lt;h4 id="and-some-pelicans"&gt;And some pelicans&lt;/h4&gt;
&lt;p&gt;I ran "Generate an SVG of a pelican riding a bicycle" against all five thinking effort levels with Fable.&lt;/p&gt;
&lt;p&gt;Here are &lt;a href="https://tools.simonwillison.net/markdown-svg-renderer#url=https%3A%2F%2Fgist.github.com%2Fsimonw%2F94fde31c34a0400c1d29f57e6a708e6b"&gt;the results&lt;/a&gt;, including the token cost for each one:&lt;/p&gt;

&lt;div style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 1em"&gt;
  &lt;figure style="margin: 0; flex: 1 1 30%;"&gt;
    &lt;img src="https://static.simonwillison.net/static/2026/fable-low.jpg" alt="low" style="width: 100%; height: auto;" /&gt;
    &lt;figcaption style="text-align: center;"&gt;low: &lt;a href="https://www.llm-prices.com/#it=25&amp;amp;ot=1929&amp;amp;sel=claude-fable-5"&gt;1,929 out, 9.67c&lt;/a&gt;&lt;/figcaption&gt;
  &lt;/figure&gt;
  &lt;figure style="margin: 0; flex: 1 1 30%;"&gt;
    &lt;img src="https://static.simonwillison.net/static/2026/fable-medium.jpg" alt="medium" style="width: 100%; height: auto;" /&gt;
    &lt;figcaption style="text-align: center;"&gt;medium: &lt;a href="https://www.llm-prices.com/#it=25&amp;amp;ot=2290&amp;amp;sel=claude-fable-5"&gt;2,290 out, 11.475c&lt;/a&gt;&lt;/figcaption&gt;
  &lt;/figure&gt;
  &lt;figure style="margin: 0; flex: 1 1 30%;"&gt;
    &lt;img src="https://static.simonwillison.net/static/2026/fable-high.jpg" alt="high" style="width: 100%; height: auto;" /&gt;
    &lt;figcaption style="text-align: center;"&gt;high: &lt;a href="https://www.llm-prices.com/#it=25&amp;amp;ot=2057&amp;amp;sel=claude-fable-5"&gt;2,057 out, 10.31c&lt;/a&gt;&lt;/figcaption&gt;
  &lt;/figure&gt;
  &lt;figure style="margin: 0; flex: 1 1 45%;"&gt;
    &lt;img src="https://static.simonwillison.net/static/2026/fable-xhigh.jpg" alt="xhigh" style="width: 100%; height: auto;" /&gt;
    &lt;figcaption style="text-align: center;"&gt;xhigh: &lt;a href="https://www.llm-prices.com/#it=25&amp;amp;ot=5992&amp;amp;sel=claude-fable-5"&gt;5,992 out, 29.985c&lt;/a&gt;&lt;/figcaption&gt;
  &lt;/figure&gt;
  &lt;figure style="margin: 0; flex: 1 1 45%;"&gt;
    &lt;img src="https://static.simonwillison.net/static/2026/fable-max.jpg" alt="max" style="width: 100%; height: auto;" /&gt;
    &lt;figcaption style="text-align: center;"&gt;max: &lt;a href="https://www.llm-prices.com/#it=25&amp;amp;ot=14430&amp;amp;sel=claude-fable-5"&gt;14,430 out, 72.175c&lt;/a&gt;&lt;/figcaption&gt;
  &lt;/figure&gt;
&lt;/div&gt;

&lt;p&gt;It's interesting that high ended up using fewer tokens than medium for this particular run.&lt;/p&gt;

&lt;p&gt;Here are the &lt;a href="https://simonwillison.net/2026/May/28/claude-opus-4-8/#and-some-pelicans"&gt;Opus 4.8 pelicans&lt;/a&gt; for comparison.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-mythos"&gt;claude-mythos&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="claude-mythos"/></entry><entry><title>Microsoft's new MAI models</title><link href="https://simonwillison.net/2026/Jun/2/microsofts-new-models/#atom-tag" rel="alternate"/><published>2026-06-02T22:21:52+00:00</published><updated>2026-06-02T22:21:52+00:00</updated><id>https://simonwillison.net/2026/Jun/2/microsofts-new-models/#atom-tag</id><summary type="html">
    &lt;p&gt;Microsoft &lt;a href="https://microsoft.ai/news/building-a-hillclimbing-machine-launching-seven-new-mai-models/"&gt;announced two new text LLMs&lt;/a&gt; this morning - &lt;strong&gt;&lt;a href="https://microsoft.ai/news/introducing-mai-thinking-1/"&gt;MAI-Thinking-1&lt;/a&gt;&lt;/strong&gt; (reasoning, 1T parameters, 35B active, available to "select early partners") and &lt;strong&gt;&lt;a href="https://microsoft.ai/news/introducingmai-code-1-flash/"&gt;MAI-Code-1-Flash&lt;/a&gt;&lt;/strong&gt; (137B Parameters, 5B active, "purpose-built for GitHub Copilot and VS Code to deliver high performance and lower cost [...] rolling out to GitHub Copilot individual users in Visual Studio Code"). I've not been able to try either of them just yet.&lt;/p&gt;
&lt;p&gt;&lt;strike&gt;It's very interesting to see Microsoft releasing models with such low parameter counts, especially given how expensive larger models are to access right now. They claim MAI-Thinking-1 "is preferred to Sonnet 4.6 in our blind human side-by-side evaluations", which is impressive for a 35B model seeing as I frequently run models larger than that on my own laptop.&lt;/strike&gt; (UPDATE: I got this entirely wrong, see note below.)&lt;/p&gt;
&lt;p&gt;Also &lt;a href="https://microsoft.ai/news/introducing-mai-thinking-1/"&gt;of note&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We trained [MAI-Thinking-1] from the ground up on enterprise grade, clean and commercially licensed data, without distillation from third-party models.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And for &lt;a href="https://microsoft.ai/news/introducingmai-code-1-flash/"&gt;MAI-Code-1-Flash&lt;/a&gt; as well:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It is built end-to-end by Microsoft using clean and appropriately licensed data.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I would &lt;em&gt;very much&lt;/em&gt; like to learn more about this "appropriately licensed" data! Could these be the first generally useful code-specialist models that didn't train on an unlicensed dump of the web? (&lt;strong&gt;Update&lt;/strong&gt;: the answer is no, see note below.)&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: My initial published notes got the size of the models wrong. I misread Microsoft's announcements and interpreted the MoE active parameter count as the total parameter count, but the &lt;a href="https://microsoft.ai/pdf/MAI-Code-1-Flash-Model-Card.PDF"&gt;model card for MAI-Code-1-Flash&lt;/a&gt; lists it as 137B with 5B active and the &lt;a href="https://microsoft.ai/wp-content/uploads/2026/06/main_20260602_2.pdf"&gt;MAI-Thinking-1 technical paper&lt;/a&gt; reveals it to be a 1T model with 35B active.&lt;/p&gt;
&lt;p&gt;I deeply regret this error.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update 2&lt;/strong&gt;: That technical paper describes the training data in some detail from page 80 onwards. It has the same licensing problems as all of the other major LLMs: it's trained on a crawl of the public web:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The majority of our web HTML corpus comes from a proprietary crawl. After initial page discovery and selection, approximately 1.2 trillion pages are crawled and parsed. [...] In addition to Microsoft standard policy Sec. 2.4, we apply UT1 block list (Prigent, 2026) to remove adult content and piracy-related domains. In all, this filtering reduces the corpus from 1.2 trillion pages to 794 billion pages. Given the prevalence of AI-generated content on the web, we also score pages with a proprietary AI-content detection model and use manual inspection to identify domains with extensive AI-generated content; those domains are filtered out of the training corpus.&lt;/p&gt;
&lt;p&gt;[...]&lt;/p&gt;
&lt;p&gt;We process Common Crawl with the same pipeline. [...] After filtering, deduplication, merging with the proprietary web corpus, and a final round of exact-URL and content-level fuzzy deduplication, the Common Crawl portion contains 24.2 billion pages.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I did not cover this one at all well, which is somewhat ironic since I was at the Microsoft Build conference when I wrote this up! I'm sorry for not digging deeper before publishing my initial notes.&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/microsoft"&gt;microsoft&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/training-data"&gt;training-data&lt;/a&gt;&lt;/p&gt;



</summary><category term="llm-release"/><category term="generative-ai"/><category term="ai"/><category term="microsoft"/><category term="llms"/><category term="training-data"/></entry><entry><title>California Brown Pelican</title><link href="https://simonwillison.net/2026/Jun/2/sighting-367841339/#atom-tag" rel="alternate"/><published>2026-06-02T18:17:03+00:00</published><updated>2026-06-02T18:17:03+00:00</updated><id>https://simonwillison.net/2026/Jun/2/sighting-367841339/#atom-tag</id><summary type="html">
    
        &lt;p&gt;&lt;img src="https://static.inaturalist.org/photos/671786719/large.jpg" alt="California Brown Pelican"&gt;&lt;/p&gt;&lt;p&gt;California Brown Pelican, in Fort Mason, CA, US&lt;/p&gt;&lt;p&gt;I'm at the &lt;a href="https://build.microsoft.com/"&gt;Microsoft Build&lt;/a&gt; conference today, held at &lt;a href="https://en.wikipedia.org/wiki/Fort_Mason"&gt;Fort Mason&lt;/a&gt; in San Francisco. There are California Brown Pelicans diving into the water directly behind venue!&lt;/p&gt;
    
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/microsoft"&gt;microsoft&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="microsoft"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm-release"/></entry><entry><title>Claude Opus 4.8: "a modest but tangible improvement"</title><link href="https://simonwillison.net/2026/May/28/claude-opus-4-8/#atom-tag" rel="alternate"/><published>2026-05-28T23:59:50+00:00</published><updated>2026-05-28T23:59:50+00:00</updated><id>https://simonwillison.net/2026/May/28/claude-opus-4-8/#atom-tag</id><summary type="html">
    &lt;p&gt;Anthropic shipped &lt;a href="https://www.anthropic.com/news/claude-opus-4-8"&gt;Claude Opus 4.8&lt;/a&gt; today. My favourite thing about it is this note in the release announcement:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor. There’s still more to be done: we’re working on developing and releasing models that provide many of the same capabilities as Opus at a lower cost.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It's so refreshing to see an AI lab honestly describe a release as a minor incremental improvement over the previous model!&lt;/p&gt;
&lt;p&gt;Honesty seems to be a theme. Here's my other favorite note from that announcement:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;One of the most prominent improvements in Opus 4.8 is its &lt;em&gt;honesty&lt;/em&gt;. We train all our models to be honest---for instance, to avoid making claims that they can't support. But a general problem with AI models is that they sometimes jump to conclusions, confidently claiming to have made progress in their work despite the evidence being thin. Early testers report that Opus 4.8 is more likely to flag uncertainties about its work and less likely to make unsupported claims. This is borne out in &lt;a href="https://www.anthropic.com/claude-opus-4-8-system-card"&gt;our evaluations&lt;/a&gt;, which show that Opus 4.8 is around four times less likely than its predecessor to allow flaws in code it has written to pass unremarked.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That linked system card includes the following:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Claude Opus 4.8 had the lowest incorrect-rate of the six models on every benchmark—the most direct measure of factual hallucination. It achieved this mainly by abstaining on questions about which it was uncertain rather than by answering more questions correctly.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4 id="model-characteristics"&gt;Model characteristics&lt;/h4&gt;
&lt;p&gt;Not much has changed since 4.7.&lt;/p&gt;
&lt;p&gt;It's priced the same as Opus 4.5/4.6/4.7 - $5/million input and $25 per million output. "Fast mode" is twice that price, which is a significant reduction from their previous models - fast mode on 4.6/4.7 remains at $30/$150. Note that &lt;a href="https://platform.claude.com/docs/en/build-with-claude/fast-mode"&gt;fast mode&lt;/a&gt; is only available to organizations that are part of the research preview, "Contact your account manager to request access".&lt;/p&gt;
&lt;p&gt;Both the reliable knowledge cutoff and the training data cutoff are January 2026, the same as for 4.7.&lt;/p&gt;
&lt;p&gt;The context window is still 1,000,000 tokens, and the max output is 128,000 tokens.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://platform.claude.com/docs/en/about-claude/models/whats-new-claude-4-8"&gt;What's new in Claude Opus 4.8&lt;/a&gt; document has some of the more interesting details. These caught my eye:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Mid-conversation system messages&lt;/strong&gt;. Claude Opus 4.8 accepts &lt;code&gt;role: "system"&lt;/code&gt; messages immediately after a user turn in the &lt;code&gt;messages&lt;/code&gt; array (subject to &lt;a href="https://platform.claude.com/docs/en/build-with-claude/mid-conversation-system-messages#limitations"&gt;placement rules&lt;/a&gt;). This lets you append updated instructions later in a long-running conversation without restating the full system prompt, which preserves &lt;a href="https://platform.claude.com/docs/en/build-with-claude/prompt-caching"&gt;prompt cache&lt;/a&gt; hits on the earlier turns and reduces input cost on agentic loops.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;See also &lt;a href="https://github.com/anthropics/anthropic-sdk-python/commit/2b826760101664ef89db42132932f53ba97c894d#diff-a947c9c02eab58e8ddbe799a11832d533836d242e07c7251997f8543f0981f2f"&gt;this update&lt;/a&gt; to the Anthropic Python SDK. Being able to steer the system prompt mid-conversation sounds really powerful. I was worried this would be incompatible with the abstraction provided by my own &lt;a href="https://llm.datasette.io/en/stable/python-api.html#system-prompts"&gt;LLM library&lt;/a&gt;, which expects a single system prompt per conversation... but it turns out my recent &lt;a href="https://simonwillison.net/2026/Apr/29/llm/"&gt;redesign&lt;/a&gt; should handle that &lt;a href="https://github.com/simonw/llm-anthropic/issues/73"&gt;just fine&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Lower prompt cache minimum&lt;/strong&gt;. The minimum cacheable prompt length on Claude Opus 4.8 is 1,024 tokens, lower than on Claude Opus 4.7.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I checked and 4.7's minimum &lt;a href="https://platform.claude.com/docs/en/build-with-claude/prompt-caching#cache-limitations"&gt;was 4,096&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="and-some-pelicans"&gt;And some pelicans&lt;/h4&gt;
&lt;p&gt;Here are &lt;a href="https://tools.simonwillison.net/markdown-svg-renderer#url=https%3A%2F%2Fgist.github.com%2Fsimonw%2Ffea4f7546626d627862dc241a4e3a86a"&gt;pelicans riding bicycles&lt;/a&gt; for all five thinking levels, &lt;code&gt;low&lt;/code&gt;, &lt;code&gt;medium&lt;/code&gt;, &lt;code&gt;high&lt;/code&gt;, &lt;code&gt;xhigh&lt;/code&gt;, and &lt;code&gt;max&lt;/code&gt;:&lt;/p&gt;

&lt;div style="display: grid; grid-template-columns: repeat(6, 1fr); gap: 1rem; max-width: 900px; margin: 0 auto;"&gt;
    &lt;figure style="grid-column: span 2; margin: 0; text-align: center;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/claude-opus-4.8-low.png" alt="Flat-style cartoon illustration of a white duck with an orange beak and legs riding a black bicycle, its feet on the pedals, against a blue sky and green grass background." style="width: 100%; height: auto; border: 1px solid #ccc;" /&gt;
        &lt;figcaption style="margin-top: 0.5rem; font-family: system-ui, sans-serif; font-weight: bold;"&gt;
            &lt;a href="https://gist.github.com/simonw/fea4f7546626d627862dc241a4e3a86a#response"&gt;low&lt;/a&gt;
        &lt;/figcaption&gt;
    &lt;/figure&gt;
    &lt;figure style="grid-column: span 2; margin: 0; text-align: center;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/claude-opus-4.8-medium.png" alt="Flat-style illustration of a white egret or heron with an orange beak and legs riding a black bicycle, against a blue sky and green grass background." style="width: 100%; height: auto; border: 1px solid #ccc;" /&gt;
        &lt;figcaption style="margin-top: 0.5rem; font-family: system-ui, sans-serif; font-weight: bold;"&gt;
            &lt;a href="https://gist.github.com/simonw/fea4f7546626d627862dc241a4e3a86a#response-1"&gt;medium&lt;/a&gt;
        &lt;/figcaption&gt;
    &lt;/figure&gt;
    &lt;figure style="grid-column: span 2; margin: 0; text-align: center;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/claude-opus-4.8-high.png" alt="Cartoon illustration of a white duck with an orange beak riding a black bicycle, against a light blue sky with a pale yellow sun in the upper left and a green ground line at the bottom." style="width: 100%; height: auto; border: 1px solid #ccc;" /&gt;
        &lt;figcaption style="margin-top: 0.5rem; font-family: system-ui, sans-serif; font-weight: bold;"&gt;
            &lt;a href="https://gist.github.com/simonw/fea4f7546626d627862dc241a4e3a86a#response-2"&gt;high&lt;/a&gt;
        &lt;/figcaption&gt;
    &lt;/figure&gt;
    &lt;figure style="grid-column: span 3; margin: 0; text-align: center;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/claude-opus-4.8-xhigh.png" alt="Cartoon illustration of a white pelican with an orange beak riding a black bicycle, its orange legs extending down to the pedals, against a blue sky with a yellow sun and green ground." style="width: 100%; height: auto; border: 1px solid #ccc;" /&gt;
        &lt;figcaption style="margin-top: 0.5rem; font-family: system-ui, sans-serif; font-weight: bold;"&gt;
            &lt;a href="https://gist.github.com/simonw/fea4f7546626d627862dc241a4e3a86a#response-3"&gt;xhigh&lt;/a&gt;
        &lt;/figcaption&gt;
    &lt;/figure&gt;
    &lt;figure style="grid-column: span 3; margin: 0; text-align: center;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/claude-opus-4.8-max.png" alt="Cartoon illustration of a white pelican with an orange beak riding a red bicycle on green grass, against a light blue sky with a fluffy white cloud and a yellow sun." style="width: 100%; height: auto; border: 1px solid #ccc;" /&gt;
        &lt;figcaption style="margin-top: 0.5rem; font-family: system-ui, sans-serif; font-weight: bold;"&gt;&lt;a href="https://gist.github.com/simonw/fea4f7546626d627862dc241a4e3a86a#response-4"&gt;max&lt;/a&gt;&lt;/figcaption&gt;
    &lt;/figure&gt;
&lt;/div&gt;


&lt;p&gt;This time I ran them using the &lt;a href="https://llm.datasette.io/en/stable/usage.html"&gt;LLM CLI&lt;/a&gt;, exported the logs to Markdown and then had Claude Opus 4.8 &lt;a href="https://github.com/simonw/tools/commit/71e4944766b577a327ff048cc63b739ba4cbade9"&gt;build me&lt;/a&gt; an HTML tool that could render that Markdown with the &lt;code&gt;svg&lt;/code&gt; fenced code blocks displayed as SVGs on the page.&lt;/p&gt;

&lt;p&gt;(I later had GPT-5.5 xhigh in Codex &lt;a href="https://gist.github.com/simonw/bb5a267f8144dfe4e92e50a014e49e98"&gt;update that code&lt;/a&gt; to remove any XSS holes. I'm sure Claude could have done that if I'd asked, but GPT-5.5 is my code security blanket at the moment.)&lt;/p&gt;

&lt;p&gt;The max one  was clearly the best, but it did take 25 input, 17,167 output tokens for a total cost of &lt;a href="https://www.llm-prices.com/#it=25&amp;amp;ot=17167&amp;amp;ic=5&amp;amp;oc=25&amp;amp;sel=claude-opus-4-5"&gt;43 cents&lt;/a&gt;!&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/></entry><entry><title>Gemini 3.5 Flash: more expensive, but Google plan to use it for everything</title><link href="https://simonwillison.net/2026/May/19/gemini-35-flash/#atom-tag" rel="alternate"/><published>2026-05-19T22:40:25+00:00</published><updated>2026-05-19T22:40:25+00:00</updated><id>https://simonwillison.net/2026/May/19/gemini-35-flash/#atom-tag</id><summary type="html">
    &lt;p&gt;Today at Google I/O, Google &lt;a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-5/"&gt;released Gemini 3.5 Flash&lt;/a&gt;. This one skipped the &lt;code&gt;-preview&lt;/code&gt; modifier and went straight to general availability, and Google appear to be using it for a whole lot of their key products:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;3.5 Flash is available today to billions of people globally:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;For everyone via the Gemini app and AI Mode in &lt;a href="https://blog.google/products-and-platforms/products/search/search-io-2026"&gt;Google Search&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;For developers in our agent-first development platform Google Antigravity and Gemini API in Google AI Studio and Android Studio&lt;/li&gt;
&lt;li&gt;For enterprises in Gemini Enterprise Agent Platform and Gemini Enterprise.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;As usual with Gemini, the most interesting details are tucked away in the &lt;a href="https://ai.google.dev/gemini-api/docs/whats-new-gemini-3.5"&gt;What's new in Gemini 3.5 Flash&lt;/a&gt; developer documentation. It mostly has the same set of platform features as the previous Gemini 3.x series, albeit with no &lt;a href="https://ai.google.dev/gemini-api/docs/computer-use"&gt;computer use&lt;/a&gt;. The model ID is &lt;code&gt;gemini-3.5-flash&lt;/code&gt;. The knowledge cut-off is January 2025, and it supports 1,048,576 input tokens and 65,536 maximum output tokens.&lt;/p&gt;
&lt;p&gt;Google are also pushing a new &lt;a href="https://ai.google.dev/gemini-api/docs/interactions"&gt;Interactions API&lt;/a&gt;, currently in beta, which looks to me like their version of the patterns introduced by &lt;a href="https://developers.openai.com/api/reference/responses/overview"&gt;OpenAI Responses&lt;/a&gt; - in particular server-side history management.&lt;/p&gt;
&lt;h4 id="the-price-has-gone-up"&gt;The price has gone up&lt;/h4&gt;
&lt;p&gt;Gemini 3.5 Flash is accompanied by a notable price bump. The previous models in the "Flash" family were &lt;a href="https://ai.google.dev/gemini-api/docs/models/gemini-3-flash-preview"&gt;Gemini 3 Flash Preview&lt;/a&gt; and &lt;a href="https://ai.google.dev/gemini-api/docs/models/gemini-3.1-flash-lite"&gt;Gemini 3.1 Flash-Lite&lt;/a&gt;. The new 3.5 Flash is 3x the price of 3 Flash Preview and 6x the price of 3.1 Flash-Lite (see &lt;a href="https://www.llm-prices.com/#sel=gemini-3-flash-preview%2Cgemini-3.5-flash%2Cgemini-3.1-flash-lite-preview"&gt;price comparison here&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;At $1.50/million input and $9/million output it's getting close in price to Google's Gemini 3.1 Pro, which is $2 and $12.&lt;/p&gt;
&lt;p&gt;The Gemini team promise that 3.5 Pro will roll out "next month" - presumably at an even higher price.&lt;/p&gt;
&lt;p&gt;This fits a trend: OpenAI's GPT-5.5 was 2x the price of GPT-5.4, and Claude Opus 4.7 is around 1.46x the price of 4.6 when you take the &lt;a href="https://simonwillison.net/2026/Apr/20/claude-token-counts/"&gt;new tokenizer into account&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Given the price increase it's interesting to see Google roll it out for so many of their own free-to-consumer products. It feels like all three of the major AI labs are starting to probe the price tolerance of their API customers.&lt;/p&gt;
&lt;p&gt;Artificial Analysis publish the cost to run their proprietary benchmark against models, which is a useful way to take things like tokenization and increased volume of reasoning tokens into account. Some numbers worth comparing:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://artificialanalysis.ai/models/gemini-3-5-flash"&gt;Gemini 3.5 Flash (high)&lt;/a&gt;: $1,551.60&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://artificialanalysis.ai/models/gemini-3-1-pro-preview"&gt;Gemini 3.1 Pro Preview&lt;/a&gt;: $892.28&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://artificialanalysis.ai/models/gemini-3-flash-reasoning"&gt;Gemini 3 Flash Preview (Reasoning)&lt;/a&gt;: $278.26&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://artificialanalysis.ai/models/gemini-3-1-flash-lite-preview"&gt;Gemini 3.1 Flash-Lite Preview&lt;/a&gt;: $93.60&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Running the benchmark for 3.5 Flash (high) cost significantly more than 3.1 Pro Preview!&lt;/p&gt;
&lt;p&gt;Here are some numbers from other vendors:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://artificialanalysis.ai/models/claude-opus-4-7"&gt;Claude Opus 4.7 (Adaptive Reasoning, Max Effort)&lt;/a&gt;: $5,117.14&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://artificialanalysis.ai/models/claude-opus-4-7-non-reasoning"&gt;Claude Opus 4.7 (Non-reasoning, High Effort)&lt;/a&gt;: $1,217.23&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://artificialanalysis.ai/models/gpt-5-5"&gt;GPT-5.5 (xhigh)&lt;/a&gt;: $3,357.00&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://artificialanalysis.ai/models/gpt-5-5-medium"&gt;GPT-5.5 (medium)&lt;/a&gt;: $1,199.14&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="a-pelican-on-a-bicycle"&gt;A pelican on a bicycle&lt;/h4&gt;
&lt;p&gt;I ran "Generate an SVG of a pelican riding a bicycle" &lt;a href="https://gist.github.com/simonw/09cc5a5545d7e75b33b75ffa92a34601"&gt;against the Gemini API&lt;/a&gt; and got back this pelican, which is a &lt;em&gt;lot&lt;/em&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/gemini-3.5-flash.png" alt="Black background, bats in the sky against a stylized moon. Pelican is funky looking. Very good beak. Bicycle frame is a bit twisted, and the bar from pedals to back wheel is missing. Bike lamp illuminates the road in front. Quite stylish." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;From the code comments: &lt;code&gt;&amp;lt;!-- Pelican Eye / Sunglasses (Cool Retro Aviators) --&amp;gt;&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://news.ycombinator.com/item?id=48196570#48198275"&gt;hedgehog on Hacker News&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;That pelican looks like it's in Miami for a crypto conference.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That one cost me 11 input tokens and 14,403 output tokens, for a total cost of &lt;a href="https://www.llm-prices.com/#it=11&amp;amp;ot=14403&amp;amp;sel=gemini-3.5-flash"&gt;just under 13 cents&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="gemini"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/></entry><entry><title>llm-gemini 0.31</title><link href="https://simonwillison.net/2026/May/7/llm-gemini/#atom-tag" rel="alternate"/><published>2026-05-07T19:57:06+00:00</published><updated>2026-05-07T19:57:06+00:00</updated><id>https://simonwillison.net/2026/May/7/llm-gemini/#atom-tag</id><summary type="html">
    
        &lt;p&gt;&lt;strong&gt;Release:&lt;/strong&gt; &lt;a href="https://github.com/simonw/llm-gemini/releases/tag/0.31"&gt;llm-gemini 0.31&lt;/a&gt;&lt;/p&gt;
        &lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;gemini-3.1-flash-lite&lt;/code&gt; is &lt;a href="https://cloud.google.com/blog/products/ai-machine-learning/gemini-3-1-flash-lite-is-now-generally-available"&gt;no longer a preview&lt;/a&gt;. &lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's my write-up of the &lt;a href="https://simonwillison.net/2026/Mar/3/gemini-31-flash-lite/"&gt;Gemini 3.1 Flash-Lite Preview model&lt;/a&gt; back in March. I don't believe this new non-preview model has changed since then.&lt;/p&gt;
    
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="gemini"/><category term="llm-release"/></entry><entry><title>Granite 4.1 3B SVG Pelican Gallery</title><link href="https://simonwillison.net/2026/May/4/granite-41-3b-svg-pelican-gallery/#atom-tag" rel="alternate"/><published>2026-05-04T23:49:24+00:00</published><updated>2026-05-04T23:49:24+00:00</updated><id>https://simonwillison.net/2026/May/4/granite-41-3b-svg-pelican-gallery/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://simonw.github.io/granite-4.1-3b-gguf-pelicans/"&gt;Granite 4.1 3B SVG Pelican Gallery&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
IBM released their &lt;a href="https://research.ibm.com/blog/granite-4-1-ai-foundation-models"&gt;Granite 4.1 family&lt;/a&gt; of LLMs a few days ago. They're Apache 2.0 licensed and come in 3B, 8B and 30B sizes.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://huggingface.co/blog/ibm-granite/granite-4-1"&gt;Granite 4.1 LLMs: How They’re Built&lt;/a&gt; by Granite team member Yousaf Shah describes the training process in detail.&lt;/p&gt;
&lt;p&gt;Unsloth released the &lt;a href="https://huggingface.co/unsloth/granite-4.1-3b-GGUF"&gt;unsloth/granite-4.1-3b-GGUF&lt;/a&gt; collection of GGUF encoded quantized variants of the 3B model - 21 different model files ranging in size from 1.2GB to 6.34GB.&lt;/p&gt;
&lt;p&gt;All 21 of those Unsloth files add up to 51.3GB, which inspired me to finally try an experiment I've been wanting to run for ages: prompting "Generate an SVG of a pelican riding a bicycle" against different sized quantized variants of the same model to see what the results would look like.&lt;/p&gt;
&lt;p&gt;Honestly, &lt;a href="https://simonw.github.io/granite-4.1-3b-gguf-pelicans/"&gt;the results&lt;/a&gt; are less interesting than I expected. There's no distinguishable pattern relating quality to size - they're all pretty terrible!&lt;/p&gt;
&lt;p&gt;&lt;img alt="Six different SVG images from models ranging in size from 1.67GB to 1.2GB. They are almost all an abstract collection of shapes - weirdly the smallest model had the best version of a bicycle, while the largest one had something that looked a tiny bit like a pelican." src="https://static.simonwillison.net/static/2026/granite-3B-pelicans.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;I'll likely try this again in the future with a model that's better at drawing pelicans.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ibm"&gt;ibm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="ibm"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/></entry><entry><title>Introducing talkie: a 13B vintage language model from 1930</title><link href="https://simonwillison.net/2026/Apr/28/talkie/#atom-tag" rel="alternate"/><published>2026-04-28T02:47:42+00:00</published><updated>2026-04-28T02:47:42+00:00</updated><id>https://simonwillison.net/2026/Apr/28/talkie/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://talkie-lm.com/introducing-talkie"&gt;Introducing talkie: a 13B vintage language model from 1930&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New project from &lt;a href="https://nlevine.org"&gt;Nick Levine&lt;/a&gt;, &lt;a href="http://www.cs.toronto.edu/~duvenaud/"&gt;David Duvenaud&lt;/a&gt;, and &lt;a href="https://en.wikipedia.org/wiki/Alec_Radford"&gt;Alec Radford&lt;/a&gt; (of GPT, GPT-2, Whisper fame).&lt;/p&gt;
&lt;p&gt;&lt;a href="https://huggingface.co/talkie-lm/talkie-1930-13b-base"&gt;talkie-1930-13b-base&lt;/a&gt; (53.1 GB) is a "13B language model trained on 260B tokens of historical pre-1931 English text". &lt;/p&gt;
&lt;p&gt;&lt;a href="https://huggingface.co/talkie-lm/talkie-1930-13b-it"&gt;talkie-1930-13b-it&lt;/a&gt; (26.6 GB) is a checkpoint "finetuned using a novel dataset of instruction-response pairs extracted from pre-1931 reference works", designed to power a chat interface. You can &lt;a href="https://talkie-lm.com/chat"&gt;try that out here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Both models are Apache 2.0 licensed. Since the training data for the base model is entirely out of copyright (the USA copyright cutoff date is currently January 1, 1931), I'm hoping they later decide to release the training data as well.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Update&lt;/em&gt; on that: &lt;a href="https://twitter.com/status_effects/status/2049065134014726301"&gt;Nick Levine on Twitter&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Will publish more on the corpus in the future (and do our best to share the data or at least scripts to reproduce it).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Their report suggests some fascinating research objectives for this class of model, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How good are these models at predicting the future? "we calculated the surprisingness of short descriptions of historical events to a 13B model trained on pre-1931 text"&lt;/li&gt;
&lt;li&gt;Can these models invent things that are past their knowledge cutoffs? "As Demis Hassabis has asked, could a model trained up to 1911 independently discover General Relativity, as Einstein did in 1915?"&lt;/li&gt;
&lt;li&gt;Can they be taught to program? "Figure 3 (left-hand side) shows an early example of such a test, measuring how well models trained on pre-1931 text can, when given a few demonstration examples of &lt;a href="https://github.com/openai/human-eval"&gt;Python programs&lt;/a&gt;, write new correct programs."&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I have a long-running interest in what I call "vegan models" - LLMs that are trained entirely on licensed or out-of-copyright data. I think the base model here qualifies, but the chat model isn't entirely pure due to the reliance on non-vegan models to help with the fine-tuning - emphasis mine:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;First, we generated instruction-response pairs from historical texts with regular structure, such as etiquette manuals, letter-writing manuals, cookbooks, dictionaries, encyclopedias, and poetry and fable collections (see Figure 7), and fine-tuned our base model on them using a simple chat format.&lt;/p&gt;
&lt;p&gt;Next, to improve instruction-following abilities, we generated synthetic prompts covering different types of tasks, such as summarizing documents, responding to direct information requests, and continuing multi-turn conversations coherently. We then ran online direct preference optimization on rollouts generated from these prompts, &lt;strong&gt;using Claude Sonnet 4.6 as a judge&lt;/strong&gt;. [...]&lt;/p&gt;
&lt;p&gt;Finally, we did another round of supervised fine-tuning, this time on rejection-sampled &lt;strong&gt;multi-turn synthetic chats between Claude Opus 4.6 and talkie&lt;/strong&gt;, to smooth out persistent rough edges in its conversational abilities.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;A major challenge in training talkie was avoiding accidental contamination, either from post-1931 text that snuck into the training corpus or from the fine-tuning assistance from modern LLMs introducing anachronistic knowledge to the chat model. &lt;/p&gt;
&lt;p&gt;I recently &lt;a href="https://simonwillison.net/2026/Mar/30/mr-chatterbox/"&gt;wrote about Mr. Chatterbox&lt;/a&gt;, a similar vintage language model project which also had to turn to modern LLMs to create synthetic conversations to help train the chat model.&lt;/p&gt;
&lt;p&gt;The talkie team aspire to eventually move beyond this limitation:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;While we have tried to post-train talkie free from modern influence, reinforcement learning with AI feedback inevitably shapes talkie’s behavior anachronistically. (The 7B version of talkie emerged from RL speaking in listicles.) As we scale up, we hope to be able to use our vintage base models themselves as judges to enable a fully bootstrapped era-appropriate post-training pipeline.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I couldn't resist hitting the talkie demo with my classic test prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Generate an SVG of a pelican riding a bicycle&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It replied:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;An SVG of a pelican riding a bicycle was generated in 1860. The pelican was represented as perched on the saddle, with its long bill pointed forward, and its legs astride the handles. The idea was suggested by the fact that pelicans are sometimes seen fishing on horseback, on the banks of the Rhine.&lt;/p&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=47927903"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/training-data"&gt;training-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="training-data"/><category term="ai-ethics"/><category term="llm-release"/></entry><entry><title>DeepSeek V4 - almost on the frontier, a fraction of the price</title><link href="https://simonwillison.net/2026/Apr/24/deepseek-v4/#atom-tag" rel="alternate"/><published>2026-04-24T06:01:04+00:00</published><updated>2026-04-24T06:01:04+00:00</updated><id>https://simonwillison.net/2026/Apr/24/deepseek-v4/#atom-tag</id><summary type="html">
    &lt;p&gt;Chinese AI lab DeepSeek's last model release was V3.2 (and V3.2 Speciale) &lt;a href="https://simonwillison.net/2025/Dec/1/deepseek-v32/"&gt;last December&lt;/a&gt;. They just dropped the first of their hotly anticipated V4 series in the shape of two preview models, &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro"&gt;DeepSeek-V4-Pro&lt;/a&gt; and &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash"&gt;DeepSeek-V4-Flash&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Both models are 1 million token context Mixture of Experts. Pro is 1.6T total parameters, 49B active. Flash is 284B total, 13B active. They're using the standard MIT license.&lt;/p&gt;
&lt;p&gt;I think this makes DeepSeek-V4-Pro the new largest open weights model. It's larger than Kimi K2.6 (1.1T) and GLM-5.1 (754B) and more than twice the size of DeepSeek V3.2 (685B).&lt;/p&gt;
&lt;p&gt;Pro is 865GB on Hugging Face, Flash is 160GB. I'm hoping that a lightly quantized Flash will run on my 128GB M5 MacBook Pro. It's &lt;em&gt;possible&lt;/em&gt; the Pro model may run on it if I can stream just the necessary active experts from disk.&lt;/p&gt;
&lt;p&gt;For the moment I tried the models out via &lt;a href="https://openrouter.ai/"&gt;OpenRouter&lt;/a&gt;, using &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-openrouter
llm openrouter refresh
llm -m openrouter/deepseek/deepseek-v4-pro 'Generate an SVG of a pelican riding a bicycle'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's the pelican &lt;a href="https://gist.github.com/simonw/4a7a9e75b666a58a0cf81495acddf529"&gt;for DeepSeek-V4-Flash&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/deepseek-v4-flash.png" alt="Excellent bicycle - good frame shape, nice chain, even has a reflector on the front wheel. Pelican has a mean looking expression but has its wings on the handlebars and feet on the pedals. Pouch is a little sharp." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And &lt;a href="https://gist.github.com/simonw/9e8dfed68933ab752c9cf27a03250a7c"&gt;for DeepSeek-V4-Pro&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/deepseek-v4-pro.png" alt="Another solid bicycle, albeit the spokes are a little jagged and the frame is compressed a bit. Pelican has gone a bit wrong - it has a VERY large body, only one wing, a weirdly hairy backside and generally loos like it was drown be a different artist from the bicycle." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;For comparison, take a look at the pelicans I got from &lt;a href="https://simonwillison.net/2025/Dec/1/deepseek-v32/"&gt;DeepSeek V3.2 in December&lt;/a&gt;, &lt;a href="https://simonwillison.net/2025/Aug/22/deepseek-31/"&gt;V3.1 in August&lt;/a&gt;, and &lt;a href="https://simonwillison.net/2025/Mar/24/deepseek/"&gt;V3-0324 in March 2025&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;So the pelicans are pretty good, but what's really notable here is the &lt;em&gt;cost&lt;/em&gt;. DeepSeek V4 is a very, very inexpensive model.&lt;/p&gt;
&lt;p&gt;This is &lt;a href="https://api-docs.deepseek.com/quick_start/pricing"&gt;DeepSeek's pricing page&lt;/a&gt;. They're charging $0.14/million tokens input and $0.28/million tokens output for Flash, and $1.74/million input and $3.48/million output for Pro.&lt;/p&gt;
&lt;p&gt;Here's a comparison table with the frontier models from Gemini, OpenAI and Anthropic:&lt;/p&gt;
&lt;center&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input ($/M)&lt;/th&gt;
&lt;th&gt;Output ($/M)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DeepSeek V4 Flash&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0.14&lt;/td&gt;
&lt;td&gt;$0.28&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.4 Nano&lt;/td&gt;
&lt;td&gt;$0.20&lt;/td&gt;
&lt;td&gt;$1.25&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 3.1 Flash-Lite&lt;/td&gt;
&lt;td&gt;$0.25&lt;/td&gt;
&lt;td&gt;$1.50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 3 Flash Preview&lt;/td&gt;
&lt;td&gt;$0.50&lt;/td&gt;
&lt;td&gt;$3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.4 Mini&lt;/td&gt;
&lt;td&gt;$0.75&lt;/td&gt;
&lt;td&gt;$4.50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Haiku 4.5&lt;/td&gt;
&lt;td&gt;$1&lt;/td&gt;
&lt;td&gt;$5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DeepSeek V4 Pro&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$1.74&lt;/td&gt;
&lt;td&gt;$3.48&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 3.1 Pro&lt;/td&gt;
&lt;td&gt;$2&lt;/td&gt;
&lt;td&gt;$12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.4&lt;/td&gt;
&lt;td&gt;$2.50&lt;/td&gt;
&lt;td&gt;$15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Sonnet 4.6&lt;/td&gt;
&lt;td&gt;$3&lt;/td&gt;
&lt;td&gt;$15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Opus 4.7&lt;/td&gt;
&lt;td&gt;$5&lt;/td&gt;
&lt;td&gt;$25&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.5&lt;/td&gt;
&lt;td&gt;$5&lt;/td&gt;
&lt;td&gt;$30&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/center&gt;
&lt;p&gt;DeepSeek-V4-Flash is the cheapest of the small models, beating even OpenAI's GPT-5.4 Nano. DeepSeek-V4-Pro is the cheapest of the larger frontier models.&lt;/p&gt;
&lt;p&gt;This note from &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash/blob/main/DeepSeek_V4.pdf"&gt;the DeepSeek paper&lt;/a&gt; helps explain why they can price these models so low - they've focused a great deal on efficiency with this release, especially for longer context prompts:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In the scenario of 1M-token context, even DeepSeek-V4-Pro, which has a larger number of activated parameters, attains only 27% of the single-token FLOPs (measured in equivalent FP8 FLOPs) and 10% of the KV cache size relative to DeepSeek-V3.2. Furthermore, DeepSeek-V4-Flash, with its smaller number of activated parameters, pushes efficiency even further: in the 1M-token context setting, it achieves only 10% of the single-token FLOPs and 7% of the KV cache size compared with DeepSeek-V3.2.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;DeepSeek's self-reported benchmarks &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash/blob/main/DeepSeek_V4.pdf"&gt;in their paper&lt;/a&gt; show their Pro model competitive with those other frontier models, albeit with this note:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Through the expansion of reasoning tokens, DeepSeek-V4-Pro-Max demonstrates superior performance relative to GPT-5.2 and Gemini-3.0-Pro on standard reasoning benchmarks. Nevertheless, its performance falls marginally short of GPT-5.4 and Gemini-3.1-Pro, suggesting a developmental trajectory that trails state-of-the-art frontier models by approximately 3 to 6 months.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'm keeping an eye on &lt;a href="https://huggingface.co/unsloth/models"&gt;huggingface.co/unsloth/models&lt;/a&gt; as I expect the Unsloth team will have a set of quantized versions out pretty soon. It's going to be very interesting to see how well that Flash model runs on my own machine.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deepseek"&gt;deepseek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="deepseek"/><category term="llm-release"/><category term="openrouter"/><category term="ai-in-china"/></entry><entry><title>A pelican for GPT-5.5 via the semi-official Codex backdoor API</title><link href="https://simonwillison.net/2026/Apr/23/gpt-5-5/#atom-tag" rel="alternate"/><published>2026-04-23T19:59:47+00:00</published><updated>2026-04-23T19:59:47+00:00</updated><id>https://simonwillison.net/2026/Apr/23/gpt-5-5/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;a href="https://openai.com/index/introducing-gpt-5-5/"&gt;GPT-5.5 is out&lt;/a&gt;. It's available in OpenAI Codex and is rolling out to paid ChatGPT subscribers. I've had some preview access and found it to be a fast, effective and highly capable model. As is usually the case these days, it's hard to put into words what's good about it - I ask it to build things and it builds exactly what I ask for!&lt;/p&gt;
&lt;p&gt;There's one notable omission from today's release - the API:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;API deployments require different safeguards and we are working closely with partners and customers on the safety and security requirements for serving it at scale. We'll bring GPT‑5.5 and GPT‑5.5 Pro to the API very soon.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;When I run my &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;pelican benchmark&lt;/a&gt; I always prefer to use an API, to avoid hidden system prompts in ChatGPT or other agent harnesses from impacting the results.&lt;/p&gt;
&lt;h4 id="the-openclaw-backdoor"&gt;The OpenClaw backdoor&lt;/h4&gt;
&lt;p&gt;One of the ongoing tension points in the AI world over the past few months has concerned how agent harnesses like OpenClaw and Pi interact with the APIs provided by the big providers.&lt;/p&gt;
&lt;p&gt;Both OpenAI and Anthropic offer popular monthly subscriptions which provide access to their models at a significant discount to their raw API.&lt;/p&gt;
&lt;p&gt;OpenClaw integrated directly with this mechanism, and was then &lt;a href="https://www.theverge.com/ai-artificial-intelligence/907074/anthropic-openclaw-claude-subscription-ban"&gt;blocked from doing so&lt;/a&gt; by Anthropic. This kicked off a whole thing. OpenAI - who recently hired OpenClaw creator Peter Steinberger - saw an opportunity for an easy karma win and announced that OpenClaw was welcome to continue integrating with OpenAI's subscriptions via the same mechanism used by their (open source) Codex CLI tool.&lt;/p&gt;
&lt;p&gt;Does this mean &lt;em&gt;anyone&lt;/em&gt; can write code that integrates with OpenAI's Codex-specific APIs to hook into those existing subscriptions?&lt;/p&gt;
&lt;p&gt;The other day &lt;a href="https://twitter.com/jeremyphoward/status/2046537816834965714"&gt;Jeremy Howard asked&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Anyone know whether OpenAI officially supports the use of the &lt;code&gt;/backend-api/codex/responses&lt;/code&gt; endpoint that Pi and Opencode (IIUC) uses?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It turned out that on March 30th OpenAI's Romain Huet &lt;a href="https://twitter.com/romainhuet/status/2038699202834841962"&gt;had tweeted&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We want people to be able to use Codex, and their ChatGPT subscription, wherever they like! That means in the app, in the terminal, but also in JetBrains, Xcode, OpenCode, Pi, and now Claude Code.&lt;/p&gt;
&lt;p&gt;That’s why Codex CLI and Codex app server are open source too! 🙂&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And Peter Steinberger &lt;a href="https://twitter.com/steipete/status/2046775849769148838"&gt;replied to Jeremy&lt;/a&gt; that:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;OpenAI sub is officially supported.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4 id="llm-openai-via-codex"&gt;llm-openai-via-codex&lt;/h4&gt;
&lt;p&gt;So... I had Claude Code reverse-engineer the &lt;a href="https://github.com/openai/codex"&gt;openai/codex&lt;/a&gt; repo, figure out how authentication tokens were stored and build me &lt;a href="https://github.com/simonw/llm-openai-via-codex"&gt;llm-openai-via-codex&lt;/a&gt;, a new plugin for &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; which picks up your existing Codex subscription and uses it to run prompts!&lt;/p&gt;
&lt;p&gt;(With hindsight I wish I'd used GPT-5.4 or the GPT-5.5 preview, it would have been funnier. I genuinely considered rewriting the project from scratch using Codex and GPT-5.5 for the sake of the joke, but decided not to spend any more time on this!)&lt;/p&gt;
&lt;p&gt;Here's how to use it:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Install Codex CLI, buy an OpenAI plan, login to Codex&lt;/li&gt;
&lt;li&gt;Install LLM: &lt;code&gt;uv tool install llm&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Install the new plugin: &lt;code&gt;llm install llm-openai-via-codex&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Start prompting: &lt;code&gt;llm -m openai-codex/gpt-5.5 'Your prompt goes here'&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;All existing LLM features should also work - use &lt;code&gt;-a filepath.jpg/URL&lt;/code&gt; to attach an image, &lt;code&gt;llm chat -m openai-codex/gpt-5.5&lt;/code&gt; to start an ongoing chat, &lt;code&gt;llm logs&lt;/code&gt; to view logged conversations and &lt;code&gt;llm --tool ...&lt;/code&gt; to &lt;a href="https://llm.datasette.io/en/stable/tools.html"&gt;try it out with tool support&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="and-some-pelicans"&gt;And some pelicans&lt;/h4&gt;
&lt;p&gt;Let's generate a pelican!&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-openai-via-codex
llm -m openai-codex/gpt-5.5 &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Generate an SVG of a pelican riding a bicycle&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/edda1d98f7ba07fd95eeff473cb16634"&gt;what I got back&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/gpt-5.5-pelican.png" alt="It is a bit mangled to be honest - good beak, pelican body shapes are slightly weird, legs do at least extend to the pedals, bicycle frame is not quite right." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I've seen better &lt;a href="https://simonwillison.net/2026/Mar/17/mini-and-nano/#pelicans"&gt;from GPT-5.4&lt;/a&gt;, so I tagged on &lt;code&gt;-o reasoning_effort xhigh&lt;/code&gt; and &lt;a href="https://gist.github.com/simonw/a6168e4165a258e4d664aeae8e602cc5"&gt;tried again&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;That one took almost four minutes to generate, but I think it's a much better effort.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/gpt-5.5-pelican-xhigh.png" alt="Pelican has gradients now, body is much better put together, bicycle is nearly the right shape albeit with one extra bar between pedals and front wheel, clearly a better image overall." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;If you compare the SVG code (&lt;a href="https://gist.github.com/simonw/edda1d98f7ba07fd95eeff473cb16634#response"&gt;default&lt;/a&gt;, &lt;a href="https://gist.github.com/simonw/a6168e4165a258e4d664aeae8e602cc5#response"&gt;xhigh&lt;/a&gt;) the &lt;code&gt;xhigh&lt;/code&gt; one took a very different approach, which is much more CSS-heavy - as demonstrated by those gradients. &lt;code&gt;xhigh&lt;/code&gt; used 9,322 reasoning tokens where the default used just 39.&lt;/p&gt;
&lt;h4 id="a-few-more-notes-on-gpt-5-5"&gt;A few more notes on GPT-5.5&lt;/h4&gt;
&lt;p&gt;One of the most notable things about GPT-5.5 is the pricing. Once it goes live in the API it's &lt;a href="https://openai.com/index/introducing-gpt-5-5/#availability-and-pricing"&gt;going to be priced&lt;/a&gt; at &lt;em&gt;twice&lt;/em&gt; the cost of GPT-5.4 - $5 per 1M input tokens and $30 per 1M output tokens, where 5.4 is $2.5 and $15.&lt;/p&gt;
&lt;p&gt;GPT-5.5 Pro will be even more: $30 per 1M input tokens and $180 per 1M output tokens.&lt;/p&gt;
&lt;p&gt;GPT-5.4 will remain available. At half the price of 5.5 this feels like 5.4 is to 5.5 as Claude Sonnet is to Claude Opus.&lt;/p&gt;
&lt;p&gt;Ethan Mollick has a &lt;a href="https://www.oneusefulthing.org/p/sign-of-the-future-gpt-55"&gt;detailed review of GPT-5.5&lt;/a&gt; where he put it (and GPT-5.5 Pro) through an array of interesting challenges. His verdict: the jagged frontier continues to hold, with GPT-5.5 excellent at some things and challenged by others in a way that remains difficult to predict.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatgpt"&gt;chatgpt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/codex"&gt;codex&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt"&gt;gpt&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="chatgpt"/><category term="llms"/><category term="llm"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="codex"/><category term="gpt"/></entry><entry><title>Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model</title><link href="https://simonwillison.net/2026/Apr/22/qwen36-27b/#atom-tag" rel="alternate"/><published>2026-04-22T16:45:23+00:00</published><updated>2026-04-22T16:45:23+00:00</updated><id>https://simonwillison.net/2026/Apr/22/qwen36-27b/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://qwen.ai/blog?id=qwen3.6-27b"&gt;Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Big claims from Qwen about their latest open weight model:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Qwen3.6-27B delivers flagship-level agentic coding performance, surpassing the previous-generation open-source flagship Qwen3.5-397B-A17B (397B total / 17B active MoE) across all major coding benchmarks.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;On Hugging Face &lt;a href="https://huggingface.co/Qwen/Qwen3.5-397B-A17B/tree/main"&gt;Qwen3.5-397B-A17B&lt;/a&gt; is 807GB, this new &lt;a href="https://huggingface.co/Qwen/Qwen3.6-27B/tree/main"&gt;Qwen3.6-27B&lt;/a&gt; is 55.6GB.&lt;/p&gt;
&lt;p&gt;I tried it out with the 16.8GB Unsloth &lt;a href="https://huggingface.co/unsloth/Qwen3.6-27B-GGUF"&gt;Qwen3.6-27B-GGUF:Q4_K_M&lt;/a&gt; quantized version and &lt;code&gt;llama-server&lt;/code&gt; using this recipe by &lt;a href="https://news.ycombinator.com/item?id=47863217#47865140"&gt;benob on Hacker News&lt;/a&gt;, after first installing &lt;code&gt;llama-server&lt;/code&gt; using &lt;code&gt;brew install llama.cpp&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llama-server \
    -hf unsloth/Qwen3.6-27B-GGUF:Q4_K_M \
    --no-mmproj \
    --fit on \
    -np 1 \
    -c 65536 \
    --cache-ram 4096 -ctxcp 2 \
    --jinja \
    --temp 0.6 \
    --top-p 0.95 \
    --top-k 20 \
    --min-p 0.0 \
    --presence-penalty 0.0 \
    --repeat-penalty 1.0 \
    --reasoning on \
    --chat-template-kwargs '{"preserve_thinking": true}'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;On first run that saved the ~17GB model to &lt;code&gt;~/.cache/huggingface/hub/models--unsloth--Qwen3.6-27B-GGUF&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/4d99d730c840df594096366db1d27281"&gt;the transcript&lt;/a&gt; for "Generate an SVG of a pelican riding a bicycle". This is an &lt;em&gt;outstanding&lt;/em&gt; result for a 16.8GB local model:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Bicycle has spokes, a chain and a correctly shaped frame. Handlebars are a bit detached. Pelican has wing on the handlebars, weirdly bent legs that touch the pedals and a good bill. Background details are pleasant - semi-transparent clouds, birds, grass, sun." src="https://static.simonwillison.net/static/2026/Qwen3.6-27B-GGUF-Q4_K_M.png" /&gt;&lt;/p&gt;
&lt;p&gt;Performance numbers reported by &lt;code&gt;llama-server&lt;/code&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Reading: 20 tokens, 0.4s, 54.32 tokens/s&lt;/li&gt;
&lt;li&gt;Generation: 4,444 tokens, 2min 53s, 25.57 tokens/s&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For good measure, here's &lt;a href="https://gist.github.com/simonw/95735fe5e76e6fdf1753e6dcce360699"&gt;Generate an SVG of a NORTH VIRGINIA OPOSSUM ON AN E-SCOOTER&lt;/a&gt; (run previously &lt;a href="https://simonwillison.net/2026/Apr/7/glm-51/"&gt;with GLM-5.1&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img alt="Digital illustration in a neon Tron-inspired style of a grey cat-like creature wearing cyan visor goggles riding a glowing cyan futuristic motorcycle through a dark cityscape at night, with its long tail trailing behind, silhouetted buildings with yellow-lit windows in the background, and a glowing magenta moon on the right." src="https://static.simonwillison.net/static/2026/qwen3.6-27b-possum.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;That one took 6,575 tokens, 4min 25s, 24.74 t/s.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=47863217"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama-cpp"&gt;llama-cpp&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="qwen"/><category term="pelican-riding-a-bicycle"/><category term="llama-cpp"/><category term="llm-release"/><category term="ai-in-china"/></entry><entry><title>Where's the raccoon with the ham radio? (ChatGPT Images 2.0)</title><link href="https://simonwillison.net/2026/Apr/21/gpt-image-2/#atom-tag" rel="alternate"/><published>2026-04-21T20:32:24+00:00</published><updated>2026-04-21T20:32:24+00:00</updated><id>https://simonwillison.net/2026/Apr/21/gpt-image-2/#atom-tag</id><summary type="html">
    &lt;p&gt;OpenAI &lt;a href="https://openai.com/index/introducing-chatgpt-images-2-0/"&gt;released ChatGPT Images 2.0 today&lt;/a&gt;, their latest image generation model. On &lt;a href="https://www.youtube.com/watch?v=sWkGomJ3TLI"&gt;the livestream&lt;/a&gt; Sam Altman said that the leap from gpt-image-1 to gpt-image-2 was equivalent to jumping from GPT-3 to GPT-5. Here's how I put it to the test.&lt;/p&gt;
&lt;p&gt;My prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Do a where's Waldo style image but it's where is the raccoon holding a ham radio&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4 id="gpt-image-1"&gt;gpt-image-1&lt;/h4&gt;
&lt;p&gt;First as a baseline here's what I got from the older gpt-image-1 using ChatGPT directly:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://static.simonwillison.net/static/2026/chatgpt-image-1-ham-radio.png"&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/image_crop_1402x1122_w1402_q0.3.jpg" alt="There's a lot going on, but I couldn't find a raccoon." style="max-width: 100%;" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;I wasn't able to spot the raccoon - I quickly realized that testing image generation models on Where's Waldo style images (Where's Wally in the UK) can be pretty frustrating!&lt;/p&gt;
&lt;p&gt;I tried &lt;a href="https://claude.ai/share/bd6e9b88-29a9-420b-8ac1-3ac5cebac215"&gt;getting Claude Opus 4.7&lt;/a&gt; with its new higher resolution inputs to solve it but it was convinced there was a raccoon it couldn't find thanks to the instruction card at the top left of the image:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Yes — there's at least one raccoon in the picture, but it's very well hidden&lt;/strong&gt;. In my careful sweep through zoomed-in sections, honestly, I couldn't definitively spot a raccoon holding a ham radio. [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4 id="nano-banana-2-and-pro"&gt;Nano Banana 2 and Pro&lt;/h4&gt;
&lt;p&gt;Next I tried Google's Nano Banana 2, &lt;a href="https://gemini.google.com/share/3775db96c576"&gt;via Gemini&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://static.simonwillison.net/static/2026/nano-banana-2-ham-radio.jpg"&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/gemini-ham-radio-small.jpg" alt="Busy Where's Waldo-style illustration of a park festival with crowds of people, tents labeled &amp;quot;FOOD &amp;amp; DRINK&amp;quot;, &amp;quot;CRAFT FAIR&amp;quot;, &amp;quot;BOOK NOOK&amp;quot;, &amp;quot;MUSIC FEST&amp;quot;, and &amp;quot;AMATEUR RADIO CLUB - W6HAM&amp;quot; (featuring a raccoon in a red hat at the radio table), plus a Ferris wheel, carousel, gazebo with band, pond with boats, fountain, food trucks, and striped circus tents" style="max-width: 100%;" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;That one was pretty obvious, the raccoon is in the "Amateur Radio Club" booth in the center of the image!&lt;/p&gt;
&lt;p&gt;Claude said:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Honestly, this one wasn't really hiding — he's the star of the booth. Feels like the illustrator took pity on us after that last impossible scene. The little "W6HAM" callsign pun on the booth sign is a nice touch too.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I also tried Nano Banana Pro &lt;a href="https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%5B%221sGU5A7mrngkfLfSEU84xaV1DhtOTnS--%22%5D,%22action%22:%22open%22,%22userId%22:%22106366615678321494423%22,%22resourceKeys%22:%7B%7D%7D&amp;amp;usp=sharing"&gt;in AI Studio&lt;/a&gt; and got this, by far the worst result from any model. Not sure what went wrong here!&lt;/p&gt;
&lt;p&gt;&lt;a href="https://static.simonwillison.net/static/2026/nano-banana-pro-ham-radio.jpg"&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/nano-banana-pro-ham-radio-small.jpg" alt="The raccoon is larger than everyone else, right in the middle of the image with an ugly white border around it." style="max-width: 100%;" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h4 id="gpt-image-2"&gt;gpt-image-2&lt;/h4&gt;
&lt;p&gt;With the baseline established, let's try out the new model.&lt;/p&gt;
&lt;p&gt;I used an updated version of my &lt;a href="https://github.com/simonw/tools/blob/main/python/openai_image.py"&gt;openai_image.py&lt;/a&gt; script, which is a thin wrapper around the &lt;a href="https://github.com/openai/openai-python"&gt;OpenAI Python&lt;/a&gt; client library. Their client library hasn't yet been updated to include &lt;code&gt;gpt-image-2&lt;/code&gt; but thankfully it doesn't validate the model ID so you can use it anyway.&lt;/p&gt;
&lt;p&gt;Here's how I ran that:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;OPENAI_API_KEY=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;llm keys get openai&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
  uv run https://tools.simonwillison.net/python/openai_image.py \
  -m gpt-image-2 \
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Do a where's Waldo style image but it's where is the raccoon holding a ham radio&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's what I got back. I don't &lt;em&gt;think&lt;/em&gt; there's a raccoon in there - I couldn't spot one, and neither could Claude.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://static.simonwillison.net/static/2026/gpt-image-2-default.png"&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/gpt-image-2-default.jpg" alt="Lots of stuff, a ham radio booth, many many people, a lake, but maybe no raccoon?" style="max-width: 100%;" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/openai/openai-cookbook/blob/main/examples/multimodal/image-gen-models-prompting-guide.ipynb"&gt;OpenAI image generation cookbook&lt;/a&gt; has been updated with notes on &lt;code&gt;gpt-image-2&lt;/code&gt;, including the &lt;code&gt;outputQuality&lt;/code&gt; setting and available sizes.&lt;/p&gt;
&lt;p&gt;I tried setting &lt;code&gt;outputQuality&lt;/code&gt; to &lt;code&gt;high&lt;/code&gt; and the dimensions to &lt;code&gt;3840x2160&lt;/code&gt; - I believe that's the maximum - and got this - a 17MB PNG which I converted to a 5MB WEBP:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;OPENAI_API_KEY=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;llm keys get openai&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
  uv run &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://raw.githubusercontent.com/simonw/tools/refs/heads/main/python/openai_image.py&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
  -m gpt-image-2 &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Do a where's Waldo style image but it's where is the raccoon holding a ham radio&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
  --quality high --size 3840x2160&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;a href="https://static.simonwillison.net/static/2026/image-fc93bd-q100.webp"&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/image-fc93bd-q100.jpg" alt="Big complex image, lots of detail, good wording, there is indeed a raccoon with a ham radio." style="max-width: 100%;" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;That's pretty great! There's a raccoon with a ham radio in there (bottom left, quite easy to spot).&lt;/p&gt;
&lt;p&gt;The image used 13,342 output tokens, which are charged at $30/million so a total cost of around &lt;a href="https://www.llm-prices.com/#ot=13342&amp;amp;ic=5&amp;amp;cic=1.25&amp;amp;oc=10&amp;amp;sel=gpt-image-2-image"&gt;40 cents&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="takeaways"&gt;Takeaways&lt;/h4&gt;
&lt;p&gt;I think this new ChatGPT image generation model takes the crown from Gemini, at least for the moment.&lt;/p&gt;
&lt;p&gt;Where's Waldo style images are an infuriating and somewhat foolish way to test these models, but they do help illustrate how good they are getting at complex illustrations combining both text and details.&lt;/p&gt;
&lt;h4 id="update-asking-models-to-solve-this-is-risky"&gt;Update: asking models to solve this is risky&lt;/h4&gt;
&lt;p&gt;rizaco &lt;a href="https://news.ycombinator.com/item?id=47852835#47853561"&gt;on Hacker News&lt;/a&gt; asked ChatGPT to draw a red circle around the raccoon in one of the images in which I had failed to find one. Here's an animated mix of their result and the original image:&lt;/p&gt;
&lt;p&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2026/ham-radio-cheat.gif" alt="The circle appears around a raccoon with a ham radio who is definitely not there in the original image!" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Looks like we definitely can't trust these models to usefully solve their own puzzles!&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatgpt"&gt;chatgpt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/text-to-image"&gt;text-to-image&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nano-banana"&gt;nano-banana&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="chatgpt"/><category term="llms"/><category term="text-to-image"/><category term="llm-release"/><category term="nano-banana"/></entry><entry><title>llm-openrouter 0.6</title><link href="https://simonwillison.net/2026/Apr/20/llm-openrouter/#atom-tag" rel="alternate"/><published>2026-04-20T18:00:26+00:00</published><updated>2026-04-20T18:00:26+00:00</updated><id>https://simonwillison.net/2026/Apr/20/llm-openrouter/#atom-tag</id><summary type="html">
    
        &lt;p&gt;&lt;strong&gt;Release:&lt;/strong&gt; &lt;a href="https://github.com/simonw/llm-openrouter/releases/tag/0.6"&gt;llm-openrouter 0.6&lt;/a&gt;&lt;/p&gt;
        &lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;llm openrouter refresh&lt;/code&gt; command for refreshing the list of available models without waiting for the cache to expire.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;I added this feature so I could try &lt;a href="https://www.kimi.com/blog/kimi-k2-6"&gt;Kimi 2.6&lt;/a&gt; on OpenRouter as soon as it &lt;a href="https://openrouter.ai/moonshotai/kimi-k2.6"&gt;became available there&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gisthost.github.io/?ecaad98efe0f747e27bc0e0ebc669e94/pelican.html"&gt;its pelican&lt;/a&gt; - this time as an HTML page because Kimi chose to include an HTML and JavaScript UI to control the animation. &lt;a href="https://gist.github.com/simonw/ecaad98efe0f747e27bc0e0ebc669e94#2026-04-20t164936----conversation-01kpnwt8d2bt5qwkm60j9sbkbs-id-01kpnwra0prz6v822cct5b08kq"&gt;Transcript here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="The bicycle is about right. The pelican is OK. It is pedaling furiously and flapping its wings a bit. Controls below the animation provide a pause button and sliders for controlling the speed and the wing flap." src="https://static.simonwillison.net/static/2026/kimi-k2-pelican-64-colors.gif" /&gt;&lt;/p&gt;
    
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/kimi"&gt;kimi&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="openrouter"/><category term="ai-in-china"/><category term="kimi"/></entry><entry><title>Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7</title><link href="https://simonwillison.net/2026/Apr/16/qwen-beats-opus/#atom-tag" rel="alternate"/><published>2026-04-16T17:16:52+00:00</published><updated>2026-04-16T17:16:52+00:00</updated><id>https://simonwillison.net/2026/Apr/16/qwen-beats-opus/#atom-tag</id><summary type="html">
    &lt;p&gt;For anyone who has been (inadvisably) taking my &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;pelican riding a bicycle benchmark&lt;/a&gt; seriously as a robust way to test models, here are pelicans from this morning's two big model releases - &lt;a href="https://qwen.ai/blog?id=qwen3.6-35b-a3b"&gt;Qwen3.6-35B-A3B from Alibaba&lt;/a&gt; and &lt;a href="https://www.anthropic.com/news/claude-opus-4-7"&gt;Claude Opus 4.7 from Anthropic&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here's the Qwen 3.6 pelican, generated using &lt;a href="https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/blob/main/Qwen3.6-35B-A3B-UD-Q4_K_S.gguf"&gt;this 20.9GB Qwen3.6-35B-A3B-UD-Q4_K_S.gguf&lt;/a&gt; quantized model by Unsloth, running on my MacBook Pro M5 via &lt;a href="https://lmstudio.ai/"&gt;LM Studio&lt;/a&gt; (and the &lt;a href="https://github.com/agustif/llm-lmstudio"&gt;llm-lmstudio&lt;/a&gt; plugin) - &lt;a href="https://gist.github.com/simonw/4389d355d8e162bc6e4547da214f7dd2"&gt;transcript here&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/Qwen3.6-35B-A3B-UD-Q4_K_S-pelican.png" alt="The bicycle frame is the correct shape. There are clouds in the sky. The pelican has a dorky looking pouch. A caption on the ground reads Pelican on a Bicycle!" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And here's one I got from Anthropic's &lt;a href="https://www.anthropic.com/news/claude-opus-4-7"&gt;brand new Claude Opus 4.7&lt;/a&gt; (&lt;a href="https://gist.github.com/simonw/afcb19addf3f38eb1996e1ebe749c118"&gt;transcript&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/opus-4.7-pelican.png" alt="The bicycle frame is entirely the wrong shape. No clouds, a yellow sun. The pelican is looking behind itself, and has a less pronounced pouch than I would like." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I'm giving this one to Qwen 3.6. Opus managed to mess up the bicycle frame!&lt;/p&gt;
&lt;p&gt;I tried Opus a second time passing &lt;code&gt;thinking_level: max&lt;/code&gt;. It didn't do much better (&lt;a href="https://gist.github.com/simonw/7566e04a81accfb9affda83451c0f363"&gt;transcript&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/opus-4.7-pelican-max.png" alt="The bicycle frame is entirely the wrong shape but in a different way. Lines are more bold. Pelican looks a bit more like a pelican." style="max-width: 100%;" /&gt;&lt;/p&gt;

&lt;h4 id="i-dont-think-qwen-are-cheating"&gt;I don't think Qwen are cheating&lt;/h4&gt;
&lt;p&gt;A lot of people are &lt;a href="https://simonwillison.net/2025/Nov/13/training-for-pelicans-riding-bicycles/"&gt;convinced that the labs train for my stupid benchmark&lt;/a&gt;. I don't think they do, but honestly this result did give me a little glint of suspicion. So I'm burning one of my secret backup tests - here's what I got from Qwen3.6-35B-A3B and Opus 4.7 for "Generate an SVG of a flamingo riding a unicycle":&lt;/p&gt;

&lt;div style="display: flex; gap: 4px;"&gt;
  &lt;figure style="flex: 1; text-align: center; margin: 0;"&gt;
    &lt;figcaption style="margin-bottom: 1em"&gt;Qwen3.6-35B-A3B&lt;br /&gt;(&lt;a href="https://gist.github.com/simonw/f1d1ff01c34dda5fdedf684cfc430d92"&gt;transcript&lt;/a&gt;)&lt;/figcaption&gt;
    &lt;img src="https://static.simonwillison.net/static/2026/qwen-flamingo.png" alt="The unicycle spokes are a too long. The pelican has sunglasses, a bowtie and appears to be smoking a cigarette. It has two heart emoji surrounding the caption Flamingo on a Unicycle. It has a lot of charisma." style="max-width: 100%; height: auto;" /&gt;
  &lt;/figure&gt;
  &lt;figure style="flex: 1; text-align: center; margin: 0;"&gt;
    &lt;figcaption style="margin-bottom: 1em"&gt;Opus 4.7&lt;br /&gt;(&lt;a href="https://gist.github.com/simonw/35121ad5dcf23bf860397a103ae88d50"&gt;transcript&lt;/a&gt;)&lt;/figcaption&gt;
    &lt;img src="https://static.simonwillison.net/static/2026/opus-flamingo.png" alt="The unicycle has a black wheel. The flamingo is a competent if slightly dull vector illustration of a flamingo. It has no flair." style="max-width: 100%; height: auto;" /&gt;
  &lt;/figure&gt;
&lt;/div&gt;


&lt;p&gt;I'm giving this one to Qwen too, partly for the excellent &lt;code&gt;&amp;lt;!-- Sunglasses on flamingo! --&amp;gt;&lt;/code&gt; SVG comment.&lt;/p&gt;

&lt;h4 id="what-can-we-learn-from-this-"&gt;What can we learn from this?&lt;/h4&gt;
&lt;p&gt;The pelican benchmark has always been meant as a joke - it's mainly a statement on how obtuse and absurd the task of comparing these models is.&lt;/p&gt;
&lt;p&gt;The weird thing about that joke is that, for the most part, there has been a direct correlation between the quality of the pelicans produced and the general usefulness of the models. Those &lt;a href="https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/"&gt;first pelicans from October 2024&lt;/a&gt; were junk. The &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;more recent entries&lt;/a&gt; have generally been much, much better - to the point that Gemini 3.1 Pro produces &lt;a href="https://simonwillison.net/2026/Feb/19/gemini-31-pro/"&gt;illustrations you could actually use somewhere&lt;/a&gt;, provided you had a pressing need to illustrate a pelican riding a bicycle.&lt;/p&gt;
&lt;p&gt;Today, even that loose connection to utility has been broken. I have enormous respect for Qwen, but I very much doubt that a 21GB quantized version of their latest model is more powerful or useful than Anthropic's latest proprietary release.&lt;/p&gt;
&lt;p&gt;If the thing you need is an SVG illustration of a pelican riding a bicycle though, right now Qwen3.6-35B-A3B running on a laptop is a better bet than Opus 4.7!&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="qwen"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="lm-studio"/></entry><entry><title>Gemini 3.1 Flash TTS</title><link href="https://simonwillison.net/2026/Apr/15/gemini-31-flash-tts/#atom-tag" rel="alternate"/><published>2026-04-15T17:13:14+00:00</published><updated>2026-04-15T17:13:14+00:00</updated><id>https://simonwillison.net/2026/Apr/15/gemini-31-flash-tts/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-tts/"&gt;Gemini 3.1 Flash TTS&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Google released Gemini 3.1 Flash TTS today, a new text-to-speech model that can be directed using prompts.&lt;/p&gt;
&lt;p&gt;It's presented via the standard Gemini API using &lt;code&gt;gemini-3.1-flash-tts-preview&lt;/code&gt; as the model ID, but can only output audio files.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://ai.google.dev/gemini-api/docs/speech-generation#transcript-tags"&gt;prompting guide&lt;/a&gt; is surprising, to say the least. Here's their example prompt to generate just a few short sentences of audio:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# AUDIO PROFILE: Jaz R.
## "The Morning Hype"

## THE SCENE: The London Studio
It is 10:00 PM in a glass-walled studio overlooking the moonlit London skyline, but inside, it is blindingly bright. The red "ON AIR" tally light is blazing. Jaz is standing up, not sitting, bouncing on the balls of their heels to the rhythm of a thumping backing track. Their hands fly across the faders on a massive mixing desk. It is a chaotic, caffeine-fueled cockpit designed to wake up an entire nation.

### DIRECTOR'S NOTES
Style:
* The "Vocal Smile": You must hear the grin in the audio. The soft palate is always raised to keep the tone bright, sunny, and explicitly inviting.
* Dynamics: High projection without shouting. Punchy consonants and elongated vowels on excitement words (e.g., "Beauuutiful morning").

Pace: Speaks at an energetic pace, keeping up with the fast music.  Speaks with A "bouncing" cadence. High-speed delivery with fluid transitions — no dead air, no gaps.

Accent: Jaz is from Brixton, London

### SAMPLE CONTEXT
Jaz is the industry standard for Top 40 radio, high-octane event promos, or any script that requires a charismatic Estuary accent and 11/10 infectious energy.

#### TRANSCRIPT
[excitedly] Yes, massive vibes in the studio! You are locked in and it is absolutely popping off in London right now. If you're stuck on the tube, or just sat there pretending to work... stop it. Seriously, I see you.
[shouting] Turn this up! We've got the project roadmap landing in three, two... let's go!
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's what I got using that example prompt:&lt;/p&gt;
&lt;p&gt;&lt;audio controls style="width: 100%"&gt;
  &lt;source src="https://static.simonwillison.net/static/2026/gemini-flash-tts-london.wav" type="audio/wav"&gt;
  Your browser does not support the audio element.
&lt;/audio&gt;&lt;/p&gt;
&lt;p&gt;Then I modified it to say "Jaz is from Newcastle" and "... requires a charismatic Newcastle accent" and got this result:&lt;/p&gt;
&lt;p&gt;&lt;audio controls style="width: 100%"&gt;
  &lt;source src="https://static.simonwillison.net/static/2026/gemini-flash-tts-newcastle.wav" type="audio/wav"&gt;
  Your browser does not support the audio element.
&lt;/audio&gt;&lt;/p&gt;
&lt;p&gt;Here's Exeter, Devon for good measure:&lt;/p&gt;
&lt;p&gt;&lt;audio controls style="width: 100%"&gt;
  &lt;source src="https://static.simonwillison.net/static/2026/gemini-flash-tts-devon.wav" type="audio/wav"&gt;
  Your browser does not support the audio element.
&lt;/audio&gt;&lt;/p&gt;
&lt;p&gt;I &lt;a href="https://gemini.google.com/share/dd0fba5a83c4"&gt;had Gemini 3.1 Pro&lt;/a&gt; vibe code &lt;a href="https://tools.simonwillison.net/gemini-flash-tts"&gt;this UI for trying it out&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a &amp;quot;Gemini 3.1 Flash TTS&amp;quot; web application interface. At the top is an &amp;quot;API Key&amp;quot; field with a masked password. Below is a &amp;quot;TTS Mode&amp;quot; section with a dropdown set to &amp;quot;Multi-Speaker (Conversation)&amp;quot;. &amp;quot;Speaker 1 Name&amp;quot; is set to &amp;quot;Joe&amp;quot; with &amp;quot;Speaker 1 Voice&amp;quot; set to &amp;quot;Puck (Upbeat)&amp;quot;. &amp;quot;Speaker 2 Name&amp;quot; is set to &amp;quot;Jane&amp;quot; with &amp;quot;Speaker 2 Voice&amp;quot; set to &amp;quot;Kore (Firm)&amp;quot;. Under &amp;quot;Script / Prompt&amp;quot; is a tip reading &amp;quot;Tip: Format your text as a script using the Exact Speaker Names defined above.&amp;quot; The script text area contains &amp;quot;TTS the following conversation between Joe and Jane:\n\nJoe: How's it going today Jane?\nJane: [yawn] Not too bad, how about you?&amp;quot; A blue &amp;quot;Generate Audio&amp;quot; button is below. At the bottom is a &amp;quot;Success!&amp;quot; message with an audio player showing 00:00 / 00:06 and a &amp;quot;Download WAV&amp;quot; link." src="https://static.simonwillison.net/static/2026/gemini-flash-tts.jpg" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/text-to-speech"&gt;text-to-speech&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tools"&gt;tools&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-engineering"&gt;prompt-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vibe-coding"&gt;vibe-coding&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="text-to-speech"/><category term="tools"/><category term="ai"/><category term="prompt-engineering"/><category term="generative-ai"/><category term="llms"/><category term="gemini"/><category term="llm-release"/><category term="vibe-coding"/></entry><entry><title>Meta's new model is Muse Spark, and meta.ai chat has some interesting tools</title><link href="https://simonwillison.net/2026/Apr/8/muse-spark/#atom-tag" rel="alternate"/><published>2026-04-08T23:07:44+00:00</published><updated>2026-04-08T23:07:44+00:00</updated><id>https://simonwillison.net/2026/Apr/8/muse-spark/#atom-tag</id><summary type="html">
    &lt;p&gt;Meta &lt;a href="https://ai.meta.com/blog/introducing-muse-spark-msl/"&gt;announced Muse Spark&lt;/a&gt; today, their first model release since Llama 4 &lt;a href="https://simonwillison.net/2025/Apr/5/llama-4-notes/"&gt;almost exactly a year ago&lt;/a&gt;. It's hosted, not open weights, and the API is currently "a private API preview to select users", but you can try it out today on &lt;a href="https://meta.ai/"&gt;meta.ai&lt;/a&gt; (Facebook or Instagram login required).&lt;/p&gt;
&lt;p&gt;Meta's self-reported benchmarks show it competitive with Opus 4.6, Gemini 3.1 Pro, and GPT 5.4 on selected benchmarks, though notably behind on Terminal-Bench 2.0. Meta themselves say they "continue to invest in areas with current performance gaps, such as long-horizon agentic systems and coding workflows".&lt;/p&gt;
&lt;p&gt;The model is exposed as two different modes on &lt;a href="https://meta.ai/"&gt;meta.ai&lt;/a&gt; - "Instant" and "Thinking". Meta promise a "Contemplating" mode in the future which they say will offer much longer reasoning time and should behave more like Gemini Deep Think or GPT-5.4 Pro.&lt;/p&gt;
&lt;h5 id="a-couple-of-pelicans"&gt;A couple of pelicans&lt;/h5&gt;
&lt;p&gt;I prefer to run &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;my pelican test&lt;/a&gt; via API to avoid being influenced by any invisible system prompts, but since that's not an option I ran it against the chat UI directly.&lt;/p&gt;
&lt;p&gt;Here's the pelican I got for "Instant":&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/muse-spark-instant-pelican.jpg" alt="This is a pretty basic pelican. The bicycle is mangled, the pelican itself has a rectangular beak albeit with a hint of pouch curve below it. Not a very good one." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;And this one for "Thinking":&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/muse-spark-thinking-pelican.png" alt="Much better. Clearly a pelican. Bicycle is the correct shape. Pelican is wearing a blue cycling helmet (albeit badly rendered). Not a bad job at all." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Both SVGs were rendered inline by the Meta AI interface. Interestingly, the Instant model &lt;a href="https://gist.github.com/simonw/ea7466204f1001b7d67afcb5d0532f6f"&gt;output an SVG directly&lt;/a&gt; (with code comments) whereas the Thinking model &lt;a href="https://gist.github.com/simonw/bc911a56006ba44b0bf66abf0f872ab2"&gt;wrapped it in a thin HTML shell&lt;/a&gt; with some unused &lt;code&gt;Playables SDK v1.0.0&lt;/code&gt; JavaScript libraries.&lt;/p&gt;
&lt;p&gt;Which got me curious...&lt;/p&gt;
&lt;h5 id="poking-around-with-tools"&gt;Poking around with tools&lt;/h5&gt;
&lt;p&gt;Clearly Meta's chat harness has some tools wired up to it - at the very least it can render SVG and HTML as embedded frames, Claude Artifacts style.&lt;/p&gt;
&lt;p&gt;But what else can it do?&lt;/p&gt;
&lt;p&gt;I asked it:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;what tools do you have access to?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And then:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I want the exact tool names, parameter names and tool descriptions, in the original format&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It spat out detailed descriptions of 16 different tools. You can see &lt;a href="https://gist.github.com/simonw/e1ce0acd70443f93dcd6481e716c4304#response-1"&gt;the full list I got back here&lt;/a&gt; - credit to Meta for not telling their bot to hide these, since it's far less frustrating if I can get them out without having to mess around with jailbreaks.&lt;/p&gt;
&lt;p&gt;Here are highlights derived from that response:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Browse and search&lt;/strong&gt;. &lt;code&gt;browser.search&lt;/code&gt; can run a web search through an undisclosed search engine, &lt;code&gt;browser.open&lt;/code&gt; can load the full page from one of those search results and &lt;code&gt;browser.find&lt;/code&gt; can run pattern matches against the returned page content.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Meta content search&lt;/strong&gt;. &lt;code&gt;meta_1p.content_search&lt;/code&gt; can run "Semantic search across Instagram, Threads, and Facebook posts" - but only for posts the user has access to view which were created since 2025-01-01. This tool has some powerful looking parameters, including &lt;code&gt;author_ids&lt;/code&gt;, &lt;code&gt;key_celebrities&lt;/code&gt;, &lt;code&gt;commented_by_user_ids&lt;/code&gt;, and &lt;code&gt;liked_by_user_ids&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;"Catalog search"&lt;/strong&gt; - &lt;code&gt;meta_1p.meta_catalog_search&lt;/code&gt; can "Search for products in Meta's product catalog", presumably for the "Shopping" option in the Meta AI model selector.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Image generation&lt;/strong&gt;. &lt;code&gt;media.image_gen&lt;/code&gt; generates images from prompts, and "returns a CDN URL and saves the image to the sandbox". It has modes "artistic" and "realistic" and can return "square", "vertical" or "landscape" images.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;container.python_execution&lt;/strong&gt; - yes! It's &lt;a href="https://simonwillison.net/tags/code-interpreter/"&gt;Code Interpreter&lt;/a&gt;, my favourite feature of both ChatGPT and Claude.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Execute Python code in a remote sandbox environment. Python 3.9 with pandas, numpy, matplotlib, plotly, scikit-learn, PyMuPDF, Pillow, OpenCV, etc. Files persist at &lt;code&gt;/mnt/data/&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Python 3.9 &lt;a href="https://devguide.python.org/versions/"&gt;is EOL&lt;/a&gt; these days but the library collection looks useful.&lt;/p&gt;
&lt;p&gt;I prompted "use python code to confirm sqlite version and python version" and got back Python 3.9.25 and SQLite 3.34.1 (from &lt;a href="https://sqlite.org/releaselog/3_34_1.html"&gt;January 2021&lt;/a&gt;).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;container.create_web_artifact&lt;/strong&gt; - we saw this earlier with the HTML wrapper around the pelican: Meta AI can create HTML+JavaScript files in its container which can then be served up as secure sandboxed iframe interactives. "Set kind to &lt;code&gt;html&lt;/code&gt; for websites/apps or &lt;code&gt;svg&lt;/code&gt; for vector graphics."&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;container.download_meta_1p_media&lt;/strong&gt; is interesting: "Download media from Meta 1P sources into the sandbox. Use post_id for Instagram/Facebook/Threads posts, or &lt;code&gt;catalog_search_citation_id&lt;/code&gt; for catalog product images". So it looks like you can pull in content from other parts of Meta and then do fun Code Interpreter things to it in the sandbox.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;container.file_search&lt;/strong&gt; - "Search uploaded files in this conversation and return relevant excerpts" - I guess for digging through PDFs and similar?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Tools for editing files in the container&lt;/strong&gt; - &lt;code&gt;container.view&lt;/code&gt;, &lt;code&gt;container.insert&lt;/code&gt; (with &lt;code&gt;new_str&lt;/code&gt; and &lt;code&gt;insert_line&lt;/code&gt;), &lt;code&gt;container.str_replace&lt;/code&gt;. These look similar to Claude's &lt;a href="https://platform.claude.com/docs/en/agents-and-tools/tool-use/text-editor-tool#text-editor-tool-commands"&gt;text editor tool commands&lt;/a&gt; - these are becoming a common pattern across any file-equipped agent harness.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;container.visual_grounding&lt;/strong&gt; - see below, this one is &lt;em&gt;fun&lt;/em&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;subagents.spawn_agent&lt;/strong&gt; - the &lt;a href="https://simonwillison.net/guides/agentic-engineering-patterns/subagents/"&gt;sub-agent as a tool&lt;/a&gt; pattern. "Spawn an independent sub-agent for research, analysis, or delegation. It returns its final text response."&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;third_party.link_third_party_account&lt;/strong&gt; - "Initiate account linking for a third-party service", currently Google Calendar, Outlook Calendar, Gmail, or Outlook.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="image-analysis-in-the-container"&gt;Image analysis in the container&lt;/h4&gt;
&lt;p&gt;Let's talk about that &lt;code&gt;container.visual_grounding&lt;/code&gt; one. Here's the description in full:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Description:&lt;/strong&gt; Visual grounding tool that analyzes the image, identifies and labels objects, locates regions, or counts objects.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Parameters:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;object_names&lt;/strong&gt; (array, required)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;image_path&lt;/strong&gt; (string, default "")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;format_type&lt;/strong&gt; ("bbox"|"point"|"count"|null, default "bbox")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;title&lt;/strong&gt; (string|null)[string]&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;So it can take an image path, optional object names and return things in format &lt;code&gt;bbox&lt;/code&gt;, &lt;code&gt;point&lt;/code&gt; or &lt;code&gt;count&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;This sounded to me like Meta's &lt;a href="https://github.com/facebookresearch/segment-anything"&gt;Segment Anything&lt;/a&gt;, (&lt;em&gt;I later heard that it's not, this is a native feature of the model itself via a tool call with a custom system prompt&lt;/em&gt;). I decided to take it for a spin.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Prompt&lt;/strong&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;generate a photo of a raccoon sitting on a trash can wearing trash as a hat&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;Response&lt;/strong&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Here you go — king of the alley, in full regalia:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/meta-raccoon-trash-hat.jpg" alt="It's a photo of a raccoon wearing a disposable coffee cup and some newspaper and banana peel as a hat. It's pretty good" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Coffee cup crown, banana peel brim, newspaper feather. Peak raccoon fashion. Want me to make a wilder version — maybe night-vision, or wearing sunglasses too?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;My guess is that image generation in Meta AI is powered by their &lt;a href="https://arxiv.org/abs/2309.15807"&gt;Emu model&lt;/a&gt;, or an updated version of it. This is the same pattern as ChatGPT and Gemini image generation, where images are generated by a tool call (here that's &lt;code&gt;media.image_gen&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;image_gen&lt;/code&gt; tool description said that it "saves the image to the sandbox", so my guess was that I'd then be able to run Python code against it. I tried:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;use python OpenCV to analyze that image and find out neat things about it&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Annoyingly it did &lt;em&gt;not&lt;/em&gt; show me the code it ran, but it did create this image:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/raccoon_analysis.png" alt="OpenCV image analysis dashboard titled &amp;quot;OpenCV Analysis: Raccoon Trash Hat&amp;quot; with four panels: top-left shows the original 1280x1280 photo of a raccoon wearing a trash-lid hat with a coffee cup and newspaper on top in an alley; top-right shows a Canny edge detection version with white edges on black background labeled &amp;quot;Canny Edges - 4.6% density&amp;quot;; bottom-left shows &amp;quot;Top 5 Dominant Colors&amp;quot; as swatches with values RGB(20, 19, 24): 35.3%, RGB(43, 43, 47): 34.2%, RGB(84, 80, 79): 18.3%, RGB(139, 135, 131): 6.3%, RGB(210, 213, 208): 5.8%; bottom-right shows &amp;quot;Color Histograms&amp;quot; line chart with Red, Green, and Blue channels plotted from 0-250 on x-axis and 0-35000 on y-axis, peaking around value 25-50." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Pretty neat! We can both generate images and then use them with Python tools.&lt;/p&gt;
&lt;p&gt;Next, the &lt;code&gt;visual_grounding&lt;/code&gt; tool. Knowing the names of the tools makes it easy to refer to them in prompts:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Now try the container.visual_grounding tool&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The response (&lt;a href="https://gist.github.com/simonw/72d64b325d33e5882b9818d825551b44"&gt;full response here&lt;/a&gt;):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Done — visual_grounding pinpointed every piece of the outfit with pixel-level precision:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/meta-grounding.jpg" alt="Visual grounding analysis of a raccoon photo titled &amp;quot;Visual Grounding: Found 8 objects&amp;quot; with colored dot annotations on the image marking each detected object: coffee cup (515,205) in yellow, banana peel (569,261) in light yellow, newspaper (399,279) in blue, right eye (461,396) in magenta, left eye (582,397) in pink, nose (514,472) in cyan, raccoon (553,548) in green, and trash can lid (550,892) in red. A legend at the bottom lists all objects with their normalized 0-1000 coordinates. Footer text reads &amp;quot;Coordinates are 0-1000 normalized. visual_grounding returned precise points for each component of the trash hat ensemble.&amp;quot; Watermark reads &amp;quot;Meta AI&amp;quot;." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;[...]&lt;/p&gt;
&lt;p&gt;Want me to run it in bbox mode next to draw boxes around the fur texture, or count mode to see how many individual whiskers it can detect?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Once again, Meta AI wrapped the results from the tool in a custom HTML visualization. That's clearly &lt;code&gt;point&lt;/code&gt; mode, but how about the others?&lt;/p&gt;
&lt;p&gt;I prompted:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;yes draw bounding boxes, can it do masks too?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Response:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Yes — bounding boxes work great. Masks? Not directly in &lt;code&gt;visual_grounding&lt;/code&gt;; it only returns three formats: point, bbox, and count. No pixel-level segmentation masks.&lt;/p&gt;
&lt;p&gt;Here's the bbox output:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/meta-bbox.jpg" alt="Bounding box object detection visualization titled &amp;quot;Bounding Boxes (visual_grounding)&amp;quot; with subtitle &amp;quot;8 objects detected — coordinates are 0-1000 normalized&amp;quot; showing a raccoon photo with colored rectangular bounding boxes around detected objects: coffee cup in yellow [424,115,610,273] 186×158, banana peel in yellow [472,215,667,291] 195×76, newspaper in blue [305,223,548,315] 243×92, raccoon in green [270,240,892,845] 622×605, raccoon's face in purple [305,283,750,537] 445×254, right eye in magenta [442,379,489,413] 47×34, left eye in magenta [565,379,605,416] 40×37, and trash can lid in red [107,800,1000,998] 893×198. A legend at the bottom shows each object's name, coordinates, and pixel dimensions in colored cards. Watermark reads &amp;quot;Meta AI&amp;quot;." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;What you’re seeing:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;raccoon&lt;/strong&gt;: big box from (270,240) to (892,845) — covers 62% of the width, basically the whole animal&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;coffee cup&lt;/strong&gt;: (424,115)-(610,273) — sits highest, 158px tall in the 0-1000 space&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;banana peel&lt;/strong&gt;: (472,215)-(667,291) — overlaps the cup's base, confirming it's draped over the side&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;newspaper&lt;/strong&gt;: (305,223)-(548,315) — left ear coverage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;trash can lid&lt;/strong&gt;: (107,800)-(1000,998) — spans almost the full image width at the bottom&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;face&lt;/strong&gt;: (305,283)-(750,537) — tight crop around the mask&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;eyes&lt;/strong&gt;: both ~40×37 pixel boxes, perfectly level at y=379&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Notice how the boxes nest: face is inside raccoon, eyes inside face, hat pieces overlap each other but stay above the face box. The tool is doing real object localization, not just guessing centers.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;strong&gt;About masks:&lt;/strong&gt; since visual_grounding doesn't output them, I can fake them with OpenCV in the container. Two quick options:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;GrabCut&lt;/strong&gt; — I give it the raccoon bbox, it iteratively segments fur from background and returns a binary mask&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SAM-style color threshold&lt;/strong&gt; — use the bbox plus k-means to isolate the coffee cup or banana peel&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Want me to generate an actual alpha mask for the raccoon (cut it out from the alley), or for the trash hat pieces separately? Takes about 10 seconds and I can give you a transparent PNG.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://static.simonwillison.net/static/2026/meta-bbox.html"&gt;the HTML for that bbox visualization&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Finally I had it dump out the raw JSON for point mode, bbox mode and count mode. You can &lt;a href="https://gist.github.com/simonw/26bfdb22d7d73450c13f49abbf0c251a"&gt;see that all here&lt;/a&gt;. The count mode is particularly interesting - here's that in full:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;[
  {
    &lt;span class="pl-ent"&gt;"object_name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;raccoon whisker&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"points"&lt;/span&gt;: [
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;403&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;445&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;397&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;458&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;392&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;472&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;391&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;485&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;398&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;497&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;415&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;512&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;618&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;446&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;622&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;461&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;625&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;475&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;628&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;488&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;635&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;501&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;651&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;518&lt;/span&gt;}
    ],
    &lt;span class="pl-ent"&gt;"count"&lt;/span&gt;: &lt;span class="pl-c1"&gt;12&lt;/span&gt;
  },
  {
    &lt;span class="pl-ent"&gt;"object_name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;raccoon paw claw&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"points"&lt;/span&gt;: [
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;406&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;845&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;424&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;847&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;441&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;842&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;473&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;833&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;596&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;830&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;635&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;844&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;660&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;843&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;679&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;838&lt;/span&gt;}
    ],
    &lt;span class="pl-ent"&gt;"count"&lt;/span&gt;: &lt;span class="pl-c1"&gt;8&lt;/span&gt;
  },
  {
    &lt;span class="pl-ent"&gt;"object_name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;trash item on head&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"points"&lt;/span&gt;: [
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;514&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;191&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;568&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;263&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;392&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;274&lt;/span&gt;}
    ],
    &lt;span class="pl-ent"&gt;"count"&lt;/span&gt;: &lt;span class="pl-c1"&gt;3&lt;/span&gt;
  },
  {
    &lt;span class="pl-ent"&gt;"object_name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;eye&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"points"&lt;/span&gt;: [
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;462&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;395&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;582&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;398&lt;/span&gt;}
    ],
    &lt;span class="pl-ent"&gt;"count"&lt;/span&gt;: &lt;span class="pl-c1"&gt;2&lt;/span&gt;
  },
  {
    &lt;span class="pl-ent"&gt;"object_name"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;ear&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"points"&lt;/span&gt;: [
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;359&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;313&lt;/span&gt;},
      {&lt;span class="pl-ent"&gt;"x"&lt;/span&gt;: &lt;span class="pl-c1"&gt;680&lt;/span&gt;, &lt;span class="pl-ent"&gt;"y"&lt;/span&gt;: &lt;span class="pl-c1"&gt;294&lt;/span&gt;}
    ],
    &lt;span class="pl-ent"&gt;"count"&lt;/span&gt;: &lt;span class="pl-c1"&gt;2&lt;/span&gt;
  }
]&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;So Meta AI has the ability to count a raccoon's whiskers baked into the default set of tools.&lt;/p&gt;
&lt;p&gt;Which means... it can count pelicans too!&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/meta-count-pelicans.jpg" alt="Screenshot of a Meta AI conversation where the user uploaded a photo of pelicans on rocks and asked &amp;quot;Count the pelicans&amp;quot;. response reads: &amp;quot;That rocky haul-out is packed — I count 25 brown pelicans in your photo. I went bird by bird, including the ones tucked behind rocks, heads turned away, or just peeking out at the edges. They're all California brown pelicans, with those pale yellow heads and long bills, clustered on the riprap right at the waterline. To make it easy to verify, I've numbered each one on the image:&amp;quot; followed by an annotated version of the photo with red numbered circles (1–25) placed on each pelican. The annotated image has a downward chevron and an &amp;quot;Open&amp;quot; button at the bottom right." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Here's that overlay &lt;a href="https://static.simonwillison.net/static/2026/meta-count-pelicans.html"&gt;exported as HTML&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update&lt;/strong&gt;: Meta's &lt;a href="https://twitter.com/jacktripleu/status/2042050863800447387"&gt;Jack Wu confirms&lt;/a&gt; that these tools are part of the new harness they launched alongside the new model.&lt;/em&gt;&lt;/p&gt;

&lt;h4 id="maybe-open-weights-in-the-future-"&gt;Maybe open weights in the future?&lt;/h4&gt;
&lt;p&gt;On Twitter &lt;a href="https://twitter.com/alexandr_wang/status/2041909388852748717"&gt;Alexandr Wang said&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;this is step one. bigger models are already in development with infrastructure scaling to match. private api preview open to select partners today, with plans to open-source future versions.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I really hope they do go back to open-sourcing their models. Llama 3.1/3.2/3.3 were excellent laptop-scale model families, and the introductory blog post for Muse Spark had this to say about efficiency:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[...] we can reach the same capabilities with over an order of magnitude less compute than our previous model, Llama 4 Maverick. This improvement also makes Muse Spark significantly more efficient than the leading base models available for comparison.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;So are Meta back in the frontier model game? &lt;a href="https://twitter.com/ArtificialAnlys/status/2041913043379220801"&gt;Artificial Analysis&lt;/a&gt; think so - they scored Meta Spark at 52, "behind only Gemini 3.1 Pro, GPT-5.4, and Claude Opus 4.6". Last year's Llama 4 Maverick and Scout scored 18 and 13 respectively.&lt;/p&gt;
&lt;p&gt;I'm waiting for API access - while the tool collection on &lt;a href="https://meta.ai/"&gt;meta.ai&lt;/a&gt; is quite strong the real test of a model like this is still what we can build on top of it.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/facebook"&gt;facebook&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/code-interpreter"&gt;code-interpreter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/meta"&gt;meta&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="facebook"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="code-interpreter"/><category term="llm-tool-use"/><category term="meta"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/></entry><entry><title>GLM-5.1: Towards Long-Horizon Tasks</title><link href="https://simonwillison.net/2026/Apr/7/glm-51/#atom-tag" rel="alternate"/><published>2026-04-07T21:25:14+00:00</published><updated>2026-04-07T21:25:14+00:00</updated><id>https://simonwillison.net/2026/Apr/7/glm-51/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://z.ai/blog/glm-5.1"&gt;GLM-5.1: Towards Long-Horizon Tasks&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Chinese AI lab Z.ai's latest model is a giant 754B parameter 1.51TB (on &lt;a href="https://huggingface.co/zai-org/GLM-5.1"&gt;Hugging Face&lt;/a&gt;) MIT-licensed monster - the same size as their previous GLM-5 release, and sharing the &lt;a href="https://huggingface.co/papers/2602.15763"&gt;same paper&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It's available &lt;a href="https://openrouter.ai/z-ai/glm-5.1"&gt;via OpenRouter&lt;/a&gt; so I asked it to draw me a pelican:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-openrouter
llm -m openrouter/z-ai/glm-5.1 'Generate an SVG of a pelican on a bicycle'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And something new happened... unprompted, the model &lt;a href="https://gist.github.com/simonw/af7170f54256cc007ef28a8721564be8"&gt;decided to give me&lt;/a&gt; an HTML page that included both the SVG and a separate set of CSS animations!&lt;/p&gt;
&lt;p&gt;The SVG was excellent, and might be my new favorite from an open weights model:&lt;/p&gt;
&lt;p&gt;&lt;img alt="The bicycle is red and has a frame the correct shape and wheels with spokes. The pelican is a perky little fella." src="https://static.simonwillison.net/static/2026/glm-5.1-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;But the animation &lt;a href="https://gisthost.github.io/?73bb6808b18c2482f66e5f082c75f36e"&gt;broke it&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Animation - the wheels and pedals rotate, the clouds move... and the pelican has vanished, but there is a little blob bobbing up and down in the top left corner." src="https://static.simonwillison.net/static/2026/glm-5.1-broken-light-lossy.gif" /&gt;&lt;/p&gt;
&lt;p&gt;That's the pelican, floating up in the top left corner.&lt;/p&gt;
&lt;p&gt;I usually don't do follow-up prompts for the pelican test, but in this case I made an exception:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -c 'the animation is a bit broken, the pelican ends up positioned off the screen at the top right'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;GLM 5.1 replied:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The issue is that CSS &lt;code&gt;transform&lt;/code&gt; animations on SVG elements override the SVG &lt;code&gt;transform&lt;/code&gt; attribute used for positioning, causing the pelican to lose its placement and fly off to the top-right. The fix is to separate positioning (SVG attribute) from animation (inner group) and use &lt;code&gt;&amp;lt;animateTransform&amp;gt;&lt;/code&gt; for SVG rotations since it handles coordinate systems correctly.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And spat out &lt;a href="https://static.simonwillison.net/static/2026/glm-5.1-pelican-fixed.html"&gt;fresh HTML&lt;/a&gt; which fixed the problem!&lt;/p&gt;
&lt;p&gt;&lt;img alt="Now everything is right - the bicycle rotates correctly, the pelican sits on it and bobs up and down, and its lower beak moves slightly as well." src="https://static.simonwillison.net/static/2026/glm-5.1-pelican-fixed-medium-lossy.gif" /&gt;&lt;/p&gt;
&lt;p&gt;I particularly like the animation of the beak, which is described in the SVG comments like so:&lt;/p&gt;
&lt;div class="highlight highlight-text-xml-svg"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;&amp;lt;!--&lt;/span&gt; Pouch (lower beak) with wobble &lt;span class="pl-c"&gt;--&amp;gt;&lt;/span&gt;&lt;/span&gt;
&amp;lt;&lt;span class="pl-ent"&gt;g&lt;/span&gt;&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;path&lt;/span&gt; &lt;span class="pl-e"&gt;d&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;M42,-58 Q43,-50 48,-42 Q55,-35 62,-38 Q70,-42 75,-60 L42,-58 Z&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;url(#pouchGrad)&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#b06008&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke-width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;1&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;opacity&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;0.9&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;path&lt;/span&gt; &lt;span class="pl-e"&gt;d&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;M48,-50 Q55,-46 60,-52&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;fill&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;none&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#c06a08&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;stroke-width&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;0.8&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;opacity&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;0.6&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;animateTransform&lt;/span&gt; &lt;span class="pl-e"&gt;attributeName&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;transform&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;type&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;scale&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
    &lt;span class="pl-e"&gt;values&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;1,1; 1.03,0.97; 1,1&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;dur&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;0.75s&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;repeatCount&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;indefinite&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
    &lt;span class="pl-e"&gt;additive&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;sum&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;/&amp;gt;
&amp;lt;/&lt;span class="pl-ent"&gt;g&lt;/span&gt;&amp;gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong id="opossum"&gt;Update&lt;/strong&gt;: On Bluesky &lt;a href="https://bsky.app/profile/charles.capps.me/post/3miwrn42mjc2t"&gt;@charles.capps.me suggested&lt;/a&gt; a "NORTH VIRGINIA OPOSSUM ON AN E-SCOOTER" and...&lt;/p&gt;
&lt;p&gt;&lt;img alt="This is so great. It's dark, the possum is clearly a possum, it's riding an escooter, lovely animation, tail bobbing up and down, caption says NORTH VIRGINIA OPOSSUM, CRUISING THE COMMONWEALTH SINCE DUSK - only glitch is that it occasionally blinks and the eyes fall off the face" src="https://static.simonwillison.net/static/2026/glm-possum-escooter.gif.gif" /&gt;&lt;/p&gt;
&lt;p&gt;The HTML+SVG comments on that one include &lt;code&gt;/* Earring sparkle */, &amp;lt;!-- Opossum fur gradient --&amp;gt;, &amp;lt;!-- Distant treeline silhouette - Virginia pines --&amp;gt;,  &amp;lt;!-- Front paw on handlebar --&amp;gt;&lt;/code&gt; - here's &lt;a href="https://gist.github.com/simonw/1864b89f5304eba03c3ded4697e156c4"&gt;the transcript&lt;/a&gt; and the &lt;a href="https://static.simonwillison.net/static/2026/glm-possum-escooter.html"&gt;HTML result&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/css"&gt;css&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/svg"&gt;svg&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/glm"&gt;glm&lt;/a&gt;&lt;/p&gt;



</summary><category term="css"/><category term="svg"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="ai-in-china"/><category term="glm"/></entry><entry><title>Anthropic's Project Glasswing - restricting Claude Mythos to security researchers - sounds necessary to me</title><link href="https://simonwillison.net/2026/Apr/7/project-glasswing/#atom-tag" rel="alternate"/><published>2026-04-07T20:52:54+00:00</published><updated>2026-04-07T20:52:54+00:00</updated><id>https://simonwillison.net/2026/Apr/7/project-glasswing/#atom-tag</id><summary type="html">
    &lt;p&gt;Anthropic &lt;em&gt;didn't&lt;/em&gt; release their latest model, Claude Mythos (&lt;a href="https://www-cdn.anthropic.com/53566bf5440a10affd749724787c8913a2ae0841.pdf"&gt;system card PDF&lt;/a&gt;), today. They have instead made it available to a very restricted set of preview partners under their newly announced &lt;a href="https://www.anthropic.com/glasswing"&gt;Project Glasswing&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The model is a general purpose model, similar to Claude Opus 4.6, but Anthropic claim that its cyber-security research abilities are strong enough that they need to give the software industry as a whole time to prepare.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Mythos Preview has already found thousands of high-severity vulnerabilities, including some in &lt;em&gt;every major operating system and web browser&lt;/em&gt;. Given the rate of AI progress, it will not be long before such capabilities proliferate, potentially beyond actors who are committed to deploying them safely.&lt;/p&gt;
&lt;p&gt;[...]&lt;/p&gt;
&lt;p&gt;Project Glasswing partners will receive access to Claude Mythos Preview to find and fix vulnerabilities or weaknesses in their foundational systems—systems that represent a very large portion of the world’s shared cyberattack surface. We anticipate this work will focus on tasks like local vulnerability detection, black box testing of binaries, securing endpoints, and penetration testing of systems.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;There's a great deal more technical detail in &lt;a href="https://red.anthropic.com/2026/mythos-preview/"&gt; Assessing Claude Mythos Preview’s cybersecurity capabilities&lt;/a&gt; on the Anthropic Red Team blog:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;In one case, Mythos Preview wrote a web browser exploit that chained together four vulnerabilities, writing a complex &lt;a href="https://en.wikipedia.org/wiki/JIT_spraying "&gt;JIT heap spray&lt;/a&gt; that escaped both renderer and OS sandboxes. It autonomously obtained local privilege escalation exploits on Linux and other operating systems by exploiting subtle race conditions and KASLR-bypasses. And it autonomously wrote a remote code execution exploit on FreeBSD's NFS server that granted full root access to unauthenticated users by splitting a 20-gadget ROP chain over multiple packets.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Plus this comparison with Claude 4.6 Opus:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;Our internal evaluations showed that Opus 4.6 generally had a near-0% success rate at autonomous exploit development. But Mythos Preview is in a different league. For example, Opus 4.6 turned the vulnerabilities it had found in Mozilla’s Firefox 147 JavaScript engine—all patched in Firefox 148—into JavaScript shell exploits only two times out of several hundred attempts. We re-ran this experiment as a benchmark for Mythos Preview, which developed working exploits 181 times, and achieved register control on 29 more.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Saying "our model is too dangerous to release" is a great way to build buzz around a new model, but in this case I expect their caution is warranted.&lt;/p&gt;
&lt;p&gt;Just a few days (&lt;a href="https://simonwillison.net/2026/Apr/3/"&gt;last Friday&lt;/a&gt;) ago I started a new &lt;a href="https://simonwillison.net/tags/ai-security-research/"&gt;ai-security-research&lt;/a&gt; tag on this blog to acknowledge an uptick in credible security professionals pulling the alarm on how good modern LLMs have got at vulnerability research.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.theregister.com/2026/03/26/greg_kroahhartman_ai_kernel/"&gt;Greg Kroah-Hartman&lt;/a&gt; of the Linux kernel:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Months ago, we were getting what we called 'AI slop,' AI-generated security reports that were obviously wrong or low quality. It was kind of funny. It didn't really worry us.&lt;/p&gt;
&lt;p&gt;Something happened a month ago, and the world switched. Now we have real reports. All open source projects have real reports that are made with AI, but they're good, and they're real.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://mastodon.social/@bagder/116336957584445742"&gt;Daniel Stenberg&lt;/a&gt; of &lt;code&gt;curl&lt;/code&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The challenge with AI in open source security has transitioned from an AI slop tsunami into more of a ... plain security report tsunami. Less slop but lots of reports. Many of them really good.&lt;/p&gt;
&lt;p&gt;I'm spending hours per day on this now. It's intense.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And Thomas Ptacek published &lt;a href="https://sockpuppet.org/blog/2026/03/30/vulnerability-research-is-cooked/"&gt;Vulnerability Research Is Cooked&lt;/a&gt;, a post inspired by his &lt;a href="https://securitycryptographywhatever.com/2026/03/25/ai-bug-finding/"&gt;podcast conversation&lt;/a&gt; with Anthropic's Nicholas Carlini.&lt;/p&gt;
&lt;p&gt;Anthropic have a 5 minute &lt;a href="https://www.youtube.com/watch?v=INGOC6-LLv0"&gt;talking heads video&lt;/a&gt; describing the Glasswing project. Nicholas Carlini appears as one of those talking heads, where he said (highlights mine):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It has the ability to chain together vulnerabilities. So what this means is you find two vulnerabilities, either of which doesn't really get you very much independently. But this model is able to create exploits out of three, four, or sometimes five vulnerabilities that in sequence give you some kind of very sophisticated end outcome. [...]&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;I've found more bugs in the last couple of weeks than I found in the rest of my life combined&lt;/strong&gt;. We've used the model to scan a bunch of open source code, and the thing that we went for first was operating systems, because this is the code that underlies the entire internet infrastructure. &lt;strong&gt;For OpenBSD, we found a bug that's been present for 27 years, where I can send a couple of pieces of data to any OpenBSD server and crash it&lt;/strong&gt;. On Linux, we found a number of vulnerabilities where as a user with no permissions, I can elevate myself to the administrator by just running some binary on my machine. For each of these bugs, we told the maintainers who actually run the software about them, and they went and fixed them and have deployed the patches  patches so that anyone who runs the software is no longer vulnerable to these attacks.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I found this on the &lt;a href="https://www.openbsd.org/errata78.html"&gt;OpenBSD 7.8 errata page&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;025: RELIABILITY FIX: March 25, 2026&lt;/strong&gt;  &lt;em&gt;All architectures&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;TCP packets with invalid SACK options could crash the kernel.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://ftp.openbsd.org/pub/OpenBSD/patches/7.8/common/025_sack.patch.sig"&gt;A source code patch exists which remedies this problem.&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I tracked that change down in the &lt;a href="https://github.com/openbsd/src"&gt;GitHub mirror&lt;/a&gt; of the OpenBSD CVS repo (apparently they still use CVS!) and found it &lt;a href="https://github.com/openbsd/src/blame/master/sys/netinet/tcp_input.c#L2461"&gt;using git blame&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/openbsd-27-years.jpg" alt="Screenshot of a Git blame view of C source code around line 2455 showing TCP SACK hole validation logic. Code includes checks using SEQ_GT, SEQ_LT macros on fields like th-&amp;gt;th_ack, tp-&amp;gt;snd_una, sack.start, sack.end, tp-&amp;gt;snd_max, and tp-&amp;gt;snd_holes. Most commits are from 25–27 years ago with messages like &amp;quot;more SACK hole validity testin...&amp;quot; and &amp;quot;knf&amp;quot;, while one recent commit from 3 weeks ago (&amp;quot;Ignore TCP SACK packets wit...&amp;quot;) is highlighted with an orange left border, adding a new guard &amp;quot;if (SEQ_LT(sack.start, tp-&amp;gt;snd_una)) continue;&amp;quot;" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Sure enough, the surrounding code is from 27 years ago.&lt;/p&gt;
&lt;p&gt;I'm not sure which Linux vulnerability Nicholas was describing, but it may have been &lt;a href="https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=5133b61aaf437e5f25b1b396b14242a6bb0508e2"&gt;this NFS one&lt;/a&gt; recently covered &lt;a href="https://mtlynch.io/claude-code-found-linux-vulnerability/"&gt;by Michael Lynch
&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;There's enough smoke here that I believe there's a fire. It's not surprising to find vulnerabilities in decades-old software, especially given that they're mostly written in C, but what's new is that coding agents run by the latest frontier LLMs are proving tirelessly capable at digging up these issues.&lt;/p&gt;
&lt;p&gt;I actually thought to myself on Friday that this sounded like an industry-wide reckoning in the making, and that it might warrant a huge investment of time and money to get ahead of the inevitable barrage of vulnerabilities. Project Glasswing incorporates "$100M in usage credits ... as well as $4M in direct donations to open-source security organizations". Partners include AWS, Apple, Microsoft, Google, and the Linux Foundation. It would be great to see OpenAI involved as well - GPT-5.4 already has a strong reputation for finding security vulnerabilities and they have stronger models on the near horizon.&lt;/p&gt;
&lt;p&gt;The bad news for those of us who are &lt;em&gt;not&lt;/em&gt; trusted partners is this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We do not plan to make Claude Mythos Preview generally available, but our eventual goal is to enable our users to safely deploy Mythos-class models at scale—for cybersecurity purposes, but also for the myriad other benefits that such highly capable models will bring. To do so, we need to make progress in developing cybersecurity (and other) safeguards that detect and block the model’s most dangerous outputs. We plan to launch new safeguards with an upcoming Claude Opus model, allowing us to improve and refine them with a model that does not pose the same level of risk as Mythos Preview.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I can live with that. I think the security risks really are credible here, and having extra time for trusted teams to get ahead of them is a reasonable trade-off.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/thomas-ptacek"&gt;thomas-ptacek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nicholas-carlini"&gt;nicholas-carlini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-security-research"&gt;ai-security-research&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-mythos"&gt;claude-mythos&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="security"/><category term="thomas-ptacek"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="nicholas-carlini"/><category term="ai-ethics"/><category term="llm-release"/><category term="ai-security-research"/><category term="claude-mythos"/></entry><entry><title>Gemma 4: Byte for byte, the most capable open models</title><link href="https://simonwillison.net/2026/Apr/2/gemma-4/#atom-tag" rel="alternate"/><published>2026-04-02T18:28:54+00:00</published><updated>2026-04-02T18:28:54+00:00</updated><id>https://simonwillison.net/2026/Apr/2/gemma-4/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/"&gt;Gemma 4: Byte for byte, the most capable open models&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Four new vision-capable Apache 2.0 licensed reasoning LLMs from Google DeepMind, sized at 2B, 4B, 31B, plus a 26B-A4B Mixture-of-Experts.&lt;/p&gt;
&lt;p&gt;Google emphasize "unprecedented level of intelligence-per-parameter", providing yet more evidence that creating small useful models is one of the hottest areas of research right now.&lt;/p&gt;
&lt;p&gt;They actually label the two smaller models as E2B and E4B for "Effective" parameter size. The system card explains:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The smaller models incorporate Per-Layer Embeddings (PLE) to maximize parameter efficiency in on-device deployments. Rather than adding more layers or parameters to the model, PLE gives each decoder layer its own small embedding for every token. These embedding tables are large but are only used for quick lookups, which is why the effective parameter count is much smaller than the total.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I don't entirely understand that, but apparently that's what the "E" in E2B means!&lt;/p&gt;
&lt;p&gt;One particularly exciting feature of these models is that they are multi-modal beyond just images:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Vision and audio&lt;/strong&gt;: All models natively process video and images, supporting variable resolutions, and excelling at visual tasks like OCR and chart understanding. Additionally, the E2B and E4B models feature native audio input for speech recognition and understanding.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I've not figured out a way to run audio input locally - I don't think that feature is in LM Studio or Ollama yet.&lt;/p&gt;
&lt;p&gt;I tried them out using the GGUFs for &lt;a href="https://lmstudio.ai/models/gemma-4"&gt;LM Studio&lt;/a&gt;. The 2B (4.41GB), 4B (6.33GB) and 26B-A4B (17.99GB) models all worked perfectly, but the 31B (19.89GB) model was broken and spat out &lt;code&gt;"---\n"&lt;/code&gt; in a loop for every prompt I tried.&lt;/p&gt;
&lt;p&gt;The succession of &lt;a href="https://gist.github.com/simonw/12ae4711288637a722fd6bd4b4b56bdb"&gt;pelican quality&lt;/a&gt; from 2B to 4B to 26B-A4B is notable:&lt;/p&gt;
&lt;p&gt;E2B:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Two blue circles on a brown rectangle and a weird mess of orange blob and yellow triangle for the pelican" src="https://static.simonwillison.net/static/2026/gemma-4-2b-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;E4B:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Two black wheels joined by a sort of grey surfboard, the pelican is semicircles and a blue blob floating above it" src="https://static.simonwillison.net/static/2026/gemma-4-4b-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;26B-A4B:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Bicycle has the right pieces although the frame is wonky. Pelican is genuinely good, has a big triangle beak and a nice curved neck and is clearly a bird that is sitting on the bicycle" src="https://static.simonwillison.net/static/2026/gemma-4-26b-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;(This one actually had an SVG error - "error on line 18 at column 88: Attribute x1 redefined" - but after &lt;a href="https://gist.github.com/simonw/12ae4711288637a722fd6bd4b4b56bdb?permalink_comment_id=6074105#gistcomment-6074105"&gt;fixing that&lt;/a&gt; I got probably the best pelican I've seen yet from a model that runs on my laptop.)&lt;/p&gt;
&lt;p&gt;Google are providing API access to the two larger Gemma models via their &lt;a href="https://aistudio.google.com/prompts/new_chat?model=gemma-4-31b-it"&gt;AI Studio&lt;/a&gt;. I added support to &lt;a href="https://github.com/simonw/llm-gemini"&gt;llm-gemini&lt;/a&gt; and then &lt;a href="https://gist.github.com/simonw/f9f9e9c34c7cc0ef5325a2876413e51e"&gt;ran a pelican&lt;/a&gt; through the 31B model using that:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m gemini/gemma-4-31b-it 'Generate an SVG of a pelican riding a bicycle'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Pretty good, though it is missing the front part of the bicycle frame:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Motion blur lines, a mostly great bicycle albeit missing the front part of the frame. Pelican is decent. " src="https://static.simonwillison.net/static/2026/gemma-4-31b-pelican.png" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="vision-llms"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="gemma"/><category term="llm-release"/><category term="lm-studio"/></entry><entry><title>GPT-5.4 mini and GPT-5.4 nano, which can describe 76,000 photos for $52</title><link href="https://simonwillison.net/2026/Mar/17/mini-and-nano/#atom-tag" rel="alternate"/><published>2026-03-17T19:39:17+00:00</published><updated>2026-03-17T19:39:17+00:00</updated><id>https://simonwillison.net/2026/Mar/17/mini-and-nano/#atom-tag</id><summary type="html">
    &lt;p&gt;OpenAI today: &lt;a href="https://openai.com/index/introducing-gpt-5-4-mini-and-nano/"&gt;Introducing GPT‑5.4 mini and nano&lt;/a&gt;. These models join GPT-5.4 which was released &lt;a href="https://openai.com/index/introducing-gpt-5-4/"&gt;two weeks ago&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;OpenAI's self-reported benchmarks show the new 5.4-nano out-performing their previous GPT-5 mini model when run at maximum reasoning effort. The new mini is also 2x faster than the previous mini.&lt;/p&gt;
&lt;p&gt;Here's how the pricing looks - all prices are per million tokens. &lt;code&gt;gpt-5.4-nano&lt;/code&gt; is notably even cheaper than Google's Gemini 3.1 Flash-Lite:&lt;/p&gt;
&lt;center&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Model&lt;/th&gt;
      &lt;th&gt;Input&lt;/th&gt;
      &lt;th&gt;Cached input&lt;/th&gt;
      &lt;th&gt;Output&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;gpt-5.4&lt;/td&gt;
      &lt;td&gt;$2.50&lt;/td&gt;
      &lt;td&gt;$0.25&lt;/td&gt;
      &lt;td&gt;$15.00&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;gpt-5.4-mini&lt;/td&gt;
      &lt;td&gt;$0.75&lt;/td&gt;
      &lt;td&gt;$0.075&lt;/td&gt;
      &lt;td&gt;$4.50&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;gpt-5.4-nano&lt;/td&gt;
      &lt;td&gt;$0.20&lt;/td&gt;
      &lt;td&gt;$0.02&lt;/td&gt;
      &lt;td&gt;$1.25&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;&lt;td colspan="4"&gt;&lt;center&gt;Other models for comparison&lt;/center&gt;&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Claude Opus 4.6&lt;/td&gt;
      &lt;td&gt;$5.00&lt;/td&gt;
      &lt;td&gt;-&lt;/td&gt;
      &lt;td&gt;$25.00&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Claude Sonnet 4.6&lt;/td&gt;
      &lt;td&gt;$3.00&lt;/td&gt;
      &lt;td&gt;-&lt;/td&gt;
      &lt;td&gt;$15.00&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Gemini 3.1 Pro&lt;/td&gt;
      &lt;td&gt;$2.00&lt;/td&gt;
      &lt;td&gt;-&lt;/td&gt;
      &lt;td&gt;$12.00&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Claude Haiku 4.5&lt;/td&gt;
      &lt;td&gt;$1.00&lt;/td&gt;
      &lt;td&gt;-&lt;/td&gt;
      &lt;td&gt;$5.00&lt;/td&gt;
    &lt;/tr&gt;
&lt;tr&gt;
      &lt;td&gt;Gemini 3.1 Flash-Lite&lt;/td&gt;
      &lt;td&gt;$0.25&lt;/td&gt;
      &lt;td&gt;-&lt;/td&gt;
      &lt;td&gt;$1.50&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/center&gt;
&lt;p&gt;I used GPT-5.4 nano to generate a description of this photo I took at the &lt;a href="https://www.niche-museums.com/118"&gt;John M. Mossman Lock Collection&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/IMG_2324.jpeg" alt="Description below" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m gpt-5.4-nano -a IMG_2324.jpeg 'describe image'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's the output:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The image shows the interior of a museum gallery with a long display wall. White-painted brick walls are covered with many framed portraits arranged in neat rows. Below the portraits, there are multiple glass display cases with dark wooden frames and glass tops/fronts, containing various old historical objects and equipment. The room has a polished wooden floor, hanging ceiling light fixtures/cords, and a few visible pipes near the top of the wall. In the foreground, glass cases run along the length of the room, reflecting items from other sections of the gallery.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That took 2,751 input tokens and 112 output tokens, at a cost of &lt;a href="https://www.llm-prices.com/#it=2751&amp;amp;ot=112&amp;amp;sel=gpt-5.4-nano"&gt;0.069 cents&lt;/a&gt; (less than a tenth of a cent). That means describing every single photo in my 76,000 photo collection would cost around $52.44.&lt;/p&gt;
&lt;p&gt;I released &lt;a href="https://llm.datasette.io/en/stable/changelog.html#v0-29"&gt;llm 0.29&lt;/a&gt; with support for the new models.&lt;/p&gt;
&lt;h4 id="pelicans"&gt;Pelicans&lt;/h4&gt;
&lt;p&gt;Then I had OpenAI Codex loop through all five reasoning effort levels and all three models and produce this combined SVG grid of pelicans riding bicycles (&lt;a href="https://gist.github.com/simonw/f16292d9a5b90b28054cff3ba497a3ca"&gt;generation transcripts here&lt;/a&gt;). I do like the gpt-5.4 xhigh one the best, it has a good bicycle (with nice spokes) and the pelican has a fish in its beak!&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/gpt-5.4-pelican-family.svg" alt="Described by Claude Opus 4.6: A 5x3 comparison grid of AI-generated cartoon illustrations of a pelican riding a bicycle. Columns are labeled &amp;quot;gpt-5.4-nano&amp;quot;, &amp;quot;gpt-5.4-mini&amp;quot;, and &amp;quot;gpt-5.4&amp;quot; across the top, and rows are labeled &amp;quot;none&amp;quot;, &amp;quot;low&amp;quot;, &amp;quot;medium&amp;quot;, &amp;quot;high&amp;quot;, and &amp;quot;xhigh&amp;quot; down the left side, representing quality/detail settings. In the &amp;quot;none&amp;quot; row, gpt-5.4-nano shows a chaotic white bird with misplaced arrows and tangled wheels on grass, gpt-5.4-mini shows a duck-like brown bird awkwardly straddling a motorcycle-like bike, and gpt-5.4 shows a stiff gray-and-white pelican sitting atop a blue tandem bicycle with extra legs. In the &amp;quot;low&amp;quot; row, nano shows a chubby round white bird pedaling with small feet on grass, mini shows a cleaner white bird riding a blue bicycle with motion lines, and gpt-5.4 shows a pelican with a blue cap riding confidently but with slightly awkward proportions. In the &amp;quot;medium&amp;quot; row, nano regresses to a strange bird standing over bowling balls on ice, mini shows two plump white birds merged onto one yellow-wheeled bicycle, and gpt-5.4 shows a more recognizable gray-and-white pelican on a red bicycle but with tangled extra legs. In the &amp;quot;high&amp;quot; row, nano shows multiple small pelicans crowded around a broken green bicycle on grass with a sun overhead, mini shows a tandem bicycle with two white pelicans and clear blue sky, and gpt-5.4 shows two pelicans stacked on a red tandem bike with the most realistic proportions yet. In the &amp;quot;xhigh&amp;quot; row, nano shows the most detailed scene with a pelican on a detailed bicycle with grass and a large sun but still somewhat jumbled anatomy, mini produces the cleanest single pelican on a yellow-accented bicycle with a light blue sky, and gpt-5.4 shows a well-rendered gray pelican on a teal bicycle with the best overall coherence. Generally, quality improves moving right across models and down through quality tiers, though &amp;quot;medium&amp;quot; is inconsistently worse than &amp;quot;low&amp;quot; for some models, and all images maintain a lighthearted cartoon style with pastel skies and simple backgrounds." style="max-width: 100%;" /&gt;&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="vision-llms"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/></entry><entry><title>Introducing Mistral Small 4</title><link href="https://simonwillison.net/2026/Mar/16/mistral-small-4/#atom-tag" rel="alternate"/><published>2026-03-16T23:41:17+00:00</published><updated>2026-03-16T23:41:17+00:00</updated><id>https://simonwillison.net/2026/Mar/16/mistral-small-4/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://mistral.ai/news/mistral-small-4"&gt;Introducing Mistral Small 4&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Big new release from Mistral today (despite the name) - a new Apache 2 licensed 119B parameter (Mixture-of-Experts, 6B active) model which they describe like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Mistral Small 4 is the first Mistral model to unify the capabilities of our flagship models, Magistral for reasoning, Pixtral for multimodal, and Devstral for agentic coding, into a single, versatile model.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It supports &lt;code&gt;reasoning_effort="none"&lt;/code&gt; or &lt;code&gt;reasoning_effort="high"&lt;/code&gt;, with the latter providing "equivalent verbosity to previous Magistral models". &lt;/p&gt;
&lt;p&gt;The new model is &lt;a href="https://huggingface.co/mistralai/Mistral-Small-4-119B-2603/tree/main"&gt;242GB on Hugging Face&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I &lt;a href="https://gist.github.com/simonw/3dec228577559f15f26204a3cc550583"&gt;tried it out&lt;/a&gt; via the Mistral API using &lt;a href="https://github.com/simonw/llm-mistral"&gt;llm-mistral&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-mistral
llm mistral refresh
llm -m mistral/mistral-small-2603 "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="The bicycle is upside down and mangled and the pelican is a series of grey curves with a triangular beak." src="https://static.simonwillison.net/static/2026/mistral-small-4.png" /&gt;&lt;/p&gt;
&lt;p&gt;I couldn't find a way to set the reasoning effort in their &lt;a href="https://docs.mistral.ai/api/endpoint/chat#operation-chat_completion_v1_chat_completions_post"&gt;API documentation&lt;/a&gt;, so hopefully that's a feature which will land soon.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update 23rd March&lt;/strong&gt;: Here's new documentation for the &lt;a href="https://docs.mistral.ai/capabilities/reasoning/adjustable"&gt;reasoning_effort parameter&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Also from Mistral today and fitting their -stral naming convention is &lt;a href="https://mistral.ai/news/leanstral"&gt;Leanstral&lt;/a&gt;, an open weight model that is specifically tuned to help output the &lt;a href="https://lean-lang.org/"&gt;Lean 4&lt;/a&gt; formally verifiable coding language. I haven't explored Lean at all so I have no way to credibly evaluate this, but it's interesting to see them target one specific language in this way.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="mistral"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/></entry><entry><title>Introducing GPT‑5.4</title><link href="https://simonwillison.net/2026/Mar/5/introducing-gpt54/#atom-tag" rel="alternate"/><published>2026-03-05T23:56:09+00:00</published><updated>2026-03-05T23:56:09+00:00</updated><id>https://simonwillison.net/2026/Mar/5/introducing-gpt54/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://openai.com/index/introducing-gpt-5-4/"&gt;Introducing GPT‑5.4&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Two new API models: &lt;a href="https://developers.openai.com/api/docs/models/gpt-5.4"&gt;gpt-5.4&lt;/a&gt; and &lt;a href="https://developers.openai.com/api/docs/models/gpt-5.4-pro"&gt;gpt-5.4-pro&lt;/a&gt;, also available in ChatGPT and Codex CLI. August 31st 2025 knowledge cutoff, 1 million token context window. Priced &lt;a href="https://www.llm-prices.com/#sel=gpt-5.2%2Cgpt-5.2-pro%2Cgpt-5.4%2Cgpt-5.4-272k%2Cgpt-5.4-pro%2Cgpt-5.4-pro-272k"&gt;slightly higher&lt;/a&gt; than the GPT-5.2 family with a bump in price for both models if you go above 272,000 tokens.&lt;/p&gt;
&lt;p&gt;5.4 beats coding specialist GPT-5.3-Codex on all of the relevant benchmarks. I wonder if we'll get a 5.4 Codex or if that model line has now been merged into main?&lt;/p&gt;
&lt;p&gt;Given Claude's recent focus on business applications it's interesting to see OpenAI highlight this in their announcement of GPT-5.4:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We put a particular focus on improving GPT‑5.4’s ability to create and edit spreadsheets, presentations, and documents. On an internal benchmark of spreadsheet modeling tasks that a junior investment banking analyst might do, GPT‑5.4 achieves a mean score of &lt;strong&gt;87.3%&lt;/strong&gt;, compared to &lt;strong&gt;68.4%&lt;/strong&gt; for GPT‑5.2.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's a pelican on a bicycle &lt;a href="https://gist.github.com/simonw/7fe75b8dab6ec9c2b6bd8fd1a5a640a6"&gt;drawn by GPT-5.4&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="alt text by GPT-5.4: Illustration of a cartoon pelican riding a bicycle, with a light gray background, dark blue bike frame and wheels, orange beak and legs, and motion lines suggesting movement." src="https://static.simonwillison.net/static/2026/gpt-5.4-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;And &lt;a href="https://gist.github.com/simonw/688c0d5d93a5539b93d3f549a0b733ad"&gt;here's one&lt;/a&gt; by GPT-5.4 Pro, which took 4m45s and cost me &lt;a href="https://www.llm-prices.com/#it=16&amp;amp;ot=8593&amp;amp;sel=gpt-5.4-pro"&gt;$1.55&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Described by GPT-5.4: Illustration of a cartoon pelican riding a blue bicycle on pale green grass against a light gray background, with a large orange beak, gray-and-white body, and orange legs posed on the pedals." src="https://static.simonwillison.net/static/2026/gpt-5.4-pro-pelican.png" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/></entry><entry><title>Gemini 3.1 Flash-Lite</title><link href="https://simonwillison.net/2026/Mar/3/gemini-31-flash-lite/#atom-tag" rel="alternate"/><published>2026-03-03T21:53:54+00:00</published><updated>2026-03-03T21:53:54+00:00</updated><id>https://simonwillison.net/2026/Mar/3/gemini-31-flash-lite/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-lite/"&gt;Gemini 3.1 Flash-Lite&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Google's latest model is an update to their inexpensive Flash-Lite family. At $0.25/million tokens of input and $1.5/million output this is 1/8th the price of Gemini 3.1 Pro.&lt;/p&gt;
&lt;p&gt;It supports four different thinking levels, so I had it output &lt;a href="https://gist.github.com/simonw/99fb28dc11d0c24137d4ff8a33978a9e"&gt;four different pelicans&lt;/a&gt;:&lt;/p&gt;
&lt;div style="
    display: grid;
    grid-template-columns: repeat(2, 1fr);
    gap: 8px;
    margin: 0 auto;
  "&gt;
    &lt;div style="text-align: center;"&gt;
      &lt;div style="aspect-ratio: 1; overflow: hidden; border-radius: 4px;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/gemini-3.1-flash-lite-minimal.png" alt="A minimalist vector-style illustration of a stylized bird riding a bicycle." style="width: 100%; height: 100%; object-fit: cover; display: block;"&gt;
      &lt;/div&gt;
      &lt;p style="margin: 4px 0 0; font-size: 16px; color: #333;"&gt;minimal&lt;/p&gt;
    &lt;/div&gt;
    &lt;div style="text-align: center;"&gt;
      &lt;div style="aspect-ratio: 1; overflow: hidden; border-radius: 4px;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/gemini-3.1-flash-lite-low.png" alt="A minimalist graphic of a light blue round bird with a single black dot for an eye, wearing a yellow backpack and riding a black bicycle on a flat grey line." style="width: 100%; height: 100%; object-fit: cover; display: block;"&gt;
      &lt;/div&gt;
      &lt;p style="margin: 4px 0 0; font-size: 16px; color: #333;"&gt;low&lt;/p&gt;
    &lt;/div&gt;
    &lt;div style="text-align: center;"&gt;
      &lt;div style="aspect-ratio: 1; overflow: hidden; border-radius: 4px;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/gemini-3.1-flash-lite-medium.png" alt="A minimalist digital illustration of a light blue bird wearing a yellow backpack while riding a bicycle." style="width: 100%; height: 100%; object-fit: cover; display: block;"&gt;
      &lt;/div&gt;
      &lt;p style="margin: 4px 0 0; font-size: 16px; color: #333;"&gt;medium&lt;/p&gt;
    &lt;/div&gt;
    &lt;div style="text-align: center;"&gt;
      &lt;div style="aspect-ratio: 1; overflow: hidden; border-radius: 4px;"&gt;
        &lt;img src="https://static.simonwillison.net/static/2026/gemini-3.1-flash-lite-high.png" alt="A minimal, stylized line drawing of a bird-like creature with a yellow beak riding a bicycle made of simple geometric lines." style="width: 100%; height: 100%; object-fit: cover; display: block;"&gt;
      &lt;/div&gt;
      &lt;p style="margin: 4px 0 0; font-size: 16px; color: #333;"&gt;high&lt;/p&gt;
    &lt;/div&gt;
&lt;/div&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="gemini"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/></entry><entry><title>Gemini 3.1 Pro</title><link href="https://simonwillison.net/2026/Feb/19/gemini-31-pro/#atom-tag" rel="alternate"/><published>2026-02-19T17:58:37+00:00</published><updated>2026-02-19T17:58:37+00:00</updated><id>https://simonwillison.net/2026/Feb/19/gemini-31-pro/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/"&gt;Gemini 3.1 Pro&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The first in the Gemini 3.1 series, priced the same as Gemini 3 Pro ($2/million input, $12/million output under 200,000 tokens, $4/$18 for 200,000 to 1,000,000). That's less than half the price of Claude Opus 4.6 with very similar benchmark scores to that model.&lt;/p&gt;
&lt;p&gt;They boast about its improved SVG animation performance compared to Gemini 3 Pro in the announcement!&lt;/p&gt;
&lt;p&gt;I tried "Generate an SVG of a pelican riding a bicycle" &lt;a href="https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%5B%221ugF9fBfLGxnNoe8_rLlluzo9NSPJDWuF%22%5D,%22action%22:%22open%22,%22userId%22:%22106366615678321494423%22,%22resourceKeys%22:%7B%7D%7D&amp;amp;usp=sharing"&gt;in Google AI Studio&lt;/a&gt; and it thought for 323.9 seconds (&lt;a href="https://gist.github.com/simonw/03a755865021739a3659943a22c125ba#thinking-trace"&gt;thinking trace here&lt;/a&gt;) before producing this one:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Whimsical flat-style illustration of a pelican wearing a blue and white baseball cap, riding a red bicycle with yellow-rimmed wheels along a road. The pelican has a large orange bill and a green scarf. A small fish peeks out of a brown basket on the handlebars. The background features a light blue sky with a yellow sun, white clouds, and green hills." src="https://static.simonwillison.net/static/2026/gemini-3.1-pro-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;It's good to see the legs clearly depicted on both sides of the frame (should &lt;a href="https://twitter.com/elonmusk/status/2023833496804839808"&gt;satisfy Elon&lt;/a&gt;), the fish in the basket is a nice touch and I appreciated this comment in &lt;a href="https://gist.github.com/simonw/03a755865021739a3659943a22c125ba#response"&gt;the SVG code&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;!-- Black Flight Feathers on Wing Tip --&amp;gt;
&amp;lt;path d="M 420 175 C 440 182, 460 187, 470 190 C 450 210, 430 208, 410 198 Z" fill="#374151" /&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I've &lt;a href="https://github.com/simonw/llm-gemini/issues/121"&gt;added&lt;/a&gt; the two new model IDs &lt;code&gt;gemini-3.1-pro-preview&lt;/code&gt; and &lt;code&gt;gemini-3.1-pro-preview-customtools&lt;/code&gt; to my &lt;a href="https://github.com/simonw/llm-gemini"&gt;llm-gemini plugin&lt;/a&gt; for &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt;. That "custom tools" one is &lt;a href="https://ai.google.dev/gemini-api/docs/models/gemini-3.1-pro-preview#gemini-31-pro-preview-customtools"&gt;described here&lt;/a&gt; - apparently it may provide better tool performance than the default model in some situations.&lt;/p&gt;
&lt;p&gt;The model appears to be &lt;em&gt;incredibly&lt;/em&gt; slow right now - it took 104s to respond to a simple "hi" and a few of my other tests met "Error: This model is currently experiencing high demand. Spikes in demand are usually temporary. Please try again later." or "Error: Deadline expired before operation could complete" errors. I'm assuming that's just teething problems on launch day.&lt;/p&gt;
&lt;p&gt;It sounds like last week's &lt;a href="https://simonwillison.net/2026/Feb/12/gemini-3-deep-think/"&gt;Deep Think release&lt;/a&gt; was our first exposure to the 3.1 family:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Last week, we released a major update to Gemini 3 Deep Think to solve modern challenges across science, research and engineering. Today, we’re releasing the upgraded core intelligence that makes those breakthroughs possible: Gemini 3.1 Pro.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: In &lt;a href="https://simonwillison.net/2025/nov/13/training-for-pelicans-riding-bicycles/"&gt;What happens if AI labs train for pelicans riding bicycles?&lt;/a&gt; last November I said:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If a model finally comes out that produces an excellent SVG of a pelican riding a bicycle you can bet I’m going to test it on all manner of creatures riding all sorts of transportation devices.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p id="jeff-dean"&gt;Google's Gemini Lead Jeff Dean &lt;a href="https://x.com/JeffDean/status/2024525132266688757"&gt;tweeted this video&lt;/a&gt; featuring an animated pelican riding a bicycle, plus a frog on a penny-farthing and a giraffe driving a tiny car and an ostrich on roller skates and a turtle kickflipping a skateboard and a dachshund driving a stretch limousine.&lt;/p&gt;

&lt;video style="margin-bottom: 1em" poster="https://static.simonwillison.net/static/2026/gemini-animated-pelicans.jpg" muted controls preload="none" style="max-width: 100%"&gt;
  &lt;source src="https://static.simonwillison.net/static/2026/gemini-animated-pelicans.mp4" type="video/mp4"&gt;
&lt;/video&gt;

&lt;p&gt;I've been saying for a while that I wish AI labs would highlight things that their new models can do that their older models could not, so top marks to the Gemini team for this video.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update 2&lt;/strong&gt;: I used &lt;code&gt;llm-gemini&lt;/code&gt; to run my &lt;a href="https://simonwillison.net/2025/Nov/18/gemini-3/#and-a-new-pelican-benchmark"&gt;more detailed Pelican prompt&lt;/a&gt;, with &lt;a href="https://gist.github.com/simonw/a3bdd4ec9476ba9e9ba7aa61b46d8296"&gt;this result&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Flat-style illustration of a brown pelican riding a teal bicycle with dark blue-rimmed wheels against a plain white background. Unlike the previous image's white cartoon pelican, this pelican has realistic brown plumage with detailed feather patterns, a dark maroon head, yellow eye, and a large pink-tinged pouch bill. The bicycle is a simpler design without a basket, and the scene lacks the colorful background elements like the sun, clouds, road, hills, cap, and scarf from the first illustration, giving it a more minimalist feel." src="https://static.simonwillison.net/static/2026/gemini-3.1-pro-pelican-2.png" /&gt;&lt;/p&gt;
&lt;p&gt;From the SVG comments:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;!-- Pouch Gradient (Breeding Plumage: Red to Olive/Green) --&amp;gt;
...
&amp;lt;!-- Neck Gradient (Breeding Plumage: Chestnut Nape, White/Yellow Front) --&amp;gt;
&lt;/code&gt;&lt;/pre&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/svg"&gt;svg&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="svg"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="gemini"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/></entry><entry><title>Introducing Claude Sonnet 4.6</title><link href="https://simonwillison.net/2026/Feb/17/claude-sonnet-46/#atom-tag" rel="alternate"/><published>2026-02-17T23:58:58+00:00</published><updated>2026-02-17T23:58:58+00:00</updated><id>https://simonwillison.net/2026/Feb/17/claude-sonnet-46/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.anthropic.com/news/claude-sonnet-4-6"&gt;Introducing Claude Sonnet 4.6&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Sonnet 4.6 is out today, and Anthropic claim it offers similar performance to &lt;a href="https://simonwillison.net/2025/Nov/24/claude-opus/"&gt;November's Opus 4.5&lt;/a&gt; while maintaining the Sonnet pricing of $3/million input and $15/million output tokens (the Opus models are $5/$25). Here's &lt;a href="https://www-cdn.anthropic.com/78073f739564e986ff3e28522761a7a0b4484f84.pdf"&gt;the system card PDF&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Sonnet 4.6 has a "reliable knowledge cutoff" of August 2025, compared to Opus 4.6's May 2025 and Haiku 4.5's February 2025. Both Opus and Sonnet default to 200,000 max input tokens but can stretch to 1 million in beta and at a higher cost.&lt;/p&gt;
&lt;p&gt;I just released &lt;a href="https://github.com/simonw/llm-anthropic/releases/tag/0.24"&gt;llm-anthropic 0.24&lt;/a&gt; with support for both Sonnet 4.6 and Opus 4.6. Claude Code &lt;a href="https://github.com/simonw/llm-anthropic/pull/65"&gt;did most of the work&lt;/a&gt; - the new models had a fiddly amount of extra details around adaptive thinking and no longer supporting prefixes, as described &lt;a href="https://platform.claude.com/docs/en/about-claude/models/migration-guide"&gt;in Anthropic's migration guide&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/b185576a95e9321b441f0a4dfc0e297c"&gt;what I got&lt;/a&gt; from:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uvx --with llm-anthropic llm 'Generate an SVG of a pelican riding a bicycle' -m claude-sonnet-4.6
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="The pelican has a jaunty top hat with a red band. There is a string between the upper and lower beaks for some reason. The bicycle frame is warped in the wrong way." src="https://static.simonwillison.net/static/2026/pelican-sonnet-4.6.png" /&gt;&lt;/p&gt;
&lt;p&gt;The SVG comments include:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;!-- Hat (fun accessory) --&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I tried a second time and also got a top hat. Sonnet 4.6 apparently loves top hats!&lt;/p&gt;
&lt;p&gt;For comparison, here's the pelican Opus 4.5 drew me &lt;a href="(https://simonwillison.net/2025/Nov/24/claude-opus/)"&gt;in November&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="The pelican is cute and looks pretty good. The bicycle is not great - the frame is wrong and the pelican is facing backwards when the handlebars appear to be forwards.There is also something that looks a bit like an egg on the handlebars." src="https://static.simonwillison.net/static/2025/claude-opus-4.5-pelican.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;And here's Anthropic's current best pelican, drawn by Opus 4.6 &lt;a href="https://simonwillison.net/2026/Feb/5/two-new-models/"&gt;on February 5th&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Slightly wonky bicycle frame but an excellent pelican, very clear beak and pouch, nice feathers." src="https://static.simonwillison.net/static/2026/opus-4.6-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;Opus 4.6 produces the best pelican beak/pouch. I do think the top hat from Sonnet 4.6 is a nice touch though.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=47050488"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="anthropic"/><category term="claude"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="claude-code"/></entry><entry><title>Qwen3.5: Towards Native Multimodal Agents</title><link href="https://simonwillison.net/2026/Feb/17/qwen35/#atom-tag" rel="alternate"/><published>2026-02-17T04:30:57+00:00</published><updated>2026-02-17T04:30:57+00:00</updated><id>https://simonwillison.net/2026/Feb/17/qwen35/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://qwen.ai/blog?id=qwen3.5"&gt;Qwen3.5: Towards Native Multimodal Agents&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Alibaba's Qwen just released the first two models in the Qwen 3.5 series - one open weights, one proprietary. Both are multi-modal for vision input.&lt;/p&gt;
&lt;p&gt;The open weight one is a Mixture of Experts model called Qwen3.5-397B-A17B. Interesting to see Qwen call out serving efficiency as a benefit of that architecture:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Built on an innovative hybrid architecture that fuses linear attention (via Gated Delta Networks) with a sparse mixture-of-experts, the model attains remarkable inference efficiency: although it comprises 397 billion total parameters, just 17 billion are activated per forward pass, optimizing both speed and cost without sacrificing capability.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It's &lt;a href="https://huggingface.co/Qwen/Qwen3.5-397B-A17B"&gt;807GB on Hugging Face&lt;/a&gt;, and Unsloth have a &lt;a href="https://huggingface.co/unsloth/Qwen3.5-397B-A17B-GGUF"&gt;collection of smaller GGUFs&lt;/a&gt; ranging in size from 94.2GB 1-bit to 462GB Q8_K_XL.&lt;/p&gt;
&lt;p&gt;I got this &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;pelican&lt;/a&gt; from the &lt;a href="https://openrouter.ai/qwen/qwen3.5-397b-a17b"&gt;OpenRouter hosted model&lt;/a&gt; (&lt;a href="https://gist.github.com/simonw/625546cf6b371f9c0040e64492943b82"&gt;transcript&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img alt="Pelican is quite good although the neck lacks an outline for some reason. Bicycle is very basic with an incomplete frame" src="https://static.simonwillison.net/static/2026/qwen3.5-397b-a17b.png" /&gt;&lt;/p&gt;
&lt;p&gt;The proprietary hosted model is called Qwen3.5 Plus 2026-02-15, and is a little confusing. Qwen researcher &lt;a href="https://twitter.com/JustinLin610/status/2023340126479569140"&gt;Junyang Lin  says&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Qwen3-Plus is a hosted API version of 397B. As the model natively supports 256K tokens, Qwen3.5-Plus supports 1M token context length. Additionally it supports search and code interpreter, which you can use on Qwen Chat with Auto mode.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/9507dd47483f78dc1195117735273e20"&gt;its pelican&lt;/a&gt;, which is similar in quality to the open weights model:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Similar quality pelican. The bicycle is taller and has a better frame shape. They are visually quite similar." src="https://static.simonwillison.net/static/2026/qwen3.5-plus-02-15.png" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="vision-llms"/><category term="qwen"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="openrouter"/><category term="ai-in-china"/></entry><entry><title>Introducing GPT‑5.3‑Codex‑Spark</title><link href="https://simonwillison.net/2026/Feb/12/codex-spark/#atom-tag" rel="alternate"/><published>2026-02-12T21:16:07+00:00</published><updated>2026-02-12T21:16:07+00:00</updated><id>https://simonwillison.net/2026/Feb/12/codex-spark/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://openai.com/index/introducing-gpt-5-3-codex-spark/"&gt;Introducing GPT‑5.3‑Codex‑Spark&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
OpenAI announced a partnership with Cerebras &lt;a href="https://openai.com/index/cerebras-partnership/"&gt;on January 14th&lt;/a&gt;. Four weeks later they're already launching the first integration, "an ultra-fast model for real-time coding in Codex".&lt;/p&gt;
&lt;p&gt;Despite being named GPT-5.3-Codex-Spark it's not purely an accelerated alternative to GPT-5.3-Codex - the blog post calls it "a smaller version of GPT‑5.3-Codex" and clarifies that "at launch, Codex-Spark has a 128k context window and is text-only."&lt;/p&gt;
&lt;p&gt;I had some preview access to this model and I can confirm that it's significantly faster than their other models.&lt;/p&gt;
&lt;p&gt;Here's what that speed looks like running in Codex CLI:&lt;/p&gt;
&lt;div style="max-width: 100%;"&gt;
    &lt;video 
        controls 
        preload="none"
        poster="https://static.simonwillison.net/static/2026/gpt-5.3-codex-spark-medium-last.jpg"
        style="width: 100%; height: auto;"&gt;
        &lt;source src="https://static.simonwillison.net/static/2026/gpt-5.3-codex-spark-medium.mp4" type="video/mp4"&gt;
    &lt;/video&gt;
&lt;/div&gt;

&lt;p&gt;That was the "Generate an SVG of a pelican riding a bicycle" prompt - here's the rendered result:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Whimsical flat illustration of an orange duck merged with a bicycle, where the duck's body forms the seat and frame area while its head extends forward over the handlebars, set against a simple light blue sky and green grass background." src="https://static.simonwillison.net/static/2026/gpt-5.3-codex-spark-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;Compare that to the speed of regular GPT-5.3 Codex medium:&lt;/p&gt;
&lt;div style="max-width: 100%;"&gt;
    &lt;video 
        controls 
        preload="none"
        poster="https://static.simonwillison.net/static/2026/gpt-5.3-codex-medium-last.jpg"
        style="width: 100%; height: auto;"&gt;
        &lt;source src="https://static.simonwillison.net/static/2026/gpt-5.3-codex-medium.mp4" type="video/mp4"&gt;
    &lt;/video&gt;
&lt;/div&gt;

&lt;p&gt;Significantly slower, but the pelican is a lot better:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Whimsical flat illustration of a white pelican riding a dark blue bicycle at speed, with motion lines behind it, its long orange beak streaming back in the wind, set against a light blue sky and green grass background." src="https://static.simonwillison.net/static/2026/gpt-5.3-codex-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;What's interesting about this model isn't the quality though, it's the &lt;em&gt;speed&lt;/em&gt;. When a model responds this fast you can stay in flow state and iterate with the model much more productively.&lt;/p&gt;
&lt;p&gt;I showed a demo of Cerebras running Llama 3.1 70 B at 2,000 tokens/second against Val Town &lt;a href="https://simonwillison.net/2024/Oct/31/cerebras-coder/"&gt;back in October 2024&lt;/a&gt;. OpenAI claim 1,000 tokens/second for their new model, and I expect it will prove to be a ferociously useful partner for hands-on iterative coding sessions.&lt;/p&gt;
&lt;p&gt;It's not yet clear what the pricing will look like for this new model.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/codex"&gt;codex&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="cerebras"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="codex"/><category term="llm-performance"/></entry><entry><title>Gemini 3 Deep Think</title><link href="https://simonwillison.net/2026/Feb/12/gemini-3-deep-think/#atom-tag" rel="alternate"/><published>2026-02-12T18:12:17+00:00</published><updated>2026-02-12T18:12:17+00:00</updated><id>https://simonwillison.net/2026/Feb/12/gemini-3-deep-think/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-deep-think/"&gt;Gemini 3 Deep Think&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New from Google. They say it's "built to push the frontier of intelligence and solve modern challenges across science, research, and engineering".&lt;/p&gt;
&lt;p&gt;It drew me a &lt;em&gt;really good&lt;/em&gt; &lt;a href="https://gist.github.com/simonw/7e317ebb5cf8e75b2fcec4d0694a8199"&gt;SVG of a pelican riding a bicycle&lt;/a&gt;! I think this is the best one I've seen so far - here's &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;my previous collection&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="This alt text also generated by Gemini 3 Deep Think: A highly detailed, colorful, flat vector illustration with thick dark blue outlines depicting a stylized white pelican riding a bright cyan blue bicycle from left to right across a sandy beige beach with white speed lines indicating forward motion. The pelican features a light blue eye, a pink cheek blush, a massive bill with a vertical gradient from yellow to orange, a backward magenta cap with a cyan brim and a small yellow top button, and a matching magenta scarf blowing backward in the wind. Its white wing, accented with a grey mid-section and dark blue feather tips, reaches forward to grip the handlebars, while its long tan leg and orange foot press down on an orange pedal. Attached to the front handlebars is a white wire basket carrying a bright blue cartoon fish that is pointing upwards and forwards. The bicycle itself has a cyan frame, dark blue tires, striking neon pink inner rims, cyan spokes, a white front chainring, and a dark blue chain. Behind the pelican, a grey trapezoidal pier extends from the sand toward a horizontal band of deep blue ocean water detailed with light cyan wavy lines. A massive, solid yellow-orange semi-circle sun sits on the horizon line, setting directly behind the bicycle frame. The background sky is a smooth vertical gradient transitioning from soft pink at the top to warm golden-yellow at the horizon, decorated with stylized pale peach fluffy clouds, thin white horizontal wind streaks, twinkling four-pointed white stars, and small brown v-shaped silhouettes of distant flying birds." src="https://static.simonwillison.net/static/2026/gemini-3-deep-think-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;(And since it's an FAQ, here's my answer to &lt;a href="https://simonwillison.net/2025/Nov/13/training-for-pelicans-riding-bicycles/"&gt;What happens if AI labs train for pelicans riding bicycles?&lt;/a&gt;)&lt;/p&gt;
&lt;p&gt;Since it did so well on my basic &lt;code&gt;Generate an SVG of a pelican riding a bicycle&lt;/code&gt; I decided to try the &lt;a href="https://simonwillison.net/2025/Nov/18/gemini-3/#and-a-new-pelican-benchmark"&gt;more challenging version&lt;/a&gt; as well:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Generate an SVG of a California brown pelican riding a bicycle. The bicycle must have spokes and a correctly shaped bicycle frame. The pelican must have its characteristic large pouch, and there should be a clear indication of feathers. The pelican must be clearly pedaling the bicycle. The image should show the full breeding plumage of the California brown pelican.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/154c0cc7b4daed579f6a5e616250ecc8"&gt;what I got&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Also described by Gemini 3 Deep Think: A highly detailed, vibrant, and stylized vector illustration of a whimsical bird resembling a mix between a pelican and a frigatebird enthusiastically riding a bright cyan bicycle from left to right across a flat tan and brown surface. The bird leans horizontally over the frame in an aerodynamic racing posture, with thin, dark brown wing-like arms reaching forward to grip the silver handlebars and a single thick brown leg, patterned with white V-shapes, stretching down to press on a black pedal. The bird's most prominent and striking feature is an enormous, vividly bright red, inflated throat pouch hanging beneath a long, straight grey upper beak that ends in a small orange hook. Its head is mostly white with a small pink patch surrounding the eye, a dark brown stripe running down the back of its neck, and a distinctive curly pale yellow crest on the very top. The bird's round, dark brown body shares the same repeating white V-shaped feather pattern as its leg and is accented by a folded wing resting on its side, made up of cleanly layered light blue and grey feathers. A tail composed of four stiff, straight dark brown feathers extends directly backward. Thin white horizontal speed lines trail behind the back wheel and the bird's tail, emphasizing swift forward motion. The bicycle features a classic diamond frame, large wheels with thin black tires, grey rims, and detailed silver spokes, along with a clearly visible front chainring, silver chain, and rear cog. The whimsical scene is set against a clear light blue sky featuring two small, fluffy white clouds on the left and a large, pale yellow sun in the upper right corner that radiates soft, concentric, semi-transparent pastel green and yellow halos. A solid, darker brown shadow is cast directly beneath the bicycle's wheels on the minimalist two-toned brown ground." src="https://static.simonwillison.net/static/2026/gemini-3-deep-think-complex-pelican.png" /&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=46991240"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="gemini"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/></entry><entry><title>GLM-5: From Vibe Coding to Agentic Engineering</title><link href="https://simonwillison.net/2026/Feb/11/glm-5/#atom-tag" rel="alternate"/><published>2026-02-11T18:56:14+00:00</published><updated>2026-02-11T18:56:14+00:00</updated><id>https://simonwillison.net/2026/Feb/11/glm-5/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://z.ai/blog/glm-5"&gt;GLM-5: From Vibe Coding to Agentic Engineering&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This is a &lt;em&gt;huge&lt;/em&gt; new MIT-licensed model: 744B parameters and &lt;a href="https://huggingface.co/zai-org/GLM-5"&gt;1.51TB on Hugging Face&lt;/a&gt; twice the size of &lt;a href="https://huggingface.co/zai-org/GLM-4.7"&gt;GLM-4.7&lt;/a&gt; which was 368B and 717GB (4.5 and 4.6 were around that size too).&lt;/p&gt;
&lt;p&gt;It's interesting to see Z.ai take a position on what we should call professional software engineers building with LLMs - I've seen &lt;strong&gt;Agentic Engineering&lt;/strong&gt; show up in a few other places recently. most notable &lt;a href="https://twitter.com/karpathy/status/2019137879310836075"&gt;from Andrej Karpathy&lt;/a&gt; and &lt;a href="https://addyosmani.com/blog/agentic-engineering/"&gt;Addy Osmani&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I ran my "Generate an SVG of a pelican riding a bicycle" prompt through GLM-5 via &lt;a href="https://openrouter.ai/"&gt;OpenRouter&lt;/a&gt; and got back &lt;a href="https://gist.github.com/simonw/cc4ca7815ae82562e89a9fdd99f0725d"&gt;a very good pelican on a disappointing bicycle frame&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="The pelican is good and has a well defined beak. The bicycle frame is a wonky red triangle. Nice sun and motion lines." src="https://static.simonwillison.net/static/2026/glm-5-pelican.png" /&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=46977210"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/definitions"&gt;definitions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vibe-coding"&gt;vibe-coding&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/glm"&gt;glm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/agentic-engineering"&gt;agentic-engineering&lt;/a&gt;&lt;/p&gt;



</summary><category term="definitions"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="vibe-coding"/><category term="openrouter"/><category term="ai-in-china"/><category term="glm"/><category term="agentic-engineering"/></entry></feed>