<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: gpt-oss</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/gpt-oss.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-10-10T22:37:21+00:00</updated><author><name>Simon Willison</name></author><entry><title>Video of GPT-OSS 20B running on a phone</title><link href="https://simonwillison.net/2025/Oct/10/gpt-oss-20b-snapdragon/#atom-tag" rel="alternate"/><published>2025-10-10T22:37:21+00:00</published><updated>2025-10-10T22:37:21+00:00</updated><id>https://simonwillison.net/2025/Oct/10/gpt-oss-20b-snapdragon/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://twitter.com/nexa_ai/status/1975232300985291008"&gt;Video of GPT-OSS 20B running on a phone&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
GPT-OSS 20B is a &lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/"&gt;very good model&lt;/a&gt;. At launch OpenAI claimed:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The gpt-oss-20b model delivers similar results to OpenAI o3‑mini on common benchmarks and can run on edge devices with just 16 GB of memory&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://nexa.ai/"&gt;Nexa AI&lt;/a&gt; just posted a video on Twitter demonstrating exactly that: the full GPT-OSS 20B running on a Snapdragon Gen 5 phone in their &lt;a href="https://play.google.com/store/apps/details?id=com.nexa.studio"&gt;Nexa Studio&lt;/a&gt; Android app. It requires at least 16GB of RAM, and benefits from Snapdragon using a similar trick to Apple Silicon where the system RAM is available to both the CPU and the GPU.&lt;/p&gt;
&lt;p&gt;The latest iPhone 17 Pro Max is still stuck at 12GB of RAM, presumably not enough to run this same model.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/android"&gt;android&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-oss"&gt;gpt-oss&lt;/a&gt;&lt;/p&gt;



</summary><category term="android"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="gpt-oss"/></entry><entry><title>llama.cpp guide: running gpt-oss with llama.cpp</title><link href="https://simonwillison.net/2025/Aug/19/gpt-oss-with-llama-cpp/#atom-tag" rel="alternate"/><published>2025-08-19T19:01:13+00:00</published><updated>2025-08-19T19:01:13+00:00</updated><id>https://simonwillison.net/2025/Aug/19/gpt-oss-with-llama-cpp/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/ggml-org/llama.cpp/discussions/15396"&gt;llama.cpp guide: running gpt-oss with llama.cpp&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Really useful official guide to running the OpenAI gpt-oss models using &lt;code&gt;llama-server&lt;/code&gt; from &lt;code&gt;llama.cpp&lt;/code&gt; - which provides an OpenAI-compatible localhost API and a neat web interface for interacting with the models.&lt;/p&gt;
&lt;p&gt;TLDR version for macOS to run the smaller &lt;code&gt;gpt-oss-20b&lt;/code&gt; model:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;brew install llama.cpp
llama-server -hf ggml-org/gpt-oss-20b-GGUF \
  --ctx-size 0 --jinja -ub 2048 -b 2048 -ngl 99 -fa
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This downloads a 12GB model file from &lt;a href="https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/tree/main"&gt;ggml-org/gpt-oss-20b-GGUF&lt;/a&gt; on Hugging Face, stores it in &lt;code&gt;~/Library/Caches/llama.cpp/&lt;/code&gt; and starts it running on port 8080.&lt;/p&gt;
&lt;p&gt;You can then visit this URL to start interacting with the model:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;http://localhost:8080/
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;On my 64GB M2 MacBook Pro &lt;a href="https://gist.github.com/simonw/85ea67cba9fce0c7e63951dda5117268"&gt;it runs at around&lt;/a&gt; 82 tokens/second.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a chat interface with filename &amp;quot;llama.cpp&amp;quot; showing a conversation about creating an SVG of a pelican on a bicycle. The conversation includes detailed coordinates for drawing the pelican (body ellipse center at 250,140 with rx=35, ry=50, head circle at 260,110 with r=20, beak triangle points, wings, and tail specifications), implementation notes about layering bicycle elements then pelican, and ends with a code block showing the beginning of SVG code with XML declaration, svg tag with viewBox=&amp;quot;0 0 500 300&amp;quot;, style definitions for .bg, .wheel, .frame, .crossbar, .seat, .handlebar, .pedal, .pelican-body, and .pelican-head classes with various fill and stroke properties. Below the code is explanatory text: &amp;quot;Below is a compact, self-contained SVG that shows a stylised pelican perched on a bicycle. Copy the code into an .svg file or paste it directly into an HTML page to view it.&amp;quot; At the bottom is a message input field with &amp;quot;Type a message (Shift+Enter to add a new line)&amp;quot; placeholder text." src="https://static.simonwillison.net/static/2025/llama-cpp-screenshot.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;The guide also includes notes for running on NVIDIA and AMD hardware.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/ggerganov/status/1957821440633282642"&gt;@ggerganov&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/macos"&gt;macos&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama-cpp"&gt;llama-cpp&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-oss"&gt;gpt-oss&lt;/a&gt;&lt;/p&gt;



</summary><category term="macos"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llama-cpp"/><category term="gpt-oss"/></entry><entry><title>TIL: Running a gpt-oss eval suite against LM Studio on a Mac</title><link href="https://simonwillison.net/2025/Aug/17/gpt-oss-eval-suite/#atom-tag" rel="alternate"/><published>2025-08-17T03:46:21+00:00</published><updated>2025-08-17T03:46:21+00:00</updated><id>https://simonwillison.net/2025/Aug/17/gpt-oss-eval-suite/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://til.simonwillison.net/llms/gpt-oss-evals"&gt;TIL: Running a gpt-oss eval suite against LM Studio on a Mac&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The other day &lt;a href="https://simonwillison.net/2025/Aug/15/inconsistent-performance/#update"&gt;I learned&lt;/a&gt; that OpenAI published a set of evals as part of their gpt-oss model release, described in their cookbook on &lt;a href="https://cookbook.openai.com/articles/gpt-oss/verifying-implementations"&gt;Verifying gpt-oss implementations&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I decided to try and run that eval suite on my own MacBook Pro, against &lt;code&gt;gpt-oss-20b&lt;/code&gt; running inside of LM Studio.&lt;/p&gt;
&lt;p&gt;TLDR: once I had the model running inside LM Studio with a longer than default context limit, the following incantation ran an eval suite in around 3.5 hours:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;mkdir /tmp/aime25_openai
OPENAI_API_KEY=x \
  uv run --python 3.13 --with 'gpt-oss[eval]' \
  python -m gpt_oss.evals \
  --base-url http://localhost:1234/v1 \
  --eval aime25 \
  --sampler chat_completions \
  --model openai/gpt-oss-20b \
  --reasoning-effort low \
  --n-threads 2
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;My &lt;a href="https://til.simonwillison.net/llms/gpt-oss-evals"&gt;new TIL&lt;/a&gt; breaks that command down in detail and walks through the underlying eval - AIME 2025, which asks 30 questions (8 times each) that are defined using the following format:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;{"question": "Find the sum of all integer bases $b&amp;gt;9$ for which $17_{b}$ is a divisor of $97_{b}$.", "answer": "70"}&lt;/code&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/til"&gt;til&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-oss"&gt;gpt-oss&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="ai"/><category term="til"/><category term="openai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="evals"/><category term="uv"/><category term="lm-studio"/><category term="gpt-oss"/></entry><entry><title>Open weight LLMs exhibit inconsistent performance across providers</title><link href="https://simonwillison.net/2025/Aug/15/inconsistent-performance/#atom-tag" rel="alternate"/><published>2025-08-15T16:29:34+00:00</published><updated>2025-08-15T16:29:34+00:00</updated><id>https://simonwillison.net/2025/Aug/15/inconsistent-performance/#atom-tag</id><summary type="html">
    &lt;p&gt;Artificial Analysis published &lt;a href="https://artificialanalysis.ai/models/gpt-oss-120b/providers#aime25x32-performance-gpt-oss-120b"&gt;a new benchmark&lt;/a&gt; the other day, this time focusing on how an individual model - OpenAI’s gpt-oss-120b - performs across different hosted providers.&lt;/p&gt;
&lt;p&gt;The results showed some surprising differences. Here's the one with the greatest variance, a run of the 2025 AIME (American Invitational Mathematics Examination) averaging 32 runs against each model, using gpt-oss-120b with a reasoning effort of "high":&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/aim25x32-gpt-oss-120b.jpg" alt="Performance benchmark chart showing AIME25x32 Performance for gpt-oss-120B model across different AI frameworks. Chart displays box plots with percentile ranges (Min, 25th, Median, 75th, Max) for each framework. Title: &amp;quot;AIME25x32 Performance: gpt-oss-120B&amp;quot; with subtitle &amp;quot;AIME 2025 N=32 Runs: Minimum, 25th Percentile, Median, 75th Percentile, Maximum (Higher is Better)&amp;quot;. Legend indicates &amp;quot;Median; other points represent Min, 25th, 75th percentiles and Max respectively&amp;quot;. Y-axis ranges from 0 to 1.2. Frameworks shown from left to right: Cerebras (93.3%), Nebius Base (93.3%), Fireworks (93.3%), Deepinfra (93.3%), Novita (93.3%), Together.ai (93.3%), Parasail (90.0%), Groq (86.7%), Amazon (83.3%), Azure (80.0%), CompectAI (36.7%). Watermark shows &amp;quot;Artificial Analysis&amp;quot; logo." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;These are some varied results!&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;93.3%: Cerebras, Nebius Base, Fireworks, Deepinfra, Novita, Together.ai, vLLM 0.1.0&lt;/li&gt;
&lt;li&gt;90.0%: Parasail&lt;/li&gt;
&lt;li&gt;86.7%: Groq&lt;/li&gt;
&lt;li&gt;83.3%: Amazon&lt;/li&gt;
&lt;li&gt;80.0%: Azure&lt;/li&gt;
&lt;li&gt;36.7%: CompactifAI&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It looks like most of the providers that scored 93.3% were running models using the latest &lt;a href="https://github.com/vllm-project/vllm"&gt;vLLM&lt;/a&gt; (with the exception of Cerebras who I believe have their own custom serving stack).&lt;/p&gt;
&lt;p&gt;I hadn't heard of CompactifAI before - I found &lt;a href="https://www.hpcwire.com/off-the-wire/multiverse-computing-closes-e189m-series-b-to-scale-compactifai-deployment/"&gt;this June 12th 2025 press release&lt;/a&gt; which says that "CompactifAI models are highly-compressed versions of leading open source LLMs that retain original accuracy, are 4x-12x faster and yield a 50%-80% reduction in inference costs" which helps explain their notably lower score!&lt;/p&gt;
&lt;p&gt;Microsoft Azure's Lucas Pickup &lt;a href="https://x.com/lupickup/status/1955620918086226223"&gt;confirmed&lt;/a&gt; that Azure's 80% score was caused by running an older vLLM, now fixed:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This is exactly it, it’s been fixed as of yesterday afternoon across all serving instances (of the hosted 120b service). Old vLLM commits that didn’t respect reasoning_effort, so all requests defaulted to medium.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;No news yet on what went wrong with the AWS Bedrock version.&lt;/p&gt;
&lt;h4 id="the-challenge-for-customers-of-open-weight-models"&gt;The challenge for customers of open weight models&lt;/h4&gt;
&lt;p&gt;As a customer of open weight model providers, this really isn't something I wanted to have to think about!&lt;/p&gt;
&lt;p&gt;It's not really a surprise though. When running models myself I inevitably have to make choices - about which serving framework to use (I'm usually picking between GGPF/llama.cpp and MLX on my own Mac laptop) and the quantization size to use.&lt;/p&gt;
&lt;p&gt;I know that quantization has an impact, but it's difficult for me to quantify that effect.&lt;/p&gt;
&lt;p&gt;It looks like with hosted models even knowing the quantization they are using isn't necessarily enough information to be able to predict that model's performance.&lt;/p&gt;
&lt;p&gt;I see this situation as a general challenge for open weight models. They tend to be released as an opaque set of model weights plus loose instructions for running them on a single platform - if we are lucky! Most AI labs leave quantization and format conversions to the community and third-party providers.&lt;/p&gt;
&lt;p&gt;There's a lot that can go wrong. Tool calling is particularly vulnerable to these differences - models have been trained on specific tool-calling conventions, and if a provider doesn't get these exactly right the results can be unpredictable but difficult to diagnose.&lt;/p&gt;
&lt;p&gt;What would help &lt;em&gt;enormously&lt;/em&gt; here would be some kind of conformance suite. If models were reliably deterministic this would be easy: publish a set of test cases and let providers (or their customers) run those to check the model's implementation.&lt;/p&gt;
&lt;p&gt;Models aren't deterministic though, even at a temperature of 0. Maybe this new effort from Artificial Analysis is exactly what we need here, especially since running a full benchmark suite against a provider can be quite expensive in terms of token spend.&lt;/p&gt;
&lt;p id="update"&gt;&lt;strong&gt;Update&lt;/strong&gt;: &lt;a href="https://x.com/DKundel/status/1956395988836368587"&gt;Via OpenAI's Dominik Kundel&lt;/a&gt; I learned that OpenAI now include a &lt;a href="https://github.com/openai/gpt-oss/tree/main/compatibility-test"&gt;compatibility test&lt;/a&gt; in the gpt-oss GitHub repository to help providers verify that they have implemented things like tool calling templates correctly, described in more detail in their &lt;a href="https://cookbook.openai.com/articles/gpt-oss/verifying-implementations"&gt;Verifying gpt-oss implementations&lt;/a&gt; cookbook.&lt;/p&gt;

&lt;p&gt;Here's &lt;a href="https://til.simonwillison.net/llms/gpt-oss-evals"&gt;my TIL&lt;/a&gt; on running part of that eval suite.&lt;/p&gt;

&lt;h4 id="update-aug-20"&gt;Update: August 20th 2025&lt;/h4&gt;

&lt;p&gt;Since I first wrote this article Artificial Analysis have updated the benchmark results to reflect fixes that vendors have made since their initial run. Here's what it looks like today:&lt;/p&gt;

&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-oss-eval-updated.jpg" alt="Performance benchmark chart showing AIME25x32 Performance for gpt-oss-120B model across different AI frameworks. Chart displays box plots with percentile ranges for each framework. Title: &amp;quot;AIME25x32 Performance: gpt-oss-120B&amp;quot; with subtitle &amp;quot;AIME 2025 N=32 Runs: Minimum, 25th Percentile, Median, 75th Percentile, Maximum (Higher is Better)&amp;quot;. Legend indicates &amp;quot;Median; other points represent Min, 25th, 75th percentiles and Max respectively&amp;quot;. Y-axis ranges from 0 to 1.2. Frameworks shown from left to right: Cerebras (93.3%), Nebius Base (93.3%), Azure (93.3%), Fireworks (93.3%), Deepinfra (93.3%), Novita (93.3%), Groq (93.3%), Together.ai (93.3%), Parasail (90.0%), Google Vertex (83.3%), Amazon (80.0%). Watermark shows &amp;quot;Artificial Analysis&amp;quot; logo." style="max-width: 100%" /&gt;&lt;/p&gt;
&lt;p&gt;Groq and Azure have both improved their scores to 93.3%. Google Vertex is new  to the chart at 83.3%.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-oss"&gt;gpt-oss&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/artificial-analysis"&gt;artificial-analysis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="evals"/><category term="gpt-oss"/><category term="artificial-analysis"/><category term="llm-performance"/></entry><entry><title>Quoting Artificial Analysis</title><link href="https://simonwillison.net/2025/Aug/6/artificial-analysis/#atom-tag" rel="alternate"/><published>2025-08-06T12:48:32+00:00</published><updated>2025-08-06T12:48:32+00:00</updated><id>https://simonwillison.net/2025/Aug/6/artificial-analysis/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://x.com/artificialanlys/status/1952887733803991070"&gt;&lt;p&gt;&lt;strong&gt;gpt-oss-120b is the most intelligent American open weights model, comes behind DeepSeek R1 and Qwen3 235B in intelligence but offers efficiency benefits&lt;/strong&gt; [...]&lt;/p&gt;
&lt;p&gt;We’re seeing the 120B beat o3-mini but come in behind o4-mini and o3. The 120B is the most intelligent model that can be run on a single H100 and the 20B is the most intelligent model that can be run on a consumer GPU. [...]&lt;/p&gt;
&lt;p&gt;While the larger gpt-oss-120b does not come in above DeepSeek R1 0528’s score of 59 or Qwen3 235B 2507s score of 64, it is notable that it is significantly smaller in both total and active parameters than both of those models.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://x.com/artificialanlys/status/1952887733803991070"&gt;Artificial Analysis&lt;/a&gt;, see also their &lt;a href="https://artificialanalysis.ai/models/open-source"&gt;updated leaderboard&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deepseek"&gt;deepseek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-oss"&gt;gpt-oss&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/artificial-analysis"&gt;artificial-analysis&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="evals"/><category term="qwen"/><category term="deepseek"/><category term="gpt-oss"/><category term="artificial-analysis"/></entry><entry><title>OpenAI's new open weight (Apache 2) models are really good</title><link href="https://simonwillison.net/2025/Aug/5/gpt-oss/#atom-tag" rel="alternate"/><published>2025-08-05T20:33:13+00:00</published><updated>2025-08-05T20:33:13+00:00</updated><id>https://simonwillison.net/2025/Aug/5/gpt-oss/#atom-tag</id><summary type="html">
    &lt;p&gt;The long promised &lt;a href="https://openai.com/index/introducing-gpt-oss/"&gt;OpenAI open weight models are here&lt;/a&gt;, and they are &lt;em&gt;very&lt;/em&gt; impressive. They're available under proper open source licenses - Apache 2.0 - and come in two sizes, 120B and 20B.&lt;/p&gt;
&lt;p&gt;OpenAI's own benchmarks are eyebrow-raising - emphasis mine:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The &lt;strong&gt;gpt-oss-120b&lt;/strong&gt; model achieves &lt;strong&gt;near-parity with OpenAI o4-mini&lt;/strong&gt; on core reasoning benchmarks, while running efficiently on a single 80 GB GPU. The &lt;strong&gt;gpt-oss-20b&lt;/strong&gt; model delivers &lt;strong&gt;similar results to OpenAI o3‑mini&lt;/strong&gt; on common benchmarks and can run on edge devices with just 16 GB of memory, making it ideal for on-device use cases, local inference, or rapid iteration without costly infrastructure.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;o4-mini and o3-mini are &lt;em&gt;really good&lt;/em&gt; proprietary models - I was not expecting the open weights releases to be anywhere near that class, especially given their small sizes. That gpt-oss-20b model should run quite comfortably on a Mac laptop with 32GB of RAM.&lt;/p&gt;
&lt;p&gt;Both models are mixture-of-experts:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;gpt-oss-120b activates 5.1B parameters per token, while gpt-oss-20b activates 3.6B. The models have 117b and 21b total parameters respectively.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Something that surprised me even more about the benchmarks was the scores for general knowledge based challenges. I can just about believe they managed to train a strong reasoning model that fits in 20B parameters, but these models score highly on benchmarks like "GPQA Diamond (without tools) PhD-level science questions" too:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;o3 — 83.3%&lt;/li&gt;
&lt;li&gt;o4-mini — 81.4%&lt;/li&gt;
&lt;li&gt;gpt-oss-120b — 80.1%&lt;/li&gt;
&lt;li&gt;o3-mini — 77%&lt;/li&gt;
&lt;li&gt;gpt-oss-20b — 71.5%&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A lot of these benchmarks are edging towards saturated.&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#running-gpt-oss-20b-on-my-mac-with-lm-studio"&gt;Running gpt-oss-20b on my Mac with LM Studio&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#pelican-on-reasoning-low"&gt;Pelican on reasoning=low&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#pelican-on-reasoning-medium"&gt;Pelican on reasoning=medium&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#pelican-on-reasoning-high"&gt;Pelican on reasoning=high&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#space-invaders-with-gpt-oss-20b"&gt;Space invaders with gpt-oss-20b&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#trying-gpt-oss-120b-via-api-providers"&gt;Trying gpt-oss-120b via API providers&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#llama-cpp-is-coming-very-shortly"&gt;llama.cpp is coming very shortly&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#gpt-oss-20b-in-ollama"&gt;gpt-oss:20b in Ollama&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#the-model-card"&gt;Training details from the model card&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#openai-harmony-a-new-format-for-prompt-templates"&gt;OpenAI Harmony, a new format for prompt templates&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#the-open-question-for-me-how-good-is-tool-calling-"&gt;The open question for me: how good is tool calling?&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/#china"&gt;Competing with the Chinese open models&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="running-gpt-oss-20b-on-my-mac-with-lm-studio"&gt;Running gpt-oss-20b on my Mac with LM Studio&lt;/h4&gt;
&lt;p&gt;There are already a bunch of different ways to run these models - OpenAI partnered with numerous organizations in advance of the release.&lt;/p&gt;
&lt;p&gt;I decided to start with &lt;a href="https://lmstudio.ai/"&gt;LM Studio&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I had to update to the most recent version of the app, then install the new model from &lt;a href="https://lmstudio.ai/models/openai/gpt-oss-20b"&gt;their openai/gpt-oss-20b&lt;/a&gt; page.&lt;/p&gt;
&lt;p&gt;First impressions: this is a &lt;em&gt;really good&lt;/em&gt; model, and it somehow runs using just 11.72GB of my system RAM.&lt;/p&gt;
&lt;p&gt;The model supports three reasoning efforts: low, medium and high. LM Studio makes those available via a dropdown.&lt;/p&gt;
&lt;p&gt;Let's try "Generate an SVG of a pelican riding a bicycle":&lt;/p&gt;
&lt;h4 id="pelican-on-reasoning-low"&gt;Pelican on reasoning=low&lt;/h4&gt;
&lt;p&gt;I started &lt;a href="https://gist.github.com/simonw/b71394cc85fe0f048e376392e41586da"&gt;with low&lt;/a&gt;. It thought for 0.07 seconds and then output this (at 39 tokens a second):&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-20-low.png" alt="" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Except... it output invalid SVG. One of the path elements looked like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;!-- Frame --&amp;gt;
&amp;lt;path d="
    M150,250          &amp;lt;!-- rear wheel center --&amp;gt;
    L300,120          &amp;lt;!-- top tube to front --&amp;gt;
    L450,250          &amp;lt;!-- chain stays back to front --&amp;gt;
    L300,350          &amp;lt;!-- seat stays down --&amp;gt;
    Z"
    fill="#e0e0e0" stroke="#555" stroke-width="4"/&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;But you can't put comments inside attributes like that. I fixed this to get the above image.&lt;/p&gt;
&lt;h4 id="pelican-on-reasoning-medium"&gt;Pelican on reasoning=medium&lt;/h4&gt;
&lt;p&gt;I tried again &lt;a href="https://gist.github.com/simonw/642e9e371387fc59a5aad25dcd41b0cc"&gt;on medium&lt;/a&gt;. This time it thought for 4.44 seconds and gave me this, at 55 tokens/second:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-20-medium.png" alt="" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="pelican-on-reasoning-high"&gt;Pelican on reasoning=high&lt;/h4&gt;
&lt;p&gt;And then &lt;a href="https://gist.github.com/simonw/164a3f1b4b1a971637af279a0ac49be3"&gt;on high&lt;/a&gt;. My first attempt here failed because it hit the default LM Studio 4096 context limit - so I reloaded the model and set the new limit to the maximum possible.&lt;/p&gt;
&lt;p&gt;It jumped up to using 17.35GB RAM (I imagine it might use a lot more with a longer prompt) and after thinking for 5 minutes and 50 seconds (!) gave me this:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-20-high.png" alt="" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;It thought &lt;em&gt;really hard&lt;/em&gt; to get there. Here's a snippet of &lt;a href="https://gist.github.com/simonw/164a3f1b4b1a971637af279a0ac49be3 "&gt;that transcript&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;But maybe we want seat below body bottom? Actually seat should be between body and pedals, not above body bottom. Wait, typical bicycle: seat post goes from bottom bracket upward to the seat. The seat sits on top of the seat tube (vertical). So the seat is above bottom bracket. In our representation, the seat is at y=250-255. Body center at 245, body bottom at 265. So body bottom is below seat. That means the pelican's body is below the seat? Actually no: body bottom extends to 265; seat top is at 250; so seat sits above body bottom by about 15 units. That seems plausible because seat should be slightly above body, but body bottom is at 265 &amp;gt; seat top 255. [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I've seen this from other local reasoning models too: they can really overthink this kind of problem! I don't know if there's any pelican SVG that's worth waiting nearly six minutes for.&lt;/p&gt;
&lt;h4 id="space-invaders-with-gpt-oss-20b"&gt;Space invaders with gpt-oss-20b&lt;/h4&gt;
&lt;p&gt;Given how long high took I switched back to medium for my next experiment:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Write an HTML and JavaScript page implementing space invaders&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It &lt;a href="https://gist.github.com/simonw/63d7d8c43ae2ac93c214325bd6d607e4"&gt;thought for 10.78 seconds&lt;/a&gt; and produced this:&lt;/p&gt;

&lt;div style="max-width: 100%; margin-bottom: 0.4em"&gt;
    &lt;video controls="controls" preload="none" aria-label="Space Invaders" poster="https://static.simonwillison.net/static/2025/space-invaders-gpt-20.jpg" loop="loop" style="width: 100%; height: auto;" muted="muted"&gt;
        &lt;source src="https://static.simonwillison.net/static/2025/space-invaders-gpt-20.mp4" type="video/mp4" /&gt;
    &lt;/video&gt;
&lt;/div&gt;

&lt;p&gt;You can &lt;a href="https://tools.simonwillison.net/space-invaders-gpt-oss-20b-mxfp4-medium"&gt;play that here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It's not the best I've seen - I was more impressed &lt;a href="https://simonwillison.net/2025/Jul/29/space-invaders/"&gt;by GLM 4.5 Air&lt;/a&gt; - but it's very competent for a model that only uses 12GB of my RAM (GLM 4.5 Air used 47GB).&lt;/p&gt;
&lt;h4 id="trying-gpt-oss-120b-via-api-providers"&gt;Trying gpt-oss-120b via API providers&lt;/h4&gt;
&lt;p&gt;I don't quite have the resources on my laptop to run the larger model. Thankfully it's already being hosted by a number of different API providers.&lt;/p&gt;
&lt;p&gt;OpenRouter already &lt;a href="https://openrouter.ai/openai/gpt-oss-120b/providers"&gt;lists three&lt;/a&gt; - Fireworks, Groq and Cerebras. (Update: now also Parasail and Baseten.)&lt;/p&gt;
&lt;p&gt;Cerebras is &lt;em&gt;fast&lt;/em&gt;, so I decided to try them first.&lt;/p&gt;
&lt;p&gt;I installed the &lt;a href="https://github.com/irthomasthomas/llm-cerebras"&gt;llm-cerebras&lt;/a&gt; plugin and ran the &lt;code&gt;refresh&lt;/code&gt; command to ensure it had their latest models:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install -U llm-cerebras jsonschema
llm cerebras refresh&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;(Installing jsonschema worked around a warning message.)&lt;/p&gt;
&lt;p&gt;Output:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Refreshed 10 Cerebras models:
  - cerebras-deepseek-r1-distill-llama-70b
  - cerebras-gpt-oss-120b
  - cerebras-llama-3.3-70b
  - cerebras-llama-4-maverick-17b-128e-instruct
  - cerebras-llama-4-scout-17b-16e-instruct
  - cerebras-llama3.1-8b
  - cerebras-qwen-3-235b-a22b-instruct-2507
  - cerebras-qwen-3-235b-a22b-thinking-2507
  - cerebras-qwen-3-32b
  - cerebras-qwen-3-coder-480b
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m cerebras-gpt-oss-120b \
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Generate an SVG of a pelican riding a bicycle&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Cerebras runs the new model at between 2 and 4 thousands tokens per second!&lt;/p&gt;
&lt;p&gt;To my surprise this one &lt;a href="https://gist.github.com/simonw/4c685f19f1a93b68eacb627125e36be4"&gt;had the same comments-in-attributes bug&lt;/a&gt; that we saw with oss-20b earlier. I fixed those and got this pelican:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-120-cerebras.jpg" alt="Yellow and not great pelican, quite a good bicycle if a bit sketchy." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;That bug appears intermittently - I've not seen it on some of my other runs of the same prompt.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt; plugin also provides access to the models, balanced across the underlying providers. You can use that like so:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-openrouter
llm keys &lt;span class="pl-c1"&gt;set&lt;/span&gt; openrouter
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Paste API key here&lt;/span&gt;
llm -m openrouter/openai/gpt-oss-120b &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Say hi&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h4 id="llama-cpp-is-coming-very-shortly"&gt;llama.cpp is coming very shortly&lt;/h4&gt;
&lt;p&gt;The &lt;code&gt;llama.cpp&lt;/code&gt; &lt;a href="https://github.com/ggml-org/llama.cpp/pull/15091"&gt;pull request for gpt-oss&lt;/a&gt; was landed less than an hour ago. It's worth browsing through the coded - a &lt;em&gt;lot&lt;/em&gt; of work went into supporting this new model, spanning 48 commits to 83 different files. Hopefully this will land in the &lt;a href="https://formulae.brew.sh/formula/llama.cpp"&gt;llama.cpp Homebrew package&lt;/a&gt; within the next day or so, which should provide a convenient way to run the model via &lt;code&gt;llama-server&lt;/code&gt; and friends.&lt;/p&gt;
&lt;h4 id="gpt-oss-20b-in-ollama"&gt;gpt-oss:20b in Ollama&lt;/h4&gt;
&lt;p&gt;Ollama &lt;a href="https://ollama.com/library/gpt-oss"&gt;also have gpt-oss&lt;/a&gt;, requiring an update to their app.&lt;/p&gt;
&lt;p&gt;I fetched that 14GB model like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;ollama pull gpt-oss:20b&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Now I can use it with the new Ollama native app, or access it from &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-ollama
llm -m gpt-oss:20b &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Hi&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This also appears to use around 13.26GB of system memory while running a prompt.&lt;/p&gt;
&lt;p&gt;Ollama also launched &lt;a href="https://ollama.com/turbo"&gt;Ollama Turbo&lt;/a&gt; today, offering the two OpenAI models as a paid hosted service:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;Turbo is a new way to run open models using datacenter-grade hardware. Many new models are too large to fit on widely available GPUs, or run very slowly. Ollama Turbo provides a way to run these models fast while using Ollama's App, CLI, and API. &lt;/p&gt;&lt;/blockquote&gt;
&lt;h4 id="the-model-card"&gt;Training details from the model card&lt;/h4&gt;
&lt;p&gt;Here are some interesting notes about how the models were trained from &lt;a href="https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf"&gt;the model card&lt;/a&gt; (PDF):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Data&lt;/strong&gt;: We train the models on a text-only dataset with trillions of tokens, with a focus on STEM, coding, and general knowledge. To improve the safety of the model, we filtered the data for harmful content in pre-training, especially around hazardous biosecurity knowledge, by reusing the CBRN pre-training filters from GPT-4o. Our model has a knowledge cutoff of June 2024.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Training&lt;/strong&gt;: The gpt-oss models trained on NVIDIA H100 GPUs using the PyTorch framework with expert-optimized Triton kernels. The training run for gpt-oss-120b required 2.1 million H100-hours to complete, with gpt-oss-20b needing almost 10x fewer. [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Thunder Compute's article &lt;a href="https://www.thundercompute.com/blog/nvidia-h100-pricing"&gt;NVIDIA H100 Pricing (August 2025): Cheapest On-Demand Cloud GPU Rates&lt;/a&gt; lists prices from around $2/hour to $11/hour, which would indicate a training cost of the 120b model between $4.2m and $23.1m and the 20b between $420,000 and $2.3m.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;After pre-training, we post-train the models using similar CoT RL techniques as OpenAI o3. This procedure teaches the models how to reason and solve problems using CoT and teaches the model how to use tools. Because of the similar RL techniques, these models have a personality similar to models served in our first-party products like ChatGPT. Our training dataset consists of a wide range of problems from coding, math, science, and more.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The models have additional special training to help them use web browser and Python (Jupyter notebook) tools more effectively:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;During post-training, we also teach the models to use different agentic tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A browsing tool, that allows the model to call search and open functions to interact with
the web. This aids factuality and allows the models to fetch info beyond their knowledge
cutoff.&lt;/li&gt;
&lt;li&gt;A python tool, which allows the model to run code in a stateful Jupyter notebook environment.&lt;/li&gt;
&lt;li&gt;Arbitrary developer functions, where one can specify function schemas in a &lt;code&gt;Developer&lt;/code&gt;
message similar to the OpenAI API. The definition of function is done within our harmony
format.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;There's a corresponding &lt;a href="https://github.com/openai/gpt-oss?tab=readme-ov-file#python"&gt;section about Python tool usage&lt;/a&gt; in the &lt;code&gt;openai/gpt-oss&lt;/code&gt; repository README.&lt;/p&gt;


&lt;h4 id="openai-harmony-a-new-format-for-prompt-templates"&gt;OpenAI Harmony, a new format for prompt templates&lt;/h4&gt;
&lt;p&gt;One of the gnarliest parts of implementing harnesses for LLMs is handling the prompt template format.&lt;/p&gt;
&lt;p&gt;Modern prompts are complicated beasts. They need to model user v.s. assistant conversation turns, and tool calls, and reasoning traces and an increasing number of other complex patterns.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/openai/harmony"&gt;openai/harmony&lt;/a&gt; is a brand new open source project from OpenAI (again, Apache 2) which implements a new response format that was created for the &lt;code&gt;gpt-oss&lt;/code&gt; models. It's clearly inspired by their new-ish &lt;a href="https://openai.com/index/new-tools-for-building-agents/"&gt;Responses API&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The format is described in the new &lt;a href="https://cookbook.openai.com/articles/openai-harmony"&gt;OpenAI Harmony Response Format&lt;/a&gt; cookbook document. It introduces some concepts that I've not seen in open weight models before:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;system&lt;/code&gt;, &lt;code&gt;developer&lt;/code&gt;, &lt;code&gt;user&lt;/code&gt;, &lt;code&gt;assistant&lt;/code&gt; and &lt;code&gt;tool&lt;/code&gt; roles - many other models only use user and assistant, and sometimes system and tool.&lt;/li&gt;
&lt;li&gt;Three different channels for output: &lt;code&gt;final&lt;/code&gt;, &lt;code&gt;analysis&lt;/code&gt; and &lt;code&gt;commentary&lt;/code&gt;. Only the &lt;code&gt;final&lt;/code&gt; channel is default intended to be visible to users. &lt;code&gt;analysis&lt;/code&gt; is for chain of thought and &lt;code&gt;commentary&lt;/code&gt; is sometimes used for tools.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That channels concept has been present in ChatGPT for a few months, starting with the release of o3.&lt;/p&gt;
&lt;p&gt;The details of the new tokens used by Harmony caught my eye:&lt;/p&gt;
&lt;center&gt;
&lt;table&gt;
  &lt;tbody&gt;&lt;tr&gt;
    &lt;th&gt;Token&lt;/th&gt;
    &lt;th&gt;Purpose&lt;/th&gt;
    &lt;th&gt;ID&lt;/th&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|start|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Start of message header&lt;/td&gt;
    &lt;td&gt;200006&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|end|&amp;gt;&lt;/td&gt;
    &lt;td&gt;End of message&lt;/td&gt;
    &lt;td&gt;200007&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|message|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Start of message content&lt;/td&gt;
    &lt;td&gt;200008&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|channel|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Start of channel info&lt;/td&gt;
    &lt;td&gt;200005&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|constrain|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Data type for tool call&lt;/td&gt;
    &lt;td&gt;200003&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|return|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Stop after response&lt;/td&gt;
    &lt;td&gt;200002&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&amp;lt;|call|&amp;gt;&lt;/td&gt;
    &lt;td&gt;Call a tool&lt;/td&gt;
    &lt;td&gt;200012&lt;/td&gt;
  &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/center&gt;
&lt;p&gt;Those token IDs are particularly important. They are part of a new token vocabulary called &lt;code&gt;o200k_harmony&lt;/code&gt;, which landed in OpenAI's tiktoken tokenizer library &lt;a href="https://github.com/openai/tiktoken/commit/3591ff175d6a80efbe4fcc7f0e219ddd4b8c52f1"&gt;this morning&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In the past I've seen models get confused by special tokens - try pasting &lt;code&gt;&amp;lt;|end|&amp;gt;&lt;/code&gt; into a model and see what happens.&lt;/p&gt;
&lt;p&gt;Having these special instruction tokens formally map to dedicated token IDs should hopefully be a whole lot more robust!&lt;/p&gt;
&lt;p&gt;The Harmony repo itself includes a Rust library and a Python library (wrapping that Rust library) for working with the new format in a much more ergonomic way.&lt;/p&gt;
&lt;p&gt;I tried one of their demos using &lt;code&gt;uv run&lt;/code&gt; to turn it into a shell one-liner:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uv run --python 3.12 --with openai-harmony python -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;from openai_harmony import *&lt;/span&gt;
&lt;span class="pl-s"&gt;from openai_harmony import DeveloperContent&lt;/span&gt;
&lt;span class="pl-s"&gt;enc = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)&lt;/span&gt;
&lt;span class="pl-s"&gt;convo = Conversation.from_messages([&lt;/span&gt;
&lt;span class="pl-s"&gt;    Message.from_role_and_content(&lt;/span&gt;
&lt;span class="pl-s"&gt;        Role.SYSTEM,&lt;/span&gt;
&lt;span class="pl-s"&gt;        SystemContent.new(),&lt;/span&gt;
&lt;span class="pl-s"&gt;    ),&lt;/span&gt;
&lt;span class="pl-s"&gt;    Message.from_role_and_content(&lt;/span&gt;
&lt;span class="pl-s"&gt;        Role.DEVELOPER,&lt;/span&gt;
&lt;span class="pl-s"&gt;        DeveloperContent.new().with_instructions("Talk like a pirate!")&lt;/span&gt;
&lt;span class="pl-s"&gt;    ),&lt;/span&gt;
&lt;span class="pl-s"&gt;    Message.from_role_and_content(Role.USER, "Arrr, how be you?"),&lt;/span&gt;
&lt;span class="pl-s"&gt;])&lt;/span&gt;
&lt;span class="pl-s"&gt;tokens = enc.render_conversation_for_completion(convo, Role.ASSISTANT)&lt;/span&gt;
&lt;span class="pl-s"&gt;print(tokens)&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Which outputs:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;[200006, 17360, 200008, 3575, 553, 17554, 162016, 11, 261, 4410, 6439, 2359, 22203, 656, 7788, 17527, 558, 87447, 100594, 25, 220, 1323, 19, 12, 3218, 279, 30377, 289, 25, 14093, 279, 2, 13888, 18403, 25, 8450, 11, 49159, 11, 1721, 13, 21030, 2804, 413, 7360, 395, 1753, 3176, 13, 200007, 200006, 77944, 200008, 2, 68406, 279, 37992, 1299, 261, 96063, 0, 200007, 200006, 1428, 200008, 8977, 81, 11, 1495, 413, 481, 30, 200007, 200006, 173781]&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Note those token IDs like &lt;code&gt;200006&lt;/code&gt; corresponding to the special tokens listed above.&lt;/p&gt;
&lt;h4 id="the-open-question-for-me-how-good-is-tool-calling-"&gt;The open question for me: how good is tool calling?&lt;/h4&gt;
&lt;p&gt;There's one aspect of these models that I haven't explored in detail yet: &lt;strong&gt;tool calling&lt;/strong&gt;. How these work is clearly a big part of the new Harmony format, but the packages I'm using myself (around my own &lt;a href="https://simonwillison.net/2025/May/27/llm-tools/"&gt;LLM tool calling&lt;/a&gt; support) need various tweaks and fixes to start working with that new mechanism.&lt;/p&gt;
&lt;p&gt;Tool calling currently represents my biggest disappointment with local models that I've run on my own machine. I've been able to get them to perform simple single calls, but the state of the art these days is wildly more ambitious than that.&lt;/p&gt;
&lt;p&gt;Systems like Claude Code can make dozens if not hundreds of tool calls over the course of a single session, each one adding more context and information to a single conversation with an underlying model.&lt;/p&gt;
&lt;p&gt;My experience to date has been that local models are unable to handle these lengthy conversations. I'm not sure if that's inherent to the limitations of my own machine, or if it's something that the right model architecture and training could overcome.&lt;/p&gt;
&lt;p&gt;OpenAI make big claims about the tool calling capabilities of these new models. I'm looking forward to seeing how well they perform in practice.&lt;/p&gt;

&lt;h4 id="china"&gt;Competing with the Chinese open models&lt;/h4&gt;

&lt;p&gt;I've been writing a &lt;em&gt;lot&lt;/em&gt; about the &lt;a href="https://simonwillison.net/tags/ai-in-china/"&gt;flurry of excellent open weight models&lt;/a&gt; released by Chinese AI labs over the past few months - all of them very capable and most of them under Apache 2 or MIT licenses.&lt;/p&gt;

&lt;p&gt;Just last week &lt;a href="https://simonwillison.net/2025/Jul/30/chinese-models/"&gt;I said&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Something that has become undeniable this month is that the best available open weight models now come from the Chinese AI labs.&lt;/p&gt;
&lt;p&gt;I continue to have a lot of love for Mistral, Gemma and Llama but my feeling is that Qwen, Moonshot and Z.ai have positively smoked them over the course of July. [...]&lt;/p&gt;
&lt;p&gt;I can't help but wonder if part of the reason for the delay in release of OpenAI's open weights model comes from a desire to be notably better than this truly impressive lineup of Chinese models.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;With the release of the gpt-oss models that statement no longer holds true. I'm waiting for the dust to settle and the independent benchmarks (that are more credible than my ridiculous pelicans) to roll out, but I think it's likely that OpenAI now offer the best available open weights models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: Independent evaluations are beginning to roll in. Here's &lt;a href="https://x.com/artificialanlys/status/1952887733803991070"&gt;Artificial Analysis&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;gpt-oss-120b is the most intelligent American open weights model, comes behind DeepSeek R1 and Qwen3 235B in intelligence but offers efficiency benefits [...]&lt;/p&gt;
&lt;p&gt;While the larger gpt-oss-120b does not come in above DeepSeek R1 0528’s score of 59 or Qwen3 235B 2507s score of 64, it is notable that it is significantly smaller in both total and active parameters than both of those models.&lt;/p&gt;&lt;/blockquote&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/space-invaders"&gt;space-invaders&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-oss"&gt;gpt-oss&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="open-source"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="llm-tool-use"/><category term="cerebras"/><category term="ollama"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="lm-studio"/><category term="space-invaders"/><category term="gpt-oss"/></entry><entry><title>The best available open weight LLMs now come from China</title><link href="https://simonwillison.net/2025/Jul/30/chinese-models/#atom-tag" rel="alternate"/><published>2025-07-30T16:18:38+00:00</published><updated>2025-07-30T16:18:38+00:00</updated><id>https://simonwillison.net/2025/Jul/30/chinese-models/#atom-tag</id><summary type="html">
    &lt;p&gt;Something that has become undeniable this month is that the best available open weight models now come from the Chinese AI labs.&lt;/p&gt;
&lt;p&gt;I continue to have a lot of love for Mistral, Gemma and Llama but my feeling is that Qwen, Moonshot and Z.ai have positively &lt;em&gt;smoked them&lt;/em&gt; over the course of July.&lt;/p&gt;
&lt;p&gt;Here's what came out this month, with links to my notes on each one:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Moonshot &lt;a href="https://simonwillison.net/2025/Jul/11/kimi-k2/"&gt;Kimi-K2-Instruct&lt;/a&gt; - 11th July, 1 trillion parameters&lt;/li&gt;
&lt;li&gt;Qwen &lt;a href="https://simonwillison.net/2025/Jul/22/qwen3-235b-a22b-instruct-2507/"&gt;Qwen3-235B-A22B-Instruct-2507&lt;/a&gt; - 21st July, 235 billion&lt;/li&gt;
&lt;li&gt;Qwen &lt;a href="https://simonwillison.net/2025/Jul/22/qwen3-coder/"&gt;Qwen3-Coder-480B-A35B-Instruct&lt;/a&gt; - 22nd July, 480 billion&lt;/li&gt;
&lt;li&gt;Qwen &lt;a href="https://simonwillison.net/2025/Jul/25/qwen3-235b-a22b-thinking-2507/"&gt;Qwen3-235B-A22B-Thinking-2507&lt;/a&gt; - 25th July, 235 billion&lt;/li&gt;
&lt;li&gt;Z.ai &lt;a href="https://simonwillison.net/2025/Jul/28/glm-45/"&gt;GLM-4.5 and GLM-4.5 Air&lt;/a&gt; - 28th July, 355 and 106 billion&lt;/li&gt;
&lt;li&gt;Qwen &lt;a href="https://simonwillison.net/2025/Jul/29/qwen3-30b-a3b-instruct-2507/"&gt;Qwen3-30B-A3B-Instruct-2507&lt;/a&gt; - 29th July, 30 billion&lt;/li&gt;
&lt;li&gt;Qwen &lt;a href="https://simonwillison.net/2025/Jul/30/qwen3-30b-a3b-thinking-2507/"&gt;Qwen3-30B-A3B-Thinking-2507&lt;/a&gt; - 30th July, 30 billion&lt;/li&gt;
&lt;li&gt;Qwen &lt;a href="https://simonwillison.net/2025/Jul/31/qwen3-coder-flash/"&gt;Qwen3-Coder-30B-A3B-Instruct&lt;/a&gt; - 31st July, 30 billion (released after I first posted this note)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;small&gt;Notably absent from this list is DeepSeek, but that's only because their last model release was &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-R1-0528"&gt;DeepSeek-R1-0528&lt;/a&gt; back in April.&lt;/small&gt;&lt;/p&gt;
&lt;p&gt;The only janky license among them is Kimi K2, which uses a non-OSI-compliant modified MIT. Qwen's models are all Apache 2 and Z.ai's are MIT.&lt;/p&gt;
&lt;p&gt;The larger Chinese models all offer their own APIs and are increasingly available from other providers.  I've been able to run versions of the Qwen 30B and GLM-4.5 Air 106B models on my own laptop.&lt;/p&gt;
&lt;p&gt;I can't help but wonder if part of the reason for the delay in release of OpenAI's open weights model comes from a desire to be notably better than this truly impressive lineup of Chinese models.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update August 5th 2025&lt;/strong&gt;: The OpenAI open weight models came out and &lt;a href="https://simonwillison.net/2025/Aug/5/gpt-oss/"&gt;they are very impressive&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-oss"&gt;gpt-oss&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/moonshot"&gt;moonshot&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/kimi"&gt;kimi&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/janky-licenses"&gt;janky-licenses&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/glm"&gt;glm&lt;/a&gt;&lt;/p&gt;



</summary><category term="open-source"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="qwen"/><category term="ai-in-china"/><category term="gpt-oss"/><category term="moonshot"/><category term="kimi"/><category term="janky-licenses"/><category term="glm"/></entry></feed>