<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: llm-performance</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/llm-performance.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2026-06-10T20:00:54+00:00</updated><author><name>Simon Willison</name></author><entry><title>DiffusionGemma</title><link href="https://simonwillison.net/2026/Jun/10/diffusiongemma/#atom-tag" rel="alternate"/><published>2026-06-10T20:00:54+00:00</published><updated>2026-06-10T20:00:54+00:00</updated><id>https://simonwillison.net/2026/Jun/10/diffusiongemma/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-generation/"&gt;DiffusionGemma&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Last May Google briefly released an experimental Gemini Diffusion model. I &lt;a href="https://simonwillison.net/2025/May/21/gemini-diffusion/"&gt;tried the preview at the time&lt;/a&gt; and recorded it running at 857 tokens/second. It was an exciting model, but Google made no further announcements about it.&lt;/p&gt;
&lt;p&gt;That research has returned in the best possible way: as a new open weight (Apache 2 licensed) Gemma model, &lt;a href="https://huggingface.co/google/diffusiongemma-26B-A4B-it"&gt;google/diffusiongemma-26B-A4B-it&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;NVIDIA are currently &lt;a href="https://build.nvidia.com/google/diffusiongemma-26b-a4b-it"&gt;hosting the model for free&lt;/a&gt; on their NIM cloud API. I used that API to &lt;a href="https://tools.simonwillison.net/markdown-svg-renderer#url=https%3A%2F%2Fgist.github.com%2Fsimonw%2Fe5e234a6dc6eef61e209ce1629620042"&gt;generate this pelican&lt;/a&gt;, which took 4.4s (according to &lt;code&gt;time uv run generate.py&lt;/code&gt;) to return 2,409 tokens - so at least 500 tokens/second.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Flat minimalist illustration of a white pelican with a large orange beak riding a red bicycle with black wheels, against a pale blue background with a green line representing the ground" src="https://static.simonwillison.net/static/2026/diffusiongemma-pelican.png" /&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=48478471"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nvidia"&gt;nvidia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="nvidia"/><category term="pelican-riding-a-bicycle"/><category term="gemma"/><category term="llm-release"/><category term="llm-performance"/></entry><entry><title>Quoting Thibault Sottiaux</title><link href="https://simonwillison.net/2026/Feb/21/thibault-sottiaux/#atom-tag" rel="alternate"/><published>2026-02-21T01:30:21+00:00</published><updated>2026-02-21T01:30:21+00:00</updated><id>https://simonwillison.net/2026/Feb/21/thibault-sottiaux/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://twitter.com/thsottiaux/status/2024947946849186064"&gt;&lt;p&gt;We’ve made GPT-5.3-Codex-Spark about 30% faster. It is now serving at over 1200 tokens per second.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://twitter.com/thsottiaux/status/2024947946849186064"&gt;Thibault Sottiaux&lt;/a&gt;, OpenAI&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="llm-performance"/></entry><entry><title>Taalas serves Llama 3.1 8B at 17,000 tokens/second</title><link href="https://simonwillison.net/2026/Feb/20/taalas/#atom-tag" rel="alternate"/><published>2026-02-20T22:10:04+00:00</published><updated>2026-02-20T22:10:04+00:00</updated><id>https://simonwillison.net/2026/Feb/20/taalas/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://taalas.com/the-path-to-ubiquitous-ai/"&gt;Taalas serves Llama 3.1 8B at 17,000 tokens/second&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This new Canadian hardware startup just announced their first product - a custom hardware implementation of the Llama 3.1 8B model (from &lt;a href="https://simonwillison.net/2024/Jul/23/introducing-llama-31/"&gt;July 2024&lt;/a&gt;) that can run at a staggering 17,000 tokens/second.&lt;/p&gt;
&lt;p&gt;I was going to include a video of their demo but it's so fast it would look more like a screenshot. You can try it out at &lt;a href="https://chatjimmy.ai"&gt;chatjimmy.ai&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;They describe their Silicon Llama as “aggressively quantized, combining 3-bit and 6-bit parameters.” Their next generation will use 4-bit - presumably they have quite a long lead time for baking out new models!

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=47086181"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="llms"/><category term="llm-performance"/></entry><entry><title>Introducing GPT‑5.3‑Codex‑Spark</title><link href="https://simonwillison.net/2026/Feb/12/codex-spark/#atom-tag" rel="alternate"/><published>2026-02-12T21:16:07+00:00</published><updated>2026-02-12T21:16:07+00:00</updated><id>https://simonwillison.net/2026/Feb/12/codex-spark/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://openai.com/index/introducing-gpt-5-3-codex-spark/"&gt;Introducing GPT‑5.3‑Codex‑Spark&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
OpenAI announced a partnership with Cerebras &lt;a href="https://openai.com/index/cerebras-partnership/"&gt;on January 14th&lt;/a&gt;. Four weeks later they're already launching the first integration, "an ultra-fast model for real-time coding in Codex".&lt;/p&gt;
&lt;p&gt;Despite being named GPT-5.3-Codex-Spark it's not purely an accelerated alternative to GPT-5.3-Codex - the blog post calls it "a smaller version of GPT‑5.3-Codex" and clarifies that "at launch, Codex-Spark has a 128k context window and is text-only."&lt;/p&gt;
&lt;p&gt;I had some preview access to this model and I can confirm that it's significantly faster than their other models.&lt;/p&gt;
&lt;p&gt;Here's what that speed looks like running in Codex CLI:&lt;/p&gt;
&lt;div style="max-width: 100%;"&gt;
    &lt;video 
        controls 
        preload="none"
        poster="https://static.simonwillison.net/static/2026/gpt-5.3-codex-spark-medium-last.jpg"
        style="width: 100%; height: auto;"&gt;
        &lt;source src="https://static.simonwillison.net/static/2026/gpt-5.3-codex-spark-medium.mp4" type="video/mp4"&gt;
    &lt;/video&gt;
&lt;/div&gt;

&lt;p&gt;That was the "Generate an SVG of a pelican riding a bicycle" prompt - here's the rendered result:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Whimsical flat illustration of an orange duck merged with a bicycle, where the duck's body forms the seat and frame area while its head extends forward over the handlebars, set against a simple light blue sky and green grass background." src="https://static.simonwillison.net/static/2026/gpt-5.3-codex-spark-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;Compare that to the speed of regular GPT-5.3 Codex medium:&lt;/p&gt;
&lt;div style="max-width: 100%;"&gt;
    &lt;video 
        controls 
        preload="none"
        poster="https://static.simonwillison.net/static/2026/gpt-5.3-codex-medium-last.jpg"
        style="width: 100%; height: auto;"&gt;
        &lt;source src="https://static.simonwillison.net/static/2026/gpt-5.3-codex-medium.mp4" type="video/mp4"&gt;
    &lt;/video&gt;
&lt;/div&gt;

&lt;p&gt;Significantly slower, but the pelican is a lot better:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Whimsical flat illustration of a white pelican riding a dark blue bicycle at speed, with motion lines behind it, its long orange beak streaming back in the wind, set against a light blue sky and green grass background." src="https://static.simonwillison.net/static/2026/gpt-5.3-codex-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;What's interesting about this model isn't the quality though, it's the &lt;em&gt;speed&lt;/em&gt;. When a model responds this fast you can stay in flow state and iterate with the model much more productively.&lt;/p&gt;
&lt;p&gt;I showed a demo of Cerebras running Llama 3.1 70 B at 2,000 tokens/second against Val Town &lt;a href="https://simonwillison.net/2024/Oct/31/cerebras-coder/"&gt;back in October 2024&lt;/a&gt;. OpenAI claim 1,000 tokens/second for their new model, and I expect it will prove to be a ferociously useful partner for hands-on iterative coding sessions.&lt;/p&gt;
&lt;p&gt;It's not yet clear what the pricing will look like for this new model.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/codex"&gt;codex&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="cerebras"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="codex"/><category term="llm-performance"/></entry><entry><title>Claude: Speed up responses with fast mode</title><link href="https://simonwillison.net/2026/Feb/7/claude-fast-mode/#atom-tag" rel="alternate"/><published>2026-02-07T23:10:33+00:00</published><updated>2026-02-07T23:10:33+00:00</updated><id>https://simonwillison.net/2026/Feb/7/claude-fast-mode/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://code.claude.com/docs/en/fast-mode"&gt;Claude: Speed up responses with fast mode&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New "research preview" from Anthropic today: you can now access a faster version of their frontier model Claude Opus 4.6 by typing &lt;code&gt;/fast&lt;/code&gt; in Claude Code... but at a cost that's 6x the normal price.&lt;/p&gt;
&lt;p&gt;Opus is usually $5/million input and $25/million output. The new fast mode is $30/million input and $150/million output!&lt;/p&gt;
&lt;p&gt;There's a 50% discount until the end of February 16th, so only a 3x multiple (!) before then.&lt;/p&gt;
&lt;p&gt;How much faster is it? The linked documentation doesn't say, but &lt;a href="https://x.com/claudeai/status/2020207322124132504"&gt;on Twitter&lt;/a&gt; Claude say:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Our teams have been building with a 2.5x-faster version of Claude Opus 4.6.&lt;/p&gt;
&lt;p&gt;We’re now making it available as an early experiment via Claude Code and our API.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Claude Opus 4.5 had a context limit of 200,000 tokens. 4.6 has an option to increase that to 1,000,000 at 2x the input price ($10/m) and 1.5x the output price ($37.50/m) once your input exceeds 200,000 tokens. These multiples hold for fast mode too, so after Feb 16th you'll be able to pay a hefty $60/m input and $225/m output for Anthropic's fastest best model.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="llm-pricing"/><category term="claude-code"/><category term="llm-performance"/></entry><entry><title>Introducing SWE-1.5: Our Fast Agent Model</title><link href="https://simonwillison.net/2025/Oct/29/swe-15/#atom-tag" rel="alternate"/><published>2025-10-29T23:59:20+00:00</published><updated>2025-10-29T23:59:20+00:00</updated><id>https://simonwillison.net/2025/Oct/29/swe-15/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://cognition.ai/blog/swe-1-5"&gt;Introducing SWE-1.5: Our Fast Agent Model&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Here's the second fast coding model released by a coding agent IDE in the same day - the first was &lt;a href="https://simonwillison.net/2025/Oct/29/cursor-composer/"&gt;Composer-1 by Cursor&lt;/a&gt;. This time it's Windsurf releasing SWE-1.5:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Today we’re releasing SWE-1.5, the latest in our family of models optimized for software engineering. It is a frontier-size model with hundreds of billions of parameters that achieves near-SOTA coding performance. It also sets a new standard for speed: we partnered with Cerebras to serve it at up to 950 tok/s – 6x faster than Haiku 4.5 and 13x faster than Sonnet 4.5.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Like Composer-1 it's only available via their editor, no separate API yet. Also like Composer-1 they don't appear willing to share details of the "leading open-source base model" they based their new model on.&lt;/p&gt;
&lt;p&gt;I asked it to generate an SVG of a pelican riding a bicycle and got this:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Bicycle has a red upside down Y shaped frame, pelican is a bit dumpy, it does at least have a long sharp beak." src="https://static.simonwillison.net/static/2025/swe-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;This one felt &lt;em&gt;really fast&lt;/em&gt;. Partnering with Cerebras for inference is a very smart move.&lt;/p&gt;
&lt;p&gt;They share a lot of details about their training process in the post:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;SWE-1.5 is trained on our state-of-the-art cluster of thousands of GB200 NVL72 chips. We believe SWE-1.5 may be the first public production model trained on the new GB200 generation. [...]&lt;/p&gt;
&lt;p&gt;Our RL rollouts require high-fidelity environments with code execution and even web browsing. To achieve this, we leveraged our VM hypervisor &lt;code&gt;otterlink&lt;/code&gt; that  allows us to scale &lt;strong&gt;Devin&lt;/strong&gt; to tens of thousands of concurrent machines (learn more about &lt;a href="https://cognition.ai/blog/blockdiff#why-incremental-vm-snapshots"&gt;blockdiff&lt;/a&gt;). This enabled us to smoothly support very high concurrency and ensure the training environment is aligned with our Devin production environments.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That's &lt;em&gt;another&lt;/em&gt; similarity to Cursor's Composer-1! Cursor talked about how they ran "hundreds of thousands of concurrent sandboxed coding environments in the cloud" in &lt;a href="https://cursor.com/blog/composer"&gt;their description of their RL training&lt;/a&gt; as well.&lt;/p&gt;
&lt;p&gt;This is a notable trend: if you want to build a really great agentic coding tool there's clearly a lot to be said for using reinforcement learning to fine-tune a model against your own custom set of tools using large numbers of sandboxed simulated coding environments as part of that process.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: &lt;a href="https://x.com/zai_org/status/1984076614951420273"&gt;I think it's built on GLM&lt;/a&gt;.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://x.com/cognition/status/1983662838955831372"&gt;@cognition&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/glm"&gt;glm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="coding-agents"/><category term="glm"/><category term="llm-performance"/></entry><entry><title>Composer: Building a fast frontier model with RL</title><link href="https://simonwillison.net/2025/Oct/29/cursor-composer/#atom-tag" rel="alternate"/><published>2025-10-29T20:45:53+00:00</published><updated>2025-10-29T20:45:53+00:00</updated><id>https://simonwillison.net/2025/Oct/29/cursor-composer/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://cursor.com/blog/composer"&gt;Composer: Building a fast frontier model with RL&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Cursor released &lt;a href="https://cursor.com/blog/2-0"&gt;Cursor 2.0 today&lt;/a&gt;, with a refreshed UI focused on agentic coding (and running agents in parallel) and a new model that's unique to Cursor called &lt;strong&gt;Composer&amp;nbsp;1&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;As far as I can tell there's no way to call the model directly via an API, so I fired up "Ask" mode in Cursor's chat side panel and asked it to "Generate an SVG of a pelican riding a bicycle":&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of Cursor 2 - In the chat panel I have asked the question and it spat out a bunch of SVG." src="https://static.simonwillison.net/static/2025/cursor-2.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/e5c9176f153ca718370055ecd256fe70"&gt;the result&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="The bicycle is levitating against a blue sky. The pelican looks a little bit more like a baby chicken but does at least have a long beak." src="https://static.simonwillison.net/static/2025/cursor-1-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;The notable thing about Composer-1 is that it is designed to be &lt;em&gt;fast&lt;/em&gt;. The pelican certainly came back quickly, and in their announcement they describe it as being "4x faster than similarly intelligent models".&lt;/p&gt;
&lt;p&gt;It's interesting to see Cursor investing resources in training their own code-specific model - similar to &lt;a href="https://openai.com/index/introducing-upgrades-to-codex/"&gt;GPT-5-Codex&lt;/a&gt; or &lt;a href="https://github.com/QwenLM/Qwen3-Coder"&gt;Qwen3-Coder&lt;/a&gt;. From their post:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Composer is a mixture-of-experts (MoE) language model supporting long-context generation and understanding. It is specialized for software engineering through reinforcement learning (RL) in a diverse range of development environments. [...]&lt;/p&gt;
&lt;p&gt;Efficient training of large MoE models requires significant investment into building infrastructure and systems research. We built custom training infrastructure leveraging PyTorch and Ray to power asynchronous reinforcement learning at scale. We natively train our models at low precision by combining our &lt;a href="https://cursor.com/blog/kernels"&gt;MXFP8 MoE kernels&lt;/a&gt; with expert parallelism and hybrid sharded data parallelism, allowing us to scale training to thousands of NVIDIA GPUs with minimal communication cost. [...]&lt;/p&gt;
&lt;p&gt;During RL, we want our model to be able to call any tool in the Cursor Agent harness. These tools allow editing code, using semantic search, grepping strings, and running terminal commands. At our scale, teaching the model to effectively call these tools requires running hundreds of thousands of concurrent sandboxed coding environments in the cloud.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;One detail that's notably absent from their description: did they train the model from scratch, or did they start with an existing open-weights model such as something from Qwen or GLM?&lt;/p&gt;
&lt;p&gt;Cursor researcher Sasha Rush has been answering questions &lt;a href="https://news.ycombinator.com/item?id=45748725"&gt;on Hacker News&lt;/a&gt;, but has so far been evasive in answering questions about the base model. When directly asked "is Composer a fine tune of an existing open source base model?" they replied:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Our primary focus is on RL post-training. We think that is the best way to get the model to be a strong interactive agent.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Sasha &lt;a href="https://news.ycombinator.com/item?id=45748725#45750784"&gt;did confirm&lt;/a&gt; that rumors of an earlier Cursor preview model, Cheetah, being based on a model by xAI's Grok were "Straight up untrue."

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=45748725"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cursor"&gt;cursor&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parallel-agents"&gt;parallel-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="coding-agents"/><category term="cursor"/><category term="parallel-agents"/><category term="llm-performance"/></entry><entry><title>Open weight LLMs exhibit inconsistent performance across providers</title><link href="https://simonwillison.net/2025/Aug/15/inconsistent-performance/#atom-tag" rel="alternate"/><published>2025-08-15T16:29:34+00:00</published><updated>2025-08-15T16:29:34+00:00</updated><id>https://simonwillison.net/2025/Aug/15/inconsistent-performance/#atom-tag</id><summary type="html">
    &lt;p&gt;Artificial Analysis published &lt;a href="https://artificialanalysis.ai/models/gpt-oss-120b/providers#aime25x32-performance-gpt-oss-120b"&gt;a new benchmark&lt;/a&gt; the other day, this time focusing on how an individual model - OpenAI’s gpt-oss-120b - performs across different hosted providers.&lt;/p&gt;
&lt;p&gt;The results showed some surprising differences. Here's the one with the greatest variance, a run of the 2025 AIME (American Invitational Mathematics Examination) averaging 32 runs against each model, using gpt-oss-120b with a reasoning effort of "high":&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/aim25x32-gpt-oss-120b.jpg" alt="Performance benchmark chart showing AIME25x32 Performance for gpt-oss-120B model across different AI frameworks. Chart displays box plots with percentile ranges (Min, 25th, Median, 75th, Max) for each framework. Title: &amp;quot;AIME25x32 Performance: gpt-oss-120B&amp;quot; with subtitle &amp;quot;AIME 2025 N=32 Runs: Minimum, 25th Percentile, Median, 75th Percentile, Maximum (Higher is Better)&amp;quot;. Legend indicates &amp;quot;Median; other points represent Min, 25th, 75th percentiles and Max respectively&amp;quot;. Y-axis ranges from 0 to 1.2. Frameworks shown from left to right: Cerebras (93.3%), Nebius Base (93.3%), Fireworks (93.3%), Deepinfra (93.3%), Novita (93.3%), Together.ai (93.3%), Parasail (90.0%), Groq (86.7%), Amazon (83.3%), Azure (80.0%), CompectAI (36.7%). Watermark shows &amp;quot;Artificial Analysis&amp;quot; logo." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;These are some varied results!&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;93.3%: Cerebras, Nebius Base, Fireworks, Deepinfra, Novita, Together.ai, vLLM 0.1.0&lt;/li&gt;
&lt;li&gt;90.0%: Parasail&lt;/li&gt;
&lt;li&gt;86.7%: Groq&lt;/li&gt;
&lt;li&gt;83.3%: Amazon&lt;/li&gt;
&lt;li&gt;80.0%: Azure&lt;/li&gt;
&lt;li&gt;36.7%: CompactifAI&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It looks like most of the providers that scored 93.3% were running models using the latest &lt;a href="https://github.com/vllm-project/vllm"&gt;vLLM&lt;/a&gt; (with the exception of Cerebras who I believe have their own custom serving stack).&lt;/p&gt;
&lt;p&gt;I hadn't heard of CompactifAI before - I found &lt;a href="https://www.hpcwire.com/off-the-wire/multiverse-computing-closes-e189m-series-b-to-scale-compactifai-deployment/"&gt;this June 12th 2025 press release&lt;/a&gt; which says that "CompactifAI models are highly-compressed versions of leading open source LLMs that retain original accuracy, are 4x-12x faster and yield a 50%-80% reduction in inference costs" which helps explain their notably lower score!&lt;/p&gt;
&lt;p&gt;Microsoft Azure's Lucas Pickup &lt;a href="https://x.com/lupickup/status/1955620918086226223"&gt;confirmed&lt;/a&gt; that Azure's 80% score was caused by running an older vLLM, now fixed:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This is exactly it, it’s been fixed as of yesterday afternoon across all serving instances (of the hosted 120b service). Old vLLM commits that didn’t respect reasoning_effort, so all requests defaulted to medium.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;No news yet on what went wrong with the AWS Bedrock version.&lt;/p&gt;
&lt;h4 id="the-challenge-for-customers-of-open-weight-models"&gt;The challenge for customers of open weight models&lt;/h4&gt;
&lt;p&gt;As a customer of open weight model providers, this really isn't something I wanted to have to think about!&lt;/p&gt;
&lt;p&gt;It's not really a surprise though. When running models myself I inevitably have to make choices - about which serving framework to use (I'm usually picking between GGPF/llama.cpp and MLX on my own Mac laptop) and the quantization size to use.&lt;/p&gt;
&lt;p&gt;I know that quantization has an impact, but it's difficult for me to quantify that effect.&lt;/p&gt;
&lt;p&gt;It looks like with hosted models even knowing the quantization they are using isn't necessarily enough information to be able to predict that model's performance.&lt;/p&gt;
&lt;p&gt;I see this situation as a general challenge for open weight models. They tend to be released as an opaque set of model weights plus loose instructions for running them on a single platform - if we are lucky! Most AI labs leave quantization and format conversions to the community and third-party providers.&lt;/p&gt;
&lt;p&gt;There's a lot that can go wrong. Tool calling is particularly vulnerable to these differences - models have been trained on specific tool-calling conventions, and if a provider doesn't get these exactly right the results can be unpredictable but difficult to diagnose.&lt;/p&gt;
&lt;p&gt;What would help &lt;em&gt;enormously&lt;/em&gt; here would be some kind of conformance suite. If models were reliably deterministic this would be easy: publish a set of test cases and let providers (or their customers) run those to check the model's implementation.&lt;/p&gt;
&lt;p&gt;Models aren't deterministic though, even at a temperature of 0. Maybe this new effort from Artificial Analysis is exactly what we need here, especially since running a full benchmark suite against a provider can be quite expensive in terms of token spend.&lt;/p&gt;
&lt;p id="update"&gt;&lt;strong&gt;Update&lt;/strong&gt;: &lt;a href="https://x.com/DKundel/status/1956395988836368587"&gt;Via OpenAI's Dominik Kundel&lt;/a&gt; I learned that OpenAI now include a &lt;a href="https://github.com/openai/gpt-oss/tree/main/compatibility-test"&gt;compatibility test&lt;/a&gt; in the gpt-oss GitHub repository to help providers verify that they have implemented things like tool calling templates correctly, described in more detail in their &lt;a href="https://cookbook.openai.com/articles/gpt-oss/verifying-implementations"&gt;Verifying gpt-oss implementations&lt;/a&gt; cookbook.&lt;/p&gt;

&lt;p&gt;Here's &lt;a href="https://til.simonwillison.net/llms/gpt-oss-evals"&gt;my TIL&lt;/a&gt; on running part of that eval suite.&lt;/p&gt;

&lt;h4 id="update-aug-20"&gt;Update: August 20th 2025&lt;/h4&gt;

&lt;p&gt;Since I first wrote this article Artificial Analysis have updated the benchmark results to reflect fixes that vendors have made since their initial run. Here's what it looks like today:&lt;/p&gt;

&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-oss-eval-updated.jpg" alt="Performance benchmark chart showing AIME25x32 Performance for gpt-oss-120B model across different AI frameworks. Chart displays box plots with percentile ranges for each framework. Title: &amp;quot;AIME25x32 Performance: gpt-oss-120B&amp;quot; with subtitle &amp;quot;AIME 2025 N=32 Runs: Minimum, 25th Percentile, Median, 75th Percentile, Maximum (Higher is Better)&amp;quot;. Legend indicates &amp;quot;Median; other points represent Min, 25th, 75th percentiles and Max respectively&amp;quot;. Y-axis ranges from 0 to 1.2. Frameworks shown from left to right: Cerebras (93.3%), Nebius Base (93.3%), Azure (93.3%), Fireworks (93.3%), Deepinfra (93.3%), Novita (93.3%), Groq (93.3%), Together.ai (93.3%), Parasail (90.0%), Google Vertex (83.3%), Amazon (80.0%). Watermark shows &amp;quot;Artificial Analysis&amp;quot; logo." style="max-width: 100%" /&gt;&lt;/p&gt;
&lt;p&gt;Groq and Azure have both improved their scores to 93.3%. Google Vertex is new  to the chart at 83.3%.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-oss"&gt;gpt-oss&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/artificial-analysis"&gt;artificial-analysis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="evals"/><category term="gpt-oss"/><category term="artificial-analysis"/><category term="llm-performance"/></entry><entry><title>Faster inference</title><link href="https://simonwillison.net/2025/Aug/1/faster-inference/#atom-tag" rel="alternate"/><published>2025-08-01T23:28:26+00:00</published><updated>2025-08-01T23:28:26+00:00</updated><id>https://simonwillison.net/2025/Aug/1/faster-inference/#atom-tag</id><summary type="html">
    &lt;p&gt;Two interesting examples of inference speed as a flagship feature of LLM services today.&lt;/p&gt;
&lt;p&gt;First, Cerebras &lt;a href="https://www.cerebras.ai/blog/introducing-cerebras-code"&gt;announced two new monthly plans&lt;/a&gt; for their extremely high speed hosted model service: Cerebras Code Pro ($50/month, 1,000 messages a day) and Cerebras Code Max ($200/month, 5,000/day). The model they are selling here is Qwen's Qwen3-Coder-480B-A35B-Instruct, likely the best available open weights coding model right now and one that was released &lt;a href="https://simonwillison.net/2025/Jul/22/qwen3-coder/"&gt;just ten days ago&lt;/a&gt;. Ten days from model release to third-party subscription service feels like some kind of record.&lt;/p&gt;
&lt;p&gt;Cerebras claim they can serve the model at an astonishing 2,000 tokens per second - four times the speed of Claude Sonnet 4 in &lt;a href="https://x.com/cerebrassystems/status/1951340566077440464"&gt;their demo video&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Also today, Moonshot &lt;a href="https://x.com/kimi_moonshot/status/1951168907131355598"&gt;announced&lt;/a&gt; a new hosted version of their trillion parameter Kimi K2 model called &lt;code&gt;kimi-k2-turbo-preview&lt;/code&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;🆕 Say hello to kimi-k2-turbo-preview
Same model. Same context. NOW 4× FASTER.&lt;/p&gt;
&lt;p&gt;⚡️ From 10 tok/s to 40 tok/s.&lt;/p&gt;
&lt;p&gt;💰 Limited-Time Launch Price (50% off until Sept 1)&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;$0.30 / million input tokens (cache hit)&lt;/li&gt;
&lt;li&gt;$1.20 / million input tokens (cache miss)&lt;/li&gt;
&lt;li&gt;$5.00 / million output tokens&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;👉 Explore more: &lt;a href="https://platform.moonshot.ai"&gt;platform.moonshot.ai&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is twice the price of their regular model for 4x the speed (increasing to 4x the price in September). No details yet on how they achieved the speed-up.&lt;/p&gt;
&lt;p&gt;I am interested to see how much market demand there is for faster performance like this. I've &lt;a href="https://simonwillison.net/2024/Oct/31/cerebras-coder/"&gt;experimented with Cerebras in the past&lt;/a&gt; and found that the speed really does make iterating on code with live previews feel a whole lot more interactive.&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/moonshot"&gt;moonshot&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/kimi"&gt;kimi&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="qwen"/><category term="cerebras"/><category term="llm-pricing"/><category term="ai-in-china"/><category term="moonshot"/><category term="kimi"/><category term="llm-performance"/></entry><entry><title>Gemini Diffusion</title><link href="https://simonwillison.net/2025/May/21/gemini-diffusion/#atom-tag" rel="alternate"/><published>2025-05-21T21:44:02+00:00</published><updated>2025-05-21T21:44:02+00:00</updated><id>https://simonwillison.net/2025/May/21/gemini-diffusion/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://deepmind.google/models/gemini-diffusion/"&gt;Gemini Diffusion&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Another of the announcements from Google I/O yesterday was Gemini Diffusion, Google's first LLM to use diffusion (similar to image models like Imagen and Stable Diffusion) in place of transformers.&lt;/p&gt;
&lt;p&gt;Google describe it like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Traditional autoregressive language models generate text one word – or token – at a time. This sequential process can be slow, and limit the quality and coherence of the output.&lt;/p&gt;
&lt;p&gt;Diffusion models work differently. Instead of predicting text directly, they learn to generate outputs by refining noise, step-by-step. This means they can iterate on a solution very quickly and error correct during the generation process. This helps them excel at tasks like editing, including in the context of math and code.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The key feature then is &lt;em&gt;speed&lt;/em&gt;. I made it through the waitlist and tried it out just now and &lt;em&gt;wow&lt;/em&gt;, they are not kidding about it being fast.&lt;/p&gt;
&lt;p&gt;In this video I prompt it with "Build a simulated chat app" and it responds at 857 tokens/second, resulting in an interactive HTML+JavaScript page (embedded in the chat tool, Claude Artifacts style) within single digit seconds.&lt;/p&gt;
&lt;div style="max-width: 100%;"&gt;
    &lt;video 
        controls 
        preload="none"
        aria-label="In the video I prompt Gemini Diffusion to create me an example chat app and it responds at over 650 tokens a second, giving me a working app I can iterate on in less than a few seconds."
        poster="https://static.simonwillison.net/static/2025/gemini-diffusion.jpg"
        style="width: 100%; height: auto;"&gt;
        &lt;source src="https://static.simonwillison.net/static/2025/gemini-diffusion.mp4" type="video/mp4"&gt;
    &lt;/video&gt;
&lt;/div&gt;

&lt;p&gt;The performance feels similar to &lt;a href="https://simonwillison.net/2024/Oct/31/cerebras-coder/"&gt;the Cerebras Coder tool&lt;/a&gt;, which used Cerebras to run Llama3.1-70b at around 2,000 tokens/second.&lt;/p&gt;
&lt;p&gt;How good is the model? I've not seen any independent benchmarks yet, but Google's landing page for it promises "the performance of Gemini 2.0 Flash-Lite at 5x the speed" so presumably they think it's comparable to Gemini 2.0 Flash-Lite, one of their least expensive models.&lt;/p&gt;
&lt;p&gt;Prior to this the only commercial grade diffusion model I've encountered is &lt;a href="https://www.inceptionlabs.ai/introducing-mercury"&gt;Inception Mercury&lt;/a&gt; back in February this year.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: a correction from &lt;a href="https://news.ycombinator.com/item?id=44057820#44057939"&gt;synapsomorphy on Hacker News&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Diffusion isn't in place of transformers, it's in place of autoregression. Prior diffusion LLMs like &lt;a href="https://www.inceptionlabs.ai/introducing-mercury"&gt;Mercury&lt;/a&gt; still use a transformer, but there's no causal masking, so the entire input is processed all at once and the output generation is obviously different. I very strongly suspect this is also using a transformer.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;nvtop &lt;a href="https://news.ycombinator.com/context?id=44059646"&gt;provided this explanation&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Despite the name, diffusion LMs have little to do with image diffusion and are much closer to BERT and old good masked language modeling. Recall how BERT is trained:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Take a full sentence ("the cat sat on the mat")&lt;/li&gt;
&lt;li&gt;Replace 15% of tokens with a [MASK] token ("the cat [MASK] on [MASK] mat")&lt;/li&gt;
&lt;li&gt;Make the Transformer predict tokens at masked positions. It does it in parallel, via a single inference step.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Now, diffusion LMs take this idea further. BERT can recover 15% of masked tokens ("noise"), but why stop here. Let's train a model to recover texts with 30%, 50%, 90%, 100% of masked tokens.&lt;/p&gt;
&lt;p&gt;Once you've trained that, in order to generate something from scratch, you start by feeding the model all [MASK]s. It will generate you mostly gibberish, but you can take some tokens (let's say, 10%) at random positions and assume that these tokens are generated ("final"). Next, you run another iteration of inference, this time input having 90% of masks and 10% of "final" tokens. Again, you mark 10% of new tokens as final. Continue, and in 10 steps you'll have generated a whole sequence. This is a core idea behind diffusion language models. [...]&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/google-io"&gt;google-io&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="google-io"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="gemini"/><category term="llm-release"/><category term="llm-performance"/></entry><entry><title>Cerebras brings instant inference to Mistral Le Chat</title><link href="https://simonwillison.net/2025/Feb/10/cerebras-mistral/#atom-tag" rel="alternate"/><published>2025-02-10T03:50:18+00:00</published><updated>2025-02-10T03:50:18+00:00</updated><id>https://simonwillison.net/2025/Feb/10/cerebras-mistral/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://cerebras.ai/blog/mistral-le-chat"&gt;Cerebras brings instant inference to Mistral Le Chat&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Mistral &lt;a href="https://mistral.ai/en/news/all-new-le-chat"&gt;announced a major upgrade&lt;/a&gt; to their &lt;a href="https://chat.mistral.ai/chat"&gt;Le Chat&lt;/a&gt; web UI (their version of ChatGPT) a few days ago, and one of the signature features was performance.&lt;/p&gt;
&lt;p&gt;It turns out that performance boost comes from hosting their model on Cerebras:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We are excited to bring our technology to Mistral – specifically the flagship 123B parameter Mistral Large 2 model. Using our Wafer Scale Engine technology, we achieve over 1,100 tokens per second on text queries.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Given Cerebras's so far unrivaled inference performance I'm surprised that no other AI lab has formed a partnership like this already.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="mistral"/><category term="cerebras"/><category term="llm-performance"/></entry><entry><title>Cerebras Coder</title><link href="https://simonwillison.net/2024/Oct/31/cerebras-coder/#atom-tag" rel="alternate"/><published>2024-10-31T22:39:15+00:00</published><updated>2024-10-31T22:39:15+00:00</updated><id>https://simonwillison.net/2024/Oct/31/cerebras-coder/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.val.town/v/stevekrouse/cerebras_coder"&gt;Cerebras Coder&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Val Town founder Steve Krouse has been building demos on top of the Cerebras API that runs Llama3.1-70b at 2,000 tokens/second.&lt;/p&gt;
&lt;p&gt;Having a capable LLM with that kind of performance turns out to be really interesting. Cerebras Coder is a demo that implements Claude Artifact-style on-demand JavaScript apps, and having it run at that speed means changes you request are visible within less than a second:&lt;/p&gt;
&lt;div style="max-width: 100%;"&gt;
    &lt;video 
        controls 
        preload="none"
        poster="https://static.simonwillison.net/static/2024/cascade-emoji.jpeg"
        style="width: 100%; height: auto;"&gt;
        &lt;source src="https://static.simonwillison.net/static/2024/cascade-emoji.mp4" type="video/mp4"&gt;
    &lt;/video&gt;
&lt;/div&gt;

&lt;p&gt;Steve's implementation (created with the help of &lt;a href="https://www.val.town/townie"&gt;Townie&lt;/a&gt;, the Val Town code assistant) demonstrates the simplest possible version of an iframe sandbox:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;iframe
    srcDoc={code}
    sandbox="allow-scripts allow-modals allow-forms allow-popups allow-same-origin allow-top-navigation allow-downloads allow-presentation allow-pointer-lock"
/&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Where &lt;code&gt;code&lt;/code&gt; is populated by a &lt;code&gt;setCode(...)&lt;/code&gt; call inside a React component.&lt;/p&gt;
&lt;p&gt;The most interesting applications of LLMs continue to be where they operate in a tight loop with a human - this can make those review loops potentially much faster and more productive.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/stevekrouse/status/1851995718514327848"&gt;@stevekrouse&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/iframes"&gt;iframes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sandboxing"&gt;sandboxing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/react"&gt;react&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/val-town"&gt;val-town&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/steve-krouse"&gt;steve-krouse&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="iframes"/><category term="sandboxing"/><category term="ai"/><category term="react"/><category term="generative-ai"/><category term="llama"/><category term="llms"/><category term="ai-assisted-programming"/><category term="val-town"/><category term="steve-krouse"/><category term="cerebras"/><category term="llm-performance"/></entry><entry><title>llm-cerebras</title><link href="https://simonwillison.net/2024/Oct/25/llm-cerebras/#atom-tag" rel="alternate"/><published>2024-10-25T05:50:47+00:00</published><updated>2024-10-25T05:50:47+00:00</updated><id>https://simonwillison.net/2024/Oct/25/llm-cerebras/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/irthomasthomas/llm-cerebras"&gt;llm-cerebras&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;a href="https://cerebras.ai/"&gt;Cerebras&lt;/a&gt; (&lt;a href="https://simonwillison.net/2024/Aug/28/cerebras-inference/"&gt;previously&lt;/a&gt;) provides Llama LLMs hosted on custom hardware at ferociously high speeds.&lt;/p&gt;
&lt;p&gt;GitHub user &lt;a href="https://github.com/irthomasthomas"&gt;irthomasthomas&lt;/a&gt; built an &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; plugin that works against &lt;a href="https://cloud.cerebras.ai/"&gt;their API&lt;/a&gt; - which is currently free, albeit with a rate limit of 30 requests per minute for their two models.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-cerebras
llm keys set cerebras
# paste key here
llm -m cerebras-llama3.1-70b 'an epic tail of a walrus pirate'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's &lt;a href="https://static.simonwillison.net/static/2024/cerebras-is-fast.mp4"&gt;a video&lt;/a&gt; showing the speed of that prompt:&lt;/p&gt;
&lt;div style="max-width: 100%;"&gt;
    &lt;video 
        controls 
        preload="none"
        poster="https://static.simonwillison.net/static/2024/cerebras-poster.jpg"
        style="width: 100%; height: auto;"&gt;
        &lt;source src="https://static.simonwillison.net/static/2024/cerebras-is-fast.mp4" type="video/mp4"&gt;
    &lt;/video&gt;
&lt;/div&gt;

&lt;p&gt;The other model is &lt;code&gt;cerebras-llama3.1-8b&lt;/code&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="cerebras"/><category term="llm-performance"/></entry><entry><title>Cerebras Inference: AI at Instant Speed</title><link href="https://simonwillison.net/2024/Aug/28/cerebras-inference/#atom-tag" rel="alternate"/><published>2024-08-28T04:14:00+00:00</published><updated>2024-08-28T04:14:00+00:00</updated><id>https://simonwillison.net/2024/Aug/28/cerebras-inference/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://cerebras.ai/blog/introducing-cerebras-inference-ai-at-instant-speed"&gt;Cerebras Inference: AI at Instant Speed&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New hosted API for Llama running at absurdly high speeds: "1,800 tokens per second for Llama3.1 8B and 450 tokens per second for Llama3.1 70B".&lt;/p&gt;
&lt;p&gt;How are they running so fast? Custom hardware. Their &lt;a href="https://cerebras.ai/product-chip/"&gt;WSE-3&lt;/a&gt; is 57x &lt;em&gt;physically larger&lt;/em&gt; than an NVIDIA H100, and has 4 trillion transistors, 900,000 cores and 44GB of memory all on one enormous chip.&lt;/p&gt;
&lt;p&gt;Their &lt;a href="https://inference.cerebras.ai/"&gt;live chat demo&lt;/a&gt; just returned me a response at 1,833 tokens/second. Their API currently has a waitlist.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=41369705"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/performance"&gt;performance&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cerebras"&gt;cerebras&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="performance"/><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="llms"/><category term="cerebras"/><category term="llm-performance"/></entry><entry><title>Introducing Llama-3-Groq-Tool-Use Models</title><link href="https://simonwillison.net/2024/Jul/17/llama-3-groq-tool-use-models/#atom-tag" rel="alternate"/><published>2024-07-17T20:32:50+00:00</published><updated>2024-07-17T20:32:50+00:00</updated><id>https://simonwillison.net/2024/Jul/17/llama-3-groq-tool-use-models/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://wow.groq.com/introducing-llama-3-groq-tool-use-models/"&gt;Introducing Llama-3-Groq-Tool-Use Models&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New from &lt;a href="https://groq.com/"&gt;Groq&lt;/a&gt;: two custom fine-tuned Llama 3 models specifically designed for tool use. Hugging Face model links:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/Groq/Llama-3-Groq-8B-Tool-Use"&gt;Groq/Llama-3-Groq-8B-Tool-Use&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/Groq/Llama-3-Groq-70B-Tool-Use"&gt;Groq/Llama-3-Groq-70B-Tool-Use&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Groq's own internal benchmarks put their 70B model at the top of the &lt;a href="https://gorilla.cs.berkeley.edu/leaderboard.html"&gt;Berkeley Function-Calling Leaderboard&lt;/a&gt; with a score of 90.76 (and 89.06 for their 8B model, which would put it at #3). For comparison, Claude 3.5 Sonnet scores 90.18 and GPT-4-0124 scores 88.29.&lt;/p&gt;
&lt;p&gt;The two new Groq models are also available through their screamingly-fast (fastest in the business?) API, running at 330 tokens/s and 1050 tokens/s respectively.&lt;/p&gt;
&lt;p&gt;Here's the documentation on &lt;a href="https://console.groq.com/docs/tool-use"&gt;how to use tools through their API&lt;/a&gt;.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/RickLamers/status/1813341037198204962"&gt;Rick Lamers&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/groq"&gt;groq&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="groq"/><category term="llm-tool-use"/><category term="llm-performance"/></entry><entry><title>Fast groq-hosted LLMs vs browser jank</title><link href="https://simonwillison.net/2024/May/19/fast-groq-hosted-llms-vs-browser-jank/#atom-tag" rel="alternate"/><published>2024-05-19T13:35:47+00:00</published><updated>2024-05-19T13:35:47+00:00</updated><id>https://simonwillison.net/2024/May/19/fast-groq-hosted-llms-vs-browser-jank/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://taras.glek.net/post/groq-vs-html-reflows/"&gt;Fast groq-hosted LLMs vs browser jank&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;a href="https://groq.com/"&gt;Groq&lt;/a&gt; is now serving LLMs such as Llama 3 so quickly that JavaScript which attempts to render Markdown strings on every new token can cause performance issues in browsers.&lt;/p&gt;
&lt;p&gt;Taras Glek's &lt;a href="https://github.com/tarasglek/chatcraft.org/pull/640/files"&gt;solution&lt;/a&gt; was to move the rendering to a &lt;code&gt;requestAnimationFrame()&lt;/code&gt; callback, effectively buffering the rendering to the fastest rate the browser can support.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://lobste.rs/s/5i2axx/fast_groq_hosted_llms_vs_browser_jank"&gt;lobste.rs&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/groq"&gt;groq&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-performance"&gt;llm-performance&lt;/a&gt;&lt;/p&gt;



</summary><category term="javascript"/><category term="llms"/><category term="groq"/><category term="llm-performance"/></entry></feed>