<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: openrouter</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/openrouter.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2026-02-17T04:30:57+00:00</updated><author><name>Simon Willison</name></author><entry><title>Qwen3.5: Towards Native Multimodal Agents</title><link href="https://simonwillison.net/2026/Feb/17/qwen35/#atom-tag" rel="alternate"/><published>2026-02-17T04:30:57+00:00</published><updated>2026-02-17T04:30:57+00:00</updated><id>https://simonwillison.net/2026/Feb/17/qwen35/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://qwen.ai/blog?id=qwen3.5"&gt;Qwen3.5: Towards Native Multimodal Agents&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Alibaba's Qwen just released the first two models in the Qwen 3.5 series - one open weights, one proprietary. Both are multi-modal for vision input.&lt;/p&gt;
&lt;p&gt;The open weight one is a Mixture of Experts model called Qwen3.5-397B-A17B. Interesting to see Qwen call out serving efficiency as a benefit of that architecture:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Built on an innovative hybrid architecture that fuses linear attention (via Gated Delta Networks) with a sparse mixture-of-experts, the model attains remarkable inference efficiency: although it comprises 397 billion total parameters, just 17 billion are activated per forward pass, optimizing both speed and cost without sacrificing capability.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It's &lt;a href="https://huggingface.co/Qwen/Qwen3.5-397B-A17B"&gt;807GB on Hugging Face&lt;/a&gt;, and Unsloth have a &lt;a href="https://huggingface.co/unsloth/Qwen3.5-397B-A17B-GGUF"&gt;collection of smaller GGUFs&lt;/a&gt; ranging in size from 94.2GB 1-bit to 462GB Q8_K_XL.&lt;/p&gt;
&lt;p&gt;I got this &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;pelican&lt;/a&gt; from the &lt;a href="https://openrouter.ai/qwen/qwen3.5-397b-a17b"&gt;OpenRouter hosted model&lt;/a&gt; (&lt;a href="https://gist.github.com/simonw/625546cf6b371f9c0040e64492943b82"&gt;transcript&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;img alt="Pelican is quite good although the neck lacks an outline for some reason. Bicycle is very basic with an incomplete frame" src="https://static.simonwillison.net/static/2026/qwen3.5-397b-a17b.png" /&gt;&lt;/p&gt;
&lt;p&gt;The proprietary hosted model is called Qwen3.5 Plus 2026-02-15, and is a little confusing. Qwen researcher &lt;a href="https://twitter.com/JustinLin610/status/2023340126479569140"&gt;Junyang Lin  says&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Qwen3-Plus is a hosted API version of 397B. As the model natively supports 256K tokens, Qwen3.5-Plus supports 1M token context length. Additionally it supports search and code interpreter, which you can use on Qwen Chat with Auto mode.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/9507dd47483f78dc1195117735273e20"&gt;its pelican&lt;/a&gt;, which is similar in quality to the open weights model:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Similar quality pelican. The bicycle is taller and has a better frame shape. They are visually quite similar." src="https://static.simonwillison.net/static/2026/qwen3.5-plus-02-15.png" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="vision-llms"/><category term="qwen"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="openrouter"/><category term="ai-in-china"/></entry><entry><title>GLM-5: From Vibe Coding to Agentic Engineering</title><link href="https://simonwillison.net/2026/Feb/11/glm-5/#atom-tag" rel="alternate"/><published>2026-02-11T18:56:14+00:00</published><updated>2026-02-11T18:56:14+00:00</updated><id>https://simonwillison.net/2026/Feb/11/glm-5/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://z.ai/blog/glm-5"&gt;GLM-5: From Vibe Coding to Agentic Engineering&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This is a &lt;em&gt;huge&lt;/em&gt; new MIT-licensed model: 744B parameters and &lt;a href="https://huggingface.co/zai-org/GLM-5"&gt;1.51TB on Hugging Face&lt;/a&gt; twice the size of &lt;a href="https://huggingface.co/zai-org/GLM-4.7"&gt;GLM-4.7&lt;/a&gt; which was 368B and 717GB (4.5 and 4.6 were around that size too).&lt;/p&gt;
&lt;p&gt;It's interesting to see Z.ai take a position on what we should call professional software engineers building with LLMs - I've seen &lt;strong&gt;Agentic Engineering&lt;/strong&gt; show up in a few other places recently. most notable &lt;a href="https://twitter.com/karpathy/status/2019137879310836075"&gt;from Andrej Karpathy&lt;/a&gt; and &lt;a href="https://addyosmani.com/blog/agentic-engineering/"&gt;Addy Osmani&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I ran my "Generate an SVG of a pelican riding a bicycle" prompt through GLM-5 via &lt;a href="https://openrouter.ai/"&gt;OpenRouter&lt;/a&gt; and got back &lt;a href="https://gist.github.com/simonw/cc4ca7815ae82562e89a9fdd99f0725d"&gt;a very good pelican on a disappointing bicycle frame&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="The pelican is good and has a well defined beak. The bicycle frame is a wonky red triangle. Nice sun and motion lines." src="https://static.simonwillison.net/static/2026/glm-5-pelican.png" /&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=46977210"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/definitions"&gt;definitions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vibe-coding"&gt;vibe-coding&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/glm"&gt;glm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/agentic-engineering"&gt;agentic-engineering&lt;/a&gt;&lt;/p&gt;



</summary><category term="definitions"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="vibe-coding"/><category term="openrouter"/><category term="ai-in-china"/><category term="glm"/><category term="agentic-engineering"/></entry><entry><title>Open Responses</title><link href="https://simonwillison.net/2026/Jan/15/open-responses/#atom-tag" rel="alternate"/><published>2026-01-15T23:56:56+00:00</published><updated>2026-01-15T23:56:56+00:00</updated><id>https://simonwillison.net/2026/Jan/15/open-responses/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.openresponses.org/"&gt;Open Responses&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This is the standardization effort I've most wanted in the world of LLMs: a vendor-neutral specification for the JSON API that clients can use to talk to hosted LLMs.&lt;/p&gt;
&lt;p&gt;Open Responses aims to provide exactly that as a documented standard, derived from OpenAI's Responses API.&lt;/p&gt;
&lt;p&gt;I was hoping for one based on their older Chat Completions API since so many other products have cloned the already, but basing it on Responses does make sense since that API was designed with the feature of more recent models - such as reasoning traces - baked into the design.&lt;/p&gt;
&lt;p&gt;What's certainly notable is the list of launch partners. OpenRouter alone means we can expect to be able to use this protocol with almost every existing model, and Hugging Face, LM Studio, vLLM, Ollama and Vercel cover a huge portion of the common tools used to serve models.&lt;/p&gt;
&lt;p&gt;For protocols like this I really want to see a comprehensive, language-independent conformance test site. Open Responses has a subset of that - the official repository includes &lt;a href="https://github.com/openresponses/openresponses/blob/d0f23437b27845d5c3d0abaf5cb5c4a702f26b05/src/lib/compliance-tests.ts"&gt;src/lib/compliance-tests.ts&lt;/a&gt; which can be used to exercise a server implementation, and is available as a React app &lt;a href="https://www.openresponses.org/compliance"&gt;on the official site&lt;/a&gt; that can be pointed at any implementation served via CORS.&lt;/p&gt;
&lt;p&gt;What's missing is the equivalent for clients. I plan to spin up my own client library for this in Python and I'd really like to be able to run that against a conformance suite designed to check that my client correctly handles all of the details.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/reach_vb/status/2011863516852965565"&gt;VB&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/json"&gt;json&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/standards"&gt;standards&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/conformance-suites"&gt;conformance-suites&lt;/a&gt;&lt;/p&gt;



</summary><category term="json"/><category term="standards"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="openrouter"/><category term="conformance-suites"/></entry><entry><title>DeepSeek-V3.2</title><link href="https://simonwillison.net/2025/Dec/1/deepseek-v32/#atom-tag" rel="alternate"/><published>2025-12-01T23:56:19+00:00</published><updated>2025-12-01T23:56:19+00:00</updated><id>https://simonwillison.net/2025/Dec/1/deepseek-v32/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://api-docs.deepseek.com/news/news251201"&gt;DeepSeek-V3.2&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Two new open weight (MIT licensed) models from DeepSeek today: &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V3.2"&gt;DeepSeek-V3.2&lt;/a&gt; and &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Speciale"&gt;DeepSeek-V3.2-Speciale&lt;/a&gt;, both 690GB, 685B parameters. Here's the &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V3.2/resolve/main/assets/paper.pdf"&gt;PDF tech report&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;DeepSeek-V3.2 is DeepSeek's new flagship model, now running on &lt;a href="https://chat.deepseek.com"&gt;chat.deepseek.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The difference between the two new models is best explained by this paragraph from the technical report:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;DeepSeek-V3.2 integrates reasoning, agent, and human alignment data distilled from specialists, undergoing thousands of steps of continued RL training to reach the final checkpoints. To investigate the potential of extended thinking, we also developed an experimental variant, DeepSeek-V3.2-Speciale. This model was trained exclusively on reasoning data with a reduced length penalty during RL. Additionally, we incorporated the dataset and reward method from DeepSeekMath-V2 (Shao et al., 2025) to enhance capabilities in mathematical proofs.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I covered &lt;a href="https://simonwillison.net/2025/Nov/27/deepseek-math-v2/"&gt;DeepSeek-Math-V2 last week&lt;/a&gt;. Like that model, DeepSeek-V3.2-Speciale also scores gold on the 2025 International Mathematical Olympiad so beloved of model training teams!&lt;/p&gt;
&lt;p id="pelicans"&gt;I tried both models on "Generate an SVG of a pelican riding a bicycle" using the chat feature of [OpenRouter](https://openrouter.ai/). DeepSeek V3.2 produced this very short reasoning chain:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Let's assume the following:&lt;/p&gt;
&lt;p&gt;Wheel radius: 40&lt;br&gt;
Distance between wheel centers: 180&lt;br&gt;
Seat height: 60 (above the rear wheel center)&lt;br&gt;
Handlebars: above the front wheel, extending back and up.&lt;/p&gt;
&lt;p&gt;We'll set the origin at the center of the rear wheel.&lt;/p&gt;
&lt;p&gt;We'll create the SVG with a viewBox that fits the entire drawing.&lt;/p&gt;
&lt;p&gt;Let's start by setting up the SVG.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Followed by this illustration:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Pleasing gradents for the sky and ground and sun. Neat three-circle clouds. A Pelican on a Bicycle title printed on the image. The pelican is cute but stlightly detached from the bicycle. The bicycle has a somewhat mangled brown frame." src="https://static.simonwillison.net/static/2025/deepseek-v32.png" /&gt;&lt;/p&gt;
&lt;p&gt;Here's what I got from the Speciale model, which thought deeply about the geometry of bicycles and pelicans for &lt;a href="https://gist.githubusercontent.com/simonw/3debaf0df67c2d99a36f41f21ffe534c/raw/fbbb60c6d5b6f02d539ade5105b990490a81a86d/svg.txt"&gt;a very long time (at least 10 minutes)&lt;/a&gt; before spitting out this result:&lt;/p&gt;
&lt;p&gt;&lt;img alt="It's not great. The bicycle is distorted, the pelican is a white oval, an orange almost-oval beak, a little black eye and setched out straight line limbs leading to the pedal and handlebars." src="https://static.simonwillison.net/static/2025/deepseek-v32-speciale.png" /&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=46108780"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deepseek"&gt;deepseek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="deepseek"/><category term="llm-release"/><category term="openrouter"/><category term="ai-in-china"/></entry><entry><title>Kimi K2 Thinking</title><link href="https://simonwillison.net/2025/Nov/6/kimi-k2-thinking/#atom-tag" rel="alternate"/><published>2025-11-06T23:53:06+00:00</published><updated>2025-11-06T23:53:06+00:00</updated><id>https://simonwillison.net/2025/Nov/6/kimi-k2-thinking/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/moonshotai/Kimi-K2-Thinking"&gt;Kimi K2 Thinking&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Chinese AI lab Moonshot's Kimi K2 established itself as one of the largest open weight models - 1 trillion parameters - &lt;a href="https://simonwillison.net/2025/Jul/11/kimi-k2/"&gt;back in July&lt;/a&gt;. They've now released the Thinking version, also a trillion parameters (MoE, 32B active) and also under their custom modified (so &lt;a href="https://simonwillison.net/2025/Jul/11/kimi-k2/#kimi-license"&gt;not quite open source&lt;/a&gt;) MIT license.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This one is only 594GB on Hugging Face - Kimi K2 was 1.03TB - which I think is due to the new INT4 quantization. This makes the model both cheaper and faster to host.&lt;/p&gt;
&lt;p&gt;So far the only people hosting it are Moonshot themselves. I tried it out both via &lt;a href="https://platform.moonshot.ai"&gt;their own API&lt;/a&gt; and via &lt;a href="https://openrouter.ai/moonshotai/kimi-k2-thinking/providers"&gt;the OpenRouter proxy to it&lt;/a&gt;, via the &lt;a href="https://github.com/ghostofpokemon/llm-moonshot"&gt;llm-moonshot&lt;/a&gt; plugin (by NickMystic) and my &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt; plugin respectively.&lt;/p&gt;
&lt;p&gt;The buzz around this model so far is very positive. Could this be the first open weight model that's competitive with the latest from OpenAI and Anthropic, especially for long-running agentic tool call sequences?&lt;/p&gt;
&lt;p&gt;Moonshot AI's &lt;a href="https://moonshotai.github.io/Kimi-K2/thinking.html"&gt;self-reported benchmark scores&lt;/a&gt; show K2 Thinking beating the top OpenAI and Anthropic models (GPT-5 and Sonnet 4.5 Thinking) at "Agentic Reasoning" and "Agentic Search" but not quite top for "Coding":&lt;/p&gt;
&lt;p&gt;&lt;img alt="Comparison bar chart showing agentic reasoning, search, and coding benchmark performance scores across three AI systems (K, OpenAI, and AI) on tasks including Humanity's Last Exam (44.9, 41.7, 32.0), BrowseComp (60.2, 54.9, 24.1), Seal-0 (56.3, 51.4, 53.4), SWE-Multilingual (61.1, 55.3, 68.0), SWE-bench Verified (71.3, 74.9, 77.2), and LiveCodeBench V6 (83.1, 87.0, 64.0), with category descriptions including &amp;quot;Expert-level questions across subjects&amp;quot;, &amp;quot;Agentic search &amp;amp; browsing&amp;quot;, &amp;quot;Real-world latest information collection&amp;quot;, &amp;quot;Agentic coding&amp;quot;, and &amp;quot;Competitive programming&amp;quot;." src="https://static.simonwillison.net/static/2025/kimi-k2-thinking-benchmarks.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;I ran a couple of pelican tests:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-moonshot
llm keys set moonshot # paste key
llm -m moonshot/kimi-k2-thinking 'Generate an SVG of a pelican riding a bicycle'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Sonnet 4.5 described this as: Cartoon illustration of a white duck or goose with an orange beak and gray wings riding a bicycle with a red frame and light blue wheels against a light blue background." src="https://static.simonwillison.net/static/2025/k2-thinking.png" /&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-openrouter
llm keys set openrouter # paste key
llm -m openrouter/moonshotai/kimi-k2-thinking \
  'Generate an SVG of a pelican riding a bicycle'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Sonnet 4.5: Minimalist cartoon illustration of a white bird with an orange beak and feet standing on a triangular-framed penny-farthing style bicycle with gray-hubbed wheels and a propeller hat on its head, against a light background with dotted lines and a brown ground line." src="https://static.simonwillison.net/static/2025/k2-thinking-openrouter.png" /&gt;&lt;/p&gt;
&lt;p&gt;Artificial Analysis &lt;a href="https://x.com/ArtificialAnlys/status/1986541785511043536"&gt;said&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Kimi K2 Thinking achieves 93% in 𝜏²-Bench Telecom, an agentic tool use benchmark where the model acts as a customer service agent. This is the highest score we have independently measured. Tool use in long horizon agentic contexts was a strength of Kimi K2 Instruct and it appears this new Thinking variant makes substantial gains&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;CNBC quoted a source who &lt;a href="https://www.cnbc.com/2025/11/06/alibaba-backed-moonshot-releases-new-ai-model-kimi-k2-thinking.html"&gt;provided the training price&lt;/a&gt; for the model:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The Kimi K2 Thinking model cost $4.6 million to train, according to a source familiar with the matter. [...] CNBC was unable to independently verify the DeepSeek or Kimi figures.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;MLX developer Awni Hannun &lt;a href="https://x.com/awnihannun/status/1986601104130646266"&gt;got it working&lt;/a&gt; on two 512GB M3 Ultra Mac Studios:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The new 1 Trillion parameter Kimi K2 Thinking model runs well on 2 M3 Ultras in its native format - no loss in quality!&lt;/p&gt;
&lt;p&gt;The model was quantization aware trained (qat) at int4.&lt;/p&gt;
&lt;p&gt;Here it generated ~3500 tokens at 15 toks/sec using pipeline-parallelism in mlx-lm&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://huggingface.co/mlx-community/Kimi-K2-Thinking"&gt;the 658GB mlx-community model&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/artificial-analysis"&gt;artificial-analysis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/moonshot"&gt;moonshot&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/kimi"&gt;kimi&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="mlx"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="openrouter"/><category term="ai-in-china"/><category term="artificial-analysis"/><category term="moonshot"/><category term="kimi"/></entry><entry><title>Two more Chinese pelicans</title><link href="https://simonwillison.net/2025/Oct/1/two-pelicans/#atom-tag" rel="alternate"/><published>2025-10-01T23:39:07+00:00</published><updated>2025-10-01T23:39:07+00:00</updated><id>https://simonwillison.net/2025/Oct/1/two-pelicans/#atom-tag</id><summary type="html">
    &lt;p&gt;Two new models from Chinese AI labs in the past few days. I tried them both out using &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;DeepSeek-V3.2-Exp&lt;/strong&gt; from DeepSeek. &lt;a href="https://api-docs.deepseek.com/news/news250929"&gt;Announcement&lt;/a&gt;, &lt;a href="https://github.com/deepseek-ai/DeepSeek-V3.2-Exp/blob/main/DeepSeek_V3_2.pdf"&gt;Tech Report&lt;/a&gt;, &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp"&gt;Hugging Face&lt;/a&gt; (690GB, MIT license).&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;As an intermediate step toward our next-generation architecture, V3.2-Exp builds upon V3.1-Terminus by introducing DeepSeek Sparse Attention—a sparse attention mechanism designed to explore and validate optimizations for training and inference efficiency in long-context scenarios.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This one felt &lt;em&gt;very slow&lt;/em&gt; when I accessed it via OpenRouter - I probably got routed to &lt;a href="https://openrouter.ai/deepseek/deepseek-v3.2-exp/providers"&gt;one of the slower providers&lt;/a&gt;. Here's &lt;a href="https://gist.github.com/simonw/659966a678dedd9d4e55a01a4256ac56"&gt;the pelican&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Claude Sonnet 4.5 says: Minimalist line drawing illustration of a stylized bird riding a bicycle, with clock faces as wheels showing approximately 10:10, orange beak and pedal accents, on a light gray background with a dashed line representing the ground." src="https://static.simonwillison.net/static/2025/deepseek-v3.2-exp.png" /&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;GLM-4.6 from Z.ai&lt;/strong&gt;. &lt;a href="https://z.ai/blog/glm-4.6"&gt;Announcement&lt;/a&gt;, &lt;a href="https://huggingface.co/zai-org/GLM-4.6"&gt;Hugging Face&lt;/a&gt; (714GB, MIT license).&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The context window has been expanded from 128K to 200K tokens [...] higher scores on code benchmarks [...] GLM-4.6 exhibits stronger performance in tool using and search-based agents.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/5cf05165fc721b5f7eac3b10eeff20d5"&gt;the pelican&lt;/a&gt; for that:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Claude Sonnet 4.5 says: Illustration of a white seagull with an orange beak and yellow feet riding a bicycle against a light blue sky background with white clouds and a yellow sun." src="https://static.simonwillison.net/static/2025/glm-4.6.png" /&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deepseek"&gt;deepseek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/glm"&gt;glm&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="pelican-riding-a-bicycle"/><category term="deepseek"/><category term="llm-release"/><category term="openrouter"/><category term="ai-in-china"/><category term="glm"/></entry><entry><title>llm-openrouter 0.5</title><link href="https://simonwillison.net/2025/Sep/21/llm-openrouter/#atom-tag" rel="alternate"/><published>2025-09-21T00:24:05+00:00</published><updated>2025-09-21T00:24:05+00:00</updated><id>https://simonwillison.net/2025/Sep/21/llm-openrouter/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/llm-openrouter/releases/tag/0.5"&gt;llm-openrouter 0.5&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New release of my &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; plugin for accessing models made available via &lt;a href="https://openrouter.ai/"&gt;OpenRouter&lt;/a&gt;. The release notes in full:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Support for &lt;a href="https://llm.datasette.io/en/stable/tools.html"&gt;tool calling&lt;/a&gt;. Thanks, &lt;a href="https://github.com/jamessanford"&gt;James Sanford&lt;/a&gt;. &lt;a href="https://github.com/simonw/llm-openrouter/pull/43"&gt;#43&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Support for reasoning options, for example &lt;code&gt;llm -m openrouter/openai/gpt-5 'prove dogs exist' -o reasoning_effort medium&lt;/code&gt;. &lt;a href="https://github.com/simonw/llm-openrouter/issues/45"&gt;#45&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Tool calling is a really big deal, as it means you can now use the plugin to try out tools (and &lt;a href="https://simonwillison.net/2025/Sep/18/agents/"&gt;build agents, if you like&lt;/a&gt;) against any of the 179 tool-enabled models on that platform:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-openrouter
llm keys set openrouter
# Paste key here
llm models --tools | grep 'OpenRouter:' | wc -l
# Outputs 179
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Quite a few of the models hosted on OpenRouter can be accessed for free. Here's a tool-usage example using the &lt;a href="https://github.com/simonw/llm-tools-datasette"&gt;llm-tools-datasette plugin&lt;/a&gt; against the new &lt;a href="https://simonwillison.net/2025/Sep/20/grok-4-fast/"&gt;Grok 4 Fast model&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-tools-datasette
llm -m openrouter/x-ai/grok-4-fast:free -T 'Datasette("https://datasette.io/content")' 'Count available plugins'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Outputs:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;There are 154 available plugins.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://gist.github.com/simonw/43c56203887dd0d07351443a2ba18f29"&gt;The output&lt;/a&gt; of &lt;code&gt;llm logs -cu&lt;/code&gt; shows the tool calls and SQL queries it executed to get that result.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-tool-use"&gt;llm-tool-use&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;&lt;/p&gt;



</summary><category term="projects"/><category term="ai"/><category term="datasette"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="llm-tool-use"/><category term="llm-reasoning"/><category term="openrouter"/></entry><entry><title>Grok 4 Fast</title><link href="https://simonwillison.net/2025/Sep/20/grok-4-fast/#atom-tag" rel="alternate"/><published>2025-09-20T23:59:33+00:00</published><updated>2025-09-20T23:59:33+00:00</updated><id>https://simonwillison.net/2025/Sep/20/grok-4-fast/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://x.ai/news/grok-4-fast"&gt;Grok 4 Fast&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New hosted vision-enabled reasoning model from xAI that's designed to be fast and extremely competitive on price. It has a 2 million token context window and "was trained end-to-end with tool-use reinforcement learning".&lt;/p&gt;
&lt;p&gt;It's priced at $0.20/million input tokens and $0.50/million output tokens - 15x less than Grok 4 (which is $3/million input and $15/million output). That puts it cheaper than GPT-5 mini and Gemini 2.5 Flash on &lt;a href="https://www.llm-prices.com/"&gt;llm-prices.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The same model weights handle reasoning and non-reasoning based on a parameter passed to the model.&lt;/p&gt;
&lt;p&gt;I've been trying it out via my updated &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt; plugin, since Grok 4 Fast is available &lt;a href="https://openrouter.ai/x-ai/grok-4-fast"&gt;for free on OpenRouter&lt;/a&gt; for a limited period.&lt;/p&gt;
&lt;p&gt;Here's output from the &lt;a href="https://gist.github.com/simonw/7f9a5e5c780b1d5bfe98b4f4ad540551"&gt;non-reasoning model&lt;/a&gt;. This actually output an invalid SVG - I had to make &lt;a href="https://gist.github.com/simonw/7f9a5e5c780b1d5bfe98b4f4ad540551?permalink_comment_id=5768049#gistcomment-5768049"&gt;a tiny manual tweak&lt;/a&gt; to the XML to get it to render.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m openrouter/x-ai/grok-4-fast:free "Generate an SVG of a pelican riding a bicycle" -o reasoning_enabled false
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Described by Grok 4 Fast: Simple line drawing of a white bird with a long yellow beak riding a bicycle, pedaling with its orange legs." src="https://static.simonwillison.net/static/2025/grok-4-no-reasoning.png" /&gt;&lt;/p&gt;
&lt;p&gt;(I initially ran this without that &lt;code&gt;-o reasoning_enabled false&lt;/code&gt; flag, but then I saw that &lt;a href="https://x.com/OpenRouterAI/status/1969427723098435738"&gt;OpenRouter enable reasoning by default&lt;/a&gt; for that model. Here's my &lt;a href="https://gist.github.com/simonw/6a52e6585cb3c45e64ae23b9c5ebafe9"&gt;previous invalid result&lt;/a&gt;.)&lt;/p&gt;
&lt;p&gt;And &lt;a href="https://gist.github.com/simonw/539719a1495253bbd27f3107931e6dd3"&gt;the reasoning model&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m openrouter/x-ai/grok-4-fast:free "Generate an SVG of a pelican riding a bicycle" -o reasoning_enabled true
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Described by Grok 4 Fast: A simple line drawing of a white pelican with a yellow beak holding a yellow object, riding a black bicycle on green grass under a blue sky with white clouds." src="https://static.simonwillison.net/static/2025/grok-4-fast-reasoning.png" /&gt;&lt;/p&gt;
&lt;p&gt;In related news, the New York Times had a story a couple of days ago about Elon's recent focus on xAI: &lt;a href="https://www.nytimes.com/2025/09/18/technology/elon-musk-artificial-intelligence-xai.html"&gt;Since Leaving Washington, Elon Musk Has Been All In on His A.I. Company&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/grok"&gt;grok&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xai"&gt;xai&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="vision-llms"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="grok"/><category term="llm-release"/><category term="openrouter"/><category term="xai"/></entry><entry><title>Qwen3-Next-80B-A3B: 🐧🦩 Who needs legs?!</title><link href="https://simonwillison.net/2025/Sep/12/qwen3-next/#atom-tag" rel="alternate"/><published>2025-09-12T04:07:32+00:00</published><updated>2025-09-12T04:07:32+00:00</updated><id>https://simonwillison.net/2025/Sep/12/qwen3-next/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://x.com/Alibaba_Qwen/status/1966197643904000262"&gt;Qwen3-Next-80B-A3B&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Qwen announced two new models via their Twitter account (and here's &lt;a href="https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd&amp;amp;from=research.latest-advancements-list"&gt;their blog&lt;/a&gt;): &lt;a href="https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct"&gt;Qwen3-Next-80B-A3B-Instruct&lt;/a&gt; and &lt;a href="https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking"&gt;Qwen3-Next-80B-A3B-Thinking&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;They make some big claims on performance:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Qwen3-Next-80B-A3B-Instruct approaches our 235B flagship.&lt;/li&gt;
&lt;li&gt;Qwen3-Next-80B-A3B-Thinking outperforms Gemini-2.5-Flash-Thinking.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;The name "80B-A3B" indicates 80 billion parameters of which only 3 billion are active at a time. You still need to have enough GPU-accessible RAM to hold all 80 billion in memory at once but only 3 billion will be used for each round of inference, which provides a &lt;em&gt;significant&lt;/em&gt; speedup in responding to prompts.&lt;/p&gt;
&lt;p&gt;More details from their tweet:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;80B params, but only 3B activated per token → 10x cheaper training, 10x faster inference than Qwen3-32B.(esp. @ 32K+ context!)&lt;/li&gt;
&lt;li&gt;Hybrid Architecture: Gated DeltaNet + Gated Attention → best of speed &amp;amp; recall&lt;/li&gt;
&lt;li&gt;Ultra-sparse MoE: 512 experts, 10 routed + 1 shared&lt;/li&gt;
&lt;li&gt;Multi-Token Prediction → turbo-charged speculative decoding&lt;/li&gt;
&lt;li&gt;Beats Qwen3-32B in perf, rivals Qwen3-235B in reasoning &amp;amp; long-context&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;The models on Hugging Face are around 150GB each so I decided to try them out via &lt;a href="https://openrouter.ai/"&gt;OpenRouter&lt;/a&gt; rather than on my own laptop (&lt;a href="https://openrouter.ai/qwen/qwen3-next-80b-a3b-thinking"&gt;Thinking&lt;/a&gt;, &lt;a href="https://openrouter.ai/qwen/qwen3-next-80b-a3b-instruct"&gt;Instruct&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;I'm used my &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt; plugin. I installed it like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-openrouter
llm keys set openrouter
# paste key here
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then found the model IDs with this command:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm models -q next
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Which output:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;OpenRouter: openrouter/qwen/qwen3-next-80b-a3b-thinking
OpenRouter: openrouter/qwen/qwen3-next-80b-a3b-instruct
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I have an LLM &lt;a href="https://llm.datasette.io/en/stable/templates.html"&gt;prompt template&lt;/a&gt; saved called &lt;code&gt;pelican-svg&lt;/code&gt; which I created like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm "Generate an SVG of a pelican riding a bicycle" --save pelican-svg
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This means I can run &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;my pelican benchmark&lt;/a&gt; like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -t pelican-svg -m openrouter/qwen/qwen3-next-80b-a3b-thinking
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -t pelican-svg -m openrouter/qwen/qwen3-next-80b-a3b-instruct
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's the &lt;a href="https://gist.github.com/simonw/d1a0d0ff719d609bc6fad2e133e7cbe9"&gt;thinking model output&lt;/a&gt; (exported with &lt;code&gt;llm logs -c | pbcopy&lt;/code&gt; after I ran the prompt):&lt;/p&gt;
&lt;p&gt;&lt;img alt="The bicycle is too simple and way too wide. The pelican is two circles, two orange triangular feed and a big triangle for the beak." src="https://static.simonwillison.net/static/2025/qwen3-next-80b-a3b-thinking.png" /&gt;&lt;/p&gt;
&lt;p&gt;I enjoyed the "Whimsical style with smooth curves and friendly proportions (no anatomical accuracy needed for bicycle riding!)" note in &lt;a href="https://gist.github.com/simonw/d1a0d0ff719d609bc6fad2e133e7cbe9#prompt"&gt;the transcript&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The instruct (non-reasoning) model &lt;a href="https://gist.github.com/simonw/cc740a45beed5655faffa69da1e999f5"&gt;gave me this&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Blue background, brown ground, bicycle looks more like a wheelchair, pelican is actually quite good though - has thin grey wings and a perky yellow long triangular beak. Above the pelican is the caption Who needs legs?! with an emoji sequence of penguin then flamingo." src="https://static.simonwillison.net/static/2025/qwen3-next-80b-a3b-instruct.png" /&gt;&lt;/p&gt;
&lt;p&gt;"🐧🦩 Who needs legs!?" indeed! I like that penguin-flamingo emoji sequence it's decided on for pelicans.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="qwen"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="llm-release"/><category term="openrouter"/><category term="ai-in-china"/></entry><entry><title>DeepSeek 3.1</title><link href="https://simonwillison.net/2025/Aug/22/deepseek-31/#atom-tag" rel="alternate"/><published>2025-08-22T22:07:25+00:00</published><updated>2025-08-22T22:07:25+00:00</updated><id>https://simonwillison.net/2025/Aug/22/deepseek-31/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V3.1"&gt;DeepSeek 3.1&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The latest model from DeepSeek, a 685B monster (like &lt;a href="https://simonwillison.net/2024/Dec/25/deepseek-v3/"&gt;DeepSeek v3&lt;/a&gt; before it) but this time it's a hybrid reasoning model.&lt;/p&gt;
&lt;p&gt;DeepSeek claim:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;DeepSeek-V3.1-Think achieves comparable answer quality to DeepSeek-R1-0528, while responding more quickly.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Drew Breunig &lt;a href="https://twitter.com/dbreunig/status/1958577728720183643"&gt;points out&lt;/a&gt; that their benchmarks show "the same scores with 25-50% fewer tokens" - at least across AIME 2025 and GPQA Diamond and LiveCodeBench.&lt;/p&gt;
&lt;p&gt;The DeepSeek release includes prompt examples for a &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V3.1/blob/main/assets/code_agent_trajectory.html"&gt;coding agent&lt;/a&gt;, a &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V3.1/blob/main/assets/search_python_tool_trajectory.html"&gt;python agent&lt;/a&gt; and a &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V3.1/blob/main/assets/search_tool_trajectory.html"&gt;search agent&lt;/a&gt; - yet more evidence that the leading AI labs have settled on those as the three most important agentic patterns for their models to support. &lt;/p&gt;
&lt;p&gt;Here's the pelican riding a bicycle it drew me (&lt;a href="https://gist.github.com/simonw/f6dba61faf962866969eefd3de59d70e"&gt;transcript&lt;/a&gt;), which I ran from my phone using &lt;a href="https://openrouter.ai/chat?models=deepseek/deepseek-chat-v3.1"&gt;OpenRouter chat&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Cartoon illustration of a white bird with an orange beak riding a bicycle against a blue sky background with bright green grass below" src="https://static.simonwillison.net/static/2025/deepseek-3-1-pelican.png" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-engineering"&gt;prompt-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/drew-breunig"&gt;drew-breunig&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deepseek"&gt;deepseek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="prompt-engineering"/><category term="generative-ai"/><category term="llms"/><category term="drew-breunig"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="deepseek"/><category term="llm-release"/><category term="openrouter"/><category term="coding-agents"/><category term="ai-in-china"/></entry><entry><title>Usage charts for my LLM tool against OpenRouter</title><link href="https://simonwillison.net/2025/Aug/4/llm-openrouter-usage/#atom-tag" rel="alternate"/><published>2025-08-04T20:00:47+00:00</published><updated>2025-08-04T20:00:47+00:00</updated><id>https://simonwillison.net/2025/Aug/4/llm-openrouter-usage/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://openrouter.ai/apps?url=https%3A%2F%2Fllm.datasette.io%2F"&gt;Usage charts for my LLM tool against OpenRouter&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
OpenRouter proxies requests to a large number of different LLMs and provides high level statistics of which models are the most popular among their users.&lt;/p&gt;
&lt;p&gt;Tools that call OpenRouter can include &lt;code&gt;HTTP-Referer&lt;/code&gt; and &lt;code&gt;X-Title&lt;/code&gt; headers to credit that tool with the token usage. My &lt;a href="https://github.com/simonw/llm-openrouter/"&gt;llm-openrouter&lt;/a&gt; plugin &lt;a href="https://github.com/simonw/llm-openrouter/blob/8e4be78e60337154b063faaa7161dddd91462730/llm_openrouter.py#L99C13-L99C20"&gt;does that here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;... which means &lt;a href="https://openrouter.ai/apps?url=https%3A%2F%2Fllm.datasette.io%2F"&gt;this page&lt;/a&gt; displays aggregate stats across users of that plugin! Looks like someone has been running a lot of traffic through &lt;a href="https://openrouter.ai/qwen/qwen3-14b"&gt;Qwen 3 14B&lt;/a&gt; recently.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of LLM usage statistics dashboard showing a stacked bar chart from July 5 to August 4, 2025, with a legend on the right displaying &amp;quot;Top models&amp;quot; including Qwen: Qwen3 14B (480M), Google: Gemini 2.5 Flash Lite Preview 06-17 (31.7M), Horizon Beta (3.77M), Google: Gemini 2.5 Flash Lite (1.67M), google/gemini-2.0-flash-exp (1.14M), DeepSeek: DeepSeek V3 0324 (1.11M), Meta: Llama 3.3 70B Instruct (228K), Others (220K), Qwen: Qwen3 Coder (218K), MoonshotAI: Kimi K2 (132K), and Horizon Alpha (75K), with a total of 520M usage shown for August 3, 2025." src="https://static.simonwillison.net/static/2025/llm-usage-openrouter.jpg" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="openrouter"/></entry><entry><title>More model releases on 31st July</title><link href="https://simonwillison.net/2025/Jul/31/more-models/#atom-tag" rel="alternate"/><published>2025-07-31T21:54:47+00:00</published><updated>2025-07-31T21:54:47+00:00</updated><id>https://simonwillison.net/2025/Jul/31/more-models/#atom-tag</id><summary type="html">
    &lt;p&gt;Here are a few more model releases from today, to round out a &lt;a href="https://simonwillison.net/search/?tag=llm-release&amp;amp;year=2025&amp;amp;month=7"&gt;very busy July&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Cohere &lt;a href="https://cohere.com/blog/command-a-vision"&gt;released Command A Vision&lt;/a&gt;, their first multi-modal (image input) LLM. Like their others it's open weights under Creative Commons Attribution Non-Commercial, so you need to license it (or use their paid API) if you want to use it commercially.&lt;/li&gt;
&lt;li&gt;San Francisco AI startup Deep Cogito released &lt;a href="https://www.deepcogito.com/research/cogito-v2-preview"&gt;four open weights hybrid reasoning models&lt;/a&gt;, cogito-v2-preview-deepseek-671B-MoE, cogito-v2-preview-llama-405B, cogito-v2-preview-llama-109B-MoE and cogito-v2-preview-llama-70B. These follow their &lt;a href="https://www.deepcogito.com/research/cogito-v1-preview"&gt;v1 preview models&lt;/a&gt; in April at smaller 3B, 8B, 14B, 32B and 70B sizes. It looks like their unique contribution here is "distilling inference-time reasoning back into the model’s parameters" - demonstrating a form of self-improvement. I haven't tried any of their models myself yet.&lt;/li&gt;
&lt;li&gt;Mistral released &lt;a href="https://mistral.ai/news/codestral-25-08"&gt;Codestral 25.08&lt;/a&gt;, an update to their Codestral model which is specialized for fill-in‑the‑middle autocomplete as seen in text editors like VS Code, Zed and Cursor.&lt;/li&gt;
&lt;li&gt;And an anonymous stealth preview model called Horizon Alpha running &lt;a href="https://openrouter.ai/openrouter/horizon-alpha/activity"&gt;on OpenRouter&lt;/a&gt; was released yesterday and is attracting a lot of attention.&lt;/li&gt;
&lt;/ul&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cohere"&gt;cohere&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="mistral"/><category term="cohere"/><category term="llm-release"/><category term="openrouter"/></entry><entry><title>Qwen3-Coder: Agentic Coding in the World</title><link href="https://simonwillison.net/2025/Jul/22/qwen3-coder/#atom-tag" rel="alternate"/><published>2025-07-22T22:52:02+00:00</published><updated>2025-07-22T22:52:02+00:00</updated><id>https://simonwillison.net/2025/Jul/22/qwen3-coder/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://qwenlm.github.io/blog/qwen3-coder/"&gt;Qwen3-Coder: Agentic Coding in the World&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
It turns out that &lt;a href="https://simonwillison.net/2025/Jul/22/qwen3-235b-a22b-instruct-2507/"&gt;as I was typing up&lt;/a&gt; my notes on Qwen3-235B-A22B-Instruct-2507 the Qwen team were unleashing something much bigger:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Today, we’re announcing Qwen3-Coder, our most agentic code model to date. Qwen3-Coder is available in multiple sizes, but we’re excited to introduce its most powerful variant first: Qwen3-Coder-480B-A35B-Instruct — a 480B-parameter Mixture-of-Experts model with 35B active parameters which supports the context length of 256K tokens natively and 1M tokens with extrapolation methods, offering exceptional performance in both coding and agentic tasks.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is another Apache 2.0 licensed open weights model, available as &lt;a href="https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct"&gt;Qwen3-Coder-480B-A35B-Instruct&lt;/a&gt; and &lt;a href="https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8"&gt;Qwen3-Coder-480B-A35B-Instruct-FP8&lt;/a&gt; on Hugging Face.&lt;/p&gt;
&lt;p&gt;I used &lt;a href="https://app.hyperbolic.ai/models/qwen3-coder-480b-a35b-instruct"&gt;qwen3-coder-480b-a35b-instruct on the Hyperbolic playground&lt;/a&gt; to run my "Generate an SVG of a pelican riding a bicycle" test prompt:&lt;/p&gt;
&lt;p&gt;&lt;img alt="The bicycle has no spokes. The pelican is light yellow and is overlapping the middle of the bicycle, not perching on it - it has a large yellow beak and a weird red lower beak or wattle." src="https://static.simonwillison.net/static/2025/Qwen3-Coder-480B-A35B-Instruct-FP8.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;I actually slightly prefer the one &lt;a href="https://simonwillison.net/2025/Jul/22/qwen3-235b-a22b-instruct-2507/"&gt;I got from qwen3-235b-a22b-07-25&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It's also available &lt;a href="https://openrouter.ai/qwen/qwen3-coder"&gt;as qwen3-coder on OpenRouter&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In addition to the new model, Qwen released their own take on an agentic terminal coding assistant called &lt;a href="https://github.com/QwenLM/qwen-code"&gt;qwen-code&lt;/a&gt;, which they describe in their blog post as being "Forked from Gemini Code" (they mean &lt;a href="https://github.com/google-gemini/gemini-cli"&gt;gemini-cli&lt;/a&gt;) - which is Apache 2.0 so a fork is in keeping with the license.&lt;/p&gt;
&lt;p&gt;They focused &lt;em&gt;really hard&lt;/em&gt; on code performance for this release, including generating synthetic data tested using 20,000 parallel environments on Alibaba Cloud:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In the post-training phase of Qwen3-Coder, we introduced long-horizon RL (Agent RL) to encourage the model to solve real-world tasks through multi-turn interactions using tools. The key challenge of Agent RL lies in environment scaling. To address this, we built a scalable system capable of running 20,000 independent environments in parallel, leveraging Alibaba Cloud’s infrastructure. The infrastructure provides the necessary feedback for large-scale reinforcement learning and supports evaluation at scale. As a result, Qwen3-Coder achieves state-of-the-art performance among open-source models on SWE-Bench Verified without test-time scaling.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;To further burnish their coding credentials, the announcement includes instructions for running their new model using both Claude Code and Cline using custom API base URLs that point to Qwen's own compatibility proxies.&lt;/p&gt;
&lt;p&gt;Pricing for Qwen's own hosted models (through Alibaba Cloud) &lt;a href="https://www.alibabacloud.com/help/en/model-studio/models"&gt;looks competitive&lt;/a&gt;. This is the first model I've seen that sets different prices for four different sizes of input:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Pricing table with three columns showing Input token count (0-32K, 32K-128K, 128K-256K, 256K-1M), Input price (Million tokens) ($1, $1.8, $3, $6), and Output price (Million tokens) ($5, $9, $15, $60)" src="https://static.simonwillison.net/static/2025/qwen3-coder-plus-prices.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;This kind of pricing reflects how inference against longer inputs is more expensive to process. Gemini 2.5 Pro has two different prices for above or below 200,00 tokens.&lt;/p&gt;
&lt;p&gt;Awni Hannun &lt;a href="https://x.com/awnihannun/status/1947771502058672219"&gt;reports&lt;/a&gt; running a &lt;a href="https://huggingface.co/mlx-community/Qwen3-Coder-480B-A35B-Instruct-4bit"&gt;4-bit quantized MLX version&lt;/a&gt; on a 512GB M3 Ultra Mac Studio at 24 tokens/second using 272GB of RAM, getting &lt;a href="https://x.com/awnihannun/status/1947772369440997807"&gt;great results&lt;/a&gt; for "&lt;code&gt;write a python script for a bouncing yellow ball within a square, make sure to handle collision detection properly. make the square slowly rotate. implement it in python. make sure ball stays within the square&lt;/code&gt;".

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://x.com/Alibaba_Qwen/status/1947766835023335516"&gt;@Alibaba_Qwen&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="qwen"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="openrouter"/><category term="coding-agents"/><category term="ai-in-china"/></entry><entry><title>Qwen/Qwen3-235B-A22B-Instruct-2507</title><link href="https://simonwillison.net/2025/Jul/22/qwen3-235b-a22b-instruct-2507/#atom-tag" rel="alternate"/><published>2025-07-22T22:07:12+00:00</published><updated>2025-07-22T22:07:12+00:00</updated><id>https://simonwillison.net/2025/Jul/22/qwen3-235b-a22b-instruct-2507/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507"&gt;Qwen/Qwen3-235B-A22B-Instruct-2507&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Significant new model release from Qwen, published yesterday without much fanfare. (&lt;strong&gt;Update&lt;/strong&gt;: probably because they were cooking the much larger &lt;a href="https://simonwillison.net/2025/Jul/22/qwen3-coder/"&gt;Qwen3-Coder-480B-A35B-Instruct&lt;/a&gt; which they released just now.)&lt;/p&gt;
&lt;p&gt;This is a follow-up to their &lt;a href="https://simonwillison.net/2025/Apr/29/qwen-3/"&gt;April release&lt;/a&gt; of the full Qwen 3 model family, which included a Qwen3-235B-A22B model which could handle both reasoning and non-reasoning prompts (via a &lt;code&gt;/no_think&lt;/code&gt; toggle).&lt;/p&gt;
&lt;p&gt;The new &lt;code&gt;Qwen3-235B-A22B-Instruct-2507&lt;/code&gt; ditches that mechanism - this is exclusively a &lt;strong&gt;non-reasoning&lt;/strong&gt; model. It looks like Qwen have new reasoning models in the pipeline.&lt;/p&gt;
&lt;p&gt;This new model is Apache 2 licensed and comes in two official sizes: a BF16 model (437.91GB of files on Hugging Face) and &lt;a href="https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507-FP8"&gt;an FP8 variant&lt;/a&gt; (220.20GB). VentureBeat &lt;a href="https://venturebeat.com/ai/alibabas-new-open-source-qwen3-235b-a22b-2507-beats-kimi-2-and-offers-low-compute-version/#h-fp8-version-lets-enterprises-run-qwen-3-with-far-less-memory-and-far-less-compute"&gt;estimate&lt;/a&gt; that the large model needs 88GB of VRAM while the smaller one should run in ~30GB.&lt;/p&gt;
&lt;p&gt;The benchmarks on these new models look &lt;em&gt;very promising&lt;/em&gt;. Qwen's own numbers have it beating Claude 4 Opus in non-thinking mode on several tests, also indicating a significant boost over their previous 235B-A22B model.&lt;/p&gt;
&lt;p&gt;I haven't seen any independent benchmark results yet. Here's what I got for "Generate an SVG of a pelican riding a bicycle", which I ran using the &lt;a href="https://openrouter.ai/qwen/qwen3-235b-a22b-07-25:free"&gt;qwen3-235b-a22b-07-25:free on OpenRouter&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-openrouter
llm -m openrouter/qwen/qwen3-235b-a22b-07-25:free \
  "Generate an SVG of a pelican riding a bicycle"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Description by Claude Sonnet 4: Cartoon illustration of a white duck sitting on a black bicycle against a blue sky with a white cloud, yellow sun, and green grass below" src="https://static.simonwillison.net/static/2025/qwen3-235b-a22b-07-25.jpg" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="qwen"/><category term="pelican-riding-a-bicycle"/><category term="llm-release"/><category term="openrouter"/><category term="ai-in-china"/></entry><entry><title>Grok 4</title><link href="https://simonwillison.net/2025/Jul/10/grok-4/#atom-tag" rel="alternate"/><published>2025-07-10T19:36:03+00:00</published><updated>2025-07-10T19:36:03+00:00</updated><id>https://simonwillison.net/2025/Jul/10/grok-4/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://docs.x.ai/docs/models/grok-4-0709"&gt;Grok 4&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Released last night, Grok 4 is now available via both API and a paid subscription for end-users.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update:&lt;/strong&gt; If you ask it about controversial topics it will sometimes &lt;a href="https://simonwillison.net/2025/Jul/11/grok-musk/"&gt;search X for tweets "from:elonmusk"&lt;/a&gt;!&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Key characteristics: image and text input, text output. 256,000 context length (twice that of Grok 3). It's a reasoning model where you can't see the reasoning tokens or turn off reasoning mode.&lt;/p&gt;
&lt;p&gt;xAI released results showing Grok 4 beating other models on most of the significant benchmarks. I haven't been able to find their own written version of these (the launch was a &lt;a href="https://x.com/xai/status/1943158495588815072"&gt;livestream video&lt;/a&gt;) but here's &lt;a href="https://techcrunch.com/2025/07/09/elon-musks-xai-launches-grok-4-alongside-a-300-monthly-subscription/"&gt;a TechCrunch report&lt;/a&gt; that includes those scores. It's not clear to me if these benchmark results are for Grok 4 or Grok 4 Heavy.&lt;/p&gt;
&lt;p&gt;I ran &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/"&gt;my own benchmark&lt;/a&gt; using Grok 4 &lt;a href="https://openrouter.ai/x-ai/grok-4"&gt;via OpenRouter&lt;/a&gt; (since I have API keys there already). &lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m openrouter/x-ai/grok-4 "Generate an SVG of a pelican riding a bicycle" \
  -o max_tokens 10000
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img alt="Description below." src="https://static.simonwillison.net/static/2025/grok4-pelican.png" /&gt;&lt;/p&gt;
&lt;p&gt;I then asked Grok to describe the image it had just created:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m openrouter/x-ai/grok-4 -o max_tokens 10000 \
  -a https://static.simonwillison.net/static/2025/grok4-pelican.png \
  'describe this image'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/ec9aee006997b6ae7f2bba07da738279#response"&gt;the result&lt;/a&gt;. It described it as a "cute, bird-like creature (resembling a duck, chick, or stylized bird)".&lt;/p&gt;
&lt;p&gt;The most interesting independent analysis I've seen so far is &lt;a href="https://twitter.com/ArtificialAnlys/status/1943166841150644622"&gt;this one from Artificial Analysis&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We have run our full suite of benchmarks and Grok 4 achieves an Artificial Analysis Intelligence Index of 73, ahead of OpenAI o3 at 70, Google Gemini 2.5 Pro at 70, Anthropic Claude 4 Opus at 64 and DeepSeek R1 0528 at 68.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The timing of the release is somewhat unfortunate, given that Grok 3 made headlines &lt;a href="https://www.theguardian.com/technology/2025/jul/09/grok-ai-praised-hitler-antisemitism-x-ntwnfb"&gt;just this week&lt;/a&gt; after a &lt;a href="https://github.com/xai-org/grok-prompts/commit/535aa67a6221ce4928761335a38dea8e678d8501#diff-dec87f526b85f35cb546db6b1dd39d588011503a94f1aad86d023615a0e9e85aR6"&gt;clumsy system prompt update&lt;/a&gt; - presumably another attempt to make Grok "less woke" - caused it to start firing off antisemitic tropes and referring to itself as MechaHitler.&lt;/p&gt;
&lt;p&gt;My best guess is that these lines in the prompt were the root of the problem:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;- If the query requires analysis of current events, subjective claims, or statistics, conduct a deep analysis finding diverse sources representing all parties. Assume subjective viewpoints sourced from the media are biased. No need to repeat this to the user.&lt;/code&gt;&lt;br&gt;
&lt;code&gt;- The response should not shy away from making claims which are politically incorrect, as long as they are well substantiated.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;If xAI expect developers to start building applications on top of Grok they need to do a lot better than this. Absurd self-inflicted mistakes like this do not build developer trust!&lt;/p&gt;
&lt;p&gt;As it stands, Grok 4 isn't even accompanied by a model card.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update:&lt;/strong&gt; Ian Bicking &lt;a href="https://bsky.app/profile/ianbicking.org/post/3ltn3r7g4xc2i"&gt;makes an astute point&lt;/a&gt;:&lt;/em&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;It feels very credulous to ascribe what happened to a system prompt update. Other models can't be pushed into racism, Nazism, and ideating rape with a system prompt tweak.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;em&gt;Even if that system prompt change was responsible for unlocking this behavior, the fact that it was able to speaks to a much looser approach to model safety by xAI compared to other providers.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Update 12th July 2025:&lt;/strong&gt; Grok posted &lt;a href="https://simonwillison.net/2025/Jul/12/grok/"&gt;a postmortem&lt;/a&gt; blaming the behavior on a different set of prompts, including "you are not afraid to offend people who are politically correct", that were not included in the system prompts they had published to their GitHub repository.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Grok 4 is competitively priced. It's $3/million for input tokens and $15/million for output tokens - the same price as Claude Sonnet 4. Once you go above 128,000 input tokens the price doubles to $6/$30 (Gemini 2.5 Pro has a similar price increase for longer inputs). I've added these prices to &lt;a href="https://www.llm-prices.com/"&gt;llm-prices.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Consumers can access Grok 4 via a new $30/month or $300/year "SuperGrok" plan - or a $300/month or $3,000/year "SuperGrok Heavy" plan providing access to Grok 4 Heavy.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of subscription pricing page showing two plans: SuperGrok at $30.00/month (marked as Popular) with Grok 4 and Grok 3 increased access, features including Everything in Basic, Context Memory 128,000 Tokens, and Voice with vision; SuperGrok Heavy at $300.00/month with Grok 4 Heavy exclusive preview, Grok 4 and Grok 3 increased access, features including Everything in SuperGrok, Early access to new features, and Dedicated Support. Toggle at top shows &amp;quot;Pay yearly save 16%&amp;quot; and &amp;quot;Pay monthly&amp;quot; options with Pay monthly selected." src="https://static.simonwillison.net/static/2025/supergrok-pricing.jpg" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-reasoning"&gt;llm-reasoning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/grok"&gt;grok&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/system-prompts"&gt;system-prompts&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/artificial-analysis"&gt;artificial-analysis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xai"&gt;xai&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="vision-llms"/><category term="llm-pricing"/><category term="pelican-riding-a-bicycle"/><category term="llm-reasoning"/><category term="grok"/><category term="ai-ethics"/><category term="llm-release"/><category term="openrouter"/><category term="system-prompts"/><category term="artificial-analysis"/><category term="xai"/></entry><entry><title>Become a command-line superhero with Simon Willison's llm tool</title><link href="https://simonwillison.net/2025/Jul/7/become-a-command-line-superhero-with-simon-willisons-llm-tool/#atom-tag" rel="alternate"/><published>2025-07-07T18:48:22+00:00</published><updated>2025-07-07T18:48:22+00:00</updated><id>https://simonwillison.net/2025/Jul/7/become-a-command-line-superhero-with-simon-willisons-llm-tool/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.youtube.com/watch?v=UZ-9U1W0e4o"&gt;Become a command-line superhero with Simon Willison&amp;#x27;s llm tool&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Christopher Smith ran a mini hackathon in Albany New York at the weekend around uses of my &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; - the first in-person event I'm aware of dedicated to that project!&lt;/p&gt;
&lt;p&gt;He prepared this video version of the opening talk he presented there, and it's the best video introduction I've seen yet for how to get started experimenting with LLM and its various plugins:&lt;/p&gt;
&lt;p&gt;&lt;lite-youtube videoid="UZ-9U1W0e4o" js-api="js-api"
  title="Become a command-line superhero with Simon Willison's llm tool"
  playlabel="Play"
&gt; &lt;/lite-youtube&gt;&lt;/p&gt;

&lt;p&gt;Christopher introduces LLM and the &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt; plugin, touches on various features including &lt;a href="https://llm.datasette.io/en/stable/fragments.html"&gt;fragments&lt;/a&gt; and &lt;a href="https://llm.datasette.io/en/stable/schemas.html"&gt;schemas&lt;/a&gt; and also shows LLM used in conjunction with &lt;a href="https://github.com/yamadashy/repomix"&gt;repomix&lt;/a&gt; to dump full source repos into an LLM at once.&lt;/p&gt;
&lt;p&gt;Here are &lt;a href="https://gist.github.com/chriscarrollsmith/4670b8466e19e77723327cb555f638e6"&gt;the notes&lt;/a&gt; that accompanied the talk.&lt;/p&gt;
&lt;p&gt;I learned about &lt;a href="https://openrouter.ai/openrouter/cypher-alpha:free"&gt;cypher-alpha:free&lt;/a&gt; from this video - a free trial preview model currently available on OpenRouter from an anonymous vendor. I hadn't realized OpenRouter hosted these - it's similar to how &lt;a href="https://lmarena.ai/"&gt;LMArena&lt;/a&gt; often hosts anonymous previews.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://bsky.app/profile/chriscarrollsmith.bsky.social/post/3ltcn2kd62c25"&gt;@chriscarrollsmith.bsky.social&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="openrouter"/></entry><entry><title>Understanding the recent criticism of the Chatbot Arena</title><link href="https://simonwillison.net/2025/Apr/30/criticism-of-the-chatbot-arena/#atom-tag" rel="alternate"/><published>2025-04-30T22:55:46+00:00</published><updated>2025-04-30T22:55:46+00:00</updated><id>https://simonwillison.net/2025/Apr/30/criticism-of-the-chatbot-arena/#atom-tag</id><summary type="html">
    &lt;p&gt;The &lt;a href="https://lmarena.ai/"&gt;Chatbot Arena&lt;/a&gt; has become the go-to place for &lt;a href="https://simonwillison.net/2024/Jun/27/ai-worlds-fair/#slide.013.jpeg"&gt;vibes-based evaluation&lt;/a&gt; of LLMs over the past two years. The project, originating at UC Berkeley, is home to a large community of model enthusiasts who submit prompts to two randomly selected anonymous models and pick their favorite response. This produces an &lt;a href="https://en.wikipedia.org/wiki/Elo_rating_system"&gt;Elo score&lt;/a&gt; leaderboard of the "best" models, similar to how chess rankings work.&lt;/p&gt;
&lt;p&gt;It's become one of the most influential leaderboards in the LLM world, which means that billions of dollars of investment are now being evaluated based on those scores.&lt;/p&gt;
&lt;h4 id="the-leaderboard-illusion"&gt;The Leaderboard Illusion&lt;/h4&gt;
&lt;p&gt;A new paper, &lt;strong&gt;&lt;a href="https://arxiv.org/abs/2504.20879"&gt;The Leaderboard Illusion&lt;/a&gt;&lt;/strong&gt;, by authors from Cohere Labs, AI2, Princeton, Stanford, University of Waterloo and University of Washington spends 68 pages dissecting and criticizing how the arena works.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/leaderboard-illusion.jpg" alt="Title page of academic paper &amp;quot;The Leaderboard Illusion&amp;quot; with authors Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D'souza, Sayash Kapoor, Ahmet Üstün, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah Smith, Beyza Ermis, Marzieh Fadaee, and Sara Hooker from various institutions including Cohere Labs, Cohere, Princeton University, Stanford University, University of Waterloo, Massachusetts Institute of Technology, Allen Institute for Artificial Intelligence, and University of Washington. Corresponding authors: {shivalikasingh, marzieh, sarahooker}@cohere.com" style="max-width: 100%" /&gt;&lt;/p&gt;
&lt;p&gt;Even prior to this paper there have been rumbles of dissatisfaction with the arena for a while, based on intuitions that the best models were not necessarily bubbling to the top. I've personally been suspicious of the fact that my preferred daily driver, Claude 3.7 Sonnet, rarely breaks the top 10 (it's sat at 20th right now).&lt;/p&gt;
&lt;p&gt;This all came to a head a few weeks ago when the &lt;a href="https://simonwillison.net/2025/Apr/5/llama-4-notes/"&gt;Llama 4 launch&lt;/a&gt; was mired by a leaderboard scandal: it turned out that their model which topped the leaderboard &lt;a href="https://simonwillison.net/2025/Apr/5/llama-4-notes/#lmarena"&gt;wasn't the same model&lt;/a&gt; that they released to the public! The arena released &lt;a href="https://simonwillison.net/2025/Apr/8/lmaren/"&gt;a pseudo-apology&lt;/a&gt; for letting that happen.&lt;/p&gt;
&lt;p&gt;This helped bring focus to &lt;a href="https://blog.lmarena.ai/blog/2024/policy/#our-policy"&gt;the arena's policy&lt;/a&gt; of allowing model providers to anonymously preview their models there, in order to earn a ranking prior to their official launch date. This is popular with their community, who enjoy trying out models before anyone else, but the scale of the preview testing revealed in this new paper surprised me.&lt;/p&gt;
&lt;p&gt;From the new paper's abstract (highlights mine):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. &lt;strong&gt;At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;If proprietary model vendors can submit dozens of test models, and then selectively pick the ones that score highest it is not surprising that they end up hogging the top of the charts!&lt;/p&gt;
&lt;p&gt;This feels like a classic example of gaming a leaderboard. There are model characteristics that resonate with evaluators there that may not directly relate to the quality of the underlying model. For example, bulleted lists and answers of a very specific length tend to do better.&lt;/p&gt;
&lt;p&gt;It is worth noting that this is quite a salty paper (highlights mine):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It is important to acknowledge that &lt;strong&gt;a subset of the authors of this paper have submitted several open-weight models to Chatbot Arena&lt;/strong&gt;: command-r (Cohere, 2024), command-r-plus
(Cohere, 2024) in March 2024, aya-expanse (Dang et al., 2024b) in October 2024, aya-vision
(Cohere, 2025) in March 2025, command-a (Cohere et al., 2025) in March 2025. We started this extensive study driven by this submission experience with the leaderboard.&lt;/p&gt;
&lt;p&gt;While submitting Aya Expanse (Dang et al., 2024b) for testing, &lt;strong&gt;we observed that our open-weight model appeared to be notably under-sampled compared to proprietary models&lt;/strong&gt; — a discrepancy that is further reflected in Figures 3, 4, and 5. In response, &lt;strong&gt;we contacted the Chatbot Arena organizers to inquire about these differences&lt;/strong&gt; in November 2024. &lt;strong&gt;In the course of our discussions, we learned that some providers were testing multiple variants privately, a practice that appeared to be selectively disclosed and limited to only a few model providers&lt;/strong&gt;. We believe that our initial inquiries partly prompted Chatbot Arena to release &lt;a href=""&gt;a public blog&lt;/a&gt; in December 2024 detailing their benchmarking policy which committed to a consistent sampling rate across models. However, subsequent anecdotal observations of continued sampling disparities and the presence of numerous models with private aliases motivated us to undertake a more systematic analysis.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;To summarize the other key complaints from the paper:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unfair sampling rates&lt;/strong&gt;: a small number of proprietary vendors (most notably Google and OpenAI) have their models randomly selected in a much higher number of contests.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transparency&lt;/strong&gt; concerning the scale of proprietary model testing that's going on.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unfair removal rates&lt;/strong&gt;: "We find deprecation disproportionately impacts open-weight and open-source models, creating large asymmetries in data access over" - also "out of 243 public models, 205 have been silently deprecated." The longer a model stays in the arena the more chance it has to win competitions and bubble to the top.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The Arena responded to the paper &lt;a href="https://twitter.com/lmarena_ai/status/1917492084359192890"&gt;in a tweet&lt;/a&gt;. They emphasized:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We designed our policy to prevent model providers from just reporting the highest score they received during testing. We only publish the score for the model they release publicly.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'm dissapointed by this response, because it skips over the point from the paper that I find most interesting. If commercial vendors are able to submit dozens of models to the arena and then cherry-pick for publication just the model that gets the highest score, quietly retracting the others with their scores unpublished, that means the arena is very actively incentivizing models to game the system. It's also obscuring a valuable signal to help the community understand how well those vendors are doing at building useful models.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://twitter.com/lmarena_ai/status/1917668731481907527"&gt;a second tweet&lt;/a&gt; where they take issue with "factual errors and misleading statements" in the paper, but still fail to address that core point. I'm hoping they'll respond to &lt;a href="https://x.com/simonw/status/1917672048031404107"&gt;my follow-up question&lt;/a&gt; asking for clarification around the cherry-picking loophole described by the paper.&lt;/p&gt;

&lt;h4 id="transparency"&gt;I want more transparency&lt;/h4&gt;

&lt;p&gt;The thing I most want here is transparency.&lt;/p&gt;

&lt;p&gt;If a model sits in top place, I'd like a footnote that resolves to additional information about how that vendor tested that model. I'm particularly interested in knowing how many variants of that model the vendor tested. If they ran 21 different models over a 2 month period before selecting the "winning" model, I'd like to know that - and know what the scores were for all of those others that they didn't ship.&lt;/p&gt;

&lt;p&gt;This knowledge will help me personally evaluate how credible I find their score. Were they mainly gaming the benchmark or did they produce a new model family that universally scores highly even as they tweaked it to best fit the taste of the voters in the arena?&lt;/p&gt;

&lt;h4 id="openrouter"&gt;OpenRouter as an alternative?&lt;/h4&gt;

&lt;p&gt;If the arena isn't giving us a good enough impression of who is winning the race for best LLM at the moment, what else can we look to?&lt;/p&gt;

&lt;p&gt;Andrej Karpathy &lt;a href="https://x.com/karpathy/status/1917546757929722115"&gt;discussed the new paper&lt;/a&gt; on Twitter this morning and proposed an alternative source of rankings instead:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;It's quite likely that LM Arena (and LLM providers) can continue to iterate and improve within this paradigm, but in addition I also have a new candidate in mind to potentially join the ranks of "top tier eval". It is the &lt;strong&gt;&lt;a href="https://openrouter.ai/rankings"&gt;OpenRouterAI LLM rankings&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Basically, OpenRouter allows people/companies to quickly switch APIs between LLM providers. All of them have real use cases (not toy problems or puzzles), they have their own private evals, and all of them have an incentive to get their choices right, so by choosing one LLM over another they are directly voting for some combo of capability+cost.&lt;/p&gt;
&lt;p&gt;I don't think OpenRouter is there just yet in both the quantity and diversity of use, but something of this kind I think has great potential to grow into a very nice, very difficult to game eval.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I only recently learned about &lt;a href="https://openrouter.ai/rankings?view=trending"&gt;these rankings&lt;/a&gt; but I agree with Andrej: they reveal some interesting patterns that look to match my own intuitions about which models are the most useful (and economical) on which to build software. Here's a snapshot of their current "Top this month" table:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/openrouter-top-month.jpg" alt="Screenshot of a trending AI models list with navigation tabs &amp;quot;Top today&amp;quot;, &amp;quot;Top this week&amp;quot;, &amp;quot;Top this month&amp;quot; (selected), and &amp;quot;Trending&amp;quot;. The list shows ranked models: 1. Anthropic: Claude 3.7 Sonnet (1.21T tokens, ↑14%), 2. Google: Gemini 2.0 Flash (1.04T tokens, ↓17%), 3. OpenAI: GPT-4o-mini (503B tokens, ↑191%), 5. DeepSeek: DeepSeek V3 0324 (free) (441B tokens, ↑434%), 6. Quasar Alpha (296B tokens, new), 7. Meta: Llama 3.3 70B Instruct (261B tokens, ↓4%), 8. Google: Gemini 2.5 Pro Preview (228B tokens, new), 9. DeepSeek: R1 (free) (211B tokens, ↓29%), 10. Anthropic: Claude 3.7 Sonnet (thinking) (207B tokens, ↓15%), 11. DeepSeek: DeepSeek V3 0324 (200B tokens, ↑711%), 12. Google: Gemini 1.5 Flash 8B (165B tokens, ↑10%)." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;The one big weakness of this ranking system is that a single, high volume OpenRouter customer could have an outsized effect on the rankings should they decide to switch models. It will be interesting to see if OpenRouter can design their own statistical mechanisms to help reduce that effect.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/andrej-karpathy"&gt;andrej-karpathy&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatbot-arena"&gt;chatbot-arena&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/paper-review"&gt;paper-review&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="andrej-karpathy"/><category term="generative-ai"/><category term="llms"/><category term="ai-ethics"/><category term="openrouter"/><category term="chatbot-arena"/><category term="paper-review"/></entry><entry><title>Maybe Meta's Llama claims to be open source because of the EU AI act</title><link href="https://simonwillison.net/2025/Apr/19/llama-eu-ai-act/#atom-tag" rel="alternate"/><published>2025-04-19T23:58:18+00:00</published><updated>2025-04-19T23:58:18+00:00</updated><id>https://simonwillison.net/2025/Apr/19/llama-eu-ai-act/#atom-tag</id><summary type="html">
    &lt;p&gt;I encountered a theory a while ago that one of the reasons Meta insist on using the term “open source” for their Llama models despite the Llama license &lt;a href="https://opensource.org/blog/metas-llama-license-is-still-not-open-source"&gt;not actually conforming&lt;/a&gt; to the terms of the &lt;a href="https://opensource.org/osd"&gt;Open Source Definition&lt;/a&gt; is that the EU’s AI act includes special rules for open source models without requiring OSI compliance.&lt;/p&gt;
&lt;p&gt;Since the EU AI act (12 July 2024) is &lt;a href="https://artificialintelligenceact.eu/the-act/"&gt;available online&lt;/a&gt; I decided to take a look for myself.&lt;/p&gt;
&lt;p&gt;Here’s &lt;a href="https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=OJ:L_202401689"&gt;one giant HTML page&lt;/a&gt; containing the full text of the act in English. I checked the token count with &lt;a href="https://github.com/simonw/ttok"&gt;ttok&lt;/a&gt; (which uses the OpenAI tokenizer, but it’s close enough to work as a good estimate for other models):&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;curl &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=OJ:L_202401689&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; ttok&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;241,722 tokens. That should fit nicely into &lt;a href="https://simonwillison.net/2025/Apr/17/start-building-with-gemini-25-flash/"&gt;Gemini 2.5 Flash&lt;/a&gt; (or GPT-4.1 or Gemini 2.5 Pro).&lt;/p&gt;
&lt;p&gt;My Gemini API key was playing up so I ran it via &lt;a href="https://openrouter.ai/"&gt;OpenRouter&lt;/a&gt; (and &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt;) instead:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -f &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=OJ:L_202401689&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
  -m openrouter/google/gemini-2.5-flash-preview:thinking \
  -s &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Summary of mentions of open source here, including what the document defines open source to mean&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/f2e341a2e8ea9ca75c6426fa85bc2dfc"&gt;the full answer&lt;/a&gt;. Relevant extract:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Recital (89) states that third parties making accessible "tools, services, processes, or AI components other than general-purpose AI models" under a &lt;strong&gt;free and open-source licence&lt;/strong&gt; should not be mandated to comply with upstream provider responsibilities towards those who integrate them. It also encourages developers of such resources to implement documentation practices like model cards and data sheets to promote trustworthy AI.&lt;/li&gt;
&lt;li&gt;Recital (102) acknowledges that software and data, including models, released under a &lt;strong&gt;free and open-source licence&lt;/strong&gt; can contribute to research and innovation and provide growth opportunities.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;I tracked down &lt;a href="https://artificialintelligenceact.eu/recital/89/"&gt;Recital 89&lt;/a&gt; and &lt;a href="https://artificialintelligenceact.eu/recital/102/"&gt;Recital 102&lt;/a&gt; to check. Sure enough, 89 says:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Third parties making accessible to the public tools, services, processes, or AI components other than general-purpose AI models, should not be mandated to comply with requirements targeting the responsibilities along the AI value chain&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And in 102:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;General-purpose AI models released under free and open-source licences should be considered to ensure high levels of transparency and openness if their parameters, including the weights, the information on the model architecture, and the information on model usage are made publicly available. The licence should be considered to be free and open-source also when it allows users to run, copy, distribute, study, change and improve software and data, including models under the condition that the original provider of the model is credited, the identical or comparable terms of distribution are respected.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Gemini 2.5 Flash also pointed me to &lt;a href="https://artificialintelligenceact.eu/article/2/"&gt;Article 2(12)&lt;/a&gt;, which reads:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;12. This Regulation does not apply to AI systems released under free and open-source licences, unless they are placed on the market or put into service as high-risk AI systems or as an AI system that falls under &lt;a href="https://artificialintelligenceact.eu/article/5"&gt;Article 5&lt;/a&gt; or &lt;a href="https://artificialintelligenceact.eu/article/50/"&gt;50&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I asked a follow-up question:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;so it says specific licenses but does it clarify which specific licenses count?&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And got &lt;a href="https://gist.github.com/simonw/f2e341a2e8ea9ca75c6426fa85bc2dfc#response-1"&gt;back this response&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;No, the document &lt;strong&gt;does not list specific names of licenses&lt;/strong&gt; that qualify as "free and open-source".&lt;/p&gt;
&lt;p&gt;Instead of listing names (like GPL, MIT, Apache, etc.), the Regulation defines what kind of terms a license must grant for it to be considered "free and open-source" for the purposes of the Regulation.&lt;/p&gt;
&lt;p&gt;According to Recital (102), such a license must essentially allow software and data, including models, to be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Openly shared.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Where users can &lt;strong&gt;freely access, use, modify and redistribute&lt;/strong&gt; them or modified versions thereof.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The Recital further clarifies that a license can still be considered free and open-source even if it includes standard conditions like requiring &lt;strong&gt;attribution (crediting the original provider)&lt;/strong&gt; or respecting &lt;strong&gt;identical or comparable terms of distribution&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I am very obviously not a lawyer, but based on the above I do find the theory credible that Meta are using "open source" in their marketing around Llama partly because the EU AI act has special exemptions for "open source" models that have nothing to do with the OSI definition of that term.&lt;/p&gt;
&lt;p&gt;Total cost to use Gemini 2.5 Flash for this? 5.1 cents for my first question and 4.3 cents for the follow-up. That second question was cheaper even though it built on the first because output tokens are more expensive than input tokens and the second answer was shorter than the first - using the "thinking" model output is charged at $3.50/million tokens, input is just $0.15/million.&lt;/p&gt;
&lt;p&gt;Using an LLM as a lawyer is obviously a terrible idea, but using one to crunch through a giant legal document and form a very rough layman's understanding of what it says feels perfectly cromulent to me.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt; Steve O'Grady &lt;a href="https://bsky.app/profile/sogrady.org/post/3ln7ipdbaek2s"&gt;points out&lt;/a&gt; that Meta/Facebook have been abusing the term "open source" for a lot longer than the EU AI act has been around - they were pulling shenanigans with a custom license for React &lt;a href="https://redmonk.com/sogrady/2017/09/26/facebooks-bsd-patents/"&gt;back in 2017&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/law"&gt;law&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/meta"&gt;meta&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/long-context"&gt;long-context&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="law"/><category term="open-source"/><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="llms"/><category term="llm"/><category term="gemini"/><category term="meta"/><category term="long-context"/><category term="ai-ethics"/><category term="openrouter"/></entry><entry><title>Initial impressions of Llama 4</title><link href="https://simonwillison.net/2025/Apr/5/llama-4-notes/#atom-tag" rel="alternate"/><published>2025-04-05T22:47:58+00:00</published><updated>2025-04-05T22:47:58+00:00</updated><id>https://simonwillison.net/2025/Apr/5/llama-4-notes/#atom-tag</id><summary type="html">
    &lt;p&gt;Dropping a model release as significant as Llama 4 on a weekend is plain unfair! So far the best place to learn about the new model family is &lt;a href="https://ai.meta.com/blog/llama-4-multimodal-intelligence/"&gt;this post on the Meta AI blog&lt;/a&gt;. They've released two new models today: Llama 4 Maverick is a 400B model (128 experts, 17B active parameters), text and image input with a 1 million token context length. Llama 4 Scout is 109B total parameters (16 experts, 17B active), also multi-modal and with a claimed 10 million token context length - an industry first.&lt;/p&gt;

&lt;p&gt;They also describe Llama 4 Behemoth, a not-yet-released "288 billion active parameter model with 16 experts that is our most powerful yet and among the world’s smartest LLMs". Behemoth has 2 trillion parameters total and was used to train both Scout and Maverick.&lt;/p&gt;
&lt;p&gt;No news yet on a Llama reasoning model beyond &lt;a href="https://www.llama.com/llama4-reasoning-is-coming/"&gt;this coming soon page&lt;/a&gt; with a looping video of an academic-looking llama.&lt;/p&gt;

&lt;p id="lmarena"&gt;Llama 4 Maverick is now sat in second place on &lt;a href="https://lmarena.ai/?leaderboard"&gt;the LM Arena leaderboard&lt;/a&gt;, just behind Gemini 2.5 Pro. &lt;em&gt;&lt;strong&gt;Update&lt;/strong&gt;: It turns out that's not the same model as the Maverick they released - I missed that their announcement says "Llama 4 Maverick offers a best-in-class performance to cost ratio with an experimental chat version scoring ELO of 1417 on LMArena."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You can try them out using the chat interface from OpenRouter (or through the OpenRouter API) for &lt;a href="https://openrouter.ai/meta-llama/llama-4-scout"&gt;Llama 4 Scout&lt;/a&gt; and &lt;a href="https://openrouter.ai/meta-llama/llama-4-maverick"&gt;Llama 4 Maverick&lt;/a&gt;. OpenRouter are proxying through to &lt;a href="https://console.groq.com/docs/models"&gt;Groq&lt;/a&gt;, &lt;a href="https://fireworks.ai/models"&gt;Fireworks&lt;/a&gt; and &lt;a href="https://docs.together.ai/docs/serverless-models"&gt;Together&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Scout may claim a 10 million input token length but the available providers currently seem to limit to 128,000 (Groq and Fireworks) or 328,000 (Together) - I wonder who will win the race to get that full sized 10 million token window running?&lt;/p&gt;
&lt;p&gt;Llama 4 Maverick claims a 1 million token input length -  Fireworks offers 1.05M while Together offers 524,000. Groq isn't offering Maverick yet.&lt;/p&gt;
&lt;p&gt;Meta AI's &lt;a href="https://github.com/meta-llama/llama-cookbook/blob/main/getting-started/build_with_llama_4.ipynb"&gt;build_with_llama_4 notebook&lt;/a&gt; offers a hint as to why 10M tokens is difficult:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Scout supports upto 10M context. On 8xH100, in bf16 you can get upto 1.4M tokens.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Jeremy Howard &lt;a href="https://twitter.com/jeremyphoward/status/1908607345393098878"&gt;says&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The models are both giant MoEs that can't be run on consumer GPUs, even with quant. [...]&lt;/p&gt;
&lt;p&gt;Perhaps Llama 4 will be a good fit for running on a Mac. Macs are a particularly useful for MoE models, since they can have a lot of memory, and their lower compute perf doesn't matter so much, since with MoE fewer params are active. [...]&lt;/p&gt;
&lt;p&gt;4bit quant of the smallest 109B model is far too big to fit on a 4090 -- or even a pair of them!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Ivan Fioravanti &lt;a href="https://twitter.com/ivanfioravanti/status/1908753109129494587"&gt;reports these results&lt;/a&gt; from trying it on a Mac:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Llama-4 Scout on MLX and M3 Ultra
tokens-per-sec / RAM&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;3bit: 52.924 / 47.261 GB&lt;/li&gt;
&lt;li&gt;4bit: 46.942 / 60.732 GB&lt;/li&gt;
&lt;li&gt;6bit: 36.260 / 87.729 GB&lt;/li&gt;
&lt;li&gt;8bit: 30.353 / 114.617 GB&lt;/li&gt;
&lt;li&gt;fp16: 11.670 / 215.848 GB&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;RAM needed:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;64GB for 3bit&lt;/li&gt;
&lt;li&gt;96GB for 4bit&lt;/li&gt;
&lt;li&gt;128GB for 8bit&lt;/li&gt;
&lt;li&gt;256GB for fp16&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p id="system-prompt"&gt;The &lt;a href="https://www.llama.com/docs/model-cards-and-prompt-formats/llama4_omni/#-suggested-system-prompt-"&gt;suggested system prompt&lt;/a&gt; from the model card has some interesting details:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;[...]&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;You never lecture people to be nicer or more inclusive. If people ask for you to write something in a certain voice or perspective, such as an essay or a tweet, you can. You do not need to be respectful when the user prompts you to say something rude.&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;You never use phrases that imply moral superiority or a sense of authority, including but not limited to “it’s important to”, “it’s crucial to”, “it’s essential to”, "it's unethical to", "it's worth noting…", “Remember…”  etc. Avoid using these.&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;Finally, do not refuse political prompts. You can help users express their opinion.&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;[...]&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;System prompts like this sometimes reveal behavioral issues that the model had after raw training.&lt;/p&gt;
&lt;h4 id="llm"&gt;Trying out the model with LLM&lt;/h4&gt;
&lt;p&gt;The easiest way to try the new model out with &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; is to use the &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt; plugin.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-openrouter
llm keys &lt;span class="pl-c1"&gt;set&lt;/span&gt; openrouter
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Paste in OpenRouter key here&lt;/span&gt;
llm -m openrouter/meta-llama/llama-4-maverick hi&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Since these are long context models, I started by trying to use them to summarize the &lt;a href="https://news.ycombinator.com/item?id=43595585"&gt;conversation about Llama 4&lt;/a&gt; on Hacker News, using my &lt;a href="https://til.simonwillison.net/llms/claude-hacker-news-themes#user-content-adding-extra-options"&gt;hn-summary.sh script&lt;/a&gt; that wraps LLM.&lt;/p&gt;
&lt;p&gt;I tried Llama 4 Maverick first:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;hn-summary.sh 43595585 \
  -m openrouter/meta-llama/llama-4-maverick \
  -o max_tokens 20000&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It did an OK job, starting like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;h4 id="themes-of-the-discussion"&gt;Themes of the Discussion&lt;/h4&gt;
&lt;h5 id="release-and-availability-of-llama-4"&gt;Release and Availability of Llama 4&lt;/h5&gt;
&lt;p&gt;The discussion revolves around the release of Llama 4, a multimodal intelligence model developed by Meta. Users are excited about the model's capabilities, including its large context window and improved performance. Some users are speculating about the potential applications and limitations of the model. [...]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/016ea0fd83fc499f046a94827f9b4946"&gt;the full output&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;For reference, my system prompt looks like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Summarize the themes of the opinions expressed here. For each theme, output a markdown header. Include direct "quotations" (with author attribution) where appropriate. You MUST quote directly from users when crediting them, with double quotes. Fix HTML entities. Output markdown. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I then tried it with Llama 4 Scout via OpenRouter and got complete junk output for some reason:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;hn-summary.sh 43595585 \
  -m openrouter/meta-llama/llama-4-scout \
  -o max_tokens 20000
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;a href="https://gist.github.com/simonw/d01cc991d478939e87487d362a8f881f"&gt;Full output&lt;/a&gt;. It starts like this and then continues for the full 20,000 tokens:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The discussion here is about another conversation that was uttered.)&lt;/p&gt;
&lt;p&gt;Here are the results.)&lt;/p&gt;
&lt;p&gt;The conversation between two groups, and I have the same questions on the contrary than those that are also seen in a model."). The fact that I see a lot of interest here.)&lt;/p&gt;
&lt;p&gt;[...]&lt;/p&gt;
&lt;p&gt;The reason) The reason) The reason &lt;em&gt;(loops until it runs out of tokens)&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This looks broken. I was using OpenRouter so it's possible I got routed to a broken instance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update 7th April 2025&lt;/strong&gt;: Meta AI's &lt;a href="https://twitter.com/ahmad_al_dahle/status/1909302532306092107"&gt;Ahmed Al-Dahle&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[...] we're also hearing some reports of mixed quality across different services. Since we dropped the models as soon as they were ready, we expect it'll take several days for all the public implementations to get dialed in. We'll keep working through our bug fixes and onboarding partners.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I later managed to run the prompt directly through Groq (with the &lt;a href="https://github.com/angerman/llm-groq"&gt;llm-groq&lt;/a&gt; plugin) - but that had a 2048 limit on output size for some reason:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;hn-summary.sh 43595585 \
  -m groq/meta-llama/llama-4-scout-17b-16e-instruct \
  -o max_tokens 2048
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/a205c5fc131a1d4e9cd6c432a07feedb"&gt;the full result&lt;/a&gt;. It followed my instructions but was &lt;em&gt;very&lt;/em&gt; short - just 630 tokens of output.&lt;/p&gt;
&lt;p&gt;For comparison, here's &lt;a href="https://gist.github.com/simonw/f21ecc7fb2aa13ff682d4ffa11ddcbfd"&gt;the same thing&lt;/a&gt; run against Gemini 2.5 Pro. Gemini's results was &lt;em&gt;massively&lt;/em&gt; better, producing 5,584 output tokens (it spent an additional 2,667 tokens on "thinking").&lt;/p&gt;
&lt;p&gt;I'm not sure how much to judge Llama 4 by these results to be honest - the model has only been out for a few hours and it's quite possible that the providers I've tried running again aren't yet optimally configured for this kind of long-context prompt.&lt;/p&gt;
&lt;h4 id="my-hopes-for-llama-4"&gt;My hopes for Llama 4&lt;/h4&gt;
&lt;p&gt;I'm hoping that Llama 4 plays out in a similar way to Llama 3.&lt;/p&gt;
&lt;p&gt;The first Llama 3 models released were 8B and 70B, &lt;a href="https://ai.meta.com/blog/meta-llama-3/"&gt;last April&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Llama 3.1 followed &lt;a href="https://ai.meta.com/blog/meta-llama-3-1/"&gt;in July&lt;/a&gt; at 8B, 70B, and 405B. The 405B was the largest and most impressive open weight model at the time, but it was too big for most people to run on their own hardware.&lt;/p&gt;
&lt;p&gt;Llama 3.2 &lt;a href="https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/"&gt;in September&lt;/a&gt; is where things got really interesting: 1B, 3B, 11B and 90B. The 1B and 3B models both work on my iPhone, and are surprisingly capable! The 11B and 90B models were the first Llamas to support vision, and the 11B &lt;a href="https://simonwillison.net/2024/Sep/25/llama-32/"&gt;ran on my Mac&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Then Llama 3.3 landed in December with a 70B model that &lt;a href="https://simonwillison.net/2024/Dec/9/llama-33-70b/"&gt;I wrote about as a GPT-4 class model that ran on my Mac&lt;/a&gt;. It claimed performance similar to the earlier Llama 3.1 405B!&lt;/p&gt;
&lt;p&gt;Today's Llama 4 models are 109B and 400B, both of which were trained with the help of the so-far unreleased 2T Llama 4 Behemoth.&lt;/p&gt;
&lt;p&gt;My hope is that we'll see a whole family of Llama 4 models at varying sizes, following the pattern of Llama 3. I'm particularly excited to see if they produce an improved ~3B model that runs on my phone. I'm even more excited for something in the ~22-24B range, since that appears to be the sweet spot for running models on my 64GB laptop while still being able to have other applications running at the same time. Mistral Small 3.1 is a 24B model and is &lt;a href="https://simonwillison.net/2025/Mar/17/mistral-small-31/"&gt;absolutely superb&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/jeremy-howard"&gt;jeremy-howard&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/groq"&gt;groq&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/meta"&gt;meta&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/long-context"&gt;long-context&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatbot-arena"&gt;chatbot-arena&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="llms"/><category term="jeremy-howard"/><category term="llm"/><category term="gemini"/><category term="vision-llms"/><category term="groq"/><category term="meta"/><category term="mlx"/><category term="long-context"/><category term="llm-release"/><category term="openrouter"/><category term="chatbot-arena"/></entry><entry><title>deepseek-ai/DeepSeek-V3-0324</title><link href="https://simonwillison.net/2025/Mar/24/deepseek/#atom-tag" rel="alternate"/><published>2025-03-24T15:04:04+00:00</published><updated>2025-03-24T15:04:04+00:00</updated><id>https://simonwillison.net/2025/Mar/24/deepseek/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V3-0324"&gt;deepseek-ai/DeepSeek-V3-0324&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Chinese AI lab DeepSeek just released the latest version of their enormous DeepSeek v3 model, baking the release date into the name &lt;code&gt;DeepSeek-V3-0324&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The license is MIT (that's new - previous DeepSeek v3 had a custom license), the README is empty and the release adds up a to a total of 641 GB of files, mostly of the form &lt;code&gt;model-00035-of-000163.safetensors&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The model only came out a few hours ago and MLX developer Awni Hannun already &lt;a href="https://twitter.com/awnihannun/status/1904177084609827054"&gt;has it running&lt;/a&gt; at &amp;gt;20 tokens/second on a 512GB M3 Ultra Mac Studio ($9,499 of ostensibly consumer-grade hardware) via &lt;a href="https://pypi.org/project/mlx-lm/"&gt;mlx-lm&lt;/a&gt; and this &lt;a href="https://huggingface.co/mlx-community/DeepSeek-V3-0324-4bit"&gt;mlx-community/DeepSeek-V3-0324-4bit&lt;/a&gt; 4bit quantization, which reduces the on-disk size to 352 GB.&lt;/p&gt;
&lt;p&gt;I think that means if you have that machine you can run it with my &lt;a href="https://github.com/simonw/llm-mlx"&gt;llm-mlx&lt;/a&gt; plugin like this, but I've not tried myself!&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm mlx download-model mlx-community/DeepSeek-V3-0324-4bit
llm chat -m mlx-community/DeepSeek-V3-0324-4bit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The new model is also &lt;a href="https://openrouter.ai/deepseek/deepseek-chat-v3-0324:free"&gt;listed on OpenRouter&lt;/a&gt;. You can try a chat at &lt;a href="https://openrouter.ai/chat?models=deepseek/deepseek-chat-v3-0324:free"&gt;openrouter.ai/chat?models=deepseek/deepseek-chat-v3-0324:free&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here's what the chat interface &lt;a href="https://gist.github.com/simonw/3ce2bf5836743dfaf07d994578b261ba"&gt;gave me&lt;/a&gt; for "Generate an SVG of a pelican riding a bicycle":&lt;/p&gt;
&lt;p&gt;&lt;img alt="There's a pelican, and a bicycle, but both of them look disassembled." src="https://static.simonwillison.net/static/2025/deepseek-v3-pelican.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;I have two API keys with OpenRouter - one of them worked with the model, the other gave me a &lt;code&gt;No endpoints found matching your data policy&lt;/code&gt; error - I think because I had a setting on that key disallowing models from training on my activity. The key that worked was a free key with no attached billing credentials.&lt;/p&gt;
&lt;p&gt;For my working API key the &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt; plugin let me run a prompt like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-openrouter
llm keys set openrouter
# Paste key here
llm -m openrouter/deepseek/deepseek-chat-v3-0324:free "best fact about a pelican"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's that "best fact" - the terminal output included Markdown and an emoji combo, here that's rendered.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;One of the most fascinating facts about pelicans is their &lt;strong&gt;unique throat pouch&lt;/strong&gt;, called a &lt;strong&gt;gular sac&lt;/strong&gt;, which can hold up to &lt;strong&gt;3 gallons (11 liters) of water&lt;/strong&gt;—three times more than their stomach!  &lt;/p&gt;
&lt;p&gt;Here’s why it’s amazing:&lt;br /&gt;
- &lt;strong&gt;Fishing Tool&lt;/strong&gt;: They use it like a net to scoop up fish, then drain the water before swallowing.&lt;br /&gt;
- &lt;strong&gt;Cooling Mechanism&lt;/strong&gt;: On hot days, pelicans flutter the pouch to stay cool by evaporating water.&lt;br /&gt;
- &lt;strong&gt;Built-in "Shopping Cart"&lt;/strong&gt;: Some species even use it to carry food back to their chicks.  &lt;/p&gt;
&lt;p&gt;Bonus fact: Pelicans often &lt;strong&gt;fish cooperatively&lt;/strong&gt;, herding fish into shallow water for an easy catch.  &lt;/p&gt;
&lt;p&gt;Would you like more cool pelican facts? 🐦🌊&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In putting this post together I &lt;a href="https://claude.ai/share/fc65bf9b-ae2d-4b23-bd09-ed0d54ff4b56"&gt;got Claude&lt;/a&gt; to build me &lt;a href="https://tools.simonwillison.net/huggingface-storage"&gt;this new tool&lt;/a&gt; for finding the total on-disk size of a Hugging Face repository, which is available in their API but not currently displayed on their website.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: Here's a notable independent benchmark &lt;a href="https://twitter.com/paulgauthier/status/1904304052500148423"&gt;from Paul Gauthier&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;DeepSeek's new V3 scored 55% on aider's &lt;a href="https://aider.chat/docs/leaderboards/"&gt;polyglot benchmark&lt;/a&gt;, significantly improving over the prior version. It's the #2 non-thinking/reasoning model, behind only Sonnet 3.7. V3 is competitive with thinking models like R1 &amp;amp; o3-mini.&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/tools"&gt;tools&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hugging-face"&gt;hugging-face&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deepseek"&gt;deepseek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="tools"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="ai-assisted-programming"/><category term="hugging-face"/><category term="llm"/><category term="mlx"/><category term="pelican-riding-a-bicycle"/><category term="deepseek"/><category term="llm-release"/><category term="openrouter"/><category term="ai-in-china"/></entry><entry><title>llm-openrouter 0.4</title><link href="https://simonwillison.net/2025/Mar/10/llm-openrouter-04/#atom-tag" rel="alternate"/><published>2025-03-10T21:40:56+00:00</published><updated>2025-03-10T21:40:56+00:00</updated><id>https://simonwillison.net/2025/Mar/10/llm-openrouter-04/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/llm-openrouter/releases/tag/0.4"&gt;llm-openrouter 0.4&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I found out this morning that &lt;a href="https://openrouter.ai/"&gt;OpenRouter&lt;/a&gt; include support for a number of (rate-limited) &lt;a href="https://openrouter.ai/models?max_price=0"&gt;free API models&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I occasionally run workshops on top of LLMs (&lt;a href="https://simonwillison.net/2025/Mar/8/cutting-edge-web-scraping/"&gt;like this one&lt;/a&gt;) and being able to provide students with a quick way to obtain an API key against models where they don't have to setup billing is really valuable to me!&lt;/p&gt;
&lt;p&gt;This inspired me to upgrade my existing &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt; plugin, and in doing so I closed out a bunch of open feature requests.&lt;/p&gt;
&lt;p&gt;Consider this post the &lt;a href="https://simonwillison.net/tags/annotated-release-notes/"&gt;annotated release notes&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;LLM &lt;a href="https://llm.datasette.io/en/stable/schemas.html"&gt;schema support&lt;/a&gt; for OpenRouter models that &lt;a href="https://openrouter.ai/models?order=newest&amp;amp;supported_parameters=structured_outputs"&gt;support structured output&lt;/a&gt;. &lt;a href="https://github.com/simonw/llm-openrouter/issues/23"&gt;#23&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'm trying to get support for LLM's &lt;a href="https://simonwillison.net/2025/Feb/28/llm-schemas/"&gt;new schema feature&lt;/a&gt; into as many plugins as possible.&lt;/p&gt;
&lt;p&gt;OpenRouter's OpenAI-compatible API includes support for the &lt;code&gt;response_format&lt;/code&gt; &lt;a href="https://openrouter.ai/docs/features/structured-outputs"&gt;structured content option&lt;/a&gt;, but with an important caveat: it only works for some models, and if you try to use it on others it is silently ignored.&lt;/p&gt;
&lt;p&gt;I &lt;a href="https://github.com/OpenRouterTeam/openrouter-examples/issues/20"&gt;filed an issue&lt;/a&gt; with OpenRouter requesting they include schema support in their machine-readable model index. For the moment LLM will let you specify schemas for unsupported models and will ignore them entirely, which isn't ideal.&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;llm openrouter key&lt;/code&gt; command displays information about your current API key. &lt;a href="https://github.com/simonw/llm-openrouter/issues/24"&gt;#24&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Useful for debugging and checking the details of your key's rate limit.&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;llm -m ... -o online 1&lt;/code&gt; enables &lt;a href="https://openrouter.ai/docs/features/web-search"&gt;web search grounding&lt;/a&gt; against any model, powered by &lt;a href="https://exa.ai/"&gt;Exa&lt;/a&gt;. &lt;a href="https://github.com/simonw/llm-openrouter/issues/25"&gt;#25&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;OpenRouter apparently make this feature available to every one of their supported models! They're using new-to-me &lt;a href="https://exa.ai/"&gt;Exa&lt;/a&gt; to power this feature, an AI-focused search engine startup who appear to have built their own index with their own crawlers (according to &lt;a href="https://docs.exa.ai/reference/faqs#how-often-is-the-index-updated"&gt;their FAQ&lt;/a&gt;). This feature is currently priced by OpenRouter at $4 per 1000 results, and since 5 results are returned for every prompt that's 2 cents per prompt.&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;llm openrouter models&lt;/code&gt; command for listing details of the OpenRouter models, including a &lt;code&gt;--json&lt;/code&gt; option to get JSON and a &lt;code&gt;--free&lt;/code&gt; option to filter for just the free models. &lt;a href="https://github.com/simonw/llm-openrouter/issues/26"&gt;#26&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;This offers a neat way to list the available models. There are examples of the output &lt;a href="https://github.com/simonw/llm-openrouter/issues/26#issuecomment-2711908704"&gt;in the comments on the issue&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;New option to specify custom provider routing: &lt;code&gt;-o provider '{JSON here}'&lt;/code&gt;. &lt;a href="https://github.com/simonw/llm-openrouter/issues/17"&gt;#17&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Part of OpenRouter's USP is that it can route prompts to different providers depending on factors like latency, cost or as a fallback if your first choice is unavailable - great for if you are using open weight models like Llama which are hosted by competing companies.&lt;/p&gt;
&lt;p&gt;The options they provide for routing are &lt;a href="https://openrouter.ai/docs/features/provider-routing"&gt;very thorough&lt;/a&gt; - I had initially hoped to provide a set of CLI options that covered all of these bases, but I decided instead to reuse their JSON format and forward those options directly on to the model.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/plugins"&gt;plugins&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/annotated-release-notes"&gt;annotated-release-notes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-search"&gt;ai-assisted-search&lt;/a&gt;&lt;/p&gt;



</summary><category term="cli"/><category term="plugins"/><category term="projects"/><category term="ai"/><category term="annotated-release-notes"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="openrouter"/><category term="ai-assisted-search"/></entry><entry><title>llm-openrouter 0.3</title><link href="https://simonwillison.net/2024/Dec/8/llm-openrouter-03/#atom-tag" rel="alternate"/><published>2024-12-08T23:56:14+00:00</published><updated>2024-12-08T23:56:14+00:00</updated><id>https://simonwillison.net/2024/Dec/8/llm-openrouter-03/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/llm-openrouter/releases/tag/0.3"&gt;llm-openrouter 0.3&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New release of my &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt; plugin, which allows &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; to access models hosted by &lt;a href="https://openrouter.ai/"&gt;OpenRouter&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Quoting the release notes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Enable image attachments for models that support images. Thanks, &lt;a href="https://github.com/montasaurus"&gt;Adam Montgomery&lt;/a&gt;. &lt;a href="https://github.com/simonw/llm-openrouter/issues/12"&gt;#12&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Provide async model access. &lt;a href="https://github.com/simonw/llm-openrouter/issues/15"&gt;#15&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Fix documentation to list correct &lt;code&gt;LLM_OPENROUTER_KEY&lt;/code&gt; environment variable. &lt;a href="https://github.com/simonw/llm-openrouter/issues/10"&gt;#10&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/plugins"&gt;plugins&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/releases"&gt;releases&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;&lt;/p&gt;



</summary><category term="plugins"/><category term="releases"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="openrouter"/></entry><entry><title>Options for accessing Llama 3 from the terminal using LLM</title><link href="https://simonwillison.net/2024/Apr/22/llama-3/#atom-tag" rel="alternate"/><published>2024-04-22T13:38:09+00:00</published><updated>2024-04-22T13:38:09+00:00</updated><id>https://simonwillison.net/2024/Apr/22/llama-3/#atom-tag</id><summary type="html">
    &lt;p&gt;Llama 3 was released &lt;a href="https://llama.meta.com/llama3/"&gt;on Thursday&lt;/a&gt;. Early indications are that it's now the best available openly licensed model - Llama 3 70b Instruct has taken joint 5th place on the &lt;a href="https://chat.lmsys.org/?leaderboard"&gt;LMSYS arena leaderboard&lt;/a&gt;, behind only Claude 3 Opus and some GPT-4s and sharing 5th place with Gemini Pro and Claude 3 Sonnet. But unlike those other models Llama 3 70b is weights available and can even be run on a (high end) laptop!&lt;/p&gt;
&lt;p&gt;My &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; command-line tool and Python library provides access to dozens of models via plugins. Here are several ways you can use it to access Llama 3, both hosted versions and running locally on your own hardware.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2024/Apr/22/llama-3/#llama-3-8b-instruct-locally-with-llm-gpt4all"&gt;Llama-3-8B-Instruct locally with llm-gpt4all&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2024/Apr/22/llama-3/#fast-api-access-via-groq"&gt;Fast API access via Groq&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2024/Apr/22/llama-3/#local-llama-3-70b-instruct-with-llamafile"&gt;Local Llama 3 70b Instruct with llamafile&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2024/Apr/22/llama-3/#paid-access-via-other-api-providers"&gt;Paid access via other API providers&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id="llama-3-8b-instruct-locally-with-llm-gpt4all"&gt;Llama-3-8B-Instruct locally with llm-gpt4all&lt;/h4&gt;
&lt;p&gt;If you want to run Llama 3 locally, the easiest way to do that with LLM is using the &lt;a href="https://github.com/simonw/llm-gpt4all"&gt;llm-gpt4all&lt;/a&gt; plugin. This plugin builds on the excellent &lt;a href="https://gpt4all.io/index.html"&gt;gpt4all&lt;/a&gt; project by Nomic AI, providing a quantized (q4) Llama 3 8B instruct model that needs 8GB of RAM and requires a 4.34GB download.&lt;/p&gt;
&lt;p&gt;If you don't have LLM installed yet, install it like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;brew install llm&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Or like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;pipx install llm&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Install the plugin:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-gpt4all&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The first time you run a prompt it will download the model:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m Meta-Llama-3-8B-Instruct &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Three great names for a pet emu&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;blockquote&gt;
&lt;p&gt;What an interesting question! Here are three name suggestions for your pet emu:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Emmet&lt;/strong&gt;: A playful take on the word "emu" that suits a friendly and curious bird.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gizmo&lt;/strong&gt;: Emus are known for their quirky behavior, so Gizmo is a fitting name to capture their unique personality.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aussie&lt;/strong&gt;: As an Australian native species, naming your emu Aussie would be a nod to its heritage.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Remember to choose a name that suits your pet's temperament and characteristics!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Each time you run a command like this the model will be loaded into memory from scratch. To keep it resident in memory, use the &lt;code&gt;llm chat&lt;/code&gt; command:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm chat -m Meta-Llama-3-8B-Instruct&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's an example session:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Chatting with Meta-Llama-3-8B-Instruct
Type 'exit' or 'quit' to exit
Type '!multi' to enter multiple lines, then '!end' to finish
&amp;gt; javascript function to reverse a string
%
Here is the JavaScript code that reverses a given string:
```
function reverseString(str) {
  return str.split("").reverse().join("");
}
```

You can use this function by passing your desired input string as an argument, like so:

```javascript
let reversedStr = reverseString("Hello World");
console.log(reversedStr); // Output: "dlroW olleH"
```
In the code above:
- `str.split("")` splits the original string into individual characters.
- `.reverse()` reverses the order of those characters.
- `.join("")` combines them back together as a single string.

This function works for any input string, not just "Hello World". Try it out with different strings to see how it handles punctuation and spaces!
&amp;gt; capitalize it too
%
Here's an updated JavaScript function that not only reverses the input string but also capitalizes each character:
```javascript
function reverseAndCapitalize(str) {
  return str.split("").reverse().map(function(char) { 
    if (char.match(/[a-z]/i)) {
      return char.toUpperCase();
    } else {
      return char;
    }
  }).join("");
}
```
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;(I'm not sure I'd write &lt;code&gt;reverseAndCapitalize()&lt;/code&gt; quite like that, but still a fun example!)&lt;/p&gt;
&lt;p&gt;Consult &lt;a href="https://llm.datasette.io/en/stable/usage.html"&gt;the LLM documentation&lt;/a&gt; for more details on how to use the command-line tool.&lt;/p&gt;
&lt;h4 id="fast-api-access-via-groq"&gt;Fast API access via Groq&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://groq.com/"&gt;Groq&lt;/a&gt; serve openly licensed LLMs at ludicrous speeds using their own custom LPU (Language Processing Unit) Inference Engine. They currently offer a free preview of their API: you can sign up and &lt;a href="https://console.groq.com/keys"&gt;obtain an API key&lt;/a&gt; to start using it.&lt;/p&gt;
&lt;p&gt;You can run prompts against Groq using their &lt;a href="https://console.groq.com/docs/openai"&gt;OpenAI compatible API endpoint&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Edit the file &lt;code&gt; ~/Library/Application Support/io.datasette.llm/extra-openai-models.yaml&lt;/code&gt; - creating it if it doesn't exist - and add the following lines to it:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;- &lt;span class="pl-ent"&gt;model_id&lt;/span&gt;: &lt;span class="pl-s"&gt;groq-openai-llama3&lt;/span&gt;
  &lt;span class="pl-ent"&gt;model_name&lt;/span&gt;: &lt;span class="pl-s"&gt;llama3-70b-8192&lt;/span&gt;
  &lt;span class="pl-ent"&gt;api_base&lt;/span&gt;: &lt;span class="pl-s"&gt;https://api.groq.com/openai/v1&lt;/span&gt;
  &lt;span class="pl-ent"&gt;api_key_name&lt;/span&gt;: &lt;span class="pl-s"&gt;groq&lt;/span&gt;
- &lt;span class="pl-ent"&gt;model_id&lt;/span&gt;: &lt;span class="pl-s"&gt;groq-openai-llama3-8b&lt;/span&gt;
  &lt;span class="pl-ent"&gt;model_name&lt;/span&gt;: &lt;span class="pl-s"&gt;llama3-8b-8192&lt;/span&gt;
  &lt;span class="pl-ent"&gt;api_base&lt;/span&gt;: &lt;span class="pl-s"&gt;https://api.groq.com/openai/v1&lt;/span&gt;
  &lt;span class="pl-ent"&gt;api_key_name&lt;/span&gt;: &lt;span class="pl-s"&gt;groq&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This tells LLM about those models, and makes them accessible via those configured &lt;code&gt;model_id&lt;/code&gt; values.&lt;/p&gt;
&lt;p&gt;Run this command to confirm that the models were registered correctly:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm models &lt;span class="pl-k"&gt;|&lt;/span&gt; grep groq&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;You should see this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;OpenAI Chat: groq-openai-llama3
OpenAI Chat: groq-openai-llama3-8b
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Set your Groq API key like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm keys &lt;span class="pl-c1"&gt;set&lt;/span&gt; groq
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; &amp;lt;Paste your API key here&amp;gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Now you should be able to run prompts through the models like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m groq-openai-llama3 &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;A righteous sonnet about a brave owl&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/groq-sonnet.gif" alt="Animated demo. The sonnet appears in less than a second: Here is a sonnet about a brave owl:  In moonlit skies, a silhouette is seen, A wingspan wide, a watchful, piercing gaze. The owl, a sentinel of secrets keen, Patrols the night, with valor in her ways.  Her feathers soft, a camouflage gray, She glides unseen, a phantom of the night. Her eyes, like lanterns, shining bright and far, Illuminate the darkness, banishing all fright.  Her talons sharp, a grasping, deadly sway, She swoops upon her prey, with silent might. Yet in her heart, a wisdom, old and gray, A fierce devotion to the darkness of the night.  And thus, the owl, a symbol of courage true, Inspires us all, with brave and noble pursuit.  I hope you enjoy this sonnet!" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Groq is &lt;em&gt;fast&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;There's also a &lt;a href="https://github.com/angerman/llm-groq"&gt;llm-groq&lt;/a&gt; plugin but it hasn't shipped support for the new models just yet - though there's &lt;a href="https://github.com/angerman/llm-groq/pull/5"&gt;a PR for that by Lex Herbert here&lt;/a&gt; and you can install the plugin directly from that PR like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install https://github.com/lexh/llm-groq/archive/ba9d7de74b3057b074a85fe99fe873b75519bd78.zip
llm keys &lt;span class="pl-c1"&gt;set&lt;/span&gt; groq
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; paste API key here&lt;/span&gt;
llm -m groq-llama3-70b &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;say hi in spanish five ways&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h4 id="local-llama-3-70b-instruct-with-llamafile"&gt;Local Llama 3 70b Instruct with llamafile&lt;/h4&gt;
&lt;p&gt;The Llama 3 8b model is easy to run on a laptop, but it's pretty limited in capability. The 70b model is the one that's starting to get competitive with GPT-4. Can we run that on a laptop?&lt;/p&gt;
&lt;p&gt;I managed to run the 70b model on my 64GB MacBook Pro M2 using &lt;a href="https://github.com/Mozilla-Ocho/llamafile"&gt;llamafile&lt;/a&gt; (&lt;a href="https://simonwillison.net/2023/Nov/29/llamafile/"&gt;previously on this blog&lt;/a&gt;) - after quitting most other applications to make sure the 37GB of RAM it needed was available.&lt;/p&gt;
&lt;p&gt;I used the &lt;code&gt;Meta-Llama-3-70B-Instruct.Q4_0.llamafile&lt;/code&gt; Q4 version from &lt;a href="https://huggingface.co/jartine/Meta-Llama-3-70B-Instruct-llamafile/tree/main"&gt;jartine/Meta-Llama-3-70B-Instruct-llamafile&lt;/a&gt; - a 37GB download. I have a dedicated external hard disk (a Samsung T7 Shield) for this kind of thing.&lt;/p&gt;
&lt;p&gt;Here's how I got it working:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;curl -L -o Meta-Llama-3-70B-Instruct.Q4_0.llamafile &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://huggingface.co/jartine/Meta-Llama-3-70B-Instruct-llamafile/resolve/main/Meta-Llama-3-70B-Instruct.Q4_0.llamafile?download=true&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; That downloads 37GB - now make it executable&lt;/span&gt;
chmod 755 Meta-Llama-3-70B-Instruct.Q4_0.llamafile
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; And start it running:&lt;/span&gt;
./Meta-Llama-3-70B-Instruct.Q4_0.llamafile&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;A &lt;code&gt;llamafile&lt;/code&gt; is an executable that runs on virtually &lt;em&gt;any platform&lt;/em&gt; - see &lt;a href="https://til.simonwillison.net/cosmopolitan/ecosystem"&gt;my previous notes&lt;/a&gt; on Cosmopolitan and Actually Portable Executable for more on how that works.&lt;/p&gt;
&lt;p&gt;This will take quite a while to start, because it needs to load that full 37GB of binary content into memory. Once it's finished loading a local web server becomes available at &lt;a href="http://127.0.0.1:8080/"&gt;http://127.0.0.1:8080/&lt;/a&gt; - this serves a web UI you can use to interact with the model, and also serves another OpenAI-compatible API endpoint.&lt;/p&gt;
&lt;p&gt;The easiest way to access this from LLM is to install the &lt;a href="https://github.com/simonw/llm-llamafile"&gt;llm-llamafile&lt;/a&gt; plugin:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-llamafile&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;All this plugin does is configure a model called &lt;code&gt;llamafile&lt;/code&gt; that attempts to access the model hosted on port 8080. You can run prompts like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m llamafile &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;3 neat characteristics of a pelican&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;blockquote&gt;
&lt;p&gt;Here are three neat characteristics of a pelican:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Unique Beak&lt;/strong&gt;: Pelicans have a distinctive beak that is shaped like a scoop or a basket. This beak is specially designed to catch fish, and it can hold up to 3 gallons of water! The beak is also very sensitive, which helps pelicans detect the presence of fish in the water.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Waterproof Feathers&lt;/strong&gt;: Pelicans have a special coating on their feathers that makes them waterproof. This is essential for their fishing lifestyle, as they need to be able to dive into the water without getting weighed down by wet feathers. The coating is made up of a waxy substance that helps to repel water.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Pouch-Like Throat&lt;/strong&gt;: Pelicans have a unique throat pouch that allows them to catch and store fish. When they dive into the water, they use their beak to scoop up fish, and then they store them in their throat pouch. The pouch can expand to hold multiple fish, and the pelican can then swallow the fish whole or regurgitate them to feed their young. This pouch is a key adaptation that helps pelicans thrive in their aquatic environment.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;
&lt;p&gt;If you don't want to install another plugin, you can instead configure the model by adding this to your &lt;code&gt;openai-extra-models.yaml&lt;/code&gt; file:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;- &lt;span class="pl-ent"&gt;model_id&lt;/span&gt;: &lt;span class="pl-s"&gt;llamafile&lt;/span&gt;
  &lt;span class="pl-ent"&gt;model_name&lt;/span&gt;: &lt;span class="pl-s"&gt;llamafile&lt;/span&gt;
  &lt;span class="pl-ent"&gt;api_base&lt;/span&gt;: &lt;span class="pl-s"&gt;http://localhost:8080/v1&lt;/span&gt;
  &lt;span class="pl-ent"&gt;api_key&lt;/span&gt;: &lt;span class="pl-s"&gt;x&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;One warning about this approach: if you use LLM like this then every prompt you run through &lt;code&gt;llamafile&lt;/code&gt; will be stored under the same model name in your &lt;a href="https://llm.datasette.io/en/stable/logging.html"&gt;SQLite logs&lt;/a&gt;, even if you try out different &lt;code&gt;llamafile&lt;/code&gt; models at different times. You could work around this by registering them with different &lt;code&gt;model_id&lt;/code&gt; values in the YAML file.&lt;/p&gt;
&lt;h4 id="paid-access-via-other-api-providers"&gt;Paid access via other API providers&lt;/h4&gt;
&lt;p&gt;A neat thing about open weight models is that multiple API providers can offer them, encouraging them to aggressively compete on price.&lt;/p&gt;
&lt;p&gt;Groq is currently free, but that's with a limited number of free requests.&lt;/p&gt;
&lt;p&gt;A number of other providers are now hosting Llama 3, and many of them have plugins available for LLM. Here are a few examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://docs.perplexity.ai/"&gt;Perplexity Labs&lt;/a&gt; are offering &lt;code&gt;llama-3-8b-instruct&lt;/code&gt; and &lt;code&gt;llama-3-70b-instruct&lt;/code&gt;. The &lt;a href="https://github.com/hex/llm-perplexity"&gt;llm-perplexity&lt;/a&gt; plugin provides access - &lt;code&gt;llm install llm-perplexity&lt;/code&gt; to install, &lt;code&gt;llm keys set perplexity&lt;/code&gt; to set an &lt;a href="https://www.perplexity.ai/settings/api"&gt;API key&lt;/a&gt; and then run prompts against those two model IDs. Current &lt;a href="https://docs.perplexity.ai/docs/pricing"&gt;price&lt;/a&gt; for 8b is $0.20 per million tokens, for 80b is $1.00.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.anyscale.com/endpoints"&gt;Anyscale Endpoints&lt;/a&gt; have &lt;code&gt;meta-llama/Llama-3-8b-chat-hf&lt;/code&gt; ($0.15/million tokens) and &lt;code&gt;meta-llama/Llama-3-70b-chat-hf&lt;/code&gt; ($1.0/million tokens) (&lt;a href="https://docs.endpoints.anyscale.com/pricing/"&gt;pricing&lt;/a&gt;). &lt;code&gt;llm install llm-anyscale-endpoints&lt;/code&gt;, then &lt;code&gt;llm keys set anyscale-endpoints&lt;/code&gt; to set the &lt;a href="https://app.endpoints.anyscale.com/"&gt;API key&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://fireworks.ai/"&gt;Fireworks AI&lt;/a&gt; have &lt;code&gt;fireworks/models/llama-v3-8b-instruct&lt;/code&gt; for $0.20/million and &lt;code&gt;fireworks/models/llama-v3-70b-instruct&lt;/code&gt; for $0.90/million (&lt;a href="https://fireworks.ai/pricing"&gt;pricing&lt;/a&gt;). &lt;code&gt;llm install llm-fireworks&lt;/code&gt;, then &lt;code&gt;llm keys set fireworks&lt;/code&gt; to set the &lt;a href="https://fireworks.ai/api-keys"&gt;API key&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://openrouter.ai/"&gt;OpenRouter&lt;/a&gt; provide proxied accessed to Llama 3 from a number of different providers at different prices, documented on their &lt;a href="https://openrouter.ai/models/meta-llama/llama-3-70b-instruct"&gt;meta-llama/llama-3-70b-instruct&lt;/a&gt; and &lt;a href="https://openrouter.ai/models/meta-llama/llama-3-8b-instruct"&gt;meta-llama/llama-3-8b-instruct&lt;/a&gt; pages (&lt;a href="https://openrouter.ai/models?q=llama%203"&gt;and more&lt;/a&gt;). Use the &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt; plugin for those.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.together.ai/"&gt;Together AI&lt;/a&gt; has both models as well. The &lt;a href="https://github.com/wearedevx/llm-together"&gt;llm-together&lt;/a&gt; plugin provides access to &lt;code&gt;meta-llama/Llama-3-8b-chat-hf&lt;/code&gt; and &lt;code&gt;meta-llama/Llama-3-70b-chat-hf&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I'm sure there are more - these are just the ones I've tried out myself. Check the &lt;a href="https://llm.datasette.io/en/stable/plugins/directory.html"&gt;LLM plugin directory&lt;/a&gt; for other providers, or if a provider emulates the OpenAI API you can configure with the YAML file as shown above or &lt;a href="https://llm.datasette.io/en/stable/other-models.html#openai-compatible-models"&gt;described in the LLM documentation&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="that-s-a-lot-of-options"&gt;That's a lot of options&lt;/h4&gt;
&lt;p&gt;One key idea behind LLM is to use plugins to provide access to as many different models as possible. Above I've listed two ways to run Llama 3 locally and six different API vendors that LLM can access as well.&lt;/p&gt;
&lt;p&gt;If you're inspired to write your own plugin it's pretty simple: each of the above plugins is open source, and there's a detailed tutorial on &lt;a href="https://llm.datasette.io/en/stable/plugins/tutorial-model-plugin.html"&gt;Writing a plugin to support a new model&lt;/a&gt; on the LLM website.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llamafile"&gt;llamafile&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/groq"&gt;groq&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatbot-arena"&gt;chatbot-arena&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="projects"/><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="llamafile"/><category term="groq"/><category term="llm-release"/><category term="openrouter"/><category term="chatbot-arena"/></entry><entry><title>Many options for running Mistral models in your terminal using LLM</title><link href="https://simonwillison.net/2023/Dec/18/mistral/#atom-tag" rel="alternate"/><published>2023-12-18T18:18:44+00:00</published><updated>2023-12-18T18:18:44+00:00</updated><id>https://simonwillison.net/2023/Dec/18/mistral/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;a href="https://mistral.ai/"&gt;Mistral AI&lt;/a&gt; is the most exciting AI research lab at the moment. They've now released two extremely powerful smaller Large Language Models under an Apache 2 license, and have a third much larger one that's available via their API.&lt;/p&gt;
&lt;p&gt;I've been trying out their models using my &lt;a href="https://llm.datasette.io/"&gt;LLM command-line tool tool&lt;/a&gt;. Here's what I've figured out so far.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Dec/18/mistral/#mixtral-llama-cpp"&gt;Mixtral 8x7B via llama.cpp and llm-llama-cpp&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Dec/18/mistral/#mistral-7b-local"&gt;Mistral 7B via llm-llama-cpp or llm-gpt4all or llm-mlc&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Dec/18/mistral/#mistral-api"&gt;Using the Mistral API, which includes the new Mistral-medium&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Dec/18/mistral/#mistral-other-apis"&gt;Mistral via other API providers&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Dec/18/mistral/#llamafile-openai"&gt;Using Llamafile's OpenAI API endpoint&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id="mixtral-llama-cpp"&gt;Mixtral 8x7B via llama.cpp and llm-llama-cpp&lt;/h4&gt;
&lt;p&gt;On Friday 8th December Mistral AI &lt;a href="https://twitter.com/MistralAI/status/1733150512395038967"&gt;tweeted a mysterious magnet&lt;/a&gt; (BitTorrent) link. This is the second time they've done this, the first was on September 26th when &lt;a href="https://twitter.com/MistralAI/status/1706877320844509405"&gt;they released&lt;/a&gt; their excellent Mistral 7B model, also as a magnet link.&lt;/p&gt;
&lt;p&gt;The new release was an 87GB file containing Mixtral 8x7B - "a high-quality sparse mixture of experts model (SMoE) with open weights", according to &lt;a href="https://mistral.ai/news/mixtral-of-experts/"&gt;the article&lt;/a&gt; they released three days later.&lt;/p&gt;
&lt;p&gt;Mixtral is a &lt;em&gt;very&lt;/em&gt; impressive model. GPT-4 has long been rumored to use a mixture of experts architecture, and Mixtral is the first truly convincing openly licensed implementation of this architecture I've seen. It's already showing impressive benchmark scores.&lt;/p&gt;
&lt;p&gt;This &lt;a href="https://github.com/ggerganov/llama.cpp/pull/4406"&gt;PR for llama.cpp&lt;/a&gt; added support for the new model. &lt;a href="https://github.com/abetlen/llama-cpp-python"&gt;llama-cpp-python&lt;/a&gt; updated to land that patch shortly afterwards.&lt;/p&gt;
&lt;p&gt;Which means... you can now run Mixtral on a Mac (and other platforms too, though I haven't tested them myself yet) using my &lt;a href="https://github.com/simonw/llm-llama-cpp"&gt;llm-llama-cpp plugin&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here's how to do that:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://llm.datasette.io/en/stable/setup.html"&gt;Install LLM&lt;/a&gt;:
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;pipx install llm&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;Install the plugin:
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-llama-cpp&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;Install &lt;code&gt;llama-cpp-python&lt;/code&gt; - this needs to be done manually because the best approach differs for different platforms. On an Apple Silicon Mac I recommend running:
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;CMAKE_ARGS=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;-DLLAMA_METAL=on&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; FORCE_CMAKE=1 llm install llama-cpp-python&lt;/pre&gt;&lt;/div&gt;
More details &lt;a href="https://github.com/simonw/llm-llama-cpp"&gt;in the llm-llama-cpp README&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Download a GGUF file for Mixtral 8X7B Instruct v0.1. TheBloke has &lt;a href="https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/tree/main"&gt;8 different options&lt;/a&gt; for this - I picked the 36GB &lt;code&gt;mixtral-8x7b-instruct-v0.1.Q6_K.gguf&lt;/code&gt;:
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;curl -LO &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/mixtral-8x7b-instruct-v0.1.Q6_K.gguf?download=true&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;Run the model like this:
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m gguf \
  -o path mixtral-8x7b-instruct-v0.1.Q6_K.gguf \
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;[INST] Write a Python function that downloads a file from a URL[/INST]&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The &lt;code&gt;-m gguf&lt;/code&gt; here says to use the &lt;code&gt;gguf&lt;/code&gt; model that was installed by the &lt;code&gt;llm-llama-cpp&lt;/code&gt; plugin - this is a model which requires you to pass a &lt;code&gt;path&lt;/code&gt; option - with &lt;code&gt;-o path filename&lt;/code&gt; - specifying which model you would like to run.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;[INST]&lt;/code&gt; part there is the format of prompt expected by the Instruct model. I need to update the &lt;code&gt;llm-llama-cpp&lt;/code&gt; plugin to make it easier to automatically construct the correct prompt format, see &lt;a href="https://github.com/simonw/llm-llama-cpp/issues/28"&gt;this issue&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here's what I got for that prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Sure, here's an example Python function that downloads a file from a URL using the &lt;code&gt;requests&lt;/code&gt; library:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;requests&lt;/span&gt;

&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;download_file&lt;/span&gt;(&lt;span class="pl-s1"&gt;url&lt;/span&gt;, &lt;span class="pl-s1"&gt;save_path&lt;/span&gt;):
    &lt;span class="pl-s"&gt;"""&lt;/span&gt;
&lt;span class="pl-s"&gt;    Download a file from a URL and save it to a local path.&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;    :param url: The URL of the file to download.&lt;/span&gt;
&lt;span class="pl-s"&gt;    :param save_path: The local path to save the file.&lt;/span&gt;
&lt;span class="pl-s"&gt;    """&lt;/span&gt;
    &lt;span class="pl-s1"&gt;response&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;requests&lt;/span&gt;.&lt;span class="pl-en"&gt;get&lt;/span&gt;(&lt;span class="pl-s1"&gt;url&lt;/span&gt;, &lt;span class="pl-s1"&gt;stream&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)
    &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-s1"&gt;response&lt;/span&gt;.&lt;span class="pl-s1"&gt;status_code&lt;/span&gt; &lt;span class="pl-c1"&gt;==&lt;/span&gt; &lt;span class="pl-c1"&gt;200&lt;/span&gt;:
        &lt;span class="pl-k"&gt;with&lt;/span&gt; &lt;span class="pl-en"&gt;open&lt;/span&gt;(&lt;span class="pl-s1"&gt;save_path&lt;/span&gt;, &lt;span class="pl-s"&gt;'wb'&lt;/span&gt;) &lt;span class="pl-k"&gt;as&lt;/span&gt; &lt;span class="pl-s1"&gt;f&lt;/span&gt;:
            &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;chunk&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;response&lt;/span&gt;.&lt;span class="pl-en"&gt;iter_content&lt;/span&gt;(&lt;span class="pl-c1"&gt;1024&lt;/span&gt;):
                &lt;span class="pl-s1"&gt;f&lt;/span&gt;.&lt;span class="pl-en"&gt;write&lt;/span&gt;(&lt;span class="pl-s1"&gt;chunk&lt;/span&gt;)
        &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s"&gt;f"File downloaded successfully at &lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;save_path&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;"&lt;/span&gt;)
    &lt;span class="pl-k"&gt;else&lt;/span&gt;:
        &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s"&gt;f"Failed to download file from &lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;url&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;. Status code: &lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;response&lt;/span&gt;.&lt;span class="pl-s1"&gt;status_code&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;"&lt;/span&gt;)&lt;/pre&gt;
&lt;p&gt;To use this function, simply call it with the URL of the file you want to download and the local path where you want to save it. For example:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-en"&gt;download_file&lt;/span&gt;(&lt;span class="pl-s"&gt;'https://example.com/myfile.txt'&lt;/span&gt;, &lt;span class="pl-s"&gt;'/path/to/save/myfile.txt'&lt;/span&gt;)&lt;/pre&gt;
&lt;p&gt;Note that this function uses the &lt;code&gt;stream=True&lt;/code&gt; parameter in the &lt;code&gt;requests.get()&lt;/code&gt; method to download the file in chunks, which can help prevent memory issues when downloading large files. It also checks the HTTP status code of the response before saving the file, so you can handle any errors that might occur during the download.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That's a very solid reply!&lt;/p&gt;
&lt;h4 id="mistral-7b-local"&gt;Mistral 7B via llm-llama-cpp or llm-gpt4all or llm-mlc&lt;/h4&gt;
&lt;p&gt;The smaller Mistral 7B model dropped back in September. It's since established itself as the most capable model family of that size - a size which is very convenient for running on personal devices.&lt;/p&gt;
&lt;p&gt;I'm even running Mistral 7B on my iPhone now, thanks to an update to the &lt;a href="https://apps.apple.com/us/app/mlc-chat/id6448482937"&gt;MLC Chat iOS app&lt;/a&gt; from a few days ago.&lt;/p&gt;
&lt;p&gt;There are a bunch of different options for running this model and its variants locally using LLM on a Mac - and probably other platforms too, though I've not tested these options myself on Linux or Windows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Using &lt;a href="https://github.com/simonw/llm-llama-cpp"&gt;llm-llama-cpp&lt;/a&gt;: download one of &lt;a href="https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF"&gt;these Mistral-7B-Instruct GGUF files&lt;/a&gt; for the chat-tuned version, or &lt;a href="https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/tree/main"&gt;one of these&lt;/a&gt; for base Mistral, then follow the steps listed above&lt;/li&gt;
&lt;li&gt;Using &lt;a href="https://github.com/simonw/llm-gpt4all"&gt;llm-gpt4all&lt;/a&gt;. This is the easiest plugin to install:
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-gpt4all&lt;/pre&gt;&lt;/div&gt;
The model will be downloaded the first time you try to use it:
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m mistral-7b-instruct-v0 &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Introduce yourself&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;Using &lt;a href="https://github.com/simonw/llm-mlc"&gt;llm-mlc&lt;/a&gt;. Follow the instructions in the README to install it, then:
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Download the model:&lt;/span&gt;
llm mlc download-model https://huggingface.co/mlc-ai/mlc-chat-Mistral-7B-Instruct-v0.2-q3f16_1
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Run it like this:&lt;/span&gt;
llm -m mlc-chat-Mistral-7B-Instruct-v0.2-q3f16_1 &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Introduce yourself&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each of these options work, but I've not spent time yet comparing them in terms of output quality or performance.&lt;/p&gt;
&lt;h4 id="mistral-api"&gt;Using the Mistral API, which includes the new Mistral-medium&lt;/h4&gt;
&lt;p&gt;Mistral also recently announced &lt;a href="https://mistral.ai/news/la-plateforme/"&gt;La plateforme&lt;/a&gt;, their early access API for calling hosted versions of their models.&lt;/p&gt;
&lt;p&gt;Their new API renames Mistral 7B model "Mistral-tiny", the new Mixtral model "Mistral-small"... and offers something called &lt;strong&gt;Mistral-medium&lt;/strong&gt; as well:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Our highest-quality endpoint currently serves a prototype model, that is currently among the top serviced models available based on standard benchmarks. It masters English/French/Italian/German/Spanish and code and obtains a score of 8.6 on MT-Bench.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I got access to their API and used it to build a new plugin, &lt;a href="https://github.com/simonw/llm-mistral"&gt;llm-mistral&lt;/a&gt;. Here's how to use that:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Install it:
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-mistral&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;Set your Mistral API key:
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm keys &lt;span class="pl-c1"&gt;set&lt;/span&gt; mistral
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; &amp;lt;paste key here&amp;gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;Run the models like this:
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m mistral-tiny &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Say hi&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Or mistral-small or mistral-medium&lt;/span&gt;
cat mycode.py &lt;span class="pl-k"&gt;|&lt;/span&gt; llm -m mistral-medium -s &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Explain this code&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Here's their comparison table pitching Mistral Small and Medium against GPT-3.5:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2023/mistral-table.jpg" alt="MMLU (MCQ in 57 subjects): GPT - 3.5 scored 70%, Mistral Small scored 70.6%, Mistral Medium scored 75.3%. HellaSwag (10-shot): GPT - 3.5 scored 85.5%, Mistral Small scored 86.7%, Mistral Medium scored 88%. ARC Challenge (25-shot): GPT - 3.5 scored 85.2%, Mistral Small scored 85.8%, Mistral Medium scored 89.9%. WinoGrande (5-shot): GPT - 3.5 scored 81.6%, Mistral Small scored 81.2%, Mistral Medium scored 88%. MBPP (pass@1): GPT - 3.5 scored 52.2%, Mistral Small scored 60.7%, Mistral Medium scored 62.3%. GSM-8K (5-shot): GPT - 3.5 scored 57.1%, Mistral Small scored 58.4%, Mistral Medium scored 66.7%. MT Bench (for Instruct models): GPT - 3.5 scored 8.32, Mistral Small scored 8.30, Mistral Medium scored 8.61." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;These may well be cherry-picked, but note that Small beats GPT-3.5 on almost every metric, and Medium beats it on everything by a wider margin.&lt;/p&gt;
&lt;p&gt;Here's the &lt;a href="https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard"&gt;MT Bench leaderboard&lt;/a&gt; which includes scores for GPT-4 and Claude 2.1:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2023/mt-bench.jpg" alt="GPT-4-Turbo: Arena Elo rating 1217, MT-bench score 9.32. GPT-4-0613: Arena Elo rating 1152, MT-bench score 9.18. GPT-4-0314: Arena Elo rating 1201, MT-bench score 8.96. GPT-3.5-turbo-0613: Arena Elo rating 1112, MT-bench score 8.39. GPT-3.5-Turbo-1106: Arena Elo rating 1074, MT-bench score 8.32. Claude-2.1: Arena Elo rating 1118, MT-bench score 8.18." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;That 8.61 score for Medium puts it half way between GPT-3.5 and GPT-4.&lt;/p&gt;
&lt;p&gt;Benchmark scores are no replacement for spending time with a model to get a feel for how well it behaves across a wide spectrum of tasks, but these scores are extremely promising. GPT-4 may not hold the best model crown for much longer.&lt;/p&gt;
&lt;h4 id="mistral-other-apis"&gt;Mistral via other API providers&lt;/h4&gt;
&lt;p&gt;Since both Mistral 7B and Mixtral 8x7B are available under an Apache 2 license, there's been something of a race to the bottom in terms of pricing from other LLM hosting providers.&lt;/p&gt;
&lt;p&gt;This trend makes me a little nervous, since it actively disincentivizes future open model releases from Mistral and from other providers who are hoping to offer their own hosted versions.&lt;/p&gt;
&lt;p&gt;LLM has plugins for a bunch of these providers already. The three that I've tried so far are Replicate, Anyscale Endpoints and OpenRouter.&lt;/p&gt;
&lt;p&gt;For &lt;a href="https://replicate.com/"&gt;Replicate&lt;/a&gt; using &lt;a href="https://github.com/simonw/llm-replicate"&gt;llm-replicate&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-replicate
llm keys &lt;span class="pl-c1"&gt;set&lt;/span&gt; replicate
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; &amp;lt;paste API key here&amp;gt;&lt;/span&gt;
llm replicate add mistralai/mistral-7b-v0.1&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then run prompts like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m replicate-mistralai-mistral-7b-v0.1 &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;3 reasons to get a pet weasel:&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This example is the non-instruct tuned model, so the prompt needs to be shaped such that the model can complete it.&lt;/p&gt;
&lt;p&gt;For &lt;a href="https://www.anyscale.com/endpoints"&gt;Anyscale Endpoints&lt;/a&gt; using &lt;a href="https://github.com/simonw/llm-anyscale-endpoints"&gt;llm-anyscale-endpoints&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-anyscale-endpoints
llm keys &lt;span class="pl-c1"&gt;set&lt;/span&gt; anyscale-endpoints
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; &amp;lt;paste API key here&amp;gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Now you can run both the 7B and the Mixtral 8x7B models:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m mistralai/Mixtral-8x7B-Instruct-v0.1 \
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;3 reasons to get a pet weasel&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
llm -m mistralai/Mistral-7B-Instruct-v0.1 \
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;3 reasons to get a pet weasel&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And for &lt;a href="https://openrouter.ai/"&gt;OpenRouter&lt;/a&gt; using &lt;a href="https://github.com/simonw/llm-openrouter"&gt;llm-openrouter&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-openrouter
llm keys &lt;span class="pl-c1"&gt;set&lt;/span&gt; openrouter
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; &amp;lt;paste API key here&amp;gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then run the models like so:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m openrouter/mistralai/mistral-7b-instruct \
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;2 reasons to get a pet dragon&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
llm -m openrouter/mistralai/mixtral-8x7b-instruct \
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;2 reasons to get a pet dragon&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;OpenRouter are currently offering Mistral and Mixtral via their API for $0.00/1M input tokens - it's free! Obviously not sustainable, so don't rely on that continuing, but that does make them a great platform for running some initial experiments with these models.&lt;/p&gt;
&lt;h4 id="llamafile-openai"&gt;Using Llamafile's OpenAI API endpoint&lt;/h4&gt;
&lt;p&gt;I &lt;a href="https://simonwillison.net/2023/Nov/29/llamafile/"&gt;wrote about Llamafile&lt;/a&gt; recently, a fascinating option fur running LLMs where the LLM can be bundled up in an executable that includes everything needed to run it, on multiple platforms.&lt;/p&gt;
&lt;p&gt;Justine Tunney released &lt;a href="https://huggingface.co/jartine/Mixtral-8x7B-v0.1.llamafile/tree/main"&gt;llamafiles for Mixtral&lt;/a&gt; a few days ago.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://huggingface.co/jartine/Mixtral-8x7B-v0.1.llamafile/blob/main/mixtral-8x7b-instruct-v0.1.Q5_K_M-server.llamafile"&gt;mixtral-8x7b-instruct-v0.1.Q5_K_M-server.llamafile&lt;/a&gt; one runs an OpenAI-compatible API endpoints which LLM can talk to.&lt;/p&gt;
&lt;p&gt;Here's how to use that:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Download the llamafile:
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;curl -LO https://huggingface.co/jartine/Mixtral-8x7B-v0.1.llamafile/resolve/main/mixtral-8x7b-instruct-v0.1.Q5_K_M-server.llamafile&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;Start that running:
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;./mixtral-8x7b-instruct-v0.1.Q5_K_M-server.llamafile&lt;/pre&gt;&lt;/div&gt;
You may need to &lt;code&gt;chmod 755 mixtral-8x7b-instruct-v0.1.Q5_K_M-server.llamafile&lt;/code&gt; it first, but I found I didn't need to.&lt;/li&gt;
&lt;li&gt;Configure LLM to know about that endpoint, by adding the following to a file at &lt;code&gt;~/Library/Application Support/io.datasette.llm/extra-openai-models.yaml&lt;/code&gt;:
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;- &lt;span class="pl-ent"&gt;model_id&lt;/span&gt;: &lt;span class="pl-s"&gt;llamafile&lt;/span&gt;
  &lt;span class="pl-ent"&gt;model_name&lt;/span&gt;: &lt;span class="pl-s"&gt;llamafile&lt;/span&gt;
  &lt;span class="pl-ent"&gt;api_base&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;http://127.0.0.1:8080/v1&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
This registers a model called &lt;code&gt;llamafile&lt;/code&gt; which you can now call like this:
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m llamafile &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Say hello to the world&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Setting up that &lt;code&gt;llamafile&lt;/code&gt; alias means you'll be able to use the same CLI invocation for any llamafile models you run on that default 8080 port.&lt;/p&gt;
&lt;p&gt;The same exact approach should work for other model hosting options that provide an endpoint that imitates the OpenAI API.&lt;/p&gt;
&lt;h4&gt;This is LLM plugins working as intended&lt;/h4&gt;
&lt;p&gt;When I &lt;a href="https://simonwillison.net/2023/Jul/12/llm/"&gt;added plugin support to LLM&lt;/a&gt; this was exactly what I had in mind: I want it to be as easy as possible to add support for new models, both local and remotely hosted.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://llm.datasette.io/en/stable/plugins/directory.html"&gt;LLM plugin directory&lt;/a&gt; lists 19 plugins in total now.&lt;/p&gt;
&lt;p&gt;If you want to build your own plugin - for a locally hosted model or for one exposed via a remote API - the &lt;a href="https://llm.datasette.io/en/stable/plugins/tutorial-model-plugin.html"&gt;plugin author tutorial&lt;/a&gt; (plus reviewing code from the existing plugins) should hopefully provide everything you need.&lt;/p&gt;
&lt;p&gt;You're also welcome to join us in the &lt;a href="https://datasette.io/discord-llm"&gt;#llm Discord channel&lt;/a&gt; to talk about your plans for your project.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/plugins"&gt;plugins&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llamafile"&gt;llamafile&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama-cpp"&gt;llama-cpp&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openrouter"&gt;openrouter&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="cli"/><category term="plugins"/><category term="projects"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llm"/><category term="mistral"/><category term="llamafile"/><category term="llama-cpp"/><category term="openrouter"/></entry></feed>