Simon Willison's Weblog: openrouter

Qwen3.5: Towards Native Multimodal Agents

2026-02-17T04:30:57+00:00

Qwen3.5: Towards Native Multimodal Agents

Alibaba's Qwen just released the first two models in the Qwen 3.5 series - one open weights, one proprietary. Both are multi-modal for vision input.

The open weight one is a Mixture of Experts model called Qwen3.5-397B-A17B. Interesting to see Qwen call out serving efficiency as a benefit of that architecture:

Built on an innovative hybrid architecture that fuses linear attention (via Gated Delta Networks) with a sparse mixture-of-experts, the model attains remarkable inference efficiency: although it comprises 397 billion total parameters, just 17 billion are activated per forward pass, optimizing both speed and cost without sacrificing capability.

It's 807GB on Hugging Face, and Unsloth have a collection of smaller GGUFs ranging in size from 94.2GB 1-bit to 462GB Q8_K_XL.

I got this pelican from the OpenRouter hosted model (transcript):

The proprietary hosted model is called Qwen3.5 Plus 2026-02-15, and is a little confusing. Qwen researcher Junyang Lin says:

Qwen3-Plus is a hosted API version of 397B. As the model natively supports 256K tokens, Qwen3.5-Plus supports 1M token context length. Additionally it supports search and code interpreter, which you can use on Qwen Chat with Auto mode.

Here's its pelican, which is similar in quality to the open weights model:

Tags: ai, generative-ai, llms, vision-llms, qwen, pelican-riding-a-bicycle, llm-release, openrouter, ai-in-china

GLM-5: From Vibe Coding to Agentic Engineering

2026-02-11T18:56:14+00:00

GLM-5: From Vibe Coding to Agentic Engineering

This is a huge new MIT-licensed model: 744B parameters and 1.51TB on Hugging Face twice the size of GLM-4.7 which was 368B and 717GB (4.5 and 4.6 were around that size too).

It's interesting to see Z.ai take a position on what we should call professional software engineers building with LLMs - I've seen Agentic Engineering show up in a few other places recently. most notable from Andrej Karpathy and Addy Osmani.

I ran my "Generate an SVG of a pelican riding a bicycle" prompt through GLM-5 via OpenRouter and got back a very good pelican on a disappointing bicycle frame:

Via Hacker News

Tags: definitions, ai, generative-ai, llms, ai-assisted-programming, pelican-riding-a-bicycle, llm-release, vibe-coding, openrouter, ai-in-china, glm, agentic-engineering

Open Responses

2026-01-15T23:56:56+00:00

Open Responses

This is the standardization effort I've most wanted in the world of LLMs: a vendor-neutral specification for the JSON API that clients can use to talk to hosted LLMs.

Open Responses aims to provide exactly that as a documented standard, derived from OpenAI's Responses API.

I was hoping for one based on their older Chat Completions API since so many other products have cloned the already, but basing it on Responses does make sense since that API was designed with the feature of more recent models - such as reasoning traces - baked into the design.

What's certainly notable is the list of launch partners. OpenRouter alone means we can expect to be able to use this protocol with almost every existing model, and Hugging Face, LM Studio, vLLM, Ollama and Vercel cover a huge portion of the common tools used to serve models.

For protocols like this I really want to see a comprehensive, language-independent conformance test site. Open Responses has a subset of that - the official repository includes src/lib/compliance-tests.ts which can be used to exercise a server implementation, and is available as a React app on the official site that can be pointed at any implementation served via CORS.

What's missing is the equivalent for clients. I plan to spin up my own client library for this in Python and I'd really like to be able to run that against a conformance suite designed to check that my client correctly handles all of the details.

Via VB

Tags: json, standards, ai, openai, generative-ai, llms, openrouter, conformance-suites

DeepSeek-V3.2

2025-12-01T23:56:19+00:00

DeepSeek-V3.2

Two new open weight (MIT licensed) models from DeepSeek today: DeepSeek-V3.2 and DeepSeek-V3.2-Speciale, both 690GB, 685B parameters. Here's the PDF tech report.

DeepSeek-V3.2 is DeepSeek's new flagship model, now running on chat.deepseek.com.

The difference between the two new models is best explained by this paragraph from the technical report:

DeepSeek-V3.2 integrates reasoning, agent, and human alignment data distilled from specialists, undergoing thousands of steps of continued RL training to reach the final checkpoints. To investigate the potential of extended thinking, we also developed an experimental variant, DeepSeek-V3.2-Speciale. This model was trained exclusively on reasoning data with a reduced length penalty during RL. Additionally, we incorporated the dataset and reward method from DeepSeekMath-V2 (Shao et al., 2025) to enhance capabilities in mathematical proofs.

I covered DeepSeek-Math-V2 last week. Like that model, DeepSeek-V3.2-Speciale also scores gold on the 2025 International Mathematical Olympiad so beloved of model training teams!

I tried both models on "Generate an SVG of a pelican riding a bicycle" using the chat feature of [OpenRouter](https://openrouter.ai/). DeepSeek V3.2 produced this very short reasoning chain:

Let's assume the following:

Wheel radius: 40
Distance between wheel centers: 180
Seat height: 60 (above the rear wheel center)
Handlebars: above the front wheel, extending back and up.

We'll set the origin at the center of the rear wheel.

We'll create the SVG with a viewBox that fits the entire drawing.

Let's start by setting up the SVG.

Followed by this illustration:

Here's what I got from the Speciale model, which thought deeply about the geometry of bicycles and pelicans for a very long time (at least 10 minutes) before spitting out this result:

Via Hacker News

Tags: ai, generative-ai, llms, pelican-riding-a-bicycle, llm-reasoning, deepseek, llm-release, openrouter, ai-in-china

Kimi K2 Thinking

2025-11-06T23:53:06+00:00

Kimi K2 Thinking

Chinese AI lab Moonshot's Kimi K2 established itself as one of the largest open weight models - 1 trillion parameters - back in July. They've now released the Thinking version, also a trillion parameters (MoE, 32B active) and also under their custom modified (so not quite open source) MIT license.

Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage.

This one is only 594GB on Hugging Face - Kimi K2 was 1.03TB - which I think is due to the new INT4 quantization. This makes the model both cheaper and faster to host.

So far the only people hosting it are Moonshot themselves. I tried it out both via their own API and via the OpenRouter proxy to it, via the llm-moonshot plugin (by NickMystic) and my llm-openrouter plugin respectively.

The buzz around this model so far is very positive. Could this be the first open weight model that's competitive with the latest from OpenAI and Anthropic, especially for long-running agentic tool call sequences?

Moonshot AI's self-reported benchmark scores show K2 Thinking beating the top OpenAI and Anthropic models (GPT-5 and Sonnet 4.5 Thinking) at "Agentic Reasoning" and "Agentic Search" but not quite top for "Coding":

I ran a couple of pelican tests:

llm install llm-moonshot
llm keys set moonshot # paste key
llm -m moonshot/kimi-k2-thinking 'Generate an SVG of a pelican riding a bicycle'

llm install llm-openrouter
llm keys set openrouter # paste key
llm -m openrouter/moonshotai/kimi-k2-thinking \
  'Generate an SVG of a pelican riding a bicycle'

Artificial Analysis said:

Kimi K2 Thinking achieves 93% in 𝜏²-Bench Telecom, an agentic tool use benchmark where the model acts as a customer service agent. This is the highest score we have independently measured. Tool use in long horizon agentic contexts was a strength of Kimi K2 Instruct and it appears this new Thinking variant makes substantial gains

CNBC quoted a source who provided the training price for the model:

The Kimi K2 Thinking model cost $4.6 million to train, according to a source familiar with the matter. [...] CNBC was unable to independently verify the DeepSeek or Kimi figures.

MLX developer Awni Hannun got it working on two 512GB M3 Ultra Mac Studios:

The new 1 Trillion parameter Kimi K2 Thinking model runs well on 2 M3 Ultras in its native format - no loss in quality!

The model was quantization aware trained (qat) at int4.

Here it generated ~3500 tokens at 15 toks/sec using pipeline-parallelism in mlx-lm

Here's the 658GB mlx-community model.

Tags: ai, generative-ai, llms, llm, mlx, pelican-riding-a-bicycle, llm-reasoning, llm-release, openrouter, ai-in-china, artificial-analysis, moonshot, kimi

Two more Chinese pelicans

2025-10-01T23:39:07+00:00

Two new models from Chinese AI labs in the past few days. I tried them both out using llm-openrouter:

DeepSeek-V3.2-Exp from DeepSeek. Announcement, Tech Report, Hugging Face (690GB, MIT license).

As an intermediate step toward our next-generation architecture, V3.2-Exp builds upon V3.1-Terminus by introducing DeepSeek Sparse Attention—a sparse attention mechanism designed to explore and validate optimizations for training and inference efficiency in long-context scenarios.

This one felt very slow when I accessed it via OpenRouter - I probably got routed to one of the slower providers. Here's the pelican:

GLM-4.6 from Z.ai. Announcement, Hugging Face (714GB, MIT license).

The context window has been expanded from 128K to 200K tokens [...] higher scores on code benchmarks [...] GLM-4.6 exhibits stronger performance in tool using and search-based agents.

Here's the pelican for that:

Tags: ai, generative-ai, llms, llm, pelican-riding-a-bicycle, deepseek, llm-release, openrouter, ai-in-china, glm

llm-openrouter 0.5

2025-09-21T00:24:05+00:00

llm-openrouter 0.5

New release of my LLM plugin for accessing models made available via OpenRouter. The release notes in full:

Support for tool calling. Thanks, James Sanford. #43

Support for reasoning options, for example llm -m openrouter/openai/gpt-5 'prove dogs exist' -o reasoning_effort medium. #45

Tool calling is a really big deal, as it means you can now use the plugin to try out tools (and build agents, if you like) against any of the 179 tool-enabled models on that platform:

llm install llm-openrouter
llm keys set openrouter
# Paste key here
llm models --tools | grep 'OpenRouter:' | wc -l
# Outputs 179

Quite a few of the models hosted on OpenRouter can be accessed for free. Here's a tool-usage example using the llm-tools-datasette plugin against the new Grok 4 Fast model:

llm install llm-tools-datasette
llm -m openrouter/x-ai/grok-4-fast:free -T 'Datasette("https://datasette.io/content")' 'Count available plugins'

Outputs:

There are 154 available plugins.

The output of llm logs -cu shows the tool calls and SQL queries it executed to get that result.

Tags: projects, ai, datasette, generative-ai, llms, llm, llm-tool-use, llm-reasoning, openrouter

Grok 4 Fast

2025-09-20T23:59:33+00:00

Grok 4 Fast

New hosted vision-enabled reasoning model from xAI that's designed to be fast and extremely competitive on price. It has a 2 million token context window and "was trained end-to-end with tool-use reinforcement learning".

It's priced at $0.20/million input tokens and $0.50/million output tokens - 15x less than Grok 4 (which is $3/million input and $15/million output). That puts it cheaper than GPT-5 mini and Gemini 2.5 Flash on llm-prices.com.

The same model weights handle reasoning and non-reasoning based on a parameter passed to the model.

I've been trying it out via my updated llm-openrouter plugin, since Grok 4 Fast is available for free on OpenRouter for a limited period.

Here's output from the non-reasoning model. This actually output an invalid SVG - I had to make a tiny manual tweak to the XML to get it to render.

llm -m openrouter/x-ai/grok-4-fast:free "Generate an SVG of a pelican riding a bicycle" -o reasoning_enabled false

(I initially ran this without that -o reasoning_enabled false flag, but then I saw that OpenRouter enable reasoning by default for that model. Here's my previous invalid result.)

And the reasoning model:

llm -m openrouter/x-ai/grok-4-fast:free "Generate an SVG of a pelican riding a bicycle" -o reasoning_enabled true

In related news, the New York Times had a story a couple of days ago about Elon's recent focus on xAI: Since Leaving Washington, Elon Musk Has Been All In on His A.I. Company.

Tags: ai, generative-ai, llms, llm, vision-llms, llm-pricing, pelican-riding-a-bicycle, llm-reasoning, grok, llm-release, openrouter, xai

Qwen3-Next-80B-A3B: 🐧🦩 Who needs legs?!

2025-09-12T04:07:32+00:00

Qwen3-Next-80B-A3B

Qwen announced two new models via their Twitter account (and here's their blog): Qwen3-Next-80B-A3B-Instruct and Qwen3-Next-80B-A3B-Thinking.

They make some big claims on performance:

Qwen3-Next-80B-A3B-Instruct approaches our 235B flagship.

Qwen3-Next-80B-A3B-Thinking outperforms Gemini-2.5-Flash-Thinking.

The name "80B-A3B" indicates 80 billion parameters of which only 3 billion are active at a time. You still need to have enough GPU-accessible RAM to hold all 80 billion in memory at once but only 3 billion will be used for each round of inference, which provides a significant speedup in responding to prompts.

More details from their tweet:

80B params, but only 3B activated per token → 10x cheaper training, 10x faster inference than Qwen3-32B.(esp. @ 32K+ context!)

Hybrid Architecture: Gated DeltaNet + Gated Attention → best of speed & recall

Ultra-sparse MoE: 512 experts, 10 routed + 1 shared

Multi-Token Prediction → turbo-charged speculative decoding

Beats Qwen3-32B in perf, rivals Qwen3-235B in reasoning & long-context

The models on Hugging Face are around 150GB each so I decided to try them out via OpenRouter rather than on my own laptop (Thinking, Instruct).

I'm used my llm-openrouter plugin. I installed it like this:

llm install llm-openrouter
llm keys set openrouter
# paste key here

Then found the model IDs with this command:

llm models -q next

Which output:

OpenRouter: openrouter/qwen/qwen3-next-80b-a3b-thinking
OpenRouter: openrouter/qwen/qwen3-next-80b-a3b-instruct

I have an LLM prompt template saved called pelican-svg which I created like this:

llm "Generate an SVG of a pelican riding a bicycle" --save pelican-svg

This means I can run my pelican benchmark like this:

llm -t pelican-svg -m openrouter/qwen/qwen3-next-80b-a3b-thinking

Or like this:

llm -t pelican-svg -m openrouter/qwen/qwen3-next-80b-a3b-instruct

Here's the thinking model output (exported with llm logs -c | pbcopy after I ran the prompt):

I enjoyed the "Whimsical style with smooth curves and friendly proportions (no anatomical accuracy needed for bicycle riding!)" note in the transcript.

The instruct (non-reasoning) model gave me this:

"🐧🦩 Who needs legs!?" indeed! I like that penguin-flamingo emoji sequence it's decided on for pelicans.

Tags: ai, generative-ai, llms, llm, qwen, pelican-riding-a-bicycle, llm-reasoning, llm-release, openrouter, ai-in-china

DeepSeek 3.1

2025-08-22T22:07:25+00:00

DeepSeek 3.1

The latest model from DeepSeek, a 685B monster (like DeepSeek v3 before it) but this time it's a hybrid reasoning model.

DeepSeek claim:

DeepSeek-V3.1-Think achieves comparable answer quality to DeepSeek-R1-0528, while responding more quickly.

Drew Breunig points out that their benchmarks show "the same scores with 25-50% fewer tokens" - at least across AIME 2025 and GPQA Diamond and LiveCodeBench.

The DeepSeek release includes prompt examples for a coding agent, a python agent and a search agent - yet more evidence that the leading AI labs have settled on those as the three most important agentic patterns for their models to support.

Here's the pelican riding a bicycle it drew me (transcript), which I ran from my phone using OpenRouter chat.

Tags: ai, prompt-engineering, generative-ai, llms, drew-breunig, pelican-riding-a-bicycle, llm-reasoning, deepseek, llm-release, openrouter, coding-agents, ai-in-china

Usage charts for my LLM tool against OpenRouter

2025-08-04T20:00:47+00:00

Usage charts for my LLM tool against OpenRouter

OpenRouter proxies requests to a large number of different LLMs and provides high level statistics of which models are the most popular among their users.

Tools that call OpenRouter can include HTTP-Referer and X-Title headers to credit that tool with the token usage. My llm-openrouter plugin does that here.

... which means this page displays aggregate stats across users of that plugin! Looks like someone has been running a lot of traffic through Qwen 3 14B recently.

Tags: ai, generative-ai, llms, llm, openrouter

More model releases on 31st July

2025-07-31T21:54:47+00:00

Here are a few more model releases from today, to round out a very busy July:

Cohere released Command A Vision, their first multi-modal (image input) LLM. Like their others it's open weights under Creative Commons Attribution Non-Commercial, so you need to license it (or use their paid API) if you want to use it commercially.
San Francisco AI startup Deep Cogito released four open weights hybrid reasoning models, cogito-v2-preview-deepseek-671B-MoE, cogito-v2-preview-llama-405B, cogito-v2-preview-llama-109B-MoE and cogito-v2-preview-llama-70B. These follow their v1 preview models in April at smaller 3B, 8B, 14B, 32B and 70B sizes. It looks like their unique contribution here is "distilling inference-time reasoning back into the model’s parameters" - demonstrating a form of self-improvement. I haven't tried any of their models myself yet.
Mistral released Codestral 25.08, an update to their Codestral model which is specialized for fill-in‑the‑middle autocomplete as seen in text editors like VS Code, Zed and Cursor.
And an anonymous stealth preview model called Horizon Alpha running on OpenRouter was released yesterday and is attracting a lot of attention.

Tags: ai, generative-ai, llms, mistral, cohere, llm-release, openrouter

Qwen3-Coder: Agentic Coding in the World

2025-07-22T22:52:02+00:00

Qwen3-Coder: Agentic Coding in the World

It turns out that as I was typing up my notes on Qwen3-235B-A22B-Instruct-2507 the Qwen team were unleashing something much bigger:

Today, we’re announcing Qwen3-Coder, our most agentic code model to date. Qwen3-Coder is available in multiple sizes, but we’re excited to introduce its most powerful variant first: Qwen3-Coder-480B-A35B-Instruct — a 480B-parameter Mixture-of-Experts model with 35B active parameters which supports the context length of 256K tokens natively and 1M tokens with extrapolation methods, offering exceptional performance in both coding and agentic tasks.

This is another Apache 2.0 licensed open weights model, available as Qwen3-Coder-480B-A35B-Instruct and Qwen3-Coder-480B-A35B-Instruct-FP8 on Hugging Face.

I used qwen3-coder-480b-a35b-instruct on the Hyperbolic playground to run my "Generate an SVG of a pelican riding a bicycle" test prompt:

I actually slightly prefer the one I got from qwen3-235b-a22b-07-25.

It's also available as qwen3-coder on OpenRouter.

In addition to the new model, Qwen released their own take on an agentic terminal coding assistant called qwen-code, which they describe in their blog post as being "Forked from Gemini Code" (they mean gemini-cli) - which is Apache 2.0 so a fork is in keeping with the license.

They focused really hard on code performance for this release, including generating synthetic data tested using 20,000 parallel environments on Alibaba Cloud:

In the post-training phase of Qwen3-Coder, we introduced long-horizon RL (Agent RL) to encourage the model to solve real-world tasks through multi-turn interactions using tools. The key challenge of Agent RL lies in environment scaling. To address this, we built a scalable system capable of running 20,000 independent environments in parallel, leveraging Alibaba Cloud’s infrastructure. The infrastructure provides the necessary feedback for large-scale reinforcement learning and supports evaluation at scale. As a result, Qwen3-Coder achieves state-of-the-art performance among open-source models on SWE-Bench Verified without test-time scaling.

To further burnish their coding credentials, the announcement includes instructions for running their new model using both Claude Code and Cline using custom API base URLs that point to Qwen's own compatibility proxies.

Pricing for Qwen's own hosted models (through Alibaba Cloud) looks competitive. This is the first model I've seen that sets different prices for four different sizes of input:

This kind of pricing reflects how inference against longer inputs is more expensive to process. Gemini 2.5 Pro has two different prices for above or below 200,00 tokens.

Awni Hannun reports running a 4-bit quantized MLX version on a 512GB M3 Ultra Mac Studio at 24 tokens/second using 272GB of RAM, getting great results for "write a python script for a bouncing yellow ball within a square, make sure to handle collision detection properly. make the square slowly rotate. implement it in python. make sure ball stays within the square".

Via @Alibaba_Qwen

Tags: ai, generative-ai, llms, ai-assisted-programming, qwen, llm-pricing, pelican-riding-a-bicycle, llm-release, openrouter, coding-agents, ai-in-china

Qwen/Qwen3-235B-A22B-Instruct-2507

2025-07-22T22:07:12+00:00

Qwen/Qwen3-235B-A22B-Instruct-2507

Significant new model release from Qwen, published yesterday without much fanfare. (Update: probably because they were cooking the much larger Qwen3-Coder-480B-A35B-Instruct which they released just now.)

This is a follow-up to their April release of the full Qwen 3 model family, which included a Qwen3-235B-A22B model which could handle both reasoning and non-reasoning prompts (via a /no_think toggle).

The new Qwen3-235B-A22B-Instruct-2507 ditches that mechanism - this is exclusively a non-reasoning model. It looks like Qwen have new reasoning models in the pipeline.

This new model is Apache 2 licensed and comes in two official sizes: a BF16 model (437.91GB of files on Hugging Face) and an FP8 variant (220.20GB). VentureBeat estimate that the large model needs 88GB of VRAM while the smaller one should run in ~30GB.

The benchmarks on these new models look very promising. Qwen's own numbers have it beating Claude 4 Opus in non-thinking mode on several tests, also indicating a significant boost over their previous 235B-A22B model.

I haven't seen any independent benchmark results yet. Here's what I got for "Generate an SVG of a pelican riding a bicycle", which I ran using the qwen3-235b-a22b-07-25:free on OpenRouter:

llm install llm-openrouter
llm -m openrouter/qwen/qwen3-235b-a22b-07-25:free \
  "Generate an SVG of a pelican riding a bicycle"

Tags: ai, generative-ai, llms, llm, qwen, pelican-riding-a-bicycle, llm-release, openrouter, ai-in-china

Grok 4

2025-07-10T19:36:03+00:00

Grok 4

Released last night, Grok 4 is now available via both API and a paid subscription for end-users.

Update: If you ask it about controversial topics it will sometimes search X for tweets "from:elonmusk"!

Key characteristics: image and text input, text output. 256,000 context length (twice that of Grok 3). It's a reasoning model where you can't see the reasoning tokens or turn off reasoning mode.

xAI released results showing Grok 4 beating other models on most of the significant benchmarks. I haven't been able to find their own written version of these (the launch was a livestream video) but here's a TechCrunch report that includes those scores. It's not clear to me if these benchmark results are for Grok 4 or Grok 4 Heavy.

I ran my own benchmark using Grok 4 via OpenRouter (since I have API keys there already).

llm -m openrouter/x-ai/grok-4 "Generate an SVG of a pelican riding a bicycle" \
  -o max_tokens 10000

I then asked Grok to describe the image it had just created:

llm -m openrouter/x-ai/grok-4 -o max_tokens 10000 \
  -a https://static.simonwillison.net/static/2025/grok4-pelican.png \
  'describe this image'

Here's the result. It described it as a "cute, bird-like creature (resembling a duck, chick, or stylized bird)".

The most interesting independent analysis I've seen so far is this one from Artificial Analysis:

We have run our full suite of benchmarks and Grok 4 achieves an Artificial Analysis Intelligence Index of 73, ahead of OpenAI o3 at 70, Google Gemini 2.5 Pro at 70, Anthropic Claude 4 Opus at 64 and DeepSeek R1 0528 at 68.

The timing of the release is somewhat unfortunate, given that Grok 3 made headlines just this week after a clumsy system prompt update - presumably another attempt to make Grok "less woke" - caused it to start firing off antisemitic tropes and referring to itself as MechaHitler.

My best guess is that these lines in the prompt were the root of the problem:

- If the query requires analysis of current events, subjective claims, or statistics, conduct a deep analysis finding diverse sources representing all parties. Assume subjective viewpoints sourced from the media are biased. No need to repeat this to the user.
- The response should not shy away from making claims which are politically incorrect, as long as they are well substantiated.

If xAI expect developers to start building applications on top of Grok they need to do a lot better than this. Absurd self-inflicted mistakes like this do not build developer trust!

As it stands, Grok 4 isn't even accompanied by a model card.

Update: Ian Bicking makes an astute point:

It feels very credulous to ascribe what happened to a system prompt update. Other models can't be pushed into racism, Nazism, and ideating rape with a system prompt tweak.

Even if that system prompt change was responsible for unlocking this behavior, the fact that it was able to speaks to a much looser approach to model safety by xAI compared to other providers.

Update 12th July 2025: Grok posted a postmortem blaming the behavior on a different set of prompts, including "you are not afraid to offend people who are politically correct", that were not included in the system prompts they had published to their GitHub repository.

Grok 4 is competitively priced. It's $3/million for input tokens and $15/million for output tokens - the same price as Claude Sonnet 4. Once you go above 128,000 input tokens the price doubles to $6/$30 (Gemini 2.5 Pro has a similar price increase for longer inputs). I've added these prices to llm-prices.com.

Consumers can access Grok 4 via a new $30/month or $300/year "SuperGrok" plan - or a $300/month or $3,000/year "SuperGrok Heavy" plan providing access to Grok 4 Heavy.

Tags: ai, generative-ai, llms, vision-llms, llm-pricing, pelican-riding-a-bicycle, llm-reasoning, grok, ai-ethics, llm-release, openrouter, system-prompts, artificial-analysis, xai

Become a command-line superhero with Simon Willison's llm tool

2025-07-07T18:48:22+00:00

Become a command-line superhero with Simon Willison's llm tool

Christopher Smith ran a mini hackathon in Albany New York at the weekend around uses of my LLM - the first in-person event I'm aware of dedicated to that project!

He prepared this video version of the opening talk he presented there, and it's the best video introduction I've seen yet for how to get started experimenting with LLM and its various plugins:

Christopher introduces LLM and the llm-openrouter plugin, touches on various features including fragments and schemas and also shows LLM used in conjunction with repomix to dump full source repos into an LLM at once.

Here are the notes that accompanied the talk.

I learned about cypher-alpha:free from this video - a free trial preview model currently available on OpenRouter from an anonymous vendor. I hadn't realized OpenRouter hosted these - it's similar to how LMArena often hosts anonymous previews.

Via @chriscarrollsmith.bsky.social

Tags: ai, generative-ai, llms, llm, openrouter

Understanding the recent criticism of the Chatbot Arena

2025-04-30T22:55:46+00:00

The Chatbot Arena has become the go-to place for vibes-based evaluation of LLMs over the past two years. The project, originating at UC Berkeley, is home to a large community of model enthusiasts who submit prompts to two randomly selected anonymous models and pick their favorite response. This produces an Elo score leaderboard of the "best" models, similar to how chess rankings work.

It's become one of the most influential leaderboards in the LLM world, which means that billions of dollars of investment are now being evaluated based on those scores.

The Leaderboard Illusion

A new paper, The Leaderboard Illusion, by authors from Cohere Labs, AI2, Princeton, Stanford, University of Waterloo and University of Washington spends 68 pages dissecting and criticizing how the arena works.

Even prior to this paper there have been rumbles of dissatisfaction with the arena for a while, based on intuitions that the best models were not necessarily bubbling to the top. I've personally been suspicious of the fact that my preferred daily driver, Claude 3.7 Sonnet, rarely breaks the top 10 (it's sat at 20th right now).

This all came to a head a few weeks ago when the Llama 4 launch was mired by a leaderboard scandal: it turned out that their model which topped the leaderboard wasn't the same model that they released to the public! The arena released a pseudo-apology for letting that happen.

This helped bring focus to the arena's policy of allowing model providers to anonymously preview their models there, in order to earn a ranking prior to their official launch date. This is popular with their community, who enjoy trying out models before anyone else, but the scale of the preview testing revealed in this new paper surprised me.

From the new paper's abstract (highlights mine):

We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release.

If proprietary model vendors can submit dozens of test models, and then selectively pick the ones that score highest it is not surprising that they end up hogging the top of the charts!

This feels like a classic example of gaming a leaderboard. There are model characteristics that resonate with evaluators there that may not directly relate to the quality of the underlying model. For example, bulleted lists and answers of a very specific length tend to do better.

It is worth noting that this is quite a salty paper (highlights mine):

It is important to acknowledge that a subset of the authors of this paper have submitted several open-weight models to Chatbot Arena: command-r (Cohere, 2024), command-r-plus (Cohere, 2024) in March 2024, aya-expanse (Dang et al., 2024b) in October 2024, aya-vision (Cohere, 2025) in March 2025, command-a (Cohere et al., 2025) in March 2025. We started this extensive study driven by this submission experience with the leaderboard.

While submitting Aya Expanse (Dang et al., 2024b) for testing, we observed that our open-weight model appeared to be notably under-sampled compared to proprietary models — a discrepancy that is further reflected in Figures 3, 4, and 5. In response, we contacted the Chatbot Arena organizers to inquire about these differences in November 2024. In the course of our discussions, we learned that some providers were testing multiple variants privately, a practice that appeared to be selectively disclosed and limited to only a few model providers. We believe that our initial inquiries partly prompted Chatbot Arena to release a public blog in December 2024 detailing their benchmarking policy which committed to a consistent sampling rate across models. However, subsequent anecdotal observations of continued sampling disparities and the presence of numerous models with private aliases motivated us to undertake a more systematic analysis.

To summarize the other key complaints from the paper:

Unfair sampling rates: a small number of proprietary vendors (most notably Google and OpenAI) have their models randomly selected in a much higher number of contests.
Transparency concerning the scale of proprietary model testing that's going on.
Unfair removal rates: "We find deprecation disproportionately impacts open-weight and open-source models, creating large asymmetries in data access over" - also "out of 243 public models, 205 have been silently deprecated." The longer a model stays in the arena the more chance it has to win competitions and bubble to the top.

The Arena responded to the paper in a tweet. They emphasized:

We designed our policy to prevent model providers from just reporting the highest score they received during testing. We only publish the score for the model they release publicly.

I'm dissapointed by this response, because it skips over the point from the paper that I find most interesting. If commercial vendors are able to submit dozens of models to the arena and then cherry-pick for publication just the model that gets the highest score, quietly retracting the others with their scores unpublished, that means the arena is very actively incentivizing models to game the system. It's also obscuring a valuable signal to help the community understand how well those vendors are doing at building useful models.

Here's a second tweet where they take issue with "factual errors and misleading statements" in the paper, but still fail to address that core point. I'm hoping they'll respond to my follow-up question asking for clarification around the cherry-picking loophole described by the paper.

I want more transparency

The thing I most want here is transparency.

If a model sits in top place, I'd like a footnote that resolves to additional information about how that vendor tested that model. I'm particularly interested in knowing how many variants of that model the vendor tested. If they ran 21 different models over a 2 month period before selecting the "winning" model, I'd like to know that - and know what the scores were for all of those others that they didn't ship.

This knowledge will help me personally evaluate how credible I find their score. Were they mainly gaming the benchmark or did they produce a new model family that universally scores highly even as they tweaked it to best fit the taste of the voters in the arena?

OpenRouter as an alternative?

If the arena isn't giving us a good enough impression of who is winning the race for best LLM at the moment, what else can we look to?

Andrej Karpathy discussed the new paper on Twitter this morning and proposed an alternative source of rankings instead:

It's quite likely that LM Arena (and LLM providers) can continue to iterate and improve within this paradigm, but in addition I also have a new candidate in mind to potentially join the ranks of "top tier eval". It is the OpenRouterAI LLM rankings.

Basically, OpenRouter allows people/companies to quickly switch APIs between LLM providers. All of them have real use cases (not toy problems or puzzles), they have their own private evals, and all of them have an incentive to get their choices right, so by choosing one LLM over another they are directly voting for some combo of capability+cost.

I don't think OpenRouter is there just yet in both the quantity and diversity of use, but something of this kind I think has great potential to grow into a very nice, very difficult to game eval.

I only recently learned about these rankings but I agree with Andrej: they reveal some interesting patterns that look to match my own intuitions about which models are the most useful (and economical) on which to build software. Here's a snapshot of their current "Top this month" table:

The one big weakness of this ranking system is that a single, high volume OpenRouter customer could have an outsized effect on the rankings should they decide to switch models. It will be interesting to see if OpenRouter can design their own statistical mechanisms to help reduce that effect.

Tags: ai, andrej-karpathy, generative-ai, llms, ai-ethics, openrouter, chatbot-arena, paper-review

Maybe Meta's Llama claims to be open source because of the EU AI act

2025-04-19T23:58:18+00:00

I encountered a theory a while ago that one of the reasons Meta insist on using the term “open source” for their Llama models despite the Llama license not actually conforming to the terms of the Open Source Definition is that the EU’s AI act includes special rules for open source models without requiring OSI compliance.

Since the EU AI act (12 July 2024) is available online I decided to take a look for myself.

Here’s one giant HTML page containing the full text of the act in English. I checked the token count with ttok (which uses the OpenAI tokenizer, but it’s close enough to work as a good estimate for other models):

curl 'https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=OJ:L_202401689' | ttok

241,722 tokens. That should fit nicely into Gemini 2.5 Flash (or GPT-4.1 or Gemini 2.5 Pro).

My Gemini API key was playing up so I ran it via OpenRouter (and llm-openrouter) instead:

llm -f 'https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=OJ:L_202401689' \
  -m openrouter/google/gemini-2.5-flash-preview:thinking \
  -s 'Summary of mentions of open source here, including what the document defines open source to mean'

Here's the full answer. Relevant extract:

Recital (89) states that third parties making accessible "tools, services, processes, or AI components other than general-purpose AI models" under a free and open-source licence should not be mandated to comply with upstream provider responsibilities towards those who integrate them. It also encourages developers of such resources to implement documentation practices like model cards and data sheets to promote trustworthy AI.

Recital (102) acknowledges that software and data, including models, released under a free and open-source licence can contribute to research and innovation and provide growth opportunities.

I tracked down Recital 89 and Recital 102 to check. Sure enough, 89 says:

Third parties making accessible to the public tools, services, processes, or AI components other than general-purpose AI models, should not be mandated to comply with requirements targeting the responsibilities along the AI value chain

And in 102:

General-purpose AI models released under free and open-source licences should be considered to ensure high levels of transparency and openness if their parameters, including the weights, the information on the model architecture, and the information on model usage are made publicly available. The licence should be considered to be free and open-source also when it allows users to run, copy, distribute, study, change and improve software and data, including models under the condition that the original provider of the model is credited, the identical or comparable terms of distribution are respected.

Gemini 2.5 Flash also pointed me to Article 2(12), which reads:

12. This Regulation does not apply to AI systems released under free and open-source licences, unless they are placed on the market or put into service as high-risk AI systems or as an AI system that falls under Article 5 or 50.

I asked a follow-up question:

llm -c 'so it says specific licenses but does it clarify which specific licenses count?'

And got back this response:

No, the document does not list specific names of licenses that qualify as "free and open-source".

Instead of listing names (like GPL, MIT, Apache, etc.), the Regulation defines what kind of terms a license must grant for it to be considered "free and open-source" for the purposes of the Regulation.

According to Recital (102), such a license must essentially allow software and data, including models, to be:

Openly shared.

Where users can freely access, use, modify and redistribute them or modified versions thereof.

The Recital further clarifies that a license can still be considered free and open-source even if it includes standard conditions like requiring attribution (crediting the original provider) or respecting identical or comparable terms of distribution.

I am very obviously not a lawyer, but based on the above I do find the theory credible that Meta are using "open source" in their marketing around Llama partly because the EU AI act has special exemptions for "open source" models that have nothing to do with the OSI definition of that term.

Total cost to use Gemini 2.5 Flash for this? 5.1 cents for my first question and 4.3 cents for the follow-up. That second question was cheaper even though it built on the first because output tokens are more expensive than input tokens and the second answer was shorter than the first - using the "thinking" model output is charged at $3.50/million tokens, input is just $0.15/million.

Using an LLM as a lawyer is obviously a terrible idea, but using one to crunch through a giant legal document and form a very rough layman's understanding of what it says feels perfectly cromulent to me.

Update: Steve O'Grady points out that Meta/Facebook have been abusing the term "open source" for a lot longer than the EU AI act has been around - they were pulling shenanigans with a custom license for React back in 2017.

Tags: law, open-source, ai, generative-ai, llama, llms, llm, gemini, meta, long-context, ai-ethics, openrouter

Initial impressions of Llama 4

2025-04-05T22:47:58+00:00

Dropping a model release as significant as Llama 4 on a weekend is plain unfair! So far the best place to learn about the new model family is this post on the Meta AI blog. They've released two new models today: Llama 4 Maverick is a 400B model (128 experts, 17B active parameters), text and image input with a 1 million token context length. Llama 4 Scout is 109B total parameters (16 experts, 17B active), also multi-modal and with a claimed 10 million token context length - an industry first.

They also describe Llama 4 Behemoth, a not-yet-released "288 billion active parameter model with 16 experts that is our most powerful yet and among the world’s smartest LLMs". Behemoth has 2 trillion parameters total and was used to train both Scout and Maverick.

No news yet on a Llama reasoning model beyond this coming soon page with a looping video of an academic-looking llama.

Llama 4 Maverick is now sat in second place on the LM Arena leaderboard, just behind Gemini 2.5 Pro. Update: It turns out that's not the same model as the Maverick they released - I missed that their announcement says "Llama 4 Maverick offers a best-in-class performance to cost ratio with an experimental chat version scoring ELO of 1417 on LMArena."

You can try them out using the chat interface from OpenRouter (or through the OpenRouter API) for Llama 4 Scout and Llama 4 Maverick. OpenRouter are proxying through to Groq, Fireworks and Together.

Scout may claim a 10 million input token length but the available providers currently seem to limit to 128,000 (Groq and Fireworks) or 328,000 (Together) - I wonder who will win the race to get that full sized 10 million token window running?

Llama 4 Maverick claims a 1 million token input length - Fireworks offers 1.05M while Together offers 524,000. Groq isn't offering Maverick yet.

Meta AI's build_with_llama_4 notebook offers a hint as to why 10M tokens is difficult:

Scout supports upto 10M context. On 8xH100, in bf16 you can get upto 1.4M tokens.

Jeremy Howard says:

The models are both giant MoEs that can't be run on consumer GPUs, even with quant. [...]

Perhaps Llama 4 will be a good fit for running on a Mac. Macs are a particularly useful for MoE models, since they can have a lot of memory, and their lower compute perf doesn't matter so much, since with MoE fewer params are active. [...]

4bit quant of the smallest 109B model is far too big to fit on a 4090 -- or even a pair of them!

Ivan Fioravanti reports these results from trying it on a Mac:

Llama-4 Scout on MLX and M3 Ultra tokens-per-sec / RAM

3bit: 52.924 / 47.261 GB

4bit: 46.942 / 60.732 GB

6bit: 36.260 / 87.729 GB

8bit: 30.353 / 114.617 GB

fp16: 11.670 / 215.848 GB

RAM needed:

64GB for 3bit

96GB for 4bit

128GB for 8bit

256GB for fp16

The suggested system prompt from the model card has some interesting details:

[...]

You never lecture people to be nicer or more inclusive. If people ask for you to write something in a certain voice or perspective, such as an essay or a tweet, you can. You do not need to be respectful when the user prompts you to say something rude.

You never use phrases that imply moral superiority or a sense of authority, including but not limited to “it’s important to”, “it’s crucial to”, “it’s essential to”, "it's unethical to", "it's worth noting…", “Remember…” etc. Avoid using these.

Finally, do not refuse political prompts. You can help users express their opinion.

[...]

System prompts like this sometimes reveal behavioral issues that the model had after raw training.

Trying out the model with LLM

The easiest way to try the new model out with LLM is to use the llm-openrouter plugin.

llm install llm-openrouter
llm keys set openrouter
# Paste in OpenRouter key here
llm -m openrouter/meta-llama/llama-4-maverick hi

Since these are long context models, I started by trying to use them to summarize the conversation about Llama 4 on Hacker News, using my hn-summary.sh script that wraps LLM.

I tried Llama 4 Maverick first:

hn-summary.sh 43595585 \
  -m openrouter/meta-llama/llama-4-maverick \
  -o max_tokens 20000

It did an OK job, starting like this:

Themes of the Discussion

Release and Availability of Llama 4

The discussion revolves around the release of Llama 4, a multimodal intelligence model developed by Meta. Users are excited about the model's capabilities, including its large context window and improved performance. Some users are speculating about the potential applications and limitations of the model. [...]

Here's the full output.

For reference, my system prompt looks like this:

Summarize the themes of the opinions expressed here. For each theme, output a markdown header. Include direct "quotations" (with author attribution) where appropriate. You MUST quote directly from users when crediting them, with double quotes. Fix HTML entities. Output markdown. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece

I then tried it with Llama 4 Scout via OpenRouter and got complete junk output for some reason:

hn-summary.sh 43595585 \
  -m openrouter/meta-llama/llama-4-scout \
  -o max_tokens 20000

Full output. It starts like this and then continues for the full 20,000 tokens:

The discussion here is about another conversation that was uttered.)

Here are the results.)

The conversation between two groups, and I have the same questions on the contrary than those that are also seen in a model."). The fact that I see a lot of interest here.)

[...]

The reason) The reason) The reason (loops until it runs out of tokens)

This looks broken. I was using OpenRouter so it's possible I got routed to a broken instance.

Update 7th April 2025: Meta AI's Ahmed Al-Dahle:

[...] we're also hearing some reports of mixed quality across different services. Since we dropped the models as soon as they were ready, we expect it'll take several days for all the public implementations to get dialed in. We'll keep working through our bug fixes and onboarding partners.

I later managed to run the prompt directly through Groq (with the llm-groq plugin) - but that had a 2048 limit on output size for some reason:

hn-summary.sh 43595585 \
  -m groq/meta-llama/llama-4-scout-17b-16e-instruct \
  -o max_tokens 2048

Here's the full result. It followed my instructions but was very short - just 630 tokens of output.

For comparison, here's the same thing run against Gemini 2.5 Pro. Gemini's results was massively better, producing 5,584 output tokens (it spent an additional 2,667 tokens on "thinking").

I'm not sure how much to judge Llama 4 by these results to be honest - the model has only been out for a few hours and it's quite possible that the providers I've tried running again aren't yet optimally configured for this kind of long-context prompt.

My hopes for Llama 4

I'm hoping that Llama 4 plays out in a similar way to Llama 3.

The first Llama 3 models released were 8B and 70B, last April.

Llama 3.1 followed in July at 8B, 70B, and 405B. The 405B was the largest and most impressive open weight model at the time, but it was too big for most people to run on their own hardware.

Llama 3.2 in September is where things got really interesting: 1B, 3B, 11B and 90B. The 1B and 3B models both work on my iPhone, and are surprisingly capable! The 11B and 90B models were the first Llamas to support vision, and the 11B ran on my Mac.

Then Llama 3.3 landed in December with a 70B model that I wrote about as a GPT-4 class model that ran on my Mac. It claimed performance similar to the earlier Llama 3.1 405B!

Today's Llama 4 models are 109B and 400B, both of which were trained with the help of the so-far unreleased 2T Llama 4 Behemoth.

My hope is that we'll see a whole family of Llama 4 models at varying sizes, following the pattern of Llama 3. I'm particularly excited to see if they produce an improved ~3B model that runs on my phone. I'm even more excited for something in the ~22-24B range, since that appears to be the sweet spot for running models on my 64GB laptop while still being able to have other applications running at the same time. Mistral Small 3.1 is a 24B model and is absolutely superb.

Tags: ai, generative-ai, llama, llms, jeremy-howard, llm, gemini, vision-llms, groq, meta, mlx, long-context, llm-release, openrouter, chatbot-arena

deepseek-ai/DeepSeek-V3-0324

2025-03-24T15:04:04+00:00

deepseek-ai/DeepSeek-V3-0324

Chinese AI lab DeepSeek just released the latest version of their enormous DeepSeek v3 model, baking the release date into the name DeepSeek-V3-0324.

The license is MIT (that's new - previous DeepSeek v3 had a custom license), the README is empty and the release adds up a to a total of 641 GB of files, mostly of the form model-00035-of-000163.safetensors.

The model only came out a few hours ago and MLX developer Awni Hannun already has it running at >20 tokens/second on a 512GB M3 Ultra Mac Studio ($9,499 of ostensibly consumer-grade hardware) via mlx-lm and this mlx-community/DeepSeek-V3-0324-4bit 4bit quantization, which reduces the on-disk size to 352 GB.

I think that means if you have that machine you can run it with my llm-mlx plugin like this, but I've not tried myself!

llm mlx download-model mlx-community/DeepSeek-V3-0324-4bit
llm chat -m mlx-community/DeepSeek-V3-0324-4bit

The new model is also listed on OpenRouter. You can try a chat at openrouter.ai/chat?models=deepseek/deepseek-chat-v3-0324:free.

Here's what the chat interface gave me for "Generate an SVG of a pelican riding a bicycle":

I have two API keys with OpenRouter - one of them worked with the model, the other gave me a No endpoints found matching your data policy error - I think because I had a setting on that key disallowing models from training on my activity. The key that worked was a free key with no attached billing credentials.

For my working API key the llm-openrouter plugin let me run a prompt like this:

llm install llm-openrouter
llm keys set openrouter
# Paste key here
llm -m openrouter/deepseek/deepseek-chat-v3-0324:free "best fact about a pelican"

Here's that "best fact" - the terminal output included Markdown and an emoji combo, here that's rendered.

One of the most fascinating facts about pelicans is their unique throat pouch, called a gular sac, which can hold up to 3 gallons (11 liters) of water—three times more than their stomach!

Here’s why it’s amazing:
- Fishing Tool: They use it like a net to scoop up fish, then drain the water before swallowing.
- Cooling Mechanism: On hot days, pelicans flutter the pouch to stay cool by evaporating water.
- Built-in "Shopping Cart": Some species even use it to carry food back to their chicks.

Bonus fact: Pelicans often fish cooperatively, herding fish into shallow water for an easy catch.

Would you like more cool pelican facts? 🐦🌊

In putting this post together I got Claude to build me this new tool for finding the total on-disk size of a Hugging Face repository, which is available in their API but not currently displayed on their website.

Update: Here's a notable independent benchmark from Paul Gauthier:

DeepSeek's new V3 scored 55% on aider's polyglot benchmark, significantly improving over the prior version. It's the #2 non-thinking/reasoning model, behind only Sonnet 3.7. V3 is competitive with thinking models like R1 & o3-mini.

Tags: tools, ai, generative-ai, local-llms, llms, ai-assisted-programming, hugging-face, llm, mlx, pelican-riding-a-bicycle, deepseek, llm-release, openrouter, ai-in-china

llm-openrouter 0.4

2025-03-10T21:40:56+00:00

llm-openrouter 0.4

I found out this morning that OpenRouter include support for a number of (rate-limited) free API models.

I occasionally run workshops on top of LLMs (like this one) and being able to provide students with a quick way to obtain an API key against models where they don't have to setup billing is really valuable to me!

This inspired me to upgrade my existing llm-openrouter plugin, and in doing so I closed out a bunch of open feature requests.

Consider this post the annotated release notes:

LLM schema support for OpenRouter models that support structured output. #23

I'm trying to get support for LLM's new schema feature into as many plugins as possible.

OpenRouter's OpenAI-compatible API includes support for the response_format structured content option, but with an important caveat: it only works for some models, and if you try to use it on others it is silently ignored.

I filed an issue with OpenRouter requesting they include schema support in their machine-readable model index. For the moment LLM will let you specify schemas for unsupported models and will ignore them entirely, which isn't ideal.

llm openrouter key command displays information about your current API key. #24

Useful for debugging and checking the details of your key's rate limit.

llm -m ... -o online 1 enables web search grounding against any model, powered by Exa. #25

OpenRouter apparently make this feature available to every one of their supported models! They're using new-to-me Exa to power this feature, an AI-focused search engine startup who appear to have built their own index with their own crawlers (according to their FAQ). This feature is currently priced by OpenRouter at $4 per 1000 results, and since 5 results are returned for every prompt that's 2 cents per prompt.

llm openrouter models command for listing details of the OpenRouter models, including a --json option to get JSON and a --free option to filter for just the free models. #26

This offers a neat way to list the available models. There are examples of the output in the comments on the issue.

New option to specify custom provider routing: -o provider '{JSON here}'. #17

Part of OpenRouter's USP is that it can route prompts to different providers depending on factors like latency, cost or as a fallback if your first choice is unavailable - great for if you are using open weight models like Llama which are hosted by competing companies.

The options they provide for routing are very thorough - I had initially hoped to provide a set of CLI options that covered all of these bases, but I decided instead to reuse their JSON format and forward those options directly on to the model.

Tags: cli, plugins, projects, ai, annotated-release-notes, generative-ai, llms, llm, openrouter, ai-assisted-search

llm-openrouter 0.3

2024-12-08T23:56:14+00:00

llm-openrouter 0.3

New release of my llm-openrouter plugin, which allows LLM to access models hosted by OpenRouter.

Quoting the release notes:

Enable image attachments for models that support images. Thanks, Adam Montgomery. #12

Provide async model access. #15

Fix documentation to list correct LLM_OPENROUTER_KEY environment variable. #10

Tags: plugins, releases, ai, generative-ai, llms, llm, openrouter

Options for accessing Llama 3 from the terminal using LLM

2024-04-22T13:38:09+00:00

Llama 3 was released on Thursday. Early indications are that it's now the best available openly licensed model - Llama 3 70b Instruct has taken joint 5th place on the LMSYS arena leaderboard, behind only Claude 3 Opus and some GPT-4s and sharing 5th place with Gemini Pro and Claude 3 Sonnet. But unlike those other models Llama 3 70b is weights available and can even be run on a (high end) laptop!

My LLM command-line tool and Python library provides access to dozens of models via plugins. Here are several ways you can use it to access Llama 3, both hosted versions and running locally on your own hardware.

Llama-3-8B-Instruct locally with llm-gpt4all

If you want to run Llama 3 locally, the easiest way to do that with LLM is using the llm-gpt4all plugin. This plugin builds on the excellent gpt4all project by Nomic AI, providing a quantized (q4) Llama 3 8B instruct model that needs 8GB of RAM and requires a 4.34GB download.

If you don't have LLM installed yet, install it like this:

brew install llm

Or like this:

pipx install llm

Install the plugin:

llm install llm-gpt4all

The first time you run a prompt it will download the model:

llm -m Meta-Llama-3-8B-Instruct "Three great names for a pet emu"

What an interesting question! Here are three name suggestions for your pet emu:

Emmet: A playful take on the word "emu" that suits a friendly and curious bird.

Gizmo: Emus are known for their quirky behavior, so Gizmo is a fitting name to capture their unique personality.

Aussie: As an Australian native species, naming your emu Aussie would be a nod to its heritage.

Remember to choose a name that suits your pet's temperament and characteristics!

Each time you run a command like this the model will be loaded into memory from scratch. To keep it resident in memory, use the llm chat command:

llm chat -m Meta-Llama-3-8B-Instruct

Here's an example session:

Chatting with Meta-Llama-3-8B-Instruct
Type 'exit' or 'quit' to exit
Type '!multi' to enter multiple lines, then '!end' to finish
> javascript function to reverse a string
%
Here is the JavaScript code that reverses a given string:
```
function reverseString(str) {
  return str.split("").reverse().join("");
}
```

You can use this function by passing your desired input string as an argument, like so:

```javascript
let reversedStr = reverseString("Hello World");
console.log(reversedStr); // Output: "dlroW olleH"
```
In the code above:
- `str.split("")` splits the original string into individual characters.
- `.reverse()` reverses the order of those characters.
- `.join("")` combines them back together as a single string.

This function works for any input string, not just "Hello World". Try it out with different strings to see how it handles punctuation and spaces!
> capitalize it too
%
Here's an updated JavaScript function that not only reverses the input string but also capitalizes each character:
```javascript
function reverseAndCapitalize(str) {
  return str.split("").reverse().map(function(char) { 
    if (char.match(/[a-z]/i)) {
      return char.toUpperCase();
    } else {
      return char;
    }
  }).join("");
}
```

(I'm not sure I'd write reverseAndCapitalize() quite like that, but still a fun example!)

Consult the LLM documentation for more details on how to use the command-line tool.

Fast API access via Groq

Groq serve openly licensed LLMs at ludicrous speeds using their own custom LPU (Language Processing Unit) Inference Engine. They currently offer a free preview of their API: you can sign up and obtain an API key to start using it.

You can run prompts against Groq using their OpenAI compatible API endpoint.

Edit the file ~/Library/Application Support/io.datasette.llm/extra-openai-models.yaml - creating it if it doesn't exist - and add the following lines to it:

- model_id: groq-openai-llama3
  model_name: llama3-70b-8192
  api_base: https://api.groq.com/openai/v1
  api_key_name: groq
- model_id: groq-openai-llama3-8b
  model_name: llama3-8b-8192
  api_base: https://api.groq.com/openai/v1
  api_key_name: groq

This tells LLM about those models, and makes them accessible via those configured model_id values.

Run this command to confirm that the models were registered correctly:

llm models | grep groq

You should see this:

OpenAI Chat: groq-openai-llama3
OpenAI Chat: groq-openai-llama3-8b

Set your Groq API key like this:

llm keys set groq
# <Paste your API key here>

Now you should be able to run prompts through the models like this:

llm -m groq-openai-llama3 "A righteous sonnet about a brave owl"

Groq is fast.

There's also a llm-groq plugin but it hasn't shipped support for the new models just yet - though there's a PR for that by Lex Herbert here and you can install the plugin directly from that PR like this:

llm install https://github.com/lexh/llm-groq/archive/ba9d7de74b3057b074a85fe99fe873b75519bd78.zip
llm keys set groq
# paste API key here
llm -m groq-llama3-70b 'say hi in spanish five ways'

Local Llama 3 70b Instruct with llamafile

The Llama 3 8b model is easy to run on a laptop, but it's pretty limited in capability. The 70b model is the one that's starting to get competitive with GPT-4. Can we run that on a laptop?

I managed to run the 70b model on my 64GB MacBook Pro M2 using llamafile (previously on this blog) - after quitting most other applications to make sure the 37GB of RAM it needed was available.

I used the Meta-Llama-3-70B-Instruct.Q4_0.llamafile Q4 version from jartine/Meta-Llama-3-70B-Instruct-llamafile - a 37GB download. I have a dedicated external hard disk (a Samsung T7 Shield) for this kind of thing.

Here's how I got it working:

curl -L -o Meta-Llama-3-70B-Instruct.Q4_0.llamafile 'https://huggingface.co/jartine/Meta-Llama-3-70B-Instruct-llamafile/resolve/main/Meta-Llama-3-70B-Instruct.Q4_0.llamafile?download=true'
# That downloads 37GB - now make it executable
chmod 755 Meta-Llama-3-70B-Instruct.Q4_0.llamafile
# And start it running:
./Meta-Llama-3-70B-Instruct.Q4_0.llamafile

A llamafile is an executable that runs on virtually any platform - see my previous notes on Cosmopolitan and Actually Portable Executable for more on how that works.

This will take quite a while to start, because it needs to load that full 37GB of binary content into memory. Once it's finished loading a local web server becomes available at http://127.0.0.1:8080/ - this serves a web UI you can use to interact with the model, and also serves another OpenAI-compatible API endpoint.

The easiest way to access this from LLM is to install the llm-llamafile plugin:

llm install llm-llamafile

All this plugin does is configure a model called llamafile that attempts to access the model hosted on port 8080. You can run prompts like this:

llm -m llamafile "3 neat characteristics of a pelican"

Here are three neat characteristics of a pelican:

Unique Beak: Pelicans have a distinctive beak that is shaped like a scoop or a basket. This beak is specially designed to catch fish, and it can hold up to 3 gallons of water! The beak is also very sensitive, which helps pelicans detect the presence of fish in the water.

Waterproof Feathers: Pelicans have a special coating on their feathers that makes them waterproof. This is essential for their fishing lifestyle, as they need to be able to dive into the water without getting weighed down by wet feathers. The coating is made up of a waxy substance that helps to repel water.

Pouch-Like Throat: Pelicans have a unique throat pouch that allows them to catch and store fish. When they dive into the water, they use their beak to scoop up fish, and then they store them in their throat pouch. The pouch can expand to hold multiple fish, and the pelican can then swallow the fish whole or regurgitate them to feed their young. This pouch is a key adaptation that helps pelicans thrive in their aquatic environment.

If you don't want to install another plugin, you can instead configure the model by adding this to your openai-extra-models.yaml file:

- model_id: llamafile
  model_name: llamafile
  api_base: http://localhost:8080/v1
  api_key: x

One warning about this approach: if you use LLM like this then every prompt you run through llamafile will be stored under the same model name in your SQLite logs, even if you try out different llamafile models at different times. You could work around this by registering them with different model_id values in the YAML file.

Paid access via other API providers

A neat thing about open weight models is that multiple API providers can offer them, encouraging them to aggressively compete on price.

Groq is currently free, but that's with a limited number of free requests.

A number of other providers are now hosting Llama 3, and many of them have plugins available for LLM. Here are a few examples:

Perplexity Labs are offering llama-3-8b-instruct and llama-3-70b-instruct. The llm-perplexity plugin provides access - llm install llm-perplexity to install, llm keys set perplexity to set an API key and then run prompts against those two model IDs. Current price for 8b is $0.20 per million tokens, for 80b is $1.00.
Anyscale Endpoints have meta-llama/Llama-3-8b-chat-hf ($0.15/million tokens) and meta-llama/Llama-3-70b-chat-hf ($1.0/million tokens) (pricing). llm install llm-anyscale-endpoints, then llm keys set anyscale-endpoints to set the API key.
Fireworks AI have fireworks/models/llama-v3-8b-instruct for $0.20/million and fireworks/models/llama-v3-70b-instruct for $0.90/million (pricing). llm install llm-fireworks, then llm keys set fireworks to set the API key.
OpenRouter provide proxied accessed to Llama 3 from a number of different providers at different prices, documented on their meta-llama/llama-3-70b-instruct and meta-llama/llama-3-8b-instruct pages (and more). Use the llm-openrouter plugin for those.
Together AI has both models as well. The llm-together plugin provides access to meta-llama/Llama-3-8b-chat-hf and meta-llama/Llama-3-70b-chat-hf.

I'm sure there are more - these are just the ones I've tried out myself. Check the LLM plugin directory for other providers, or if a provider emulates the OpenAI API you can configure with the YAML file as shown above or described in the LLM documentation.

That's a lot of options

One key idea behind LLM is to use plugins to provide access to as many different models as possible. Above I've listed two ways to run Llama 3 locally and six different API vendors that LLM can access as well.

If you're inspired to write your own plugin it's pretty simple: each of the above plugins is open source, and there's a detailed tutorial on Writing a plugin to support a new model on the LLM website.

Tags: projects, ai, generative-ai, llama, local-llms, llms, llm, llamafile, groq, llm-release, openrouter, chatbot-arena

Many options for running Mistral models in your terminal using LLM

2023-12-18T18:18:44+00:00

Mistral AI is the most exciting AI research lab at the moment. They've now released two extremely powerful smaller Large Language Models under an Apache 2 license, and have a third much larger one that's available via their API.

I've been trying out their models using my LLM command-line tool tool. Here's what I've figured out so far.

Mixtral 8x7B via llama.cpp and llm-llama-cpp

On Friday 8th December Mistral AI tweeted a mysterious magnet (BitTorrent) link. This is the second time they've done this, the first was on September 26th when they released their excellent Mistral 7B model, also as a magnet link.

The new release was an 87GB file containing Mixtral 8x7B - "a high-quality sparse mixture of experts model (SMoE) with open weights", according to the article they released three days later.

Mixtral is a very impressive model. GPT-4 has long been rumored to use a mixture of experts architecture, and Mixtral is the first truly convincing openly licensed implementation of this architecture I've seen. It's already showing impressive benchmark scores.

This PR for llama.cpp added support for the new model. llama-cpp-python updated to land that patch shortly afterwards.

Which means... you can now run Mixtral on a Mac (and other platforms too, though I haven't tested them myself yet) using my llm-llama-cpp plugin.

Here's how to do that:

Install LLM:
```
pipx install llm
```
Install the plugin:
```
llm install llm-llama-cpp
```
Install llama-cpp-python - this needs to be done manually because the best approach differs for different platforms. On an Apple Silicon Mac I recommend running:
```
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 llm install llama-cpp-python
```
More details in the llm-llama-cpp README
Download a GGUF file for Mixtral 8X7B Instruct v0.1. TheBloke has 8 different options for this - I picked the 36GB mixtral-8x7b-instruct-v0.1.Q6_K.gguf:
```
curl -LO 'https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/mixtral-8x7b-instruct-v0.1.Q6_K.gguf?download=true'
```

Run the model like this:

llm -m gguf \
  -o path mixtral-8x7b-instruct-v0.1.Q6_K.gguf \
  '[INST] Write a Python function that downloads a file from a URL[/INST]'

The -m gguf here says to use the gguf model that was installed by the llm-llama-cpp plugin - this is a model which requires you to pass a path option - with -o path filename - specifying which model you would like to run.

The [INST] part there is the format of prompt expected by the Instruct model. I need to update the llm-llama-cpp plugin to make it easier to automatically construct the correct prompt format, see this issue.

Here's what I got for that prompt:

Sure, here's an example Python function that downloads a file from a URL using the requests library:
import requests

def download_file(url, save_path):
    """
    Download a file from a URL and save it to a local path.

    :param url: The URL of the file to download.
    :param save_path: The local path to save the file.
    """
    response = requests.get(url, stream=True)
    if response.status_code == 200:
        with open(save_path, 'wb') as f:
            for chunk in response.iter_content(1024):
                f.write(chunk)
        print(f"File downloaded successfully at {save_path}")
    else:
        print(f"Failed to download file from {url}. Status code: {response.status_code}")
To use this function, simply call it with the URL of the file you want to download and the local path where you want to save it. For example:
download_file('https://example.com/myfile.txt', '/path/to/save/myfile.txt')
Note that this function uses the stream=True parameter in the requests.get() method to download the file in chunks, which can help prevent memory issues when downloading large files. It also checks the HTTP status code of the response before saving the file, so you can handle any errors that might occur during the download.

That's a very solid reply!

Mistral 7B via llm-llama-cpp or llm-gpt4all or llm-mlc

The smaller Mistral 7B model dropped back in September. It's since established itself as the most capable model family of that size - a size which is very convenient for running on personal devices.

I'm even running Mistral 7B on my iPhone now, thanks to an update to the MLC Chat iOS app from a few days ago.

There are a bunch of different options for running this model and its variants locally using LLM on a Mac - and probably other platforms too, though I've not tested these options myself on Linux or Windows:

Using llm-llama-cpp: download one of these Mistral-7B-Instruct GGUF files for the chat-tuned version, or one of these for base Mistral, then follow the steps listed above
Using llm-gpt4all. This is the easiest plugin to install:
```
llm install llm-gpt4all
```
The model will be downloaded the first time you try to use it:
```
llm -m mistral-7b-instruct-v0 'Introduce yourself'
```

Using llm-mlc. Follow the instructions in the README to install it, then:

# Download the model:
llm mlc download-model https://huggingface.co/mlc-ai/mlc-chat-Mistral-7B-Instruct-v0.2-q3f16_1
# Run it like this:
llm -m mlc-chat-Mistral-7B-Instruct-v0.2-q3f16_1 'Introduce yourself'

Each of these options work, but I've not spent time yet comparing them in terms of output quality or performance.

Using the Mistral API, which includes the new Mistral-medium

Mistral also recently announced La plateforme, their early access API for calling hosted versions of their models.

Their new API renames Mistral 7B model "Mistral-tiny", the new Mixtral model "Mistral-small"... and offers something called Mistral-medium as well:

Our highest-quality endpoint currently serves a prototype model, that is currently among the top serviced models available based on standard benchmarks. It masters English/French/Italian/German/Spanish and code and obtains a score of 8.6 on MT-Bench.

I got access to their API and used it to build a new plugin, llm-mistral. Here's how to use that:

Install it:
```
llm install llm-mistral
```
Set your Mistral API key:
```
llm keys set mistral
# <paste key here>
```

Run the models like this:

llm -m mistral-tiny 'Say hi'
# Or mistral-small or mistral-medium
cat mycode.py | llm -m mistral-medium -s 'Explain this code'

Here's their comparison table pitching Mistral Small and Medium against GPT-3.5:

These may well be cherry-picked, but note that Small beats GPT-3.5 on almost every metric, and Medium beats it on everything by a wider margin.

Here's the MT Bench leaderboard which includes scores for GPT-4 and Claude 2.1:

That 8.61 score for Medium puts it half way between GPT-3.5 and GPT-4.

Benchmark scores are no replacement for spending time with a model to get a feel for how well it behaves across a wide spectrum of tasks, but these scores are extremely promising. GPT-4 may not hold the best model crown for much longer.

Mistral via other API providers

Since both Mistral 7B and Mixtral 8x7B are available under an Apache 2 license, there's been something of a race to the bottom in terms of pricing from other LLM hosting providers.

This trend makes me a little nervous, since it actively disincentivizes future open model releases from Mistral and from other providers who are hoping to offer their own hosted versions.

LLM has plugins for a bunch of these providers already. The three that I've tried so far are Replicate, Anyscale Endpoints and OpenRouter.

For Replicate using llm-replicate:

llm install llm-replicate
llm keys set replicate
# <paste API key here>
llm replicate add mistralai/mistral-7b-v0.1

Then run prompts like this:

llm -m replicate-mistralai-mistral-7b-v0.1 '3 reasons to get a pet weasel:'

This example is the non-instruct tuned model, so the prompt needs to be shaped such that the model can complete it.

For Anyscale Endpoints using llm-anyscale-endpoints:

llm install llm-anyscale-endpoints
llm keys set anyscale-endpoints
# <paste API key here>

Now you can run both the 7B and the Mixtral 8x7B models:

llm -m mistralai/Mixtral-8x7B-Instruct-v0.1 \
  '3 reasons to get a pet weasel'
llm -m mistralai/Mistral-7B-Instruct-v0.1 \
  '3 reasons to get a pet weasel'

And for OpenRouter using llm-openrouter:

llm install llm-openrouter
llm keys set openrouter
# <paste API key here>

Then run the models like so:

llm -m openrouter/mistralai/mistral-7b-instruct \
  '2 reasons to get a pet dragon'
llm -m openrouter/mistralai/mixtral-8x7b-instruct \
  '2 reasons to get a pet dragon'

OpenRouter are currently offering Mistral and Mixtral via their API for $0.00/1M input tokens - it's free! Obviously not sustainable, so don't rely on that continuing, but that does make them a great platform for running some initial experiments with these models.

Using Llamafile's OpenAI API endpoint

I wrote about Llamafile recently, a fascinating option fur running LLMs where the LLM can be bundled up in an executable that includes everything needed to run it, on multiple platforms.

Justine Tunney released llamafiles for Mixtral a few days ago.

The mixtral-8x7b-instruct-v0.1.Q5_K_M-server.llamafile one runs an OpenAI-compatible API endpoints which LLM can talk to.

Here's how to use that:

Download the llamafile:

curl -LO https://huggingface.co/jartine/Mixtral-8x7B-v0.1.llamafile/resolve/main/mixtral-8x7b-instruct-v0.1.Q5_K_M-server.llamafile

Start that running:
```
./mixtral-8x7b-instruct-v0.1.Q5_K_M-server.llamafile
```
You may need to chmod 755 mixtral-8x7b-instruct-v0.1.Q5_K_M-server.llamafile it first, but I found I didn't need to.
Configure LLM to know about that endpoint, by adding the following to a file at ~/Library/Application Support/io.datasette.llm/extra-openai-models.yaml:
```
- model_id: llamafile
  model_name: llamafile
  api_base: "http://127.0.0.1:8080/v1"
```
This registers a model called llamafile which you can now call like this:
```
llm -m llamafile 'Say hello to the world'
```

Setting up that llamafile alias means you'll be able to use the same CLI invocation for any llamafile models you run on that default 8080 port.

The same exact approach should work for other model hosting options that provide an endpoint that imitates the OpenAI API.

This is LLM plugins working as intended

When I added plugin support to LLM this was exactly what I had in mind: I want it to be as easy as possible to add support for new models, both local and remotely hosted.

The LLM plugin directory lists 19 plugins in total now.

If you want to build your own plugin - for a locally hosted model or for one exposed via a remote API - the plugin author tutorial (plus reviewing code from the existing plugins) should hopefully provide everything you need.

You're also welcome to join us in the #llm Discord channel to talk about your plans for your project.

Tags: cli, plugins, projects, ai, generative-ai, local-llms, llms, llm, mistral, llamafile, llama-cpp, openrouter