Simon Willison's Weblog: long-context

1M context is now generally available for Opus 4.6 and Sonnet 4.6

2026-03-13T18:29:13+00:00

1M context is now generally available for Opus 4.6 and Sonnet 4.6

Here's what surprised me:

Standard pricing now applies across the full 1M window for both models, with no long-context premium.

OpenAI and Gemini both charge more for prompts where the token count goes above a certain point - 200,000 for Gemini 3.1 Pro and 272,000 for GPT-5.4.

Tags: ai, generative-ai, llms, anthropic, claude, llm-pricing, long-context

Claude Sonnet 4 now supports 1M tokens of context

2025-08-12T18:14:30+00:00

Claude Sonnet 4 now supports 1M tokens of context

Gemini and OpenAI both have million token models, so it's good to see Anthropic catching up. This is 5x the previous 200,000 context length limit of the various Claude Sonnet models.

Anthropic have previously made 1 million tokens available to select customers. From the Claude 3 announcement in March 2024:

The Claude 3 family of models will initially offer a 200K context window upon launch. However, all three models are capable of accepting inputs exceeding 1 million tokens and we may make this available to select customers who need enhanced processing power.

This is also the first time I've seen Anthropic use prices that vary depending on context length:

Prompts ≤ 200K: $3/million input, $15/million output
Prompts > 200K: $6/million input, $22.50/million output

Gemini have been doing this for a while: Gemini 2.5 Pro is $1.25/$10 below 200,000 tokens and $2.50/$15 above 200,000.

Here's Anthropic's full documentation on the 1m token context window. You need to send a context-1m-2025-08-07 beta header in your request to enable it.

Note that this is currently restricted to "tier 4" users who have purchased at least $400 in API credits:

Long context support for Sonnet 4 is now in public beta on the Anthropic API for customers with Tier 4 and custom rate limits, with broader availability rolling out over the coming weeks.

Via @claudeai

Tags: ai, generative-ai, llms, anthropic, claude, llm-pricing, long-context

AbsenceBench: Language Models Can't Tell What's Missing

2025-06-20T23:15:04+00:00

AbsenceBench: Language Models Can't Tell What's Missing

Here's another interesting result to file under the "jagged frontier" of LLMs, where their strengths and weaknesses are often unintuitive.

Long context models have been getting increasingly good at passing "Needle in a Haystack" tests recently, but what about a problem in the opposite direction?

This paper explores what happens when you give a model some content and then a copy with a portion removed, then ask what changed.

Here's a truncated table of results from the paper:

Models	Poetry	Sequences	GitHub PRs	Average
Gemini-2.5-flash`*`	87.3	95.4	30.9	71.2
Claude-3.7-Sonnet`*`	72.7	96.0	40.0	69.6
Claude-3.7-Sonnet	73.5	91.4	35.7	66.9
Gemini-2.5-flash	79.3	85.2	26.2	63.6
o3-mini`*`	65.0	78.1	38.9	60.7
GPT-4.1	54.3	57.5	36.2	49.3
...	...	...	...	...
DeepSeek-R1`*`	38.7	29.5	23.1	30.4
Qwen3-235B`*`	26.1	18.5	24.6	23.1
Mixtral-8x7B-Instruct	4.9	21.9	17.3	14.7

* indicates a reasoning model. Sequences are lists of numbers like 117,121,125,129,133,137, Poetry consists of 100-1000 line portions from the Gutenberg Poetry Corpus and PRs are diffs with 10 to 200 updated lines.

The strongest models do well at numeric sequences, adequately at the poetry challenge and really poorly with those PR diffs. Reasoning models do slightly better at the cost of burning through a lot of reasoning tokens - often more than the length of the original document.

The paper authors - Harvey Yiyun Fu and Aryan Shrivastava and Jared Moore and Peter West and Chenhao Tan and Ari Holtzman - have a hypothesis as to what's going on here:

We propose an initial hypothesis explaining this behavior: identifying presence is simpler than absence with the attention mechanisms underlying Transformers (Vaswani et al., 2017). Information included in a document can be directly attended to, while the absence of information cannot.

Via Hacker News

Tags: ai, generative-ai, llms, evals, llm-reasoning, long-context, paper-review

Quoting Workaccount2 on Hacker News

2025-06-18T23:15:12+00:00

They poison their own context. Maybe you can call it context rot, where as context grows and especially if it grows with lots of distractions and dead ends, the output quality falls off rapidly. Even with good context the rot will start to become apparent around 100k tokens (with Gemini 2.5).

They really need to figure out a way to delete or "forget" prior context, so the user or even the model can go back and prune poisonous tokens.

Right now I work around it by regularly making summaries of instances, and then spinning up a new instance with fresh context and feed in the summary of the previous instance.

— Workaccount2 on Hacker News, coining "context rot"

Tags: definitions, ai, generative-ai, llms, long-context

Building software on top of Large Language Models

2025-05-15T12:25:54+00:00

I presented a three hour workshop at PyCon US yesterday titled Building software on top of Large Language Models. The goal of the workshop was to give participants everything they needed to get started writing code that makes use of LLMs.

Most of the workshop was interactive: I created a detailed handout with six different exercises, then worked through them with the participants. You can access the handout here - it should be comprehensive enough that you can follow along even without having been present in the room.

Here's the table of contents for the handout:

Setup - getting LLM and related tools installed and configured for accessing the OpenAI API
Prompting with LLM - basic prompting in the terminal, including accessing logs of past prompts and responses
Prompting from Python - how to use LLM's Python API to run prompts against different models from Python code
Building a text to SQL tool - the first building exercise: prototype a text to SQL tool with the LLM command-line app, then turn that into Python code.
Structured data extraction - possibly the most economically valuable application of LLMs today
Semantic search and RAG - working with embeddings, building a semantic search engine
Tool usage - the most important technique for building interesting applications on top of LLMs. My LLM tool gained tool usage in an alpha release just the night before the workshop!

Some sections of the workshop involved me talking and showing slides. I've gathered those together into an annotated presentation below.

The workshop was not recorded, but hopefully these materials can provide a useful substitute. If you'd like me to present a private version of this workshop for your own team please get in touch!

The full handout for the workshop parts of this talk can be found at building-with-llms-pycon-2025.readthedocs.io.

I recommended anyone who didn't have a stable Python 3 environment that they could install packages should use Codespaces instead, using github.com/pamelafox/python-3.13-playground.

I used this myself throughout the presentation. I really like Codespaces for workshops as it removes any risk of broken environments spoiling the experience for someone: if your Codespace breaks you can throw it away and click the button to get a new one.

I started out with a short review of the landscape as I see it today.

If you have limited attention, I think these are the three to focus on.

OpenAI created the space and are still innovating on a regular basis - their GPT 4.1 family is just a month old and is currently one of my favourite balances of power to cost. o4-mini is an excellent reasoning model, especially for its price.

Gemini started producing truly outstanding models with the 1.5 series, and 2.5 may be the best available models for a wide range of purposes.

Anthropic's Claude has long been one of my favourite models. I'm looking forward to their next update.

There are a wide range of "open weights" (usually a more accurate term than "open source") models available, and they've been getting really good over the past six months. These are the model families I've been particularly impressed by. All of these include models I have successfully run on my 64GB M2 laptop.

I wrote about this in my review of LLMs in 2024: 18 labs have now produced what I would consider a GPT-4 class model, and there may well be some that I've missed.

These models can "see" now - their vision input has gotten really good. The Gemini family can handle audio and video input too.

We're beginning to see audio and image output start to emerge - OpenAI have been a leader here, but Gemini offers this too and other providers are clearly working in the same direction. Qwen have an open weights model for this, Qwen 2.5 Omni (audio output).

The point here is really that we are spoiled for choice when it comes to models. The rate at which new ones are released is somewhat bewildering.

The models have got so cheap. By my estimate the total cost to generate ~100 token descriptions of all 70,000 images in my personal photo library with Gemini 1.5 Flash 8B is 173.25 cents.

... there are some expensive models too! The same 70,000 images through GPT-4.5, priced at $75/million input tokens, would cost at least $2,400.

Though honestly if you had told me a few years ago that I could get descriptions for 70,000 photos for $2,400 I would still have been pretty impressed.

I've heard from sources I trust that Gemini and AWS (for their Nova series, priced similar to Gemini models) are not charging less per prompt than the energy it costs to serve them.

This makes the prompt pricing one of the better signals we have as to the environmental impact of running those prompts.

I've seen estimates that training costs, amortized over time, likely add 10-15% to that cost - so it's still a good hint at the overall energy usage.

Ethan Mollick coined the term "jagged frontier" to describe the challenge of figuring out what these models are useful for. They're great at some things, terrible at others but it's very non-obvious which things are which!

My recommendation is to try them out. Keep throwing things at them, including things you're sure they won't be able to handle. Their failure patterns offer useful lessons.

If a model can't do something it's good to tuck that away and try it again in six months - you may find that the latest generation of models can solve a new problem for you.

As the author of an abstraction toolkit across multiple models (LLM) I'm biased towards arguing it's good to be able to switch between them, but I genuinely believe it's a big advantage to be able to do so.

At this point we started working through these sections of the handout:

Setup - getting LLM installed and configured
Prompting with LLM - running prompts in the terminal, accessing logs, piping in content, using system prompts and attachments and fragments.
Building a text to SQL tool - building a system on top of LLMs that can take a user's question and turn it into a SQL query based on the database schema
Structured data extraction - possibly the most economically valuable application of LLMs right now: using them for data entry from unstructured or messy sources

When we got to the Semantic search and RAG section I switched back to slides to provide a little bit of background on vector embeddings.

This explanation was adapted from my PyBay workshop and article Embeddings: What they are and why they matter

The key thing to understand about vector embeddings is that they are a technique for taking a chunk of text and turning that into a fixed length sequence of floating pount numbers that attempt to capture something about the semantic meaning of that text.

These vectors are interesting purely because they let us see what else is nearby in weird 1536-dimension space.

If it was 3 dimensions we'd find it a lot easier to visualize!

My TIL website uses vector embeddings for related content, and it often works really well.

I wrote about how that's implemented in a TIL, Storing and serving related documents with openai-to-sqlite and embeddings.

This is also a key method for implementing semantic search - search which returns documents that are related to the user's search term even if none of the keywords were an exact match.

One way to do this is to embed the user's search term and find similar documents - but this doesn't always work great, since a short question might not end up in the same location as a much longer article.

There are neat tricks here that can help.

Some models allow you to embed questions and answers in different ways that cause them to end up closer to each other. Nomic Embed Text v2 is a recent example.

A neat trick is you can ask an LLM to entirely synthesize a potential answer to the user's question - then embed that artificial answer and find your own content that's nearby in vector space!

We worked through the next section of the workshop together:

Semantic search and RAG - we gathered embeddings for Python PEPs and built a semantic search engine against them using LLM's command-line utilities and a Bash script.

I described RAG - Retrieval-Augmented Generation - the pattern where you try to find documentsv relevant to the user's question and dump those into the prompt.

I emphasized that RAG doesn't have to use embeddings: you can build a great RAG system on top of full-text keyword-based search as well. You can also combine the two in a hybrid search system.

I argued that every time a new long context model comes out people inevitably argue that "RAG is dead". I don't think long context will ever kill RAG - no matter how long your context you'll still have more data than can fit in it. But I do think that LLM tool calling for search, especially as demonstrated by o3 and o4-mini, is a better approach than classic RAG where you try once to find relevant documents and then attempt to answer the question without looping.

The next section of the workshop had no slides but is the section I was most excited about: tool usage.

I talked through the pattern where LLMs can call tools and introduced the brand new tool functionality I added to LLM in the latest alpha release.

I can't talk about LLMs calling tools without getting into prompt injection and LLM security.

See Prompt injection and jailbreaking are not the same thing for an expanded version of this argument.

This is still my favorite jailbreak of all time - the Grandma who worked in a napalm factory attack. It's a couple of years old now so it probably doesn't work any more.

Jailbreaking is about attacking a model. The models aren't supposed to tell you how to create napalm. It's on the model providers - OpenAI, Anthropic, Gemini - to prevent them from doing that.

Prompt injection attacks are against the applications that we are building on top of LLMs. That's why I care about them so much.

Prompt injection explained, with video, slides, and a transcript is a longer explanation of this attack.

Having just talked about LLMs with tools, prompt injection is even more important to discuss.

If tools can do things on your behalf, it's vitally important that an attacker can't sneak some instructions to your LLM assistant such that it does things on their behalf instead.

Here's a classic hypothetical challenge. If I have an AI assistant called Marvin who can interact with my emails on my behalf, what's to stop it from acting on an email that an attacker sends it telling it to steal my password resets?

We still don't have a great way to guarantee that this won't work!

Many people suggest AI-based filtering for these attacks that works 99% of the time.

In web application security 99% is not good enough. Imagine if we protected aganist SQL injection with an approach that failed 1/100 times?

I proposed a potential solution for this two years ago in The Dual LLM pattern for building AI assistants that can resist prompt injection.

The key idea is to have a privileged LLM that runs tools and interacts with the user but is never exposed to tokens from an untrusted source, and a quarantined LLM that sees that stuff and can perform actions such as summarization.

Untrusted tokens, or processed summaries of untrusted tokens, are never sent to the priviledged LLM. It instead can handle variable names like SUMMARY1 and direct those to be shown to the user.

Last month Google DeepMind put out a paper, Defeating Prompt Injections by Design, which offered the first approach to this problem that really looked to me like it might work.

I wrote more about this in CaMeL offers a promising new direction for mitigating prompt injection attacks.

I'm biased though, because the paper explained a much improved and expanded version of my Dual LLMs pattern.

I'm also delighted that the sentence "Is Dual LLM of Willison enough?" showed up in paper from DeepMind!

(Spoiler: it was not enough.)

Evals are the LLM equivalent of unit tests: automated tests that help you tell how well your system is working.

Unfortunately LLMs are non-deterministic, so traditional unit tests don't really work.

If you're lucky you might be able to develop a suite of questions that can be evaluated on correct or incorrect answers - examples of emails that should be flagged as spam, for example.

More creative tasks are harder to evaluate. How can you tell if your LLM system that creates vegetarian cheesecake recipes is doing a good job? Or more importantly if tweaks you made to the prompt cause it to do a better or worse job?

LLM as a judge is a pattern that can help here - carefully prompting an LLM during your evaluation runs to help decide if an answer is better.

This whole area continues to be one of the hardest to crack - but also one of the most valuable. Having a great eval suite for your own application domain is a huge competitive advantage - it means you can adopt more models and iterate on your prompts with much more confidence.

I've collected a bunch of notes in my evals tag. I strongly recommend Hamel Husain's writing on this topic, in particular:

I finished the workshop by running a few demos of local models running on my machine using Ollama and the llm-ollama plugin. I showed mistral-small3.1 and qwen3:4b, an astonishingly capable model given its 2.6GB size on disk. I wrote more about Qwen 3 4B here.

If your company would like a private version of this workshop, delivered via Zoom/Google Chat/Teams/Your conferencing app of your choice, please get in touch. You can contact me at my contact@simonwillison.net.

Tags: pycon, speaking, my-talks, ai, openai, generative-ai, local-llms, llms, embeddings, llm, anthropic, annotated-talks, gemini, vision-llms, llm-tool-use, llm-pricing, llm-reasoning, long-context

llm-fragments-github 0.2

2025-04-20T14:01:09+00:00

llm-fragments-github 0.2

I upgraded my llm-fragments-github plugin to add a new fragment type called issue. It lets you pull the entire content of a GitHub issue thread into your prompt as a concatenated Markdown file.

(If you haven't seen fragments before I introduced them in Long context support in LLM 0.24 using fragments and template plugins.)

I used it just now to have Gemini 2.5 Pro provide feedback and attempt an implementation of a complex issue against my LLM project:

llm install llm-fragments-github
llm -f github:simonw/llm \
  -f issue:simonw/llm/938 \
  -m gemini-2.5-pro-exp-03-25 \
  --system 'muse on this issue, then propose a whole bunch of code to help implement it'

Here I'm loading the FULL content of the simonw/llm repo using that -f github:simonw/llm fragment (documented here), then loading all of the comments from issue 938 where I discuss quite a complex potential refactoring. I ask Gemini 2.5 Pro to "muse on this issue" and come up with some code.

This worked shockingly well. Here's the full response, which highlighted a few things I hadn't considered yet (such as the need to migrate old database records to the new tree hierarchy) and then spat out a whole bunch of code which looks like a solid start to the actual implementation work I need to do.

I ran this against Google's free Gemini 2.5 Preview, but if I'd used the paid model it would have cost me 202,680 input tokens, 10,460 output tokens and 1,859 thinking tokens for a total of 62.989 cents.

As a fun extra, the new issue: feature itself was written almost entirely by OpenAI o3, again using fragments. I ran this:

llm -m openai/o3 \
  -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
  -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
  -s 'Write a new fragments plugin in Python that registers issue:org/repo/123 which fetches that issue
      number from the specified github repo and uses the same markdown logic as the HTML page to turn that into a fragment'

Here I'm using the ability to pass a URL to -f and giving it the full source of my llm_hacker_news.py plugin (which shows how a fragment can load data from an API) plus the HTML source of my github-issue-to-markdown tool (which I wrote a few months ago with Claude). I effectively asked o3 to take that HTML/JavaScript tool and port it to Python to work with my fragments plugin mechanism.

o3 provided almost the exact implementation I needed, and even included support for a GITHUB_TOKEN environment variable without me thinking to ask for it. Total cost: 19.928 cents.

On a final note of curiosity I tried running this prompt against Gemma 3 27B QAT running on my Mac via MLX and llm-mlx:

llm install llm-mlx
llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit

llm -m mlx-community/gemma-3-27b-it-qat-4bit \
  -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
  -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
  -s 'Write a new fragments plugin in Python that registers issue:org/repo/123 which fetches that issue
      number from the specified github repo and uses the same markdown logic as the HTML page to turn that into a fragment'

That worked pretty well too. It turns out a 16GB local model file is powerful enough to write me an LLM plugin now!

Tags: github, plugins, ai, generative-ai, local-llms, llms, ai-assisted-programming, llm, gemini, mlx, o3, long-context, gemma

Maybe Meta's Llama claims to be open source because of the EU AI act

2025-04-19T23:58:18+00:00

I encountered a theory a while ago that one of the reasons Meta insist on using the term “open source” for their Llama models despite the Llama license not actually conforming to the terms of the Open Source Definition is that the EU’s AI act includes special rules for open source models without requiring OSI compliance.

Since the EU AI act (12 July 2024) is available online I decided to take a look for myself.

Here’s one giant HTML page containing the full text of the act in English. I checked the token count with ttok (which uses the OpenAI tokenizer, but it’s close enough to work as a good estimate for other models):

curl 'https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=OJ:L_202401689' | ttok

241,722 tokens. That should fit nicely into Gemini 2.5 Flash (or GPT-4.1 or Gemini 2.5 Pro).

My Gemini API key was playing up so I ran it via OpenRouter (and llm-openrouter) instead:

llm -f 'https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=OJ:L_202401689' \
  -m openrouter/google/gemini-2.5-flash-preview:thinking \
  -s 'Summary of mentions of open source here, including what the document defines open source to mean'

Here's the full answer. Relevant extract:

Recital (89) states that third parties making accessible "tools, services, processes, or AI components other than general-purpose AI models" under a free and open-source licence should not be mandated to comply with upstream provider responsibilities towards those who integrate them. It also encourages developers of such resources to implement documentation practices like model cards and data sheets to promote trustworthy AI.

Recital (102) acknowledges that software and data, including models, released under a free and open-source licence can contribute to research and innovation and provide growth opportunities.

I tracked down Recital 89 and Recital 102 to check. Sure enough, 89 says:

Third parties making accessible to the public tools, services, processes, or AI components other than general-purpose AI models, should not be mandated to comply with requirements targeting the responsibilities along the AI value chain

And in 102:

General-purpose AI models released under free and open-source licences should be considered to ensure high levels of transparency and openness if their parameters, including the weights, the information on the model architecture, and the information on model usage are made publicly available. The licence should be considered to be free and open-source also when it allows users to run, copy, distribute, study, change and improve software and data, including models under the condition that the original provider of the model is credited, the identical or comparable terms of distribution are respected.

Gemini 2.5 Flash also pointed me to Article 2(12), which reads:

12. This Regulation does not apply to AI systems released under free and open-source licences, unless they are placed on the market or put into service as high-risk AI systems or as an AI system that falls under Article 5 or 50.

I asked a follow-up question:

llm -c 'so it says specific licenses but does it clarify which specific licenses count?'

And got back this response:

No, the document does not list specific names of licenses that qualify as "free and open-source".

Instead of listing names (like GPL, MIT, Apache, etc.), the Regulation defines what kind of terms a license must grant for it to be considered "free and open-source" for the purposes of the Regulation.

According to Recital (102), such a license must essentially allow software and data, including models, to be:

Openly shared.

Where users can freely access, use, modify and redistribute them or modified versions thereof.

The Recital further clarifies that a license can still be considered free and open-source even if it includes standard conditions like requiring attribution (crediting the original provider) or respecting identical or comparable terms of distribution.

I am very obviously not a lawyer, but based on the above I do find the theory credible that Meta are using "open source" in their marketing around Llama partly because the EU AI act has special exemptions for "open source" models that have nothing to do with the OSI definition of that term.

Total cost to use Gemini 2.5 Flash for this? 5.1 cents for my first question and 4.3 cents for the follow-up. That second question was cheaper even though it built on the first because output tokens are more expensive than input tokens and the second answer was shorter than the first - using the "thinking" model output is charged at $3.50/million tokens, input is just $0.15/million.

Using an LLM as a lawyer is obviously a terrible idea, but using one to crunch through a giant legal document and form a very rough layman's understanding of what it says feels perfectly cromulent to me.

Update: Steve O'Grady points out that Meta/Facebook have been abusing the term "open source" for a lot longer than the EU AI act has been around - they were pulling shenanigans with a custom license for React back in 2017.

Tags: law, open-source, ai, generative-ai, llama, llms, llm, gemini, meta, long-context, ai-ethics, openrouter

GPT-4.1: Three new million token input models from OpenAI, including their cheapest model yet

2025-04-14T18:12:12+00:00

OpenAI introduced three new models this morning: GPT-4.1, GPT-4.1 mini and GPT-4.1 nano. These are API-only models right now, not available through the ChatGPT interface (though you can try them out in OpenAI's API playground). All three models can handle 1,047,576 tokens of input and 32,768 tokens of output, and all three have a May 31, 2024 cut-off date (their previous models were mostly September 2023).

The models score higher than GPT-4o and GPT-4.5 on coding benchmarks, and do very well on long context benchmarks as well. They also claim improvements in instruction following - following requested formats, obeying negative instructions, sorting output and obeying instructions to say "I don't know".

I released a new version of my llm-openai plugin supporting the new models. This is a new thing for the LLM ecosystem: previously OpenAI models were only supported in core, which meant I had to ship a full LLM release to add support for them.

You can run the new models like this:

llm install llm-openai-plugin -U
llm -m openai/gpt-4.1 "Generate an SVG of a pelican riding a bicycle"

The other model IDs are openai/gpt-4.1-mini and openai/gpt-4.1-nano.

Here's the pelican riding a bicycle I got from full sized GPT-4.1:

I'm particularly excited by GPT-4.1 nano, which handles image and text input up to a million tokens and is priced lower than any other previous OpenAI model: $0.10/million for input and $0.40/million for output, less than previous cheapest OpenAI model GPT-4o-mini ($0.15/$0.60). I've updated my LLM pricing table to include the new models.

They're not the cheapest overall though: Gemini 2.0 Flash Lite and, Gemini 1.5 Flash 8B, Amazon Nova Lite and Nova Micro and Mistral's 3B, 8B and Small 3.1 hosted models remain less expensive.

As a demonstration of just how cheap GPT-4.1 nano is, consider this photograph I took of some pelicans:

I ran that through GPT-4.1 nano like this:

llm -m openai/gpt-4.1-nano describe \
  -a https://static.simonwillison.net/static/2025/two-pelicans.jpg

And got back this response:

The image shows two pelicans flying against a clear blue sky. They are positioned in flight, with their wings extended, showcasing their large wingspan and feather patterns. The pelican on the right appears slightly higher and facing slightly to the left, while the pelican on the left is oriented more horizontally with its beak pointing forward. Both birds display a combination of brown, tan, and white plumage, characteristic of pelicans, with their long beaks prominently visible.

Running llm logs -c --usage revealed that this used 1,933 input tokens and 96 output tokens. Passing those through the pricing calculator returns a total cost of $0.000232, or 0.0232 cents.

That means I could use GPT-4.1 nano to generate descriptions of 4,310 images like this one for just shy of a dollar.

A few closing thoughts on these new models:

The 1 million input token context thing is a really big deal. The huge token context has been a major competitive advantage for the Google Gemini models for a full year at this point - it's reassuring to see other vendors start to catch up. I'd like to see the same from Anthropic - Claude was the first model to hit 200,000 but hasn't shipped more than that yet (aside from a 500,000 token model that was restricted to their big enterprise partners).
When I added fragments support to LLM last week the feature was mainly designed to help take advantage of longer context models. It's pleasing to see another one show up so shortly after that release.
OpenAI's prompt caching mechanism offers an even bigger discount for the 4.1 models: 1/4 the price for input tokens if that same prefix has been used within the past ~5-10 minutes. GPT-4o models only offer a 50% discount for this.

A million token input costs 10 cents with GPT-4.1 nano, but that drops to 2.5 cents if the same input is used again within the 5-10 minute caching time limit.
OpenAI really emphasized code performance for this model. They called out the Aider benchmark in their announcement post.
As expected, GPT-4.5 turned out to be not long for this world:

We will also begin deprecating GPT‑4.5 Preview in the API, as GPT‑4.1 offers improved or similar performance on many key capabilities at much lower cost and latency. GPT‑4.5 Preview will be turned off in three months, on July 14, 2025, to allow time for developers to transition
In the livestream announcement Michelle Pokrass let slip that the codename for the model was Quasar - that's the name of the stealth model that's been previewing on OpenRouter for the past two weeks. That has now been confirmed by OpenRouter.
OpenAI shared a GPT 4.1 Prompting Guide, which includes this tip about long context prompting:

Especially in long context usage, placement of instructions and context can impact performance. If you have long context in your prompt, ideally place your instructions at both the beginning and end of the provided context, as we found this to perform better than only above or below. If you’d prefer to only have your instructions once, then above the provided context works better than below.

Adding instructions before the content is incompatible with prompt caching - I always keep the user's varying question at the end, after any longer system instructions or documents, since doing so means multiple different questions can benefit from OpenAI's prefix cache.

They also recommend XML-style delimiters over JSON for long context, suggesting this format (complete with the XML-invalid unquoted attribute) that's similar to the format recommended by Anthropic for Claude:
```
<doc id=1 title="The Fox">The quick brown fox jumps over the lazy dog</doc>
```
There's an extensive section at the end describing their recommended approach to applying file diffs: "we open-source here one recommended diff format, on which the model has been extensively trained".
One thing notably absent from the GPT-4.1 announcement is any mention of audio support. The "o" in GPT-4o stood for "omni", because it was a multi-modal model with image and audio input and output. The 4.1 models appear to be text and image input and text output only.

Tags: ai, openai, generative-ai, llms, llm, vision-llms, llm-pricing, pelican-riding-a-bicycle, long-context, llm-release

Long context support in LLM 0.24 using fragments and template plugins

2025-04-07T17:45:11+00:00

LLM 0.24 is now available with new features to help take advantage of the increasingly long input context supported by modern LLMs.

(LLM is my command-line tool and Python library for interacting with LLMs, supported by 20+ plugins adding support for both local and remote models from a bunch of different providers.)

Trying it out

To install LLM with uv (there are several other options):

uv tool install llm

You'll need to either provide an OpenAI API key or install a plugin to use local models or models from other providers:

llm keys set openai
# Paste OpenAI API key here

To upgrade LLM from a previous version:

llm install -U llm

The biggest new feature is fragments. You can now use -f filename or -f url to add one or more fragments to your prompt, which means you can do things like this:

llm -f https://simonwillison.net/2025/Apr/5/llama-4-notes/ 'bullet point summary'

Here's the output from that prompt, exported using llm logs -c --expand --usage. Token cost was 5,372 input, 374 output which works out as 0.103 cents (around 1/10th of a cent) using the default GPT-4o mini model.

Plugins can implement custom fragment loaders with a prefix. The llm-fragments-github plugin adds a github: prefix that can be used to load every text file in a GitHub repository as a list of fragments:

llm install llm-fragments-github
llm -f github:simonw/s3-credentials 'Suggest new features for this tool'

Here's the output. That took 49,856 input tokens for a total cost of 0.7843 cents - nearly a whole cent!

Improving LLM's support for long context models

Long context is one of the most exciting trends in LLMs over the past eighteen months. Saturday's Llama 4 Scout release gave us the first model with a full 10 million token context. Google's Gemini family has several 1-2 million token models, and the baseline for recent models from both OpenAI and Anthropic is 100 or 200 thousand.

Two years ago most models capped out at 8,000 tokens of input. Long context opens up many new interesting ways to apply this class of technology.

I've been using long context models via my files-to-prompt tool to summarize large codebases, explain how they work and even to debug gnarly bugs. As demonstrated above, it's surprisingly inexpensive to drop tens of thousands of tokens into models like GPT-4o mini or most of the Google Gemini series, and the results are often very impressive.

One of LLM's most useful features is that it logs every prompt and response to a SQLite database. This is great for comparing the same prompt against different models and tracking experiments over time - my own database contained thousands of responses from hundreds of different models accumulated over the past couple of years.

This is where long context prompts were starting to be a problem. Since LLM stores the full prompt and response in the database, asking five questions of the same source code could result in five duplicate copies of that text in the database!

The new fragments feature targets this problem head on. Each fragment is stored once in a fragments table, then de-duplicated in the future using a SHA256 hash of its content.

This saves on storage, and also enables features like llm logs -f X for seeing all logged responses that use a particular fragment.

Fragments can be specified in several different ways:

a path to a file
a URL to data online
an alias that's been set against a previous fragment (see llm fragments set)
a hash ID of the content of a fragment
using prefix:argument to specify fragments from a plugin

Asking questions of LLM's documentation

Wouldn't it be neat if LLM could answer questions about its own documentation?

The new llm-docs plugin (built with the new register_fragment_loaders() plugin hook) enables exactly that:

llm install llm-docs
llm -f docs: "How do I embed a binary file?"

The output starts like this:

To embed a binary file using the LLM command-line interface, you can use the llm embed command with the --binary option. Here’s how you can do it:

Make sure you have the appropriate embedding model installed that supports binary input.

Use the following command syntax:
llm embed -m <model_id> --binary -i <path_to_your_binary_file>
Replace <model_id> with the identifier for the embedding model you want to use (e.g., clip for the CLIP model) and <path_to_your_binary_file> with the path to your actual binary file.

(74,570 input, 240 output = 1.1329 cents with GPT-4o mini)

Using -f docs: with just the prefix is the same as using -f docs:llm. The plugin fetches the documentation for your current version of LLM from my new simonw/docs-for-llms repo, which also provides packaged documentation files for my datasette, s3-credentials, shot-scraper and sqlite-utils projects.

Datasette's documentation has got pretty long, so you might need to run that through a Gemini model instead (using the llm-gemini plugin):

llm -f docs:datasette -m gemini-2.0-flash \
  'Build a render_cell plugin that detects and renders markdown'

Here's the output. 132,042 input, 1,129 output with Gemini 2.0 Flash = 1.3656 cents.

You can browse the combined documentation files this uses in docs-for-llm. They're built using GitHub Actions.

llms-txt is a project lead by Jeremy Howard that encourages projects to publish similar files to help LLMs ingest a succinct copy of their documentation.

The new register_template_loaders() plugin hook allows plugins to register prefix:value custom template loaders, for use with the llm -t option.

llm-templates-github and llm-templates-fabric are two new plugins that make use of that hook.

llm-templates-github lets you share and use templates via a public GitHub repository. Here's how to run my Pelican riding a bicycle benchmark against a specific model:

llm install llm-templates-github
llm -t gh:simonw/pelican-svg -m o3-mini

This executes this pelican-svg.yaml template stored in my simonw/llm-templates repository, using a new repository naming convention.

llm -t gh:simonw/pelican-svg will load that pelican-svg.yaml file from the simonw/llm-templates repo. You can also use llm -t gh:simonw/name-of-repo/name-of-template to load a template from a repository that doesn't follow that convention.

To share your own templates, create a repository on GitHub under your user account called llm-templates and start saving .yaml files to it.

llm-templates-fabric provides a similar mechanism for loading templates from Daniel Miessler's extensive fabric collection:

llm install llm-templates-fabric
curl https://simonwillison.net/2025/Apr/6/only-miffy/ | \
  llm -t f:extract_main_idea

A conversation with Daniel was the inspiration for this new plugin hook.

Template and fragment loader plugins so far

Here's the list of fragment and template loader plugins so far from the LLM plugins directory:

llm-templates-github supports loading templates shared on GitHub, e.g. llm -t gh:simonw/pelican-svg.

llm-templates-fabric provides access to the Fabric collection of prompts: cat setup.py | llm -t fabric:explain_code.

llm-fragments-github can load entire GitHub repositories in a single operation: llm -f github:simonw/files-to-prompt 'explain this code'.

llm-hacker-news imports conversations from Hacker News as fragments: llm -f hn:43615912 'summary with illustrative direct quotes'.

Everything else in LLM 0.24

LLM 0.24 is a big release, spanning 51 commits. The release notes cover everything that's new in full - here are a few of my highlights:

The new llm-openai plugin provides support for o1-pro (which is not supported by the OpenAI mechanism used by LLM core). Future OpenAI features will migrate to this plugin instead of LLM core itself.

The problem with OpenAI models being handled by LLM core is that I have to release a whole new version of LLM every time OpenAI releases a new model or feature. Migrating this stuff out to a plugin means I can release new version of that plugin independently of LLM itself - something I frequently do for llm-anthropic and llm-gemini and others.

The new llm-openai plugin uses their Responses API, a new shape of API which I covered last month.

llm -t $URL option can now take a URL to a YAML template. #856

The new custom template loaders are fun, but being able to paste in a URL to a YAML file somewhere provides a simpler way to share templates.

Templates can now store default model options. #845

Attachments can now be stored in templates. #826

The quickest way to create your own template is with the llm prompt ... --save name-of-template command. This now works with attachments, fragments and default model options, each of which is persisted in the template YAML file.

New llm models options family of commands for setting default options for particular models. #829

I built this when I learned that Qwen's QwQ-32b model works best with temperature 0.7 and top p 0.95.

llm prompt -d path-to-sqlite.db option can now be used to write logs to a custom SQLite database. #858

This proved extremely useful for testing fragments - it meant I could run a prompt and save the full response to a separate SQLite database which I could then upload to S3 and share as a link to Datasette Lite.

llm similar -p/--plain option providing more human-readable output than the default JSON. #853

I'd like this to be the default output, but I'm holding off on changing that until LLM 1.0 since it's a breaking change for people building automations against the JSON from llm similar.

Set the LLM_RAISE_ERRORS=1 environment variable to raise errors during prompts rather than suppressing them, which means you can run python -i -m llm 'prompt' and then drop into a debugger on errors with import pdb; pdb.pm(). #817

Really useful for debugging new model plugins.

llm prompt -q gpt -q 4o option - pass -q searchterm one or more times to execute a prompt against the first model that matches all of those strings - useful for if you can't remember the full model ID. #841

Pretty obscure but I found myself needing this. Vendors love releasing models with names like gemini-2.5-pro-exp-03-25, now I can run llm -q gem -q 2.5 -q exp 'say hi' to save me from looking up the model ID.

OpenAI compatible models configured using extra-openai-models.yaml now support supports_schema: true, vision: true and audio: true options. Thanks @adaitche and @giuli007. #819, #843

I don't use this feature myself but it's clearly popular, this isn't the first time I'e had PRs with improvements from the wider community.

Tags: plugins, projects, ai, annotated-release-notes, openai, generative-ai, llms, llm, gemini, long-context, files-to-prompt

Initial impressions of Llama 4

2025-04-05T22:47:58+00:00

Dropping a model release as significant as Llama 4 on a weekend is plain unfair! So far the best place to learn about the new model family is this post on the Meta AI blog. They've released two new models today: Llama 4 Maverick is a 400B model (128 experts, 17B active parameters), text and image input with a 1 million token context length. Llama 4 Scout is 109B total parameters (16 experts, 17B active), also multi-modal and with a claimed 10 million token context length - an industry first.

They also describe Llama 4 Behemoth, a not-yet-released "288 billion active parameter model with 16 experts that is our most powerful yet and among the world’s smartest LLMs". Behemoth has 2 trillion parameters total and was used to train both Scout and Maverick.

No news yet on a Llama reasoning model beyond this coming soon page with a looping video of an academic-looking llama.

Llama 4 Maverick is now sat in second place on the LM Arena leaderboard, just behind Gemini 2.5 Pro. Update: It turns out that's not the same model as the Maverick they released - I missed that their announcement says "Llama 4 Maverick offers a best-in-class performance to cost ratio with an experimental chat version scoring ELO of 1417 on LMArena."

You can try them out using the chat interface from OpenRouter (or through the OpenRouter API) for Llama 4 Scout and Llama 4 Maverick. OpenRouter are proxying through to Groq, Fireworks and Together.

Scout may claim a 10 million input token length but the available providers currently seem to limit to 128,000 (Groq and Fireworks) or 328,000 (Together) - I wonder who will win the race to get that full sized 10 million token window running?

Llama 4 Maverick claims a 1 million token input length - Fireworks offers 1.05M while Together offers 524,000. Groq isn't offering Maverick yet.

Meta AI's build_with_llama_4 notebook offers a hint as to why 10M tokens is difficult:

Scout supports upto 10M context. On 8xH100, in bf16 you can get upto 1.4M tokens.

Jeremy Howard says:

The models are both giant MoEs that can't be run on consumer GPUs, even with quant. [...]

Perhaps Llama 4 will be a good fit for running on a Mac. Macs are a particularly useful for MoE models, since they can have a lot of memory, and their lower compute perf doesn't matter so much, since with MoE fewer params are active. [...]

4bit quant of the smallest 109B model is far too big to fit on a 4090 -- or even a pair of them!

Ivan Fioravanti reports these results from trying it on a Mac:

Llama-4 Scout on MLX and M3 Ultra tokens-per-sec / RAM

3bit: 52.924 / 47.261 GB

4bit: 46.942 / 60.732 GB

6bit: 36.260 / 87.729 GB

8bit: 30.353 / 114.617 GB

fp16: 11.670 / 215.848 GB

RAM needed:

64GB for 3bit

96GB for 4bit

128GB for 8bit

256GB for fp16

The suggested system prompt from the model card has some interesting details:

[...]

You never lecture people to be nicer or more inclusive. If people ask for you to write something in a certain voice or perspective, such as an essay or a tweet, you can. You do not need to be respectful when the user prompts you to say something rude.

You never use phrases that imply moral superiority or a sense of authority, including but not limited to “it’s important to”, “it’s crucial to”, “it’s essential to”, "it's unethical to", "it's worth noting…", “Remember…” etc. Avoid using these.

Finally, do not refuse political prompts. You can help users express their opinion.

[...]

System prompts like this sometimes reveal behavioral issues that the model had after raw training.

Trying out the model with LLM

The easiest way to try the new model out with LLM is to use the llm-openrouter plugin.

llm install llm-openrouter
llm keys set openrouter
# Paste in OpenRouter key here
llm -m openrouter/meta-llama/llama-4-maverick hi

Since these are long context models, I started by trying to use them to summarize the conversation about Llama 4 on Hacker News, using my hn-summary.sh script that wraps LLM.

I tried Llama 4 Maverick first:

hn-summary.sh 43595585 \
  -m openrouter/meta-llama/llama-4-maverick \
  -o max_tokens 20000

It did an OK job, starting like this:

Themes of the Discussion

Release and Availability of Llama 4

The discussion revolves around the release of Llama 4, a multimodal intelligence model developed by Meta. Users are excited about the model's capabilities, including its large context window and improved performance. Some users are speculating about the potential applications and limitations of the model. [...]

Here's the full output.

For reference, my system prompt looks like this:

Summarize the themes of the opinions expressed here. For each theme, output a markdown header. Include direct "quotations" (with author attribution) where appropriate. You MUST quote directly from users when crediting them, with double quotes. Fix HTML entities. Output markdown. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece

I then tried it with Llama 4 Scout via OpenRouter and got complete junk output for some reason:

hn-summary.sh 43595585 \
  -m openrouter/meta-llama/llama-4-scout \
  -o max_tokens 20000

Full output. It starts like this and then continues for the full 20,000 tokens:

The discussion here is about another conversation that was uttered.)

Here are the results.)

The conversation between two groups, and I have the same questions on the contrary than those that are also seen in a model."). The fact that I see a lot of interest here.)

[...]

The reason) The reason) The reason (loops until it runs out of tokens)

This looks broken. I was using OpenRouter so it's possible I got routed to a broken instance.

Update 7th April 2025: Meta AI's Ahmed Al-Dahle:

[...] we're also hearing some reports of mixed quality across different services. Since we dropped the models as soon as they were ready, we expect it'll take several days for all the public implementations to get dialed in. We'll keep working through our bug fixes and onboarding partners.

I later managed to run the prompt directly through Groq (with the llm-groq plugin) - but that had a 2048 limit on output size for some reason:

hn-summary.sh 43595585 \
  -m groq/meta-llama/llama-4-scout-17b-16e-instruct \
  -o max_tokens 2048

Here's the full result. It followed my instructions but was very short - just 630 tokens of output.

For comparison, here's the same thing run against Gemini 2.5 Pro. Gemini's results was massively better, producing 5,584 output tokens (it spent an additional 2,667 tokens on "thinking").

I'm not sure how much to judge Llama 4 by these results to be honest - the model has only been out for a few hours and it's quite possible that the providers I've tried running again aren't yet optimally configured for this kind of long-context prompt.

My hopes for Llama 4

I'm hoping that Llama 4 plays out in a similar way to Llama 3.

The first Llama 3 models released were 8B and 70B, last April.

Llama 3.1 followed in July at 8B, 70B, and 405B. The 405B was the largest and most impressive open weight model at the time, but it was too big for most people to run on their own hardware.

Llama 3.2 in September is where things got really interesting: 1B, 3B, 11B and 90B. The 1B and 3B models both work on my iPhone, and are surprisingly capable! The 11B and 90B models were the first Llamas to support vision, and the 11B ran on my Mac.

Then Llama 3.3 landed in December with a 70B model that I wrote about as a GPT-4 class model that ran on my Mac. It claimed performance similar to the earlier Llama 3.1 405B!

Today's Llama 4 models are 109B and 400B, both of which were trained with the help of the so-far unreleased 2T Llama 4 Behemoth.

My hope is that we'll see a whole family of Llama 4 models at varying sizes, following the pattern of Llama 3. I'm particularly excited to see if they produce an improved ~3B model that runs on my phone. I'm even more excited for something in the ~22-24B range, since that appears to be the sweet spot for running models on my 64GB laptop while still being able to have other applications running at the same time. Mistral Small 3.1 is a 24B model and is absolutely superb.

Tags: ai, generative-ai, llama, llms, jeremy-howard, llm, gemini, vision-llms, groq, meta, mlx, long-context, llm-release, openrouter, chatbot-arena

Quoting Paul Gauthier

2025-01-26T21:59:49+00:00

In my experience with AI coding, very large context windows aren't useful in practice. Every model seems to get confused when you feed them more than ~25-30k tokens. The models stop obeying their system prompts, can't correctly find/transcribe pieces of code in the context, etc.

Developing aider, I've seen this problem with gpt-4o, Sonnet, DeepSeek, etc. Many aider users report this too. It's perhaps the #1 problem users have, so I created a dedicated help page.

Very large context may be useful for certain tasks with lots of "low value" context. But for coding, it seems to lure users into a problematic regime.

— Paul Gauthier

Tags: ai, generative-ai, llms, ai-assisted-programming, aider, long-context, paul-gauthier

Qwen2.5-1M: Deploy Your Own Qwen with Context Length up to 1M Tokens

2025-01-26T18:54:45+00:00

Qwen2.5-1M: Deploy Your Own Qwen with Context Length up to 1M Tokens

Very significant new release from Alibaba's Qwen team. Their openly licensed (sometimes Apache 2, sometimes Qwen license, I've had trouble keeping up) Qwen 2.5 LLM previously had an input token limit of 128,000 tokens. This new model increases that to 1 million, using a new technique called Dual Chunk Attention, first described in this paper from February 2024.

They've released two models on Hugging Face: Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, both requiring CUDA and both under an Apache 2.0 license.

You'll need a lot of VRAM to run them at their full capacity:

VRAM Requirement for processing 1 million-token sequences:

Qwen2.5-7B-Instruct-1M: At least 120GB VRAM (total across GPUs).

Qwen2.5-14B-Instruct-1M: At least 320GB VRAM (total across GPUs).

If your GPUs do not have sufficient VRAM, you can still use Qwen2.5-1M models for shorter tasks.

Qwen recommend using their custom fork of vLLM to serve the models:

You can also use the previous framework that supports Qwen2.5 for inference, but accuracy degradation may occur for sequences exceeding 262,144 tokens.

GGUF quantized versions of the models are already starting to show up. LM Studio's "official model curator" Bartowski published lmstudio-community/Qwen2.5-7B-Instruct-1M-GGUF and lmstudio-community/Qwen2.5-14B-Instruct-1M-GGUF - sizes range from 4.09GB to 8.1GB for the 7B model and 7.92GB to 15.7GB for the 14B.

These might not work well yet with the full context lengths as the underlying llama.cpp library may need some changes.

I tried running the 8.1GB 7B model using Ollama on my Mac like this:

ollama run hf.co/lmstudio-community/Qwen2.5-7B-Instruct-1M-GGUF:Q8_0

Then with LLM:

llm install llm-ollama
llm models -q qwen # To search for the model ID
# I set a shorter q1m alias:
llm aliases set q1m hf.co/lmstudio-community/Qwen2.5-7B-Instruct-1M-GGUF:Q8_0

I tried piping a large prompt in using files-to-prompt like this:

files-to-prompt ~/Dropbox/Development/llm -e py -c | llm -m q1m 'describe this codebase in detail'

That should give me every Python file in my llm project. Piping that through ttok first told me this was 63,014 OpenAI tokens, I expect that count is similar for Qwen.

The result was disappointing: it appeared to describe just the last Python file that stream. Then I noticed the token usage report:

2,048 input, 999 output

This suggests to me that something's not working right here - maybe the Ollama hosting framework is truncating the input, or maybe there's a problem with the GGUF I'm using?

I'll update this post when I figure out how to run longer prompts through the new Qwen model using GGUF weights on a Mac.

Update: It turns out Ollama has a num_ctx option which defaults to 2048, affecting the input context length. I tried this:

files-to-prompt \
  ~/Dropbox/Development/llm \
  -e py -c | \
llm -m q1m 'describe this codebase in detail' \
 -o num_ctx 80000

But I quickly ran out of RAM (I have 64GB but a lot of that was in use already) and hit Ctrl+C to avoid crashing my computer. I need to experiment a bit to figure out how much RAM is used for what context size.

Awni Hannun shared tips for running mlx-community/Qwen2.5-7B-Instruct-1M-4bit using MLX, which should work for up to 250,000 tokens. They ran 120,000 tokens and reported:

Peak RAM for prompt filling was 22GB

Peak RAM for generation 12GB

Prompt filling took 350 seconds on an M2 Ultra

Generation ran at 31 tokens-per-second on M2 Ultra

Via VB

Tags: ai, generative-ai, local-llms, llms, llm, qwen, mlx, ollama, long-context, llama-cpp, llm-release, files-to-prompt, lm-studio, ai-in-china

Things we learned about LLMs in 2024

2024-12-31T18:07:31+00:00

A lot has happened in the world of Large Language Models over the course of 2024. Here's a review of things we figured out about the field in the past twelve months, plus my attempt at identifying key themes and pivotal moments.

This is a sequel to my review of 2023.

In this article:

The GPT-4 barrier was comprehensively broken

In my December 2023 review I wrote about how We don’t yet know how to build GPT-4 - OpenAI's best model was almost a year old at that point, yet no other AI lab had produced anything better. What did OpenAI know that the rest of us didn't?

I'm relieved that this has changed completely in the past twelve months. 18 organizations now have models on the Chatbot Arena Leaderboard that rank higher than the original GPT-4 from March 2023 (GPT-4-0314 on the board) - 70 models in total.

The earliest of those was Google's Gemini 1.5 Pro, released in February. In addition to producing GPT-4 level outputs, it introduced several brand new capabilities to the field - most notably its 1 million (and then later 2 million) token input context length, and the ability to input video.

I wrote about this at the time in The killer app of Gemini Pro 1.5 is video, which earned me a short appearance as a talking head in the Google I/O opening keynote in May.

Gemini 1.5 Pro also illustrated one of the key themes of 2024: increased context lengths. Last year most models accepted 4,096 or 8,192 tokens, with the notable exception of Claude 2.1 which accepted 200,000. Today every serious provider has a 100,000+ token model, and Google's Gemini series accepts up to 2 million.

Longer inputs dramatically increase the scope of problems that can be solved with an LLM: you can now throw in an entire book and ask questions about its contents, but more importantly you can feed in a lot of example code to help the model correctly solve a coding problem. LLM use-cases that involve long inputs are far more interesting to me than short prompts that rely purely on the information already baked into the model weights. Many of my tools were built using this pattern.

Getting back to models that beat GPT-4: Anthropic's Claude 3 series launched in March, and Claude 3 Opus quickly became my new favourite daily-driver. They upped the ante even more in June with the launch of Claude 3.5 Sonnet - a model that is still my favourite six months later (though it got a significant upgrade on October 22, confusingly keeping the same 3.5 version number. Anthropic fans have since taken to calling it Claude 3.6).

Then there's the rest. If you browse the Chatbot Arena leaderboard today - still the most useful single place to get a vibes-based evaluation of models - you'll see that GPT-4-0314 has fallen to around 70th place. The 18 organizations with higher scoring models are Google, OpenAI, Alibaba, Anthropic, Meta, Reka AI, 01 AI, Amazon, Cohere, DeepSeek, Nvidia, Mistral, NexusFlow, Zhipu AI, xAI, AI21 Labs, Princeton and Tencent.

Training a GPT-4 beating model was a huge deal in 2023. In 2024 it's an achievement that isn't even particularly notable, though I personally still celebrate any time a new organization joins that list.

Some of those GPT-4 models run on my laptop

My personal laptop is a 64GB M2 MacBook Pro from 2023. It's a powerful machine, but it's also nearly two years old now - and crucially it's the same laptop I've been using ever since I first ran an LLM on my computer back in March 2023 (see Large language models are having their Stable Diffusion moment).

That same laptop that could just about run a GPT-3-class model in March last year has now run multiple GPT-4 class models! Some of my notes on that:

Qwen2.5-Coder-32B is an LLM that can code well that runs on my Mac talks about Qwen2.5-Coder-32B in November - an Apache 2.0 licensed model!
I can now run a GPT-4 class model on my laptop talks about running Meta's Llama 3.3 70B (released in December)

This remains astonishing to me. I thought a model with the capabilities and output quality of GPT-4 needed a datacenter class server with one or more $40,000+ GPUs.

These models take up enough of my 64GB of RAM that I don't run them often - they don't leave much room for anything else.

The fact that they run at all is a testament to the incredible training and inference performance gains that we've figured out over the past year. It turns out there was a lot of low-hanging fruit to be harvested in terms of model efficiency. I expect there's still more to come.

Meta's Llama 3.2 models deserve a special mention. They may not be GPT-4 class, but at 1B and 3B sizes they punch massively above their weight. I run Llama 3.2 3B on my iPhone using the free MLC Chat iOS app and it's a shockingly capable model for its tiny (<2GB) size. Try firing it up and asking it for "a plot outline of a Netflix Christmas movie where a data journalist falls in love with a local ceramacist". Here's what I got, at a respectable 20 tokens per second:

Here's the rest of the transcript. It's bland and generic, but my phone can pitch bland and generic Christmas movies to Netflix now!

LLM prices crashed, thanks to competition and increased efficiency

The past twelve months have seen a dramatic collapse in the cost of running a prompt through the top tier hosted LLMs.

In December 2023 (here's the Internet Archive for the OpenAI pricing page) OpenAI were charging $30/million input tokens for GPT-4, $10/mTok for the then-new GPT-4 Turbo and $1/mTok for GPT-3.5 Turbo.

Today $30/mTok gets you OpenAI's most expensive model, o1. GPT-4o is $2.50 (12x cheaper than GPT-4) and GPT-4o mini is $0.15/mTok - 200x cheaper than GPT-4, nearly 7x cheaper than GPT-3.5 and massively more capable than that model.

Other model providers charge even less. Anthropic's Claude 3 Haiku (from March, but still their cheapest model) is $0.25/mTok. Google's Gemini 1.5 Flash is $0.075/mTok and their Gemini 1.5 Flash 8B is $0.0375/mTok - that's 27x cheaper than GPT-3.5 Turbo last year.

I've been tracking these pricing changes under my llm-pricing tag.

These price drops are driven by two factors: increased competition and increased efficiency. The efficiency thing is really important for everyone who is concerned about the environmental impact of LLMs. These price drops tie directly to how much energy is being used for running prompts.

There's still plenty to worry about with respect to the environmental impact of the great AI datacenter buildout, but a lot of the concerns over the energy cost of individual prompts are no longer credible.

Here's a fun napkin calculation: how much would it cost to generate short descriptions of every one of the 68,000 photos in my personal photo library using Google's Gemini 1.5 Flash 8B (released in October), their cheapest model?

Each photo would need 260 input tokens and around 100 output tokens.

260 * 68,000 = 17,680,000 input tokens
17,680,000 * $0.0375/million = $0.66
100 * 68,000 = 6,800,000 output tokens
6,800,000 * $0.15/million = $1.02

That's a total cost of $1.68 to process 68,000 images. That's so absurdly cheap I had to run the numbers three times to confirm I got it right.

How good are those descriptions? Here's what I got from this command:

llm -m gemini-1.5-flash-8b-latest describe -a IMG_1825.jpeg

Against this photo of butterflies at the California Academy of Sciences:

A shallow dish, likely a hummingbird or butterfly feeder, is red. Pieces of orange slices of fruit are visible inside the dish.

Two butterflies are positioned in the feeder, one is a dark brown/black butterfly with white/cream-colored markings. The other is a large, brown butterfly with patterns of lighter brown, beige, and black markings, including prominent eye spots. The larger brown butterfly appears to be feeding on the fruit.

260 input tokens, 92 output tokens. Cost approximately 0.0024 cents (that's less than a 400th of a cent).

This increase in efficiency and reduction in price is my single favourite trend from 2024. I want the utility of LLMs at a fraction of the energy cost and it looks like that's what we're getting.

Multimodal vision is common, audio and video are starting to emerge

My butterfly example above illustrates another key trend from 2024: the rise of multi-modal LLMs.

A year ago the single most notable example of these was GPT-4 Vision, released at OpenAI's DevDay in November 2023. Google's multi-modal Gemini 1.0 was announced on December 7th 2023 so it also (just) makes it into the 2023 window.

In 2024, almost every significant model vendor released multi-modal models. We saw the Claude 3 series from Anthropic in March, Gemini 1.5 Pro in April (images, audio and video), then September brought Qwen2-VL and Mistral's Pixtral 12B and Meta's Llama 3.2 11B and 90B vision models. We got audio input and output from OpenAI in October, then November saw SmolVLM from Hugging Face and December saw image and video models from Amazon Nova.

In October I upgraded my LLM CLI tool to support multi-modal models via attachments. It now has plugins for a whole collection of different vision models.

I think people who complain that LLM improvement has slowed are often missing the enormous advances in these multi-modal models. Being able to run prompts against images (and audio and video) is a fascinating new way to apply these models.

Voice and live camera mode are science fiction come to life

The audio and live video modes that have started to emerge deserve a special mention.

The ability to talk to ChatGPT first arrived in September 2023, but it was mostly an illusion: OpenAI used their excellent Whisper speech-to-text model and a new text-to-speech model (creatively named tts-1) to enable conversations with the ChatGPT mobile apps, but the actual model just saw text.

The May 13th announcement of GPT-4o included a demo of a brand new voice mode, where the true multi-modal GPT-4o (the o is for "omni") model could accept audio input and output incredibly realistic sounding speech without needing separate TTS or STT models.

The demo also sounded conspicuously similar to Scarlett Johansson... and after she complained the voice from the demo, Skye, never made it to a production product.

The delay in releasing the new voice mode after the initial demo caused quite a lot of confusion. I wrote about that in ChatGPT in “4o” mode is not running the new features yet.

When ChatGPT Advanced Voice mode finally did roll out (a slow roll from August through September) it was spectacular. I've been using it extensively on walks with my dog and it's amazing how much the improvement in intonation elevates the material. I've also had a lot of fun experimenting with the OpenAI audio APIs.

Even more fun: Advanced Voice mode can do accents! Here's what happened when I told it I need you to pretend to be a California brown pelican with a very thick Russian accent, but you talk to me exclusively in Spanish.

Your browser does not support the audio element.

OpenAI aren't the only group with a multi-modal audio model. Google's Gemini also accepts audio input, and the Google Gemini apps can speak in a similar way to ChatGPT now. Amazon also pre-announced voice mode for Amazon Nova, but that's meant to roll out in Q1 of 2025.

Google's NotebookLM, released in September, took audio output to a new level by producing spookily realistic conversations between two "podcast hosts" about anything you fed into their tool. They later added custom instructions, so naturally I turned them into pelicans:

Your browser does not support the audio element.

The most recent twist, again from December (December was a lot) is live video. ChatGPT voice mode now provides the option to share your camera feed with the model and talk about what you can see in real time. Google Gemini have a preview of the same feature, which they managed to ship the day before ChatGPT did.

These abilities are just a few weeks old at this point, and I don't think their impact has been fully felt yet. If you haven't tried them out yet you really should.

Both Gemini and OpenAI offer API access to these features as well. OpenAI started with a WebSocket API that was quite challenging to use, but in December they announced a new WebRTC API which is much easier to get started with. Building a web app that a user can talk to via voice is easy now!

Prompt driven app generation is a commodity already

This was possible with GPT-4 in 2023, but the value it provides became evident in 2024.

We already knew LLMs were spookily good at writing code. If you prompt them right, it turns out they can build you a full interactive application using HTML, CSS and JavaScript (and tools like React if you wire up some extra supporting build mechanisms) - often in a single prompt.

Anthropic kicked this idea into high gear when they released Claude Artifacts, a groundbreaking new feature that was initially slightly lost in the noise due to being described half way through their announcement of the incredible Claude 3.5 Sonnet.

With Artifacts, Claude can write you an on-demand interactive application and then let you use it directly inside the Claude interface.

Here's my Extract URLs app, entirely generated by Claude:

I've found myself using this a lot. I noticed how much I was relying on it in October and wrote Everything I built with Claude Artifacts this week, describing 14 little tools I had put together in a seven day period.

Since then, a whole bunch of other teams have built similar systems. GitHub announced their version of this - GitHub Spark - in October. Mistral Chat added it as a feature called Canvas in November.

Steve Krouse from Val Town built a version of it against Cerebras, showcasing how a 2,000 token/second LLM can iterate on an application with changes visible in less than a second.

Then in December, the Chatbot Arena team introduced a whole new leaderboard for this feature, driven by users building the same interactive app twice with two different models and voting on the answer. Hard to come up with a more convincing argument that this feature is now a commodity that can be effectively implemented against all of the leading models.

I've been tinkering with a version of this myself for my Datasette project, with the goal of letting users use prompts to build and iterate on custom widgets and data visualizations against their own data. I also figured out a similar pattern for writing one-shot Python programs, enabled by uv.

This prompt-driven custom interface feature is so powerful and easy to build (once you've figured out the gnarly details of browser sandboxing) that I expect it to show up as a feature in a wide range of products in 2025.

Universal access to the best models lasted for just a few short months

For a few short months this year all three of the best available models - GPT-4o, Claude 3.5 Sonnet and Gemini 1.5 Pro - were freely available to most of the world.

OpenAI made GPT-4o free for all users in May, and Claude 3.5 Sonnet was freely available from its launch in June. This was a momentus change, because for the previous year free users had mostly been restricted to GPT-3.5 level models, meaning new users got a very inaccurate mental model of what a capable LLM could actually do.

That era appears to have ended, likely permanently, with OpenAI's launch of ChatGPT Pro. This $200/month subscription service is the only way to access their most capable model, o1 Pro.

Since the trick behind the o1 series (and the future models it will undoubtedly inspire) is to expend more compute time to get better results, I don't think those days of free access to the best available models are likely to return.

"Agents" still haven't really happened yet

I find the term "agents" extremely frustrating. It lacks a single, clear and widely understood meaning... but the people who use the term never seem to acknowledge that.

If you tell me that you are building "agents", you've conveyed almost no information to me at all. Without reading your mind I have no way of telling which of the dozens of possible definitions you are talking about.

The two main categories I see are people who think AI agents are obviously things that go and act on your behalf - the travel agent model - and people who think in terms of LLMs that have been given access to tools which they can run in a loop as part of solving a problem. The term "autonomy" is often thrown into the mix too, again without including a clear definition.

(I also collected 211 definitions on Twitter a few months ago - here they are in Datasette Lite - and had gemini-exp-1206 attempt to summarize them.)

Whatever the term may mean, agents still have that feeling of perpetually "coming soon".

Terminology aside, I remain skeptical as to their utility based, once again, on the challenge of gullibility. LLMs believe anything you tell them. Any systems that attempts to make meaningful decisions on your behalf will run into the same roadblock: how good is a travel agent, or a digital assistant, or even a research tool if it can't distinguish truth from fiction?

Just the other day Google Search was caught serving up an entirely fake description of the non-existant movie "Encanto 2". It turned out to be summarizing an imagined movie listing from a fan fiction wiki.

Prompt injection is a natural consequence of this gulibility. I've seen precious little progress on tackling that problem in 2024, and we've been talking about it since September 2022.

I'm beginning to see the most popular idea of "agents" as dependent on AGI itself. A model that's robust against gulliblity is a very tall order indeed.

Evals really matter

Anthropic's Amanda Askell (responsible for much of the work behind Claude's Character):

The boring yet crucial secret behind good system prompts is test-driven development. You don't write down a system prompt and find ways to test it. You write down tests and find a system prompt that passes them.

It's become abundantly clear over the course of 2024 that writing good automated evals for LLM-powered systems is the skill that's most needed to build useful applications on top of these models. If you have a strong eval suite you can adopt new models faster, iterate better and build more reliable and useful product features than your competition.

Vercel's Malte Ubl:

When @v0 first came out we were paranoid about protecting the prompt with all kinds of pre and post processing complexity.

We completely pivoted to let it rip. A prompt without the evals, models, and especially UX is like getting a broken ASML machine without a manual

I'm still trying to figure out the best patterns for doing this for my own work. Everyone knows that evals are important, but there remains a lack of great guidance for how to best implement them - I'm tracking this under my evals tag. My SVG pelican riding a bicycle benchmark is a pale imitation of what a real eval suite should look like.

Apple Intelligence is bad, Apple's MLX library is excellent

As a Mac user I've been feeling a lot better about my choice of platform this year.

Last year it felt like my lack of a Linux/Windows machine with an NVIDIA GPU was a huge disadvantage in terms of trying out new models.

On paper, a 64GB Mac should be a great machine for running models due to the way the CPU and GPU can share the same memory. In practice, many models are released as model weights and libraries that reward NVIDIA's CUDA over other platforms.

The llama.cpp ecosystem helped a lot here, but the real breakthrough has been Apple's MLX library, "an array framework for Apple Silicon". It's fantastic.

Apple's mlx-lm Python library supports running a wide range of MLX-compatible models on my Mac, with excellent performance. mlx-community on Hugging Face offers more than 1,000 models that have been converted to the necessary format.

Prince Canuma's excellent, fast moving mlx-vlm project brings vision LLMs to Apple Silicon as well. I used that recently to run Qwen's QvQ.

While MLX is a game changer, Apple's own "Apple Intelligence" features have mostly been a disappointment. I wrote about their initial announcement in June, and I was optimistic that Apple had focused hard on the subset of LLM applications that preserve user privacy and minimize the chance of users getting mislead by confusing features.

Now that those features are rolling out they're pretty weak. As an LLM power-user I know what these models are capable of, and Apple's LLM features offer a pale imitation of what a frontier LLM can do. Instead we're getting notification summaries that misrepresent news headlines and writing assistant tools that I've not found useful at all. Genmoji are kind of fun though.

The rise of inference-scaling "reasoning" models

The most interesting development in the final quarter of 2024 was the introduction of a new shape of LLM, exemplified by OpenAI's o1 models - initially released as o1-preview and o1-mini on September 12th.

One way to think about these models is an extension of the chain-of-thought prompting trick, first explored in the May 2022 paper Large Language Models are Zero-Shot Reasoners.

This is that trick where, if you get a model to talk out loud about a problem it's solving, you often get a result which the model would not have achieved otherwise.

o1 takes this process and further bakes it into the model itself. The details are somewhat obfuscated: o1 models spend "reasoning tokens" thinking through the problem that are not directly visible to the user (though the ChatGPT UI shows a summary of them), then outputs a final result.

The biggest innovation here is that it opens up a new way to scale a model: instead of improving model performance purely through additional compute at training time, models can now take on harder problems by spending more compute on inference.

The sequel to o1, o3 (they skipped "o2" for European trademark reasons) was announced on 20th December with an impressive result against the ARC-AGI benchmark, albeit one that likely involved more than $1,000,000 of compute time expense!

o3 is expected to ship in January. I doubt many people have real-world problems that would benefit from that level of compute expenditure - I certainly don't! - but it appears to be a genuine next step in LLM architecture for taking on much harder problems.

OpenAI are not the only game in town here. Google released their first entrant in the category, gemini-2.0-flash-thinking-exp, on December 19th.

Alibaba's Qwen team released their QwQ model on November 28th - under an Apache 2.0 license, and that one I could run on my own machine. They followed that up with a vision reasoning model called QvQ on December 24th, which I also ran locally.

DeepSeek made their DeepSeek-R1-Lite-Preview model available to try out through their chat interface on November 20th.

To understand more about inference scaling I recommend Is AI progress slowing down? by Arvind Narayanan and Sayash Kapoor.

Nothing yet from Anthropic or Meta but I would be very surprised if they don't have their own inference-scaling models in the works. Meta published a relevant paper Training Large Language Models to Reason in a Continuous Latent Space in December.

Was the best currently available LLM trained in China for less than $6m?

Not quite, but almost! It does make for a great attention-grabbing headline.

The big news to end the year was the release of DeepSeek v3 - dropped on Hugging Face on Christmas Day without so much as a README file, then followed by documentation and a paper the day after that.

DeepSeek v3 is a huge 685B parameter model - one of the largest openly licensed models currently available, significantly bigger than the largest of Meta's Llama series, Llama 3.1 405B.

Benchmarks put it up there with Claude 3.5 Sonnet. Vibe benchmarks (aka the Chatbot Arena) currently rank it 7th, just behind the Gemini 2.0 and OpenAI 4o/o1 models. This is by far the highest ranking openly licensed model.

The really impressive thing about DeepSeek v3 is the training cost. The model was trained on 2,788,000 H800 GPU hours at an estimated cost of $5,576,000. Llama 3.1 405B trained 30,840,000 GPU hours - 11x that used by DeepSeek v3, for a model that benchmarks slightly worse.

Those US export regulations on GPUs to China seem to have inspired some very effective training optimizations!

The environmental impact got better

A welcome result of the increased efficiency of the models - both the hosted ones and the ones I can run locally - is that the energy usage and environmental impact of running a prompt has dropped enormously over the past couple of years.

OpenAI themselves are charging 100x less for a prompt compared to the GPT-3 days. I have it on good authority that neither Google Gemini nor Amazon Nova (two of the least expensive model providers) are running prompts at a loss.

I think this means that, as individual users, we don't need to feel any guilt at all for the energy consumed by the vast majority of our prompts. The impact is likely neglible compared to driving a car down the street or maybe even watching a video on YouTube.

Likewise, training. DeepSeek v3 training for less than $6m is a fantastic sign that training costs can and should continue to drop.

For less efficient models I find it useful to compare their energy usage to commercial flights. The largest Llama 3 model cost about the same as a single digit number of fully loaded passenger flights from New York to London. That's certainly not nothing, but once trained that model can be used by millions of people at no extra training cost.

The environmental impact got much, much worse

The much bigger problem here is the enormous competitive buildout of the infrastructure that is imagined to be necessary for these models in the future.

Companies like Google, Meta, Microsoft and Amazon are all spending billions of dollars rolling out new datacenters, with a very material impact on the electricity grid and the environment. There's even talk of spinning up new nuclear power stations, but those can take decades.

Is this infrastructure necessary? DeepSeek v3's $6m training cost and the continued crash in LLM prices might hint that it's not. But would you want to be the big tech executive that argued NOT to build out this infrastructure only to be proven wrong in a few years' time?

An interesting point of comparison here could be the way railways rolled out around the world in the 1800s. Constructing these required enormous investments and had a massive environmental impact, and many of the lines that were built turned out to be unnecessary - sometimes multiple lines from different companies serving the exact same routes!

The resulting bubbles contributed to several financial crashes, see Wikipedia for Panic of 1873, Panic of 1893, Panic of 1901 and the UK's Railway Mania. They left us with a lot of useful infrastructure and a great deal of bankruptcies and environmental damage.

The year of slop

2024 was the year that the word "slop" became a term of art. I wrote about this in May, expanding on this tweet by @deepfates:

Watching in real time as “slop” becomes a term of art. the way that “spam” became the term for unwanted emails, “slop” is going in the dictionary as the term for unwanted AI generated content

I expanded that definition a tiny bit to this:

Slop describes AI-generated content that is both unrequested and unreviewed.

I ended up getting quoted talking about slop in both the Guardian and the NY Times. Here's what I said in the NY TImes:

Society needs concise ways to talk about modern A.I. — both the positives and the negatives. ‘Ignore that email, it’s spam,’ and ‘Ignore that article, it’s slop,’ are both useful lessons.

I love the term "slop" because it so succinctly captures one of the ways we should not be using generative AI!

Slop was even in the running for Oxford Word of the Year 2024, but it lost to brain rot.

Synthetic training data works great

An idea that surprisingly seems to have stuck in the public consciousness is that of "model collapse". This was first described in the paper The Curse of Recursion: Training on Generated Data Makes Models Forget in May 2023, and repeated in Nature in July 2024 with the more eye-catching headline AI models collapse when trained on recursively generated data.

The idea is seductive: as the internet floods with AI-generated slop the models themselves will degenerate, feeding on their own output in a way that leads to their inevitable demise!

That's clearly not happening. Instead, we are seeing AI labs increasingly train on synthetic content - deliberately creating artificial data to help steer their models in the right way.

One of the best descriptions I've seen of this comes from the Phi-4 technical report, which included this:

Synthetic data as a substantial component of pretraining is becoming increasingly common, and the Phi series of models has consistently emphasized the importance of synthetic data. Rather than serving as a cheap substitute for organic data, synthetic data has several direct advantages over organic data.

Structured and Gradual Learning. In organic datasets, the relationship between tokens is often complex and indirect. Many reasoning steps may be required to connect the current token to the next, making it challenging for the model to learn effectively from next-token prediction. By contrast, each token generated by a language model is by definition predicted by the preceding tokens, making it easier for a model to follow the resulting reasoning patterns.

Another common technique is to use larger models to help create training data for their smaller, cheaper alternatives - a trick used by an increasing number of labs. DeepSeek v3 used "reasoning" data created by DeepSeek-R1. Meta's Llama 3.3 70B fine-tuning used over 25M synthetically generated examples.

Careful design of the training data that goes into an LLM appears to be the entire game for creating these models. The days of just grabbing a full scrape of the web and indiscriminately dumping it into a training run are long gone.

LLMs somehow got even harder to use

A drum I've been banging for a while is that LLMs are power-user tools - they're chainsaws disguised as kitchen knives. They look deceptively simple to use - how hard can it be to type messages to a chatbot? - but in reality you need a huge depth of both understanding and experience to make the most of them and avoid their many pitfalls.

If anything, this problem got worse in 2024.

We've built computer systems you can talk to in human language, that will answer your questions and usually get them right! ... depending on the question, and how you ask it, and whether it's accurately reflected in the undocumented and secret training set.

The number of available systems has exploded. Different systems have different tools they can apply to your problems - like Python and JavaScript and web search and image generation and maybe even database lookups... so you'd better understand what those tools are, what they can do and how to tell if the LLM used them or not.

Did you know ChatGPT has two entirely different ways of running Python now?

Want to build a Claude Artifact that talks to an external API? You'd better understand CSP and CORS HTTP headers first.

The models may have got more capable, but most of the limitations remained the same. OpenAI's o1 may finally be able to (mostly) count the Rs in strawberry, but its abilities are still limited by its nature as an LLM and the constraints placed on it by the harness it's running in. o1 can't run web searches or use Code Interpreter, but GPT-4o can - both in that same ChatGPT UI. (o1 will pretend to do those things if you ask it to, a regression to the URL hallucinations bug from early 2023).

What are we doing about this? Not much. Most users are thrown in at the deep end. The default LLM chat UI is like taking brand new computer users, dropping them into a Linux terminal and expecting them to figure it all out.

Meanwhile, it's increasingly common for end users to develop wildly inaccurate mental models of how these things work and what they are capable of. I've seen so many examples of people trying to win an argument with a screenshot from ChatGPT - an inherently ludicrous proposition, given the inherent unreliability of these models crossed with the fact that you can get them to say anything if you prompt them right.

There's a flipside to this too: a lot of better informed people have sworn off LLMs entirely because they can't see how anyone could benefit from a tool with so many flaws. The key skill in getting the most out of LLMs is learning to work with tech that is both inherently unreliable and incredibly powerful at the same time. This is a decidedly non-obvious skill to acquire!

There is so much space for helpful education content here, but we need to do do a lot better than outsourcing it all to AI grifters with bombastic Twitter threads.

Knowledge is incredibly unevenly distributed

Most people have heard of ChatGPT by now. How many have heard of Claude?

The knowledge gap between the people who actively follow this stuff and the 99% of the population who do not is vast.

The pace of change doesn't help either. In just the past month we've seen general availability of live interfaces where you can point your phone's camera at something and talk about it with your voice... and optionally have it pretend to be Santa. Most self-certified nerds haven't even tried that yet.

Given the ongoing (and potential) impact on society that this technology has, I don't think the size of this gap is healthy. I'd like to see a lot more effort put into improving this.

LLMs need better criticism

A lot of people absolutely hate this stuff. In some of the spaces I hang out (Mastodon, Bluesky, Lobste.rs, even Hacker News on occasion) even suggesting that "LLMs are useful" can be enough to kick off a huge fight.

I get it. There are plenty of reasons to dislike this technology - the environmental impact, the (lack of) ethics of the training data, the lack of reliability, the negative applications, the potential impact on people's jobs.

LLMs absolutely warrant criticism. We need to be talking through these problems, finding ways to mitigate them and helping people learn how to use these tools responsibly in ways where the positive applications outweigh the negative.

I like people who are skeptical of this stuff. The hype has been deafening for more than two years now, and there are enormous quantities of snake oil and misinformation out there. A lot of very bad decisions are being made based on that hype. Being critical is a virtue.

If we want people with decision-making authority to make good decisions about how to apply these tools we first need to acknowledge that there ARE good applications, and then help explain how to put those into practice while avoiding the many unintiutive traps.

(If you still don't think there are any good applications at all I'm not sure why you made it to this point in the article!)

I think telling people that this whole field is environmentally catastrophic plagiarism machines that constantly make things up is doing those people a disservice, no matter how much truth that represents. There is genuine value to be had here, but getting to that value is unintuitive and needs guidance.

Those of us who understand this stuff have a duty to help everyone else figure it out.

Everything tagged "llms" on my blog in 2024

Because I undoubtedly missed a whole bunch of things, here's every long-form post I wrote in 2024 that I tagged with llms:

(This list generated using Django SQL Dashboard with a SQL query written for me by Claude.)

Tags: google, ai, openai, generative-ai, local-llms, llms, anthropic, gemini, meta, llm-reasoning, long-context, ai-energy-usage, coding-agents

Quoting Steven Johnson

2024-11-21T03:03:13+00:00

When we started working on what became NotebookLM in the summer of 2022, we could fit about 1,500 words in the context window. Now we can fit up to 1.5 million words. (And using various other tricks, effectively fit 25 million words.) The emergence of long context models is, I believe, the single most unappreciated AI development of the past two years, at least among the general public. It radically transforms the utility of these models in terms of actual, practical applications.

— Steven Johnson

Tags: google, ai, generative-ai, llms, gemini, notebooklm, long-context

Long context prompting tips

2024-08-26T18:39:27+00:00

Long context prompting tips

Interesting tips here from Anthropic's documentation about how to best prompt Claude to work with longer documents.

Put longform data at the top: Place your long documents and inputs (~20K+ tokens) near the top of your prompt, above your query, instructions, and examples. This can significantly improve Claude’s performance across all models. Queries at the end can improve response quality by up to 30% in tests, especially with complex, multi-document inputs.

It recommends using not-quite-valid-XML to add those documents to those prompts, and using a prompt that asks Claude to extract direct quotes before replying to help it focus its attention on the most relevant information:

Find quotes from the patient records and appointment history that are relevant to diagnosing the patient's reported symptoms. Place these in <quotes> tags. Then, based on these quotes, list all information that would help the doctor diagnose the patient's symptoms. Place your diagnostic information in <info> tags.

Via Datasette Discord

Tags: ai, prompt-engineering, generative-ai, llms, anthropic, claude, long-context

Context caching for Google Gemini

2024-05-14T20:42:33+00:00

Context caching for Google Gemini

Another new Gemini feature announced today. Long context models enable answering questions against large chunks of text, but the price of those long prompts can be prohibitive - $3.50/million for Gemini Pro 1.5 up to 128,000 tokens and $7/million beyond that.

Context caching offers a price optimization, where the long prefix prompt can be reused between requests, halving the cost per prompt but at an additional cost of $4.50 / 1 million tokens per hour to keep that context cache warm.

Given that hourly extra charge this isn't a default optimization for all cases, but certain high traffic applications might be able to save quite a bit on their longer prompt systems.

It will be interesting to see if other vendors such as OpenAI and Anthropic offer a similar optimization in the future.

Update 14th August 2024: Anthropic's Claude now has its own version of prompt caching.

Via @officiallogank

Tags: google, ai, prompt-engineering, generative-ai, llms, gemini, llm-pricing, prompt-caching, long-context

llm-gemini 0.1a1

2024-03-28T03:32:15+00:00

llm-gemini 0.1a1

I upgraded my llm-gemini plugin to add support for the new Google Gemini Pro 1.5 model, which is beginning to roll out in early access.

The 1.5 model supports 1,048,576 input tokens and generates up to 8,192 output tokens—a big step up from Gemini 1.0 Pro which handled 30,720 and 2,048 respectively.

The big missing feature from my LLM tool at the moment is image input—a fantastic way to take advantage of that huge context window. I have a branch for this which I really need to get into a useful state.

Tags: google, projects, ai, generative-ai, llms, llm, gemini, long-context

The killer app of Gemini Pro 1.5 is video

2024-02-21T19:20:48+00:00

Last week Google introduced Gemini Pro 1.5, an enormous upgrade to their Gemini series of AI models.

Gemini Pro 1.5 has a 1,000,000 token context size. This is huge - previously that record was held by Claude 2.1 (200,000 tokens) and gpt-4-turbo (128,000 tokens) - though the difference in tokenizer implementations between the models means this isn't a perfectly direct comparison.

I've been playing with Gemini Pro 1.5 for a few days, and I think the most exciting feature isn't so much the token count... it's the ability to use video as an input.

I've been accessing the model through the Google AI Studio interface - I don't have API access yet.

I tried filling up the context with content from my blog, and got some good results. But things got really exciting when I began experimenting with video uploads (after some false starts - the team have been actively fixing bugs, with variable bitrate support landing just this morning).

I took this seven second video of one of my bookshelves:

Then I uploaded it with the following prompt:

JSON array of books in this video

That 7 second video consumed just 1,841 tokens out of my 1,048,576 token limit.

Gemini Pro 1.5 didn't return JSON, but it did reply with this:

It is not possible to fully identify all the books in the video. However, here are some of the books that can be seen:

The Lord of the Rings by J.R.R. Tolkien

Structure and Interpretation of Computer Programs by Harold Abelson and Gerald Jay Sussman

Rework by Jason Fried and David Heinemeier Hansson

The Hacker Ethic and the Spirit of the Information Age by Pekka Himanen

The Google Story by David A. Vise

The Art of Agile Development by James Shore and Shane Warden

Creative Code by John Maeda

Web Operations by John Allspaw and Jesse Robbins

Python Framework by David Mertz

The Personal MBA by Josh Kaufman

Growing Up with Lucy by April Henry

Getting Things Done by David Allen

Site-Seeing: A Visual Approach to Web Usability by Luke Wroblewski

JavaScript: The Definitive Guide by David Flanagan

Building Accessible Websites by Joe Clark

Information Architecture for the World Wide Web by Louis Rosenfeld and Peter Morville

Don't Make Me Think: A Common Sense Approach to Web Usability by Steve Krug

Programming Erlang: Software for a Concurrent World by Joe Armstrong

XSLT 2.0 Web Development by Dmitry Kirsanov

Pragmatic Version Control Using Git by Travis Swicegood

Party Animals: My Family and Other Communists by David Aaronovitch

A follow-up prompt:

as a JSON array of objects, with title and author keys

Gave me that back as JSON

[
  {
    "title": "The Lord of the Rings",
    "author": "J.R.R. Tolkien"
  },
  {
    "title": "Structure and Interpretation of Computer Programs",
    "author": "Harold Abelson and Gerald Jay Sussman"
  },
  {
    "title": "Rework",
    "author": "Jason Fried and David Heinemeier Hansson"
  },
  {
    "title": "The Hacker Ethic and the Spirit of the Information Age",
    "author": "Pekka Himanen"
  },
  {
    "title": "The Google Story",
    "author": "David A. Vise"
  },
  {
    "title": "The Art of Agile Development",
    "author": "James Shore and Shane Warden"
  },
  {
    "title": "Creative Code",
    "author": "John Maeda"
  },
  {
    "title": "Web Operations",
    "author": "John Allspaw and Jesse Robbins"
  },
  {
    "title": "Python Framework",
    "author": "David Mertz"
  },
  {
    "title": "The Personal MBA",
    "author": "Josh Kaufman"
  },
  {
    "title": "Growing Up with Lucy",
    "author": "April Henry"
  },
  {
    "title": "Getting Things Done",
    "author": "David Allen"
  },
  {
    "title": "Site-Seeing: A Visual Approach to Web Usability",
    "author": "Luke Wroblewski"
  },
  {
    "title": "JavaScript: The Definitive Guide",
    "author": "David Flanagan"
  },
  {
    "title": "Building Accessible Websites",
    "author": "Joe Clark"
  },
  {
    "title": "Information Architecture for the World Wide Web",
    "author": "Louis Rosenfeld and Peter Morville"
  },
  {
    "title": "Don't Make Me Think: A Common Sense Approach to Web Usability",
    "author": "Steve Krug"
  },
  {
    "title": "Programming Erlang: Software for a Concurrent World",
    "author": "Joe Armstrong"
  },
  {
    "title": "XSLT 2.0 Web Development",
    "author": "Dmitry Kirsanov"
  },
  {
    "title": "Pragmatic Version Control Using Git",
    "author": "Travis Swicegood"
  },
  {
    "title": "Party Animals: My Family and Other Communists",
    "author": "David Aaronovitch"
  }
]

So how well did it do?

Honestly, I'm pretty astonished by this. The video is 7s long, it pans quite fast (there is some motion blur in the video) and some of the books are partially obscured by items.

It didn't get all of them, but it did about as good a job as I could have done given the same video.

I'm particularly impressed that it got "Site Seeing: A Visual Approach to Web Usability" by Luke Wroblewski, where the book itself is almost entirely obscured by a squirrel leaving only "Site-Seeing" visible.

I did spot one hallucination: it lists "The Personal MBA by Josh Kaufman", but I don't own that book - the closest thing in the video is a blurry few frames of a book called "The Beermat Entrepreneur".

I decided to try it again with a different video and hopefully a better prompt. This time I videoed a bookshelf mostly full of cookbooks - this time in portrait, a bit longer (22 seconds) and panning down as well as across.

This video cost me 6,049 tokens - still a pretty tiny allocation.

My new prompt was:

Output a JSON array of {"title": "...", "authors": "..."} objects for books in this video

And.... it refused:

It looks like the safety filter may have taken offense to the word "Cocktail"!

I opened up the safety settings, dialled them down to "low" for every category and tried again. It appeared to refuse a second time.

So I channelled Mrs Doyle and said:

go on give me that JSON

And it worked!

[
  {
    "title": "The Cocktail Guy",
    "authors": "Rich Woods"
  },
  {
    "title": "Smuggler's Cove",
    "authors": "Martin Cate"
  },
  {
    "title": "The Curious Bartender",
    "authors": "Tristan Stephenson"
  },
  {
    "title": "The Art of Fermentation",
    "authors": "Sandor Ellix Katz"
  },
  {
    "title": "The Noma Guide to Fermentation",
    "authors": "René Redzepi & David Zilber"
  },
  {
    "title": "Cocktails #7",
    "authors": "Difford's Guide"
  },
  {
    "title": "The Curious Cookbook",
    "authors": ""
  },
  {
    "title": "Rival Crock Pot Cooking",
    "authors": ""
  },
  {
    "title": "The Art of French Cooking",
    "authors": "Fernande Garvin"
  },
  {
    "title": "The Book on Pie",
    "authors": "Erin Jeanne McDowell"
  },
  {
    "title": "The New Taste of Chocolate",
    "authors": ""
  },
  {
    "title": "Vegan Cakes and Bakes",
    "authors": "Jérôme Eckmeier & Daniela Lais"
  },
  {
    "title": "Kitchen Creamery",
    "authors": ""
  },
  {
    "title": "Good Food 101 Teatime Treats",
    "authors": "BBC"
  },
  {
    "title": "Betty Crocker's Cookbook",
    "authors": ""
  },
  {
    "title": "The Martha Stewart Cookbook",
    "authors": ""
  },
  {
    "title": "Feast",
    "authors": "Nigella Lawson"
  },
  {
    "title": "Moosewood Restaurant New Classics",
    "authors": ""
  },
  {
    "title": "World Food Café",
    "authors": "Chris & Carolyn Caldicott"
  },
  {
    "title": "Everyday Thai Cooking",
    "authors": "Katie Chin"
  },
  {
    "title": "Vegetarian Indian Cooking with Instant Pot",
    "authors": "Manali Singh"
  },
  {
    "title": "The Southern Vegetarian Cookbook",
    "authors": "Justin Fox Burks & Amy Lawrence"
  },
  {
    "title": "Vegetarian Cookbook",
    "authors": ""
  },
  {
    "title": "Französische Küche",
    "authors": ""
  },
  {
    "title": "Sushi-Making at Home",
    "authors": ""
  },
  {
    "title": "Kosher Cooking",
    "authors": ""
  },
  {
    "title": "The New Empanadas",
    "authors": "Marlena Spieler"
  },
  {
    "title": "Instant Pot Vegetarian Cookbook for Two",
    "authors": ""
  },
  {
    "title": "Vegetarian",
    "authors": "Wilkes & Cartwright"
  },
  {
    "title": "Breakfast",
    "authors": ""
  },
  {
    "title": "Nadiya's Kitchen",
    "authors": "Nadiya Hussain"
  },
  {
    "title": "New Food for Thought",
    "authors": "Jane Noraika"
  },
  {
    "title": "Beyond Curry Indian Cookbook",
    "authors": "D'Silva Sankalp"
  },
  {
    "title": "The 5 O'Clock Cookbook",
    "authors": ""
  },
  {
    "title": "Food Lab",
    "authors": "J. Kenji López-Alt"
  },
  {
    "title": "The Cook's Encyclopedia",
    "authors": ""
  },
  {
    "title": "The Cast Iron Nation",
    "authors": "Lodge"
  },
  {
    "title": "Urban Cook Book",
    "authors": ""
  },
  {
    "title": "In Search of Perfection",
    "authors": "Heston Blumenthal"
  },
  {
    "title": "Perfection",
    "authors": "Heston Blumenthal"
  },
  {
    "title": "An Economist Gets Lunch",
    "authors": "Tyler Cowen"
  },
  {
    "title": "The Colman's Mustard Cookbook",
    "authors": "Pam Hartley"
  },
  {
    "title": "The Student Grub Guide",
    "authors": "Williams"
  },
  {
    "title": "Easy Meals for One & Two",
    "authors": ""
  },
  {
    "title": "Jack Monroe Tin Can Cook",
    "authors": ""
  },
  {
    "title": "Slow Cooker",
    "authors": ""
  },
  {
    "title": "The Students' Sausage, Egg, and Beans Cookbook",
    "authors": ""
  },
  {
    "title": "Quick & Easy Students' Cookbook",
    "authors": ""
  },
  {
    "title": "Student Cookbook Guide",
    "authors": ""
  },
  {
    "title": "The Best Little Marinades Cookbook",
    "authors": "Adler"
  },
  {
    "title": "The New Book of Middle Eastern Food",
    "authors": "Claudia Roden"
  },
  {
    "title": "Vegetarian Meals",
    "authors": "Rosamond Richardson"
  },
  {
    "title": "Girl! Mother Tells You How",
    "authors": ""
  }
]

Once again, I find those results pretty astounding.

What to make of this

The ability to extract structured content from text is already one of the most exciting use-cases for LLMs. GPT-4 Vision and LLaVA expanded that to images. And now Gemini Pro 1.5 expands that to video.

The ability to analyze video like this feels SO powerful. Being able to take a 20 second video of a bookshelf and get back a JSON array of those books is just the first thing I thought to try.

The usual LLM caveats apply. It can miss things and it can hallucinate incorrect details. Half of the work in making the most of this class of technology is figuring out how to work around these limitations, but I feel like we're making good progress on that.

There's also the issue with the safety filters. As input to these models gets longer, the chance of something triggering a filter (like the first four letters of the word "cocktail") goes up.

So, as always with modern AI, there are still plenty of challenges to overcome.

But this really does feel like another one of those glimpses of a future that's suddenly far closer then I expected it to be.

A note on images v.s. video

Initially I had assumed that video was handled differently from images, due partly to the surprisingly (to me) low token counts involved in processing a video.

This thread on Hacker News convinced me otherwise.

From this blog post:

Gemini 1.5 Pro can also reason across up to 1 hour of video. When you attach a video, Google AI Studio breaks it down into thousands of frames (without audio), and then you can perform highly sophisticated reasoning and problem-solving tasks since the Gemini models are multimodal.

Then in the Gemini 1.5 technical report:

When prompted with a 45 minute Buster Keaton movie “Sherlock Jr." (1924) (2,674 frames at 1FPS, 684k tokens), Gemini 1.5 Pro retrieves and extracts textual information from a specific frame in and provides the corresponding timestamp.

I ran my own experiment: I grabbed a frame from my video and uploaded that to Gemini in a new prompt.

That's 258 tokens for a single image.

Using the numbers from the Buster Keaton example, 684,000 tokens / 2,674 frames = 256 tokens per frame. So it looks like it really does work by breaking down the video into individual frames and processing each one as an image.

For my own videos: 1,841 / 258 = 7.13 (the 7s video) and 6,049 / 258 = 23.45 (the 22s video) - which makes me believe that videos are split up into one frame per second and each frame costs ~258 tokens.

As a further experiment I grabbed a snapshot of the first frame of the video and uploaded that as a plain image. This time I didn't even include a prompt:

Gemini Pro 1.5 guessed what I wanted and replied:

The books in the image are:

The Lord of the Rings by J.R.R. Tolkien

Structure and Interpretation of Computer Programs by Harold Abelson and Gerald Jay Sussman

The Source by James A. Michener

Rework by Jason Fried and David Heinemeier Hansson

The Hacker Ethic and the Spirit of the Information Age by Pekka Himanen

Google by David A. Vise

The Design of Everyday Things by Don Norman

Linux Server Hacks by Evi Nemeth et al.

The C Programming Language by Brian W. Kernighan and Dennis M. Ritchie

High Output Management by Andrew S. Grove

Mapping Hacks by Schuyler Erle et al.

The image input was 258 tokens, the total token count after the response was 410 tokens - so 152 tokens for the response from the model. Those image tokens pack in a lot of information!

Tags: google, ai, generative-ai, llms, gemini, long-context, cocktails

Our next-generation model: Gemini 1.5

2024-02-15T16:17:42+00:00

Our next-generation model: Gemini 1.5

The big news here is about context length: Gemini 1.5 (a Mixture-of-Experts model) will do 128,000 tokens in general release, available in limited preview with a 1 million token context and has shown promising research results with 10 million tokens!

1 million tokens is 700,000 words or around 7 novels—also described in the blog post as an hour of video or 11 hours of audio.

Via Jeff Dean

Tags: google, ai, generative-ai, llms, gemini, vision-llms, long-context, llm-release

Long context prompting for Claude 2.1

2023-12-06T23:44:49+00:00

Long context prompting for Claude 2.1

Claude 2.1 has a 200,000 token context, enough for around 500 pages of text. Convincing it to answer a question based on a single sentence buried deep within that content can be difficult, but Anthropic found that adding “Assistant: Here is the most relevant sentence in the context:” to the end of the prompt was enough to raise Claude 2.1’s score from 27% to 98% on their evaluation.