Simon Willison's Weblog: prompt-caching

Gemini 2.5 Models now support implicit caching

2025-05-09T02:46:52+00:00

Gemini 2.5 Models now support implicit caching

I just spotted a cacheTokensDetails key in the token usage JSON while running a long chain of prompts against Gemini 2.5 Flash - despite not configuring caching myself:

{"cachedContentTokenCount": 200658, "promptTokensDetails": [{"modality": "TEXT", "tokenCount": 204082}], "cacheTokensDetails": [{"modality": "TEXT", "tokenCount": 200658}], "thoughtsTokenCount": 2326}

I went searching and it turns out Gemini had a massive upgrade to their prompt caching earlier today:

Implicit caching directly passes cache cost savings to developers without the need to create an explicit cache. Now, when you send a request to one of the Gemini 2.5 models, if the request shares a common prefix as one of previous requests, then it’s eligible for a cache hit. We will dynamically pass cost savings back to you, providing the same 75% token discount. [...]

To make more requests eligible for cache hits, we reduced the minimum request size for 2.5 Flash to 1024 tokens and 2.5 Pro to 2048 tokens.

Previously you needed to both explicitly configure the cache and pay a per-hour charge to keep that cache warm.

This new mechanism is so much more convenient! It imitates how both DeepSeek and OpenAI implement prompt caching, leaving Anthropic as the remaining large provider who require you to manually configure prompt caching to get it to work.

Gemini's explicit caching mechanism is still available. The documentation says:

Explicit caching is useful in cases where you want to guarantee cost savings, but with some added developer work.

With implicit caching the cost savings aren't possible to predict in advance, especially since the cache timeout within which a prefix will be discounted isn't described and presumably varies based on load and other circumstances outside of the developer's control.

Update: DeepMind's Philipp Schmid:

There is no fixed time, but it's should be a few minutes.

Tags: ai, prompt-engineering, generative-ai, llms, gemini, llm-pricing, prompt-caching

Quoting Alex Albert

2025-01-16T16:14:27+00:00

We've adjusted prompt caching so that you now only need to specify cache write points in your prompts - we'll automatically check for cache hits at previous positions. No more manual tracking of read locations needed.

— Alex Albert, Anthropic

Tags: ai, generative-ai, llms, anthropic, claude, alex-albert, prompt-caching

Anthropic: Message Batches (beta)

2024-10-08T18:18:57+00:00

Anthropic: Message Batches (beta)

Anthropic now have a batch mode, allowing you to send prompts to Claude in batches which will be processed within 24 hours (though probably much faster than that) and come at a 50% price discount.

This matches the batch models offered by OpenAI and by Google Gemini, both of which also provide a 50% discount.

Update 15th October 2024: Alex Albert confirms that Anthropic batching and prompt caching can be combined:

Don't know if folks have realized yet that you can get close to a 95% discount on Claude 3.5 Sonnet tokens when you combine prompt caching with the new Batches API

Via @alexalbert__

Tags: ai, openai, generative-ai, llms, anthropic, claude, gemini, alex-albert, prompt-caching

Gemini 1.5 Flash-8B is now production ready

2024-10-03T20:16:36+00:00

Gemini 1.5 Flash-8B is now production ready

Gemini 1.5 Flash-8B is "a smaller and faster variant of 1.5 Flash" - and is now released to production, at half the price of the 1.5 Flash model.

It's really, really cheap:

$0.0375 per 1 million input tokens on prompts <128K
$0.15 per 1 million output tokens on prompts <128K
$0.01 per 1 million input tokens on cached prompts <128K

Prices are doubled for prompts longer than 128K.

I believe images are still charged at a flat rate of 258 tokens, which I think means a single non-cached image with Flash should cost 0.00097 cents - a number so tiny I'm doubting if I got the calculation right.

OpenAI's cheapest model remains GPT-4o mini, at $0.15/1M input - though that drops to half of that for reused prompt prefixes thanks to their new prompt caching feature (or by half if you use batches, though those can’t be combined with OpenAI prompt caching. Gemini also offer half-off for batched requests).

Anthropic's cheapest model is still Claude 3 Haiku at $0.25/M, though that drops to $0.03/M for cached tokens (if you configure them correctly).

I've released llm-gemini 0.2 with support for the new model:

llm install -U llm-gemini
llm keys set gemini
# Paste API key here
llm -m gemini-1.5-flash-8b-latest "say hi"

Via @OfficialLoganK

Tags: google, ai, openai, generative-ai, llms, llm, anthropic, gemini, vision-llms, llm-pricing, prompt-caching, llm-release

OpenAI DevDay: Let’s build developer tools, not digital God

2024-10-02T22:33:13+00:00

I had a fun time live blogging OpenAI DevDay yesterday - I’ve now shared notes about the live blogging system I threw other in a hurry on the day (with assistance from Claude and GPT-4o). Now that the smoke has settled a little, here are my impressions from the event.

Compared to last year
Prompt caching, aka the big price drop
GPT-4o audio via the new WebSocket Realtime API
Model distillation is fine-tuning made much easier
Let’s build developer tools, not digital God

Compared to last year

Comparison with the first DevDay in November 2023 are unavoidable. That event was much more keynote-driven: just in the keynote OpenAI released GPT-4 vision, and Assistants, and GPTs, and GPT-4 Turbo (with a massive price drop), and their text-to-speech API. It felt more like a launch-focused product event than something explicitly for developers.

This year was different. Media weren’t invited, there was no livestream, Sam Altman didn’t present the opening keynote (he was interviewed at the end of the day instead) and the new features, while impressive, were not as abundant.

Several features were released in the last few months that could have been saved for DevDay: GPT-4o mini and the o1 model family are two examples. I’m personally happy that OpenAI are shipping features like that as they become ready rather than holding them back for an event.

I’m a bit surprised they didn’t talk about Whisper Turbo at the conference though, released just the day before - especially since that’s one of the few pieces of technology they release under an open source (MIT) license.

This was clearly intended as an event by developers, for developers. If you don’t build software on top of OpenAI’s platform there wasn’t much to catch your attention here.

As someone who does build software on top of OpenAI, there was a ton of valuable and interesting stuff.

Prompt caching, aka the big price drop

I was hoping we might see a price drop, seeing as there’s an ongoing pricing war between Gemini, Anthropic and OpenAI. We got one in an interesting shape: a 50% discount on input tokens for prompts with a shared prefix.

This isn’t a new idea: both Google Gemini and Claude offer a form of prompt caching discount, if you configure them correctly and make smart decisions about when and how the cache should come into effect.

The difference here is that OpenAI apply the discount automatically:

API calls to supported models will automatically benefit from Prompt Caching on prompts longer than 1,024 tokens. The API caches the longest prefix of a prompt that has been previously computed, starting at 1,024 tokens and increasing in 128-token increments. If you reuse prompts with common prefixes, we will automatically apply the Prompt Caching discount without requiring you to make any changes to your API integration.

50% off repeated long prompts is a pretty significant price reduction!

Anthropic's Claude implementation saves more money: 90% off rather than 50% - but is significantly more work to put into play.

Gemini’s caching requires you to pay per hour to keep your cache warm which makes it extremely difficult to effectively build against in comparison to the other two.

It's worth noting that OpenAI are not the first company to offer automated caching discounts: DeepSeek have offered that through their API for a few months.

GPT-4o audio via the new WebSocket Realtime API

Absolutely the biggest announcement of the conference: the new Realtime API is effectively the API version of ChatGPT advanced voice mode, a user-facing feature that finally rolled out to everyone just a week ago.

This means we can finally tap directly into GPT-4o’s multimodal audio support: we can send audio directly into the model (without first transcribing it to text via something like Whisper), and we can have it directly return speech without needing to run a separate text-to-speech model.

The way they chose to expose this is interesting: it’s not (yet) part of their existing chat completions API, instead using an entirely new API pattern built around WebSockets.

They designed it like that because they wanted it to be as realtime as possible: the API lets you constantly stream audio and text in both directions, and even supports allowing users to speak over and interrupt the model!

So far the Realtime API supports text, audio and function call / tool usage - but doesn't (yet) support image input (I've been assured that's coming soon). The combination of audio and function calling is super exciting alone though - several of the demos at DevDay used these to build fun voice-driven interactive web applications.

I like this WebSocket-focused API design a lot. My only hesitation is that, since an API key is needed to open a WebSocket connection, actually running this in production involves spinning up an authenticating WebSocket proxy. I hope OpenAI can provide a less code-intensive way of solving this in the future.

Code they showed during the event demonstrated using the native browser WebSocket class directly, but I can't find those code examples online now. I hope they publish it soon. For the moment the best things to look at are the openai-realtime-api-beta and openai-realtime-console repositories.

The new playground/realtime debugging tool - the OpenAI playground for the Realtime API - is a lot of fun to try out too.

Model distillation is fine-tuning made much easier

The other big developer-facing announcements were around model distillation, which to be honest is more of a usability enhancement and minor rebranding of their existing fine-tuning features.

OpenAI have offered fine-tuning for a few years now, most recently against their GPT-4o and GPT-4o mini models. They’ve practically been begging people to try it out, offering generous free tiers in previous months:

Today [August 20th 2024] we’re launching fine-tuning for GPT-4o, one of the most requested features from developers. We are also offering 1M training tokens per day for free for every organization through September 23.

That free offer has now been extended. A footnote on the pricing page today:

Fine-tuning for GPT-4o and GPT-4o mini is free up to a daily token limit through October 31, 2024. For GPT-4o, each qualifying org gets up to 1M complimentary training tokens daily and any overage will be charged at the normal rate of $25.00/1M tokens. For GPT-4o mini, each qualifying org gets up to 2M complimentary training tokens daily and any overage will be charged at the normal rate of $3.00/1M tokens

The problem with fine-tuning is that it’s really hard to do effectively. I tried it a couple of years ago myself against GPT-3 - just to apply tags to my blog content - and got disappointing results which deterred me from spending more money iterating on the process.

To fine-tune a model effectively you need to gather a high quality set of examples and you need to construct a robust set of automated evaluations. These are some of the most challenging (and least well understood) problems in the whole nascent field of prompt engineering.

OpenAI’s solution is a bit of a rebrand. “Model distillation” is a form of fine-tuning where you effectively teach a smaller model how to do a task based on examples generated by a larger model. It’s a very effective technique. Meta recently boasted about how their impressive Llama 3.2 1B and 3B models were “taught” by their larger models:

[...] powerful teacher models can be leveraged to create smaller models that have improved performance. We used two methods—pruning and distillation—on the 1B and 3B models, making them the first highly capable lightweight Llama models that can fit on devices efficiently.

Yesterday OpenAI released two new features to help developers implement this pattern.

The first is stored completions. You can now pass a "store": true parameter to have OpenAI permanently store your prompt and its response in their backend, optionally with your own additional tags to help you filter the captured data later.

You can view your stored completions at platform.openai.com/chat-completions.

I’ve been doing effectively the same thing with my LLM command-line tool logging to a SQLite database for over a year now. It's a really productive pattern.

OpenAI pitch stored completions as a great way to collect a set of training data from their large models that you can later use to fine-tune (aka distill into) a smaller model.

The second, even more impactful feature, is evals. You can now define and run comprehensive prompt evaluations directly inside the OpenAI platform.

OpenAI’s new eval tool competes directly with a bunch of existing startups - I’m quite glad I didn’t invest much effort in this space myself!

The combination of evals and stored completions certainly seems like it should make the challenge of fine-tuning a custom model far more tractable.

The other fine-tuning announcement, greeted by applause in the room, was fine-tuning for images. This has always felt like one of the most obviously beneficial fine-tuning use-cases for me, since it’s much harder to get great image recognition results from sophisticated prompting alone.

From a strategic point of view this makes sense as well: it has become increasingly clear over the last year that many prompts are inherently transferable between models - it’s very easy to take an application with prompts designed for GPT-4o and switch it to Claude or Gemini or Llama with few if any changes required.

A fine-tuned model on the OpenAI platform is likely to be far more sticky.

Let’s build developer tools, not digital God

In the last session of the day I furiously live blogged the Fireside Chat between Sam Altman and Kevin Weil, trying to capture as much of what they were saying as possible.

A bunch of the questions were about AGI. I’m personally quite uninterested in AGI: it’s always felt a bit too much like science fiction for me. I want useful AI-driven tools that help me solve the problems I want to solve.

One point of frustration: Sam referenced OpenAI’s five-level framework a few times. I found several news stories (many paywalled - here's one that isn't) about it but I can’t find a definitive URL on an OpenAI site that explains what it is! This is why you should always Give people something to link to so they can talk about your features and ideas.

Both Sam and Kevin seemed to be leaning away from AGI as a term. From my live blog notes (which paraphrase what was said unless I use quotation marks):

Sam says they're trying to avoid the term now because it has become so over-loaded. Instead they think about their new five steps framework.

"I feel a little bit less certain on that" with respect to the idea that an AGI will make a new scientific discovery.

Kevin: "There used to be this idea of AGI as a binary thing [...] I don't think that's how think about it any more".

Sam: Most people looking back in history won't agree when AGI happened. The turing test wooshed past and nobody cared.

I for one found this very reassuring. The thing I want from OpenAI is more of what we got yesterday: I want platform tools that I can build unique software on top of which I colud not have built previously.

If the ongoing, well-documented internal turmoil at OpenAI from the last year is a result of the organization reprioritizing towards shipping useful, reliable tools for developers (and consumers) over attempting to build a digital God, then I’m all for it.

And yet… OpenAI just this morning finalized a raise of another $6.5 billion dollars at a staggering $157 billion post-money valuation. That feels more like a digital God valuation to me than a platform for developers in an increasingly competitive space.

Tags: websockets, ai, openai, generative-ai, llms, prompt-caching

Introducing Contextual Retrieval

2024-09-20T01:34:21+00:00

Introducing Contextual Retrieval

Here's an interesting new embedding/RAG technique, described by Anthropic but it should work for any embedding model against any other LLM.

One of the big challenges in implementing semantic search against vector embeddings - often used as part of a RAG system - is creating "chunks" of documents that are most likely to semantically match queries from users.

Anthropic provide this solid example where semantic chunks might let you down:

Imagine you had a collection of financial information (say, U.S. SEC filings) embedded in your knowledge base, and you received the following question: "What was the revenue growth for ACME Corp in Q2 2023?"

A relevant chunk might contain the text: "The company's revenue grew by 3% over the previous quarter." However, this chunk on its own doesn't specify which company it's referring to or the relevant time period, making it difficult to retrieve the right information or use the information effectively.

Their proposed solution is to take each chunk at indexing time and expand it using an LLM - so the above sentence would become this instead:

This chunk is from an SEC filing on ACME corp's performance in Q2 2023; the previous quarter's revenue was $314 million. The company's revenue grew by 3% over the previous quarter.

This chunk was created by Claude 3 Haiku (their least expensive model) using the following prompt template:

<document>
{{WHOLE_DOCUMENT}}
</document>
Here is the chunk we want to situate within the whole document
<chunk>
{{CHUNK_CONTENT}}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.

Here's the really clever bit: running the above prompt for every chunk in a document could get really expensive thanks to the inclusion of the entire document in each prompt. Claude added context caching last month, which allows you to pay around 1/10th of the cost for tokens cached up to your specified beakpoint.

By Anthropic's calculations:

Assuming 800 token chunks, 8k token documents, 50 token context instructions, and 100 tokens of context per chunk, the one-time cost to generate contextualized chunks is $1.02 per million document tokens.

Anthropic provide a detailed notebook demonstrating an implementation of this pattern. Their eventual solution combines cosine similarity and BM25 indexing, uses embeddings from Voyage AI and adds a reranking step powered by Cohere.

The notebook also includes an evaluation set using JSONL - here's that evaluation data in Datasette Lite.

Via Alex Albert

Tags: search, ai, prompt-engineering, generative-ai, vector-search, llms, embeddings, anthropic, claude, rag, prompt-caching

Quoting Alex Albert

2024-08-15T18:09:04+00:00

Examples are the #1 thing I recommend people use in their prompts because they work so well. The problem is that adding tons of examples increases your API costs and latency. Prompt caching fixes this. You can now add tons of examples to every prompt and create an alternative to a model finetuned on your task with basically zero cost/latency increase. […]

This works even better with smaller models. You can generate tons of examples (test case + solution) with 3.5 Sonnet and then use those examples to create a few-shot prompt for Haiku.

— Alex Albert

Tags: ai, prompt-engineering, llms, anthropic, claude, alex-albert, claude-3-5-sonnet, prompt-caching

DeepSeek API introduces Context Caching on Disk

2024-08-14T20:48:32+00:00

DeepSeek API introduces Context Caching on Disk

I wrote about Claude prompt caching this morning. It turns out Chinese LLM lab DeepSeek released their own implementation of context caching a couple of weeks ago, with the simplest possible pricing model: it's just turned on by default for all users.

When duplicate inputs are detected, the repeated parts are retrieved from the cache, bypassing the need for recomputation. This not only reduces service latency but also significantly cuts down on overall usage costs.

For cache hits, DeepSeek charges $0.014 per million tokens, slashing API costs by up to 90%.

[...]

The disk caching service is now available for all users, requiring no code or interface changes. The cache service runs automatically, and billing is based on actual cache hits.

DeepSeek currently offer two frontier models, DeepSeek-V2 and DeepSeek-Coder-V2, both of which can be run as open weights models or accessed via their API.

Via Alex Bradbury

Tags: ai, generative-ai, llms, deepseek, prompt-caching, ai-in-china

Prompt caching with Claude

2024-08-14T17:07:35+00:00

Prompt caching with Claude

The Claude API now supports prompt caching, allowing you to mark reused portions of long prompts (like a large document provided as context). Claude will cache these for up to five minutes, and any prompts within that five minutes that reuse the context will be both significantly faster and will be charged at a significant discount: ~10% of the cost of sending those uncached tokens.

Writing to the cache costs money. The cache TTL is reset every time it gets a cache hit, so any application running more than one prompt every five minutes should see significant price decreases from this. If you app prompts less than once every five minutes you'll be losing money.

This is similar to Google Gemini's context caching feature, but the pricing model works differently. Gemini charge $4.50/million tokens/hour for their caching (that's for Gemini 1.5 Pro - Gemini 1.5 Flash is $1/million/hour), for a quarter price discount on input tokens (see their pricing).

Claude’s implementation also appears designed to help with ongoing conversations. Using caching during an individual user’s multi-turn conversation - where a full copy of the entire transcript is sent with each new prompt - could help even for very low traffic (or even single user) applications.

Here's the full documentation for the new Claude caching feature, currently only enabled if you pass "anthropic-beta: prompt-caching-2024-07-31" as an HTTP header.

Interesting to note that this caching implementation doesn't save on HTTP overhead: if you have 1MB of context you still need to send a 1MB HTTP request for every call. I guess the overhead of that HTTP traffic is negligible compared to the overhead of processing those tokens once they arrive.

One minor annoyance in the announcement for this feature:

Detailed instruction sets: Share extensive lists of instructions, procedures, and examples to fine-tune Claude's responses. [...]

I wish Anthropic wouldn't use the term "fine-tune" in this context (they do the same thing in their tweet). This feature is unrelated to model fine-tuning (a feature Claude provides via AWS Bedrock). People find this terminology confusing already, frequently misinterpreting "fine-tuning" as being the same thing as "tweaking your prompt until it works better", and Anthropic's language here doesn't help.

Via @AnthropicAI

Tags: ai, prompt-engineering, generative-ai, llms, anthropic, claude, gemini, llm-pricing, prompt-caching

Context caching for Google Gemini

2024-05-14T20:42:33+00:00

Context caching for Google Gemini

Another new Gemini feature announced today. Long context models enable answering questions against large chunks of text, but the price of those long prompts can be prohibitive - $3.50/million for Gemini Pro 1.5 up to 128,000 tokens and $7/million beyond that.

Context caching offers a price optimization, where the long prefix prompt can be reused between requests, halving the cost per prompt but at an additional cost of $4.50 / 1 million tokens per hour to keep that context cache warm.

Given that hourly extra charge this isn't a default optimization for all cases, but certain high traffic applications might be able to save quite a bit on their longer prompt systems.

It will be interesting to see if other vendors such as OpenAI and Anthropic offer a similar optimization in the future.

Update 14th August 2024: Anthropic's Claude now has its own version of prompt caching.

Via @officiallogank

Tags: google, ai, prompt-engineering, generative-ai, llms, gemini, llm-pricing, prompt-caching, long-context