Simon Willison's Weblog: tokenization

Claude Token Counter, now with model comparisons

2026-04-20T00:50:45+00:00

Claude Token Counter, now with model comparisons

I upgraded my Claude Token Counter tool to add the ability to run the same count against different models in order to compare them.

As far as I can tell Claude Opus 4.7 is the first model to change the tokenizer, so it's only worth running comparisons between 4.7 and 4.6. The Claude token counting API accepts any Claude model ID though so I've included options for all four of the notable current models (Opus 4.7 and 4.6, Sonnet 4.6, and Haiku 4.5).

In the Opus 4.7 announcement Anthropic said:

Opus 4.7 uses an updated tokenizer that improves how the model processes text. The tradeoff is that the same input can map to more tokens—roughly 1.0–1.35× depending on the content type.

I pasted the Opus 4.7 system prompt into the token counting tool and found that the Opus 4.7 tokenizer used 1.46x the number of tokens as Opus 4.6.

Opus 4.7 uses the same pricing is Opus 4.6 - $5 per million input tokens and $25 per million output tokens - but this token inflation means we can expect it to be around 40% more expensive.

The token counter tool also accepts images. Opus 4.7 has improved image support, described like this:

Opus 4.7 has better vision for high-resolution images: it can accept images up to 2,576 pixels on the long edge (~3.75 megapixels), more than three times as many as prior Claude models.

I tried counting tokens for a 3456x2234 pixel 3.7MB PNG and got an even bigger increase in token counts - 3.01x times the number of tokens for 4.7 compared to 4.6:

Update: That 3x increase for images is entirely due to Opus 4.7 being able to handle higher resolutions. I tried that again with a 682x318 pixel image and it took 314 tokens with Opus 4.7 and 310 with Opus 4.6, so effectively the same cost.

Update 2: I tried a 15MB, 30 page text-heavy PDF and Opus 4.7 reported 60,934 tokens while 4.6 reported 56,482 - that's a 1.08x multiplier, significantly lower than the multiplier I got for raw text.

Tags: ai, generative-ai, llms, anthropic, claude, llm-pricing, tokenization

Sam Rose explains how LLMs work with a visual essay

2025-12-19T18:33:41+00:00

Sam Rose explains how LLMs work with a visual essay

Sam Rose is one of my favorite authors of explorable interactive explanations - here's his previous collection.

Sam joined ngrok in September as a developer educator. Here's his first big visual explainer for them, ostensibly about how prompt caching works but it quickly expands to cover tokenization, embeddings, and the basics of the transformer architecture.

The result is one of the clearest and most accessible introductions to LLM internals I've seen anywhere.

Tags: ai, explorables, generative-ai, llms, sam-rose, tokenization

Dummy's Guide to Modern LLM Sampling

2025-05-04T21:13:02+00:00

Dummy's Guide to Modern LLM Sampling

This is an extremely useful, detailed set of explanations by @AlpinDale covering the various different sampling strategies used by modern LLMs. LLMs return a set of next-token probabilities for every token in their corpus - a layer above the LLM can then use sampling strategies to decide which one to use.

I finally feel like I understand the difference between Top-K and Top-P! Top-K is when you narrow down to e.g. the 20 most likely candidates for next token and then pick one of those. Top-P instead "the smallest set of words whose combined probability exceeds threshold P" - so if you set it to 0.5 you'll filter out tokens in the lower half of the probability distribution.

There are a bunch more sampling strategies in here that I'd never heard of before - Top-A, Top-N-Sigma, Epsilon-Cutoff and more.

Reading the descriptions here of Repetition Penalty and Don't Repeat Yourself made me realize that I need to be a little careful with those for some of my own uses of LLMs.

I frequently feed larger volumes of text (or code) into an LLM and ask it to output subsets of that text as direct quotes, to answer questions like "which bit of this code handles authentication tokens" or "show me direct quotes that illustrate the main themes in this conversation".

Careless use of frequency penalty strategies might go against what I'm trying to achieve with those prompts.

Via Hacker News

Tags: ai, prompt-engineering, generative-ai, llms, tokenization

SQL injection-like attack on LLMs with special tokens

2024-08-20T22:01:50+00:00

SQL injection-like attack on LLMs with special tokens

Andrej Karpathy explains something that's been confusing me for the best part of a year:

The decision by LLM tokenizers to parse special tokens in the input string (<s>, <|endoftext|>, etc.), while convenient looking, leads to footguns at best and LLM security vulnerabilities at worst, equivalent to SQL injection attacks.

LLMs frequently expect you to feed them text that is templated like this:

<|user|>\nCan you introduce yourself<|end|>\n<|assistant|>

But what happens if the text you are processing includes one of those weird sequences of characters, like <|assistant|>? Stuff can definitely break in very unexpected ways.

LLMs generally reserve special token integer identifiers for these, which means that it should be possible to avoid this scenario by encoding the special token as that ID (for example 32001 for <|assistant|> in the Phi-3-mini-4k-instruct vocabulary) while that same sequence of characters in untrusted text is encoded as a longer sequence of smaller tokens.

Many implementations fail to do this! Thanks to Andrej I've learned that modern releases of Hugging Face transformers have a split_special_tokens=True parameter (added in 4.32.0 in August 2023) that can handle it. Here's an example:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
>>> tokenizer.encode("<|assistant|>")
[32001]
>>> tokenizer.encode("<|assistant|>", split_special_tokens=True)
[529, 29989, 465, 22137, 29989, 29958]

A better option is to use the apply_chat_template() method, which should correctly handle this for you (though I'd like to see confirmation of that).

Tags: security, sql-injection, transformers, ai, andrej-karpathy, prompt-injection, generative-ai, llms, tokenization

A Picture is Worth 170 Tokens: How Does GPT-4o Encode Images?

2024-06-07T23:30:13+00:00

A Picture is Worth 170 Tokens: How Does GPT-4o Encode Images?

Oran Looney dives into the question of how GPT-4o tokenizes images - an image "costs" just 170 tokens, despite being able to include more text than could be encoded in that many tokens by the standard tokenizer.

There are some really neat tricks in here. I particularly like the experimental validation section where Oran creates 5x5 (and larger) grids of coloured icons and asks GPT-4o to return a JSON matrix of icon descriptions. This works perfectly at 5x5, gets 38/49 for 7x7 and completely fails at 13x13.

I'm not convinced by the idea that GPT-4o runs standard OCR such as Tesseract to enhance its ability to interpret text, but I would love to understand more about how this all works. I imagine a lot can be learned from looking at how openly licensed vision models such as LLaVA work, but I've not tried to understand that myself yet.

Via Hacker News

Tags: ai, openai, generative-ai, gpt-4, llms, tokenization, gpt

mistralai/mistral-common

2024-04-18T00:39:54+00:00

mistralai/mistral-common

New from Mistral: mistral-common, an open source Python library providing "a set of tools to help you work with Mistral models".

So far that means a tokenizer! This is similar to OpenAI's tiktoken library in that it lets you run tokenization in your own code, which crucially means you can count the number of tokens that you are about to use - useful for cost estimates but also for cramming the maximum allowed tokens in the context window for things like RAG.

Mistral's library is better than tiktoken though, in that it also includes logic for correctly calculating the tokens needed for conversation construction and tool definition. With OpenAI's APIs you're currently left guessing how many tokens are taken up by these advanced features.

Anthropic haven't published any form of tokenizer at all - it's the feature I'd most like to see from them next.

Here's how to explore the vocabulary of the tokenizer:

MistralTokenizer.from_model(
    "open-mixtral-8x22b"
).instruct_tokenizer.tokenizer.vocab()[:12]

['<unk>', '<s>', '</s>', '[INST]', '[/INST]', '[TOOL_CALLS]', '[AVAILABLE_TOOLS]', '[/AVAILABLE_TOOLS]', '[TOOL_RESULTS]', '[/TOOL_RESULTS]']

Tags: python, ai, openai, prompt-engineering, generative-ai, llms, anthropic, mistral, rag, tokenization

The Tokenizer Playground

2024-03-19T02:18:59+00:00

The Tokenizer Playground

I built a tool like this a while ago, but this one is much better: it provides an interface for experimenting with tokenizers from a wide range of model architectures, including Llama, Claude, Mistral and Grok-1—all running in the browser using Transformers.js.

Via @xenovacom

Tags: ai, generative-ai, llms, transformers-js, tokenization

Let's build the GPT Tokenizer

2024-02-20T18:02:13+00:00

Let's build the GPT Tokenizer

When Andrej Karpathy left OpenAI last week a lot of people expressed hope that he would be increasing his output of educational YouTube videos.

Here’s an in-depth 2 hour dive into how tokenizers work and how to build one from scratch, published this morning.

The section towards the end, “revisiting and explaining the quirks of LLM tokenization”, helps explain a number of different LLM weaknesses—inability to reverse strings, confusion over arithmetic and even a note on why YAML can work better than JSON when providing data to LLMs (the same data can be represented in less tokens).

Tags: ai, andrej-karpathy, generative-ai, llms, tokenization

Llama encoder and decoder

2023-06-13T22:37:29+00:00

Llama encoder and decoder

I forked my GPT tokenizer Observable notebook to create a similar tool for exploring the tokenization scheme used by the Llama family of LLMs, using the new llama-tokenizer-js JavaScript library.

Tags: ai, observable, generative-ai, llama, llms, tokenization

Understanding GPT tokenizers

2023-06-08T20:37:00+00:00

Large language models such as GPT-3/4, LLaMA and PaLM work in terms of tokens. They take text, convert it into tokens (integers), then predict which tokens should come next.

Playing around with these tokens is an interesting way to get a better idea for how this stuff actually works under the hood.

OpenAI offer a Tokenizer tool for exploring how tokens work

I've built my own, slightly more interesting tool as an Observable notebook:

https://observablehq.com/@simonw/gpt-tokenizer

You can use the notebook to convert text to tokens, tokens to text and also to run searches against the full token table.

Here's what the notebook looks like:

The text I'm tokenizing here is:

The dog eats the apples
El perro come las manzanas
片仮名

This produces 21 integer tokens: 5 for the English text, 8 for the Spanish text and six (two each) for those three Japanese characters. The two newlines are each represented by tokens as well.

The notebook uses the tokenizer from GPT-2 (borrowing from this excellent notebook by EJ Fox and Ian Johnson), so it's useful primarily as an educational tool - there are differences between how it works and the latest tokenizers for GPT-3 and above.

Exploring some interesting tokens

Playing with the tokenizer reveals all sorts of interesting patterns.

Most common English words are assigned a single token. As demonstrated above:

"The": 464
" dog": 3290
" eats": 25365
" the": 262
" apples": 22514

Note that capitalization is important here. "The" with a capital T is token 464, but " the" with both a leading space and a lowercase t is token 262.

Many words also have a token that incorporates a leading space. This makes for much more efficient encoding of full sentences, since they can be encoded without needing to spend a token on each whitespace character.

Languages other than English suffer from less efficient tokenization.

"El perro come las manzanas" in Spanish is encoded like this:

"El": 9527
" per": 583
"ro": 305
" come": 1282
" las": 39990
" man": 582
"zan": 15201
"as": 292

The English bias is obvious here. " man" gets a lower token ID of 582, because it's an English word. "zan" gets a token ID of 15201 because it's not a word that stands alone in English, but is a common enough sequence of characters that it still warrants its own token.

Some languages even have single characters that end up encoding to multiple tokens, such as these Japanese characters:

片: 31965 229
仮: 20015 106
名: 28938 235

Glitch tokens

A fascinating subset of tokens are what are known as "glitch tokens". My favourite example of those is token 23282 - " davidjl".

We can find that token by searching for "david" using the search box in the notebook:

Riley Goodside highlighted some weird behaviour with that token:

Why this happens is an intriguing puzzle.

It looks likely that this token refers to user davidjl123 on Reddit, a keen member of the /r/counting subreddit. He's posted incremented numbers there well over 163,000 times.

Presumably that subreddit ended up in the training data used to create the tokenizer used by GPT-2, and since that particular username showed up hundreds of thousands of times it ended up getting its own token.

But why would that break things like this? The best theory I've seen so far came from londons_explore on Hacker News:

These glitch tokens are all near the centroid of the token embedding space. That means that the model cannot really differentiate between these tokens and the others equally near the center of the embedding space, and therefore when asked to 'repeat' them, gets the wrong one.

That happened because the tokens were on the internet many millions of times (the davidjl user has 163,000 posts on reddit simply counting increasing numbers), yet the tokens themselves were never hard to predict (and therefore while training, the gradients became nearly zero, and the embedding vectors decayed to zero, which some optimizers will do when normalizing weights).

The conversation attached to the post SolidGoldMagikarp (plus, prompt generation) on LessWrong has a great deal more detail on this phenomenon.

Counting tokens with tiktoken

OpenAI's models each have a token limit. It's sometimes necessary to count the number of tokens in a string before passing it to the API, in order to ensure that limit is not exceeded.

One technique that needs this is Retrieval Augmented Generation, where you answer a user's question by running a search (or an embedding search) against a corpus of documents, extract the most likely content and include that as context in a prompt.

The key to successfully implementing that pattern is to include as much relevant context as will fit within the token limit - so you need to be able to count tokens.

OpenAI provide a Python library for doing this called tiktoken.

If you dig around inside the library you'll find it currently includes five different tokenization schemes: r50k_base, p50k_base, p50k_edit, cl100k_base and gpt2.

Of these cl100k_base is the most relevant, being the tokenizer for both GPT-4 and the inexpensive gpt-3.5-turbo model used by current ChatGPT.

p50k_base is used by text-davinci-003. A full mapping of models to tokenizers can be found in the MODEL_TO_ENCODING dictionary in tiktoken/model.py.

Here's how to use tiktoken:

import tiktoken

encoding = tiktoken.encoding_for_model("gpt-4")
# or "gpt-3.5-turbo" or "text-davinci-003"

tokens = encoding.encode("Here is some text")
token_count = len(tokens)

tokens will now be an array of four integer token IDs - [8586, 374, 1063, 1495] in this case.

Use the .decode() method to turn an array of token IDs back into text:

text = encoding.decode(tokens)
# 'Here is some text'

The first time you call encoding_for_model() the encoding data will be fetched over HTTP from a openaipublic.blob.core.windows.net Azure blob storage bucket (code here). This is cached in a temp directory, but that will get cleared should your machine restart. You can force it to use a more persistent cache directory by setting a TIKTOKEN_CACHE_DIR environment variable.

ttok

I introduced my ttok tool a few weeks ago. It's a command-line wrapper around tiktoken with two key features: it can count tokens in text that is piped to it, and it can also truncate that text down to a specified number of tokens:

# Count tokens
echo -n "Count these tokens" | ttok
# Outputs: 3 (the newline is skipped thanks to echo -n)

# Truncation
curl 'https://simonwillison.net/' | strip-tags -m | ttok -t 6
# Outputs: Simon Willison’s Weblog

# View integer token IDs
echo "Show these tokens" | ttok --tokens
# Outputs: 7968 1521 11460 198

Use -m gpt2 or similar to use an encoding for a different model.

Watching tokens get generated

Once you understand tokens, the way GPT tools generate text starts to make a lot more sense.

In particular, it's fun to watch GPT-4 streaming back its output as independent tokens (GPT-4 is slightly slower than 3.5, making it easier to see what's going on).

Here's what I get for llm -s 'Five names for a pet pelican' -4 - using my llm CLI tool to generate text from GPT-4:

As you can see, names that are not in the dictionary such as "Pelly" take multiple tokens, but "Captain Gulliver" outputs the token "Captain" as a single chunk.

Tags: projects, ai, gpt-3, openai, generative-ai, gpt-4, llms, tokenization, ai-bias, gpt

Examples of weird GPT-4 behavior for the string " davidjl"

2023-06-08T09:29:42+00:00

Examples of weird GPT-4 behavior for the string " davidjl"

GPT-4, when told to repeat or otherwise process the string “ davidjl” (note the leading space character), treats it as “jndl” or “jspb” or “JDL” instead. It turns out “ davidjl” has its own single token in the tokenizer: token ID 23282, presumably dating back to the GPT-2 days.

Riley Goodside refers to these as “glitch tokens”.

This token might refer to Reddit user davidjl123 who ranks top of the league for the old /r/counting subreddit, with 163,477 posts there which presumably ended up in older training data.

Tags: reddit, ai, openai, generative-ai, riley-goodside, gpt-4, llms, tokenization, gpt

llm, ttok and strip-tags - CLI tools for working with ChatGPT and other LLMs

2023-05-18T21:04:01+00:00

I've been building out a small suite of command-line tools for working with ChatGPT, GPT-4 and potentially other language models in the future.

The three tools I've built so far are:

llm - a command-line tool for sending prompts to the OpenAI APIs, outputting the response and logging the results to a SQLite database. I introduced that a few weeks ago.
ttok - a tool for counting and truncating text based on tokens
strip-tags - a tool for stripping HTML tags from text, and optionally outputting a subset of the page based on CSS selectors

The idea with these tools is to support working with language model prompts using Unix pipes.

You can install the three like this:

pipx install llm
pipx install ttok
pipx install strip-tags

Or use pip if you haven't adopted pipx yet.

llm depends on an OpenAI API key in the OPENAI_API_KEY environment variable or a ~/.openai-api-key.txt text file (Update: later versions of LLM changed this, see setup instructions). The other tools don't require any configuration.

Now let's use them to summarize the homepage of the New York Times:

curl -s https://www.nytimes.com/ \
  | strip-tags .story-wrapper \
  | ttok -t 4000 \
  | llm --system 'summary bullet points'

Here's what that command outputs when you run it in the terminal:

Let's break that down.

curl -s https://www.nytimes.com/ uses curl to retrieve the HTML for the New York Times homepage - the -s option prevents it from outputting any progress information.
strip-tags .story-wrapper accepts HTML to standard input, finds just the areas of that page identified by the CSS selector .story-wrapper, then outputs the text for those areas with all HTML tags removed.
ttok -t 4000 accepts text to standard input, tokenizes it using the default tokenizer for the gpt-3.5-turbo model, truncates to the first 4,000 tokens and outputs those tokens converted back to text.
llm --system 'summary bullet points' accepts the text to standard input as the user prompt, adds a system prompt of "summary bullet points".

It's all about the tokens

I built strip-tags and ttok this morning because I needed better ways to work with tokens.

LLMs such as ChatGPT and GPT-4 work with tokens, not characters.

This is an implementation detail, but it's one that you can't avoid for two reasons:

APIs have token limits. If you try and send more than the limit you'll get an error message like this one: "This model's maximum context length is 4097 tokens. However, your messages resulted in 116142 tokens. Please reduce the length of the messages."
Tokens are how pricing works. gpt-3.5-turbo (the model used by ChatGPT, and the default model used by the llm command) costs $0.002 / 1,000 tokens. GPT-4 is $0.03 / 1,000 tokens of input and $0.06 / 1,000 for output.

Being able to keep track of token counts is really important.

But tokens are actually really hard to count! The rule of thumb is roughly 0.75 * number-of-words, but you can get an exact count by running the same tokenizer that the model uses on your own machine.

OpenAI's tiktoken library (documented in this notebook) is the best way to do this.

My ttok tool is a very thin wrapper around that library. It can do three different things:

Count tokens
Truncate text to a desired number of tokens
Show you the tokens

Here's a quick example showing all three of those in action:

$ echo 'Here is some text' | ttok
5
$ echo 'Here is some text' | ttok --truncate 2
Here is
$ echo 'Here is some text' | ttok --tokens    
8586 374 1063 1495 198

My GPT-3 token encoder and decoder Observable notebook provides an interface for exploring how these tokens work in more detail.

Stripping tags from HTML

HTML tags take up a lot of tokens, and usually aren't relevant to the prompt you are sending to the model.

My new strip-tags command strips those tags out.

Here's an example showing quite how much of a difference that can make:

$ curl -s https://simonwillison.net/ | ttok
21543
$ curl -s https://simonwillison.net/ | strip-tags | ttok
9688

For my blog's homepage, stripping tags reduces the token count by more than half!

The above is still too many tokens to send to the API.

We could truncate them, like this:

$ curl -s https://simonwillison.net/ \
  | strip-tags | ttok --truncate 4000 \
  | llm --system 'turn this into a bad poem'

Which outputs:

download-esm,

A tool to download ECMAScript modules.

Get your packages straight from CDN,

No need for build scripts, let that burden end.

All dependencies will be fetched,

Import statements will be re-writched.

Works like a charm, simple and sleek,

JavaScript just got a whole lot more chic.

But often it's only specific parts of a page that we care about. The strip-tags command takes an optional list of CSS selectors as arguments - if provided, only those parts of the page will be output.

That's how the New York Times example works above. Compare the following:

$ curl -s https://www.nytimes.com/ | ttok             
210544
$ curl -s https://www.nytimes.com/ | strip-tags | ttok
115117
$ curl -s https://www.nytimes.com/ | strip-tags .story-wrapper | ttok
2165

By selecting just the text from within the <section class="story-wrapper"> elements we can trim the whole page down to just the headlines and summaries of each of the main articles on the page.

Future plans

I'm really enjoying being able to use the terminal to interact with LLMs in this way. Having a quick way to pipe content to a model opens up all kinds of fun opportunities.

Want a quick explanation of how some code works using GPT-4? Try this:

cat ttok/cli.py | llm --system 'Explain this code' --gpt4

(Output here).

I've been having fun piping my shot-scraper tool into it too, which goes a step further than strip-tags in providing a full headless browser.

Here's an example that uses the Readability recipe from this TIL to extract the main article content, then further strips HTML tags from it and pipes it into the llm command:

shot-scraper javascript https://www.theguardian.com/uk-news/2023/may/18/rmt-to-hold-rail-strike-across-england-on-eve-of-fa-cup-final "
async () => {
    const readability = await import('https://cdn.skypack.dev/@mozilla/readability');
    return (new readability.Readability(document)).parse().content;
}" | strip-tags | llm --system summarize

In terms of next steps, the thing I'm most excited about is teaching that llm command how to talk to other models - initially Claude and PaLM2 via APIs, but I'd love to get it working against locally hosted models running on things like llama.cpp as well.

Tags: cli, projects, ai, openai, generative-ai, chatgpt, llms, llm, tokenization