<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: tokenization</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/tokenization.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2026-04-20T00:50:45+00:00</updated><author><name>Simon Willison</name></author><entry><title>Claude Token Counter, now with model comparisons</title><link href="https://simonwillison.net/2026/Apr/20/claude-token-counts/#atom-tag" rel="alternate"/><published>2026-04-20T00:50:45+00:00</published><updated>2026-04-20T00:50:45+00:00</updated><id>https://simonwillison.net/2026/Apr/20/claude-token-counts/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://tools.simonwillison.net/claude-token-counter"&gt;Claude Token Counter, now with model comparisons&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I &lt;a href="https://github.com/simonw/tools/pull/269"&gt;upgraded&lt;/a&gt; my Claude Token Counter tool to add the ability to run the same count against different models in order to compare them.&lt;/p&gt;
&lt;p&gt;As far as I can tell Claude Opus 4.7 is the first model to change the tokenizer, so it's only worth running comparisons between 4.7 and 4.6. The Claude &lt;a href="https://platform.claude.com/docs/en/build-with-claude/token-counting"&gt;token counting API&lt;/a&gt; accepts any Claude model ID though so I've included options for all four of the notable current models (Opus 4.7 and 4.6, Sonnet 4.6, and Haiku 4.5).&lt;/p&gt;
&lt;p&gt;In the Opus 4.7 announcement &lt;a href="https://www.anthropic.com/news/claude-opus-4-7#migrating-from-opus-46-to-opus-47"&gt;Anthropic said&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Opus 4.7 uses an updated tokenizer that improves how the model processes text. The tradeoff is that the same input can map to more tokens—roughly 1.0–1.35× depending on the content type.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I pasted the &lt;a href="https://github.com/simonw/research/blob/2cf912666ba08ef0c00a1b51ee07c9a8e64579ef/extract-system-prompts/claude-opus-4-7.md?plain=1"&gt;Opus 4.7 system prompt&lt;/a&gt; into the token counting tool and found that the Opus 4.7 tokenizer used 1.46x the number of tokens as Opus 4.6.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a token comparison tool. Models to compare: claude-opus-4-7 (checked), claude-opus-4-6 (checked), claude-opus-4-5, claude-sonnet-4-6, claude-haiku-4-5. Note: &amp;quot;These models share the same tokenizer&amp;quot;. Blue &amp;quot;Count Tokens&amp;quot; button. Results table — Model | Tokens | vs. lowest. claude-opus-4-7: 7,335 tokens, 1.46x (yellow badge). claude-opus-4-6: 5,039 tokens, 1.00x (green badge)." src="https://static.simonwillison.net/static/2026/claude-token-count.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;Opus 4.7 uses the same pricing is Opus 4.6 - $5 per million input tokens and $25 per million output tokens - but this token inflation means we can expect it to be around 40% more expensive.&lt;/p&gt;
&lt;p&gt;The token counter tool also accepts images. Opus 4.7 has improved image support, described like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Opus 4.7 has better vision for high-resolution images: it can accept images up to 2,576 pixels on the long edge (~3.75 megapixels), more than three times as many as prior Claude models.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I tried counting tokens for a 3456x2234 pixel 3.7MB PNG and got an even bigger increase in token counts - 3.01x times the number of tokens for 4.7 compared to 4.6:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Same UI, this time with an uploaded screenshot PNG image. claude-opus-4-7: 4,744 tokens, 3.01x (yellow badge). claude-opus-4-6: 1,578 tokens, 1.00x (green badge)." src="https://static.simonwillison.net/static/2026/claude-token-count-image.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: That 3x increase for images is &lt;em&gt;entirely&lt;/em&gt; due to Opus 4.7 being able to handle higher resolutions. I tried that again with a 682x318 pixel image and it took 314 tokens with Opus 4.7 and 310 with Opus 4.6, so effectively the same cost.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update 2&lt;/strong&gt;: I tried a 15MB, 30 page text-heavy PDF and Opus 4.7 reported 60,934   tokens while 4.6 reported 56,482 - that's a 1.08x multiplier, significantly lower than the multiplier I got for raw text.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tokenization"&gt;tokenization&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="llm-pricing"/><category term="tokenization"/></entry><entry><title>Sam Rose explains how LLMs work with a visual essay</title><link href="https://simonwillison.net/2025/Dec/19/sam-rose-llms/#atom-tag" rel="alternate"/><published>2025-12-19T18:33:41+00:00</published><updated>2025-12-19T18:33:41+00:00</updated><id>https://simonwillison.net/2025/Dec/19/sam-rose-llms/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://ngrok.com/blog/prompt-caching/"&gt;Sam Rose explains how LLMs work with a visual essay&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Sam Rose is one of my favorite authors of &lt;a href="https://simonwillison.net/tags/explorables/"&gt;explorable interactive explanations&lt;/a&gt; - here's &lt;a href="https://samwho.dev/"&gt;his previous collection&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Sam joined ngrok in September as a developer educator. Here's his first big visual explainer for them, ostensibly about how prompt caching works but it quickly expands to cover tokenization, embeddings, and the basics of the transformer architecture.&lt;/p&gt;
&lt;p&gt;The result is one of the clearest and most accessible introductions to LLM internals I've seen anywhere.&lt;/p&gt;
&lt;div style="text-align: center"&gt;&lt;img alt="Animation. Starts in tokens mode with an array of 75, 305, 24, 887 - clicking embeddings animates those into a 2D array showing each one to be composed of three floating point numbers." src="https://static.simonwillison.net/static/2025/tokens-embeddings.gif" style="max-width: 100%"&gt;&lt;/div&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/explorables"&gt;explorables&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sam-rose"&gt;sam-rose&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tokenization"&gt;tokenization&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="explorables"/><category term="generative-ai"/><category term="llms"/><category term="sam-rose"/><category term="tokenization"/></entry><entry><title>Dummy's Guide to Modern LLM Sampling</title><link href="https://simonwillison.net/2025/May/4/llm-sampling/#atom-tag" rel="alternate"/><published>2025-05-04T21:13:02+00:00</published><updated>2025-05-04T21:13:02+00:00</updated><id>https://simonwillison.net/2025/May/4/llm-sampling/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://rentry.co/samplers"&gt;Dummy&amp;#x27;s Guide to Modern LLM Sampling&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This is an extremely useful, detailed set of explanations by &lt;a href="https://x.com/AlpinDale"&gt;@AlpinDale&lt;/a&gt; covering the various different sampling strategies used by modern LLMs. LLMs return a set of next-token probabilities for every token in their corpus - a layer above the LLM can then use sampling strategies to decide which one to use.&lt;/p&gt;
&lt;p&gt;I finally feel like I understand the difference between &lt;a href="https://rentry.co/samplers#top-k"&gt;Top-K&lt;/a&gt; and &lt;a href="https://rentry.co/samplers#top-p"&gt;Top-P&lt;/a&gt;! Top-K is when you narrow down to e.g. the 20 most likely candidates for next token and then pick one of those. Top-P instead "the smallest set of words whose combined probability exceeds threshold P" - so if you set it to 0.5 you'll filter out tokens in the lower half of the probability distribution.&lt;/p&gt;
&lt;p&gt;There are a bunch more sampling strategies in here that I'd never heard of before - Top-A, Top-N-Sigma, Epsilon-Cutoff and more.&lt;/p&gt;
&lt;p&gt;Reading the descriptions here of &lt;a href="https://rentry.co/samplers#repetition-penalty"&gt;Repetition Penalty&lt;/a&gt; and &lt;a href="https://rentry.co/samplers#dry-dont-repeat-yourself"&gt;Don't Repeat Yourself&lt;/a&gt; made me realize that I need to be a little careful with those for some of my own uses of LLMs.&lt;/p&gt;
&lt;p&gt;I frequently feed larger volumes of text (or code) into an LLM and ask it to output subsets of that text as direct quotes, to answer questions like "which bit of this code handles authentication tokens" or "show me direct quotes that illustrate the main themes in this conversation".&lt;/p&gt;
&lt;p&gt;Careless use of frequency penalty strategies might go against what I'm trying to achieve with those prompts.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=43887637"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-engineering"&gt;prompt-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tokenization"&gt;tokenization&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="prompt-engineering"/><category term="generative-ai"/><category term="llms"/><category term="tokenization"/></entry><entry><title>SQL injection-like attack on LLMs with special tokens</title><link href="https://simonwillison.net/2024/Aug/20/sql-injection-like-attack-on-llms-with-special-tokens/#atom-tag" rel="alternate"/><published>2024-08-20T22:01:50+00:00</published><updated>2024-08-20T22:01:50+00:00</updated><id>https://simonwillison.net/2024/Aug/20/sql-injection-like-attack-on-llms-with-special-tokens/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://twitter.com/karpathy/status/1823418177197646104"&gt;SQL injection-like attack on LLMs with special tokens&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Andrej Karpathy explains something that's been confusing me for the best part of a year:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The decision by LLM tokenizers to parse special tokens in the input string (&lt;code&gt;&amp;lt;s&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;|endoftext|&amp;gt;&lt;/code&gt;, etc.), while convenient looking, leads to footguns at best and LLM security vulnerabilities at worst, equivalent to SQL injection attacks. &lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;LLMs frequently expect you to feed them text that is templated like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;|user|&amp;gt;\nCan you introduce yourself&amp;lt;|end|&amp;gt;\n&amp;lt;|assistant|&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;But what happens if the text you are processing includes one of those weird sequences of characters, like &lt;code&gt;&amp;lt;|assistant|&amp;gt;&lt;/code&gt;? Stuff can definitely break in very unexpected ways.&lt;/p&gt;
&lt;p&gt;LLMs generally reserve special token integer identifiers for these, which means that it should be possible to avoid this scenario by encoding the special token as that ID (for example &lt;code&gt;32001&lt;/code&gt; for &lt;code&gt;&amp;lt;|assistant|&amp;gt;&lt;/code&gt; in the &lt;code&gt;Phi-3-mini-4k-instruct&lt;/code&gt; &lt;a href="https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/blob/main/added_tokens.json"&gt;vocabulary&lt;/a&gt;) while that same sequence of characters in untrusted text is encoded as a longer sequence of smaller tokens.&lt;/p&gt;
&lt;p&gt;Many implementations fail to do this! Thanks to Andrej I've learned that modern releases of Hugging Face &lt;a href="https://pypi.org/project/transformers/"&gt;transformers&lt;/a&gt; have a &lt;code&gt;split_special_tokens=True&lt;/code&gt; parameter (added &lt;a href="https://github.com/huggingface/transformers/releases/tag/v4.32.0"&gt;in 4.32.0&lt;/a&gt; in August 2023) that can handle it. Here's an example:&lt;/p&gt;
&lt;div class="highlight highlight-text-python-console"&gt;&lt;pre&gt;&amp;gt;&amp;gt;&amp;gt; &lt;span class="pl-k"&gt;from&lt;/span&gt; transformers &lt;span class="pl-k"&gt;import&lt;/span&gt; AutoTokenizer
&amp;gt;&amp;gt;&amp;gt; tokenizer &lt;span class="pl-k"&gt;=&lt;/span&gt; AutoTokenizer.from_pretrained(&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;microsoft/Phi-3-mini-4k-instruct&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;)
&amp;gt;&amp;gt;&amp;gt; tokenizer.encode(&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&amp;lt;|assistant|&amp;gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;)
[32001]
&amp;gt;&amp;gt;&amp;gt; tokenizer.encode(&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&amp;lt;|assistant|&amp;gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;, &lt;span class="pl-v"&gt;split_special_tokens&lt;/span&gt;&lt;span class="pl-k"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)
[529, 29989, 465, 22137, 29989, 29958]&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;A better option is to use the &lt;a href="https://huggingface.co/docs/transformers/main/en/chat_templating"&gt;apply_chat_template()&lt;/a&gt; method, which should correctly handle this for you (though I'd like to see confirmation of that).


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sql-injection"&gt;sql-injection&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/transformers"&gt;transformers&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/andrej-karpathy"&gt;andrej-karpathy&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-injection"&gt;prompt-injection&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tokenization"&gt;tokenization&lt;/a&gt;&lt;/p&gt;



</summary><category term="security"/><category term="sql-injection"/><category term="transformers"/><category term="ai"/><category term="andrej-karpathy"/><category term="prompt-injection"/><category term="generative-ai"/><category term="llms"/><category term="tokenization"/></entry><entry><title>A Picture is Worth 170 Tokens: How Does GPT-4o Encode Images?</title><link href="https://simonwillison.net/2024/Jun/7/how-does-gpt-4o-encode-images/#atom-tag" rel="alternate"/><published>2024-06-07T23:30:13+00:00</published><updated>2024-06-07T23:30:13+00:00</updated><id>https://simonwillison.net/2024/Jun/7/how-does-gpt-4o-encode-images/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.oranlooney.com/post/gpt-cnn/"&gt;A Picture is Worth 170 Tokens: How Does GPT-4o Encode Images?&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Oran Looney dives into the question of how GPT-4o tokenizes images - an image "costs" just 170 tokens, despite being able to include more text than could be encoded in that many tokens by the standard tokenizer.&lt;/p&gt;
&lt;p&gt;There are some really neat tricks in here. I particularly like the &lt;a href="https://www.oranlooney.com/post/gpt-cnn/#experimental-validation"&gt;experimental validation section&lt;/a&gt; where Oran creates 5x5 (and larger) grids of coloured icons and asks GPT-4o to return a JSON matrix of icon descriptions. This works perfectly at 5x5, gets 38/49 for 7x7 and completely fails at 13x13.&lt;/p&gt;
&lt;p&gt;I'm not convinced by the idea that GPT-4o runs standard OCR such as Tesseract to enhance its ability to interpret text, but I would love to understand more about how this all works. I imagine a lot can be learned from looking at how openly licensed vision models such as LLaVA work, but I've not tried to understand that myself yet.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=40608269"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-4"&gt;gpt-4&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tokenization"&gt;tokenization&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt"&gt;gpt&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="gpt-4"/><category term="llms"/><category term="tokenization"/><category term="gpt"/></entry><entry><title>mistralai/mistral-common</title><link href="https://simonwillison.net/2024/Apr/18/mistral-common/#atom-tag" rel="alternate"/><published>2024-04-18T00:39:54+00:00</published><updated>2024-04-18T00:39:54+00:00</updated><id>https://simonwillison.net/2024/Apr/18/mistral-common/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/mistralai/mistral-common"&gt;mistralai/mistral-common&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New from Mistral: mistral-common, an open source Python library providing "a set of tools to help you work with Mistral models".&lt;/p&gt;
&lt;p&gt;So far that means a tokenizer! This is similar to OpenAI's tiktoken library in that it lets you run tokenization in your own code, which crucially means you can count the number of tokens that you are about to use - useful for cost estimates but also for cramming the maximum allowed tokens in the context window for things like RAG.&lt;/p&gt;
&lt;p&gt;Mistral's library is better than tiktoken though, in that it also includes logic for correctly calculating the tokens needed for conversation construction and tool definition. With OpenAI's APIs you're currently left guessing how many tokens are taken up by these advanced features.&lt;/p&gt;
&lt;p&gt;Anthropic haven't published any form of tokenizer at all - it's the feature I'd most like to see from them next.&lt;/p&gt;
&lt;p&gt;Here's how to explore the vocabulary of the tokenizer:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;MistralTokenizer.from_model(
    "open-mixtral-8x22b"
).instruct_tokenizer.tokenizer.vocab()[:12]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;['&amp;lt;unk&amp;gt;', '&amp;lt;s&amp;gt;', '&amp;lt;/s&amp;gt;', '[INST]', '[/INST]', '[TOOL_CALLS]', '[AVAILABLE_TOOLS]', '[/AVAILABLE_TOOLS]', '[TOOL_RESULTS]', '[/TOOL_RESULTS]']&lt;/code&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-engineering"&gt;prompt-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rag"&gt;rag&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tokenization"&gt;tokenization&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="ai"/><category term="openai"/><category term="prompt-engineering"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="mistral"/><category term="rag"/><category term="tokenization"/></entry><entry><title>The Tokenizer Playground</title><link href="https://simonwillison.net/2024/Mar/19/the-tokenizer-playground/#atom-tag" rel="alternate"/><published>2024-03-19T02:18:59+00:00</published><updated>2024-03-19T02:18:59+00:00</updated><id>https://simonwillison.net/2024/Mar/19/the-tokenizer-playground/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/spaces/Xenova/the-tokenizer-playground"&gt;The Tokenizer Playground&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I built a tool like this a while ago, but this one is much better: it provides an interface for experimenting with tokenizers from a wide range of model architectures, including Llama, Claude, Mistral and Grok-1—all running in the browser using Transformers.js.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/xenovacom/status/1769546095871287423"&gt;@xenovacom&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/transformers-js"&gt;transformers-js&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tokenization"&gt;tokenization&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="transformers-js"/><category term="tokenization"/></entry><entry><title>Let's build the GPT Tokenizer</title><link href="https://simonwillison.net/2024/Feb/20/lets-build-the-gpt-tokenizer/#atom-tag" rel="alternate"/><published>2024-02-20T18:02:13+00:00</published><updated>2024-02-20T18:02:13+00:00</updated><id>https://simonwillison.net/2024/Feb/20/lets-build-the-gpt-tokenizer/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.youtube.com/watch?v=zduSFxRajkE"&gt;Let&amp;#x27;s build the GPT Tokenizer&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
When Andrej Karpathy left OpenAI last week a lot of people expressed hope that he would be increasing his output of educational YouTube videos.&lt;/p&gt;

&lt;p&gt;Here’s an in-depth 2 hour dive into how tokenizers work and how to build one from scratch, published this morning.&lt;/p&gt;

&lt;p&gt;The section towards the end, “revisiting and explaining the quirks of LLM tokenization”, helps explain a number of different LLM weaknesses—inability to reverse strings, confusion over arithmetic and even a note on why YAML can work better than JSON when providing data to LLMs (the same data can be represented in less tokens).


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/andrej-karpathy"&gt;andrej-karpathy&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tokenization"&gt;tokenization&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="andrej-karpathy"/><category term="generative-ai"/><category term="llms"/><category term="tokenization"/></entry><entry><title>Llama encoder and decoder</title><link href="https://simonwillison.net/2023/Jun/13/llama-encoder-and-decoder/#atom-tag" rel="alternate"/><published>2023-06-13T22:37:29+00:00</published><updated>2023-06-13T22:37:29+00:00</updated><id>https://simonwillison.net/2023/Jun/13/llama-encoder-and-decoder/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://observablehq.com/@simonw/llama-encoder-and-decoder"&gt;Llama encoder and decoder&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I forked my GPT tokenizer Observable notebook to create a similar tool for exploring the tokenization scheme used by the Llama family of LLMs, using the new llama-tokenizer-js JavaScript library.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/observable"&gt;observable&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tokenization"&gt;tokenization&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="observable"/><category term="generative-ai"/><category term="llama"/><category term="llms"/><category term="tokenization"/></entry><entry><title>Understanding GPT tokenizers</title><link href="https://simonwillison.net/2023/Jun/8/gpt-tokenizers/#atom-tag" rel="alternate"/><published>2023-06-08T20:37:00+00:00</published><updated>2023-06-08T20:37:00+00:00</updated><id>https://simonwillison.net/2023/Jun/8/gpt-tokenizers/#atom-tag</id><summary type="html">
    &lt;p&gt;Large language models such as GPT-3/4, LLaMA and PaLM work in terms of tokens. They take text, convert it into tokens (integers), then predict which tokens should come next.&lt;/p&gt;
&lt;p&gt;Playing around with these tokens is an interesting way to get a better idea for how this stuff actually works under the hood.&lt;/p&gt;
&lt;p&gt;OpenAI offer a &lt;a href="https://platform.openai.com/tokenizer"&gt;Tokenizer&lt;/a&gt; tool for exploring how tokens work&lt;/p&gt;
&lt;p&gt;I've built my own, slightly more interesting tool as an Observable notebook:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://observablehq.com/@simonw/gpt-tokenizer"&gt;https://observablehq.com/@simonw/gpt-tokenizer&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;You can use the notebook to convert text to tokens, tokens to text and also to run searches against the full token table.&lt;/p&gt;
&lt;p&gt;Here's what the notebook looks like:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2023/gpt-token-encoder-decoder.jpg" alt="GPT token encoder and decoder. Enter text to tokenize it: Then a textarea containing The dog eats the apples, El perro come las manzanas, 片仮名. 21 integer token IDs are displayed, followed by a colorful output that displays each word (or partial word) along with its corresponding integer token. The Japanese characters correspond to two integer tokens each." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;The text I'm tokenizing here is:&lt;/p&gt;
&lt;blockquote&gt;
&lt;pre&gt;&lt;code&gt;The dog eats the apples
El perro come las manzanas
片仮名
&lt;/code&gt;&lt;/pre&gt;
&lt;/blockquote&gt;
&lt;p&gt;This produces 21 integer tokens: 5 for the English text, 8 for the Spanish text and six (two each) for those three Japanese characters. The two newlines are each represented by tokens as well.&lt;/p&gt;
&lt;p&gt;The notebook uses the tokenizer from GPT-2 (borrowing from &lt;a href="https://observablehq.com/@codingwithfire/gpt-3-encoder"&gt;this excellent notebook&lt;/a&gt; by EJ Fox and Ian Johnson), so it's useful primarily as an educational tool - there are differences between how it works and the latest tokenizers for GPT-3 and above.&lt;/p&gt;
&lt;h4 id="interesting-tokens"&gt;Exploring some interesting tokens&lt;/h4&gt;
&lt;p&gt;Playing with the tokenizer reveals all sorts of interesting patterns.&lt;/p&gt;
&lt;p&gt;Most common English words are assigned a single token. As demonstrated above:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;"The": 464&lt;/li&gt;
&lt;li&gt;" dog": 3290&lt;/li&gt;
&lt;li&gt;" eats": 25365&lt;/li&gt;
&lt;li&gt;" the": 262&lt;/li&gt;
&lt;li&gt;" apples": 22514&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Note that capitalization is important here. "The" with a capital T is token 464, but " the" with both a leading space and a lowercase t is token 262.&lt;/p&gt;
&lt;p&gt;Many words also have a token that incorporates a leading space. This makes for much more efficient encoding of full sentences, since they can be encoded without needing to spend a token on each whitespace character.&lt;/p&gt;
&lt;p&gt;Languages other than English suffer from less efficient tokenization.&lt;/p&gt;
&lt;p&gt;"El perro come las manzanas" in Spanish is encoded like this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;"El": 9527&lt;/li&gt;
&lt;li&gt;" per": 583&lt;/li&gt;
&lt;li&gt;"ro": 305&lt;/li&gt;
&lt;li&gt;" come": 1282&lt;/li&gt;
&lt;li&gt;" las": 39990&lt;/li&gt;
&lt;li&gt;" man": 582&lt;/li&gt;
&lt;li&gt;"zan": 15201&lt;/li&gt;
&lt;li&gt;"as": 292&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The English bias is obvious here. " man" gets a lower token ID of 582, because it's an English word. "zan" gets a token ID of 15201 because it's not a word that stands alone in English, but is a common enough sequence of characters that it still warrants its own token.&lt;/p&gt;
&lt;p&gt;Some languages even have single characters that end up encoding to multiple tokens, such as these Japanese characters:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;片: 31965 229&lt;/li&gt;
&lt;li&gt;仮: 20015 106&lt;/li&gt;
&lt;li&gt;名: 28938 235&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="glitch-tokens"&gt;Glitch tokens&lt;/h4&gt;
&lt;p&gt;A fascinating subset of tokens are what are known as "glitch tokens". My favourite example of those is token 23282 - " davidjl".&lt;/p&gt;
&lt;p&gt;We can find that token by searching for "david" using the search box in the notebook:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2023/gpt-token-search.jpg" alt="Search tokens: a search box containing david, with 5 results. 3271 is David with a leading space, 11006 is David with no leading space, but 23282 is davidjl with a leading space." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Riley Goodside &lt;a href="https://twitter.com/goodside/status/1666598580319035392"&gt;highlighted some weird behaviour&lt;/a&gt; with that token:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2023/gpt-token-davidjl.jpg" alt="Four GPT-4 sessions. Prompt: Repeat the string &amp;quot;davidjl&amp;quot;. Response: jndl. Prompt: How many letters are in this username: davidjl. Response: The username &amp;quot;jdnl&amp;quot; contains 4 letters. Prompt: Are the strings &amp;quot;jdl&amp;quot; and &amp;quot;davidjl&amp;quot; identical? Response: No, the strings &amp;quot;jdl&amp;quot; and &amp;quot;jspb&amp;quot; are not identical. Each character in a string contributes to its uniqueness. Even though both strings have the same number of characters and two of them are the same (&amp;quot;j&amp;quot; and &amp;quot;d&amp;quot;), the last characters are different. Thus, &amp;quot;jdl&amp;quot; and &amp;quot;jspb&amp;quot; are different strings. Prompt: What's the deal with davidjl? Response: I'm sorry, but &amp;quot;JDL&amp;quot; is quite ambiguous. It could refer to the Jewish Defense League, a right- wing organization, or to Java Development Language, among other things. Could you please provide more context or clarify what you're referring to? I'll do my best to give you the information you're looking for." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Why this happens is an intriguing puzzle.&lt;/p&gt;
&lt;p&gt;It looks likely that this token refers to user &lt;a href="https://www.reddit.com/user/davidjl123"&gt;davidjl123&lt;/a&gt; on Reddit, a keen member of the &lt;a href="https://reddit.com/r/counting"&gt;/r/counting&lt;/a&gt; subreddit. He's posted incremented numbers there well over 163,000 times.&lt;/p&gt;
&lt;p&gt;Presumably that subreddit ended up in the training data used to create the tokenizer used by GPT-2, and since that particular username showed up hundreds of thousands of times it ended up getting its own token.&lt;/p&gt;
&lt;p&gt;But why would that break things like this? The best theory I've seen so far came from &lt;a href="https://news.ycombinator.com/item?id=36245187"&gt;londons_explore on Hacker News&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;These glitch tokens are all near the centroid of the token embedding space. That means that the model cannot really differentiate between these tokens and the others equally near the center of the embedding space, and therefore when asked to 'repeat' them, gets the wrong one.&lt;/p&gt;
&lt;p&gt;That happened because the tokens were on the internet many millions of times (the davidjl user has 163,000 posts on reddit simply counting increasing numbers), yet the tokens themselves were never hard to predict (and therefore while training, the gradients became nearly zero, and the embedding vectors decayed to zero, which some optimizers will do when normalizing weights).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The conversation attached to the post &lt;a href="https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation"&gt;SolidGoldMagikarp (plus, prompt generation)&lt;/a&gt; on LessWrong has a great deal more detail on this phenomenon.&lt;/p&gt;
&lt;h4 id="counting-tokens"&gt;Counting tokens with tiktoken&lt;/h4&gt;
&lt;p&gt;OpenAI's models each have a token limit. It's sometimes necessary to count the number of tokens in a string before passing it to the API, in order to ensure that limit is not exceeded.&lt;/p&gt;
&lt;p&gt;One technique that needs this is &lt;a href="https://simonwillison.net/2023/Jan/13/semantic-search-answers/"&gt;Retrieval Augmented Generation&lt;/a&gt;, where you answer a user's question by running a search (or an embedding search) against a corpus of documents, extract the most likely content and include that as context in a prompt.&lt;/p&gt;
&lt;p&gt;The key to successfully implementing that pattern is to include as much relevant context as will fit within the token limit - so you need to be able to count tokens.&lt;/p&gt;
&lt;p&gt;OpenAI provide a Python library for doing this called &lt;a href="https://github.com/openai/tiktoken"&gt;tiktoken&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;If you dig around inside the library you'll find it currently includes five different tokenization schemes: &lt;code&gt;r50k_base&lt;/code&gt;, &lt;code&gt;p50k_base&lt;/code&gt;, &lt;code&gt;p50k_edit&lt;/code&gt;, &lt;code&gt;cl100k_base&lt;/code&gt; and &lt;code&gt;gpt2&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Of these &lt;code&gt;cl100k_base&lt;/code&gt; is the most relevant, being the tokenizer for both GPT-4 and the inexpensive &lt;code&gt;gpt-3.5-turbo&lt;/code&gt; model used by current ChatGPT.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;p50k_base&lt;/code&gt; is used by &lt;code&gt;text-davinci-003&lt;/code&gt;. A full mapping of models to tokenizers can be found in the &lt;code&gt;MODEL_TO_ENCODING&lt;/code&gt; dictionary in &lt;code&gt;tiktoken/model.py&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Here's how to use &lt;code&gt;tiktoken&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;tiktoken&lt;/span&gt;

&lt;span class="pl-s1"&gt;encoding&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;tiktoken&lt;/span&gt;.&lt;span class="pl-en"&gt;encoding_for_model&lt;/span&gt;(&lt;span class="pl-s"&gt;"gpt-4"&lt;/span&gt;)
&lt;span class="pl-c"&gt;# or "gpt-3.5-turbo" or "text-davinci-003"&lt;/span&gt;

&lt;span class="pl-s1"&gt;tokens&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;encoding&lt;/span&gt;.&lt;span class="pl-en"&gt;encode&lt;/span&gt;(&lt;span class="pl-s"&gt;"Here is some text"&lt;/span&gt;)
&lt;span class="pl-s1"&gt;token_count&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;len&lt;/span&gt;(&lt;span class="pl-s1"&gt;tokens&lt;/span&gt;)&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;tokens&lt;/code&gt; will now be an array of four integer token IDs - &lt;code&gt;[8586, 374, 1063, 1495]&lt;/code&gt; in this case.&lt;/p&gt;
&lt;p&gt;Use the &lt;code&gt;.decode()&lt;/code&gt; method to turn an array of token IDs back into text:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-s1"&gt;text&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;encoding&lt;/span&gt;.&lt;span class="pl-en"&gt;decode&lt;/span&gt;(&lt;span class="pl-s1"&gt;tokens&lt;/span&gt;)
&lt;span class="pl-c"&gt;# 'Here is some text'&lt;/span&gt;&lt;/pre&gt;
&lt;p&gt;The first time you call &lt;code&gt;encoding_for_model()&lt;/code&gt; the encoding data will be fetched over HTTP from a &lt;code&gt;openaipublic.blob.core.windows.net&lt;/code&gt; Azure blob storage bucket (&lt;a href="https://github.com/openai/tiktoken/blob/0.4.0/tiktoken_ext/openai_public.py"&gt;code here&lt;/a&gt;). This is cached in a temp directory, but that will get cleared should your machine restart. You can force it to use a more persistent cache directory by setting a &lt;code&gt;TIKTOKEN_CACHE_DIR&lt;/code&gt; environment variable.&lt;/p&gt;
&lt;h4 id="ttok"&gt;ttok&lt;/h4&gt;
&lt;p&gt;I introduced my &lt;a href="https://github.com/simonw/ttok"&gt;ttok&lt;/a&gt; tool &lt;a href="https://simonwillison.net/2023/May/18/cli-tools-for-llms/"&gt;a few weeks ago&lt;/a&gt;. It's a command-line wrapper around &lt;code&gt;tiktoken&lt;/code&gt; with two key features: it can count tokens in text that is piped to it, and it can also truncate that text down to a specified number of tokens:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Count tokens&lt;/span&gt;
&lt;span class="pl-c1"&gt;echo&lt;/span&gt; -n &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Count these tokens&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; ttok
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Outputs: 3 (the newline is skipped thanks to echo -n)&lt;/span&gt;

&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Truncation&lt;/span&gt;
curl &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://simonwillison.net/&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; strip-tags -m &lt;span class="pl-k"&gt;|&lt;/span&gt; ttok -t 6
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Outputs: Simon Willison’s Weblog&lt;/span&gt;

&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; View integer token IDs&lt;/span&gt;
&lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Show these tokens&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; ttok --tokens
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Outputs: 7968 1521 11460 198&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Use &lt;code&gt;-m gpt2&lt;/code&gt; or similar to use an encoding for a different model.&lt;/p&gt;
&lt;h4 id="watching-tokens"&gt;Watching tokens get generated&lt;/h4&gt;
&lt;p&gt;Once you understand tokens, the way GPT tools generate text starts to make a lot more sense.&lt;/p&gt;
&lt;p&gt;In particular, it's fun to watch GPT-4 streaming back its output as independent tokens (GPT-4 is slightly slower than 3.5, making it easier to see what's going on).&lt;/p&gt;
&lt;p&gt;Here's what I get for &lt;code&gt;llm -s 'Five names for a pet pelican' -4&lt;/code&gt; - using my &lt;a href="https://github.com/simonw/llm"&gt;llm&lt;/a&gt; CLI tool to generate text from GPT-4:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2023/gpt-token-pelican-names.gif" alt="Terminal window running that command. 1. Pelly 2. Beaky 3. SkyDancer 4. Scoop 5. Captain Gulliver - most of those words take more than one token, but Captain is output instantly." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;As you can see, names that are not in the dictionary such as "Pelly" take multiple tokens, but "Captain Gulliver" outputs the token "Captain" as a single chunk.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-3"&gt;gpt-3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-4"&gt;gpt-4&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tokenization"&gt;tokenization&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-bias"&gt;ai-bias&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt"&gt;gpt&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="projects"/><category term="ai"/><category term="gpt-3"/><category term="openai"/><category term="generative-ai"/><category term="gpt-4"/><category term="llms"/><category term="tokenization"/><category term="ai-bias"/><category term="gpt"/></entry><entry><title>Examples of weird GPT-4 behavior for the string " davidjl"</title><link href="https://simonwillison.net/2023/Jun/8/davidjl/#atom-tag" rel="alternate"/><published>2023-06-08T09:29:42+00:00</published><updated>2023-06-08T09:29:42+00:00</updated><id>https://simonwillison.net/2023/Jun/8/davidjl/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://twitter.com/goodside/status/1666598580319035392"&gt;Examples of weird GPT-4 behavior for the string &amp;quot; davidjl&amp;quot;&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
GPT-4, when told to repeat or otherwise process the string “ davidjl” (note the leading space character), treats it as “jndl” or “jspb” or “JDL” instead. It turns out “ davidjl” has its own single token in the tokenizer: token ID 23282, presumably dating back to the GPT-2 days.&lt;/p&gt;

&lt;p&gt;Riley Goodside refers to these as “glitch tokens”.&lt;/p&gt;

&lt;p&gt;This token might refer to Reddit user davidjl123 who ranks top of the league for the old /r/counting subreddit, with 163,477 posts there which presumably ended up in older training data.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/reddit"&gt;reddit&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/riley-goodside"&gt;riley-goodside&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-4"&gt;gpt-4&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tokenization"&gt;tokenization&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt"&gt;gpt&lt;/a&gt;&lt;/p&gt;



</summary><category term="reddit"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="riley-goodside"/><category term="gpt-4"/><category term="llms"/><category term="tokenization"/><category term="gpt"/></entry><entry><title>llm, ttok and strip-tags - CLI tools for working with ChatGPT and other LLMs</title><link href="https://simonwillison.net/2023/May/18/cli-tools-for-llms/#atom-tag" rel="alternate"/><published>2023-05-18T21:04:01+00:00</published><updated>2023-05-18T21:04:01+00:00</updated><id>https://simonwillison.net/2023/May/18/cli-tools-for-llms/#atom-tag</id><summary type="html">
    &lt;p&gt;I've been building out a small suite of command-line tools for working with ChatGPT, GPT-4 and potentially other language models in the future.&lt;/p&gt;
&lt;p&gt;The three tools I've built so far are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/llm"&gt;llm&lt;/a&gt;&lt;/strong&gt; - a command-line tool for sending prompts to the OpenAI APIs, outputting the response and logging the results to a SQLite database. I introduced that &lt;a href="https://simonwillison.net/2023/Apr/4/llm/"&gt;a few weeks ago&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/ttok"&gt;ttok&lt;/a&gt;&lt;/strong&gt; - a tool for counting and truncating text based on tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/strip-tags"&gt;strip-tags&lt;/a&gt;&lt;/strong&gt; - a tool for stripping HTML tags from text, and optionally outputting a subset of the page based on CSS selectors&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The idea with these tools is to support working with language model prompts using Unix pipes.&lt;/p&gt;
&lt;p&gt;You can install the three like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;pipx install llm
pipx install ttok
pipx install strip-tags&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Or use &lt;code&gt;pip&lt;/code&gt; if you haven't adopted &lt;a href="https://pypa.github.io/pipx/"&gt;pipx&lt;/a&gt; yet.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;llm&lt;/code&gt; depends on an OpenAI API key in the &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; environment variable or a &lt;code&gt;~/.openai-api-key.txt&lt;/code&gt; text file (Update: later versions of LLM changed this, see &lt;a href="https://llm.datasette.io/en/stable/setup.html#api-key-management"&gt;setup instructions&lt;/a&gt;). The other tools don't require any configuration.&lt;/p&gt;
&lt;p&gt;Now let's use them to summarize the homepage of the New York Times:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;curl -s https://www.nytimes.com/ \
  &lt;span class="pl-k"&gt;|&lt;/span&gt; strip-tags .story-wrapper \
  &lt;span class="pl-k"&gt;|&lt;/span&gt; ttok -t 4000 \
  &lt;span class="pl-k"&gt;|&lt;/span&gt; llm --system &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;summary bullet points&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's what that command outputs when you run it in the terminal:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2023/llm-nytimes.gif" alt="Animated output from running that command: 1. Senator Dianne Feinstein suffered complications from encephalitis during her recent bout with shingles, which has raised concerns about her health among some of her allies. 2. Investors, economists, and executives are preparing contingency plans in case of a possible United States debt default, but the timeline for when the government will run out of cash is uncertain. 3. The Pentagon has freed up an additional $3 billion for Ukraine through an accounting mistake, relieving pressure on the Biden administration to ask Congress for more money for weapon supplies. 4. Explosions damaged a Russian-controlled freight train in Crimea, and the railway operator has suggested that it may have been an act of sabotage, but there is no confirmation yet from Ukrainian authorities. 5. Group of Seven leaders are expected to celebrate the success of a novel effort to stabilize global oil markets and punish Russia through an untested oil price cap." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Let's break that down.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;curl -s https://www.nytimes.com/&lt;/code&gt; uses &lt;code&gt;curl&lt;/code&gt; to retrieve the HTML for the New York Times homepage - the &lt;code&gt;-s&lt;/code&gt; option prevents it from outputting any progress information.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;strip-tags .story-wrapper&lt;/code&gt; accepts HTML to standard input, finds just the areas of that page identified by the CSS selector &lt;code&gt;.story-wrapper&lt;/code&gt;, then outputs the text for those areas with all HTML tags removed.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ttok -t 4000&lt;/code&gt; accepts text to standard input, tokenizes it using the default tokenizer for the &lt;code&gt;gpt-3.5-turbo&lt;/code&gt; model, truncates to the first 4,000 tokens and outputs those tokens converted back to text.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;llm --system 'summary bullet points'&lt;/code&gt; accepts the text to standard input as the user prompt, adds a system prompt of "summary bullet points".&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;It's all about the tokens&lt;/h3&gt;
&lt;p&gt;I built &lt;code&gt;strip-tags&lt;/code&gt; and &lt;code&gt;ttok&lt;/code&gt; this morning because I needed better ways to work with tokens.&lt;/p&gt;
&lt;p&gt;LLMs such as ChatGPT and GPT-4 work with tokens, not characters.&lt;/p&gt;
&lt;p&gt;This is an implementation detail, but it's one that you can't avoid for two reasons:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;APIs have token limits. If you try and send more than the limit you'll get an error message like this one: "This model's maximum context length is 4097 tokens. However, your messages resulted in 116142 tokens. Please reduce the length of the messages."&lt;/li&gt;
&lt;li&gt;Tokens are how pricing works. &lt;code&gt;gpt-3.5-turbo&lt;/code&gt; (the model used by ChatGPT, and the default model used by the &lt;code&gt;llm&lt;/code&gt; command) costs $0.002 / 1,000 tokens. GPT-4 is $0.03 / 1,000 tokens of input and $0.06 / 1,000 for output.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Being able to keep track of token counts is really important.&lt;/p&gt;
&lt;p&gt;But tokens are actually really hard to count! The rule of thumb is roughly 0.75 * number-of-words, but you can get an exact count by running the same tokenizer that the model uses on your own machine.&lt;/p&gt;
&lt;p&gt;OpenAI's &lt;a href="https://github.com/openai/tiktoken"&gt;tiktoken&lt;/a&gt; library (documented &lt;a href="https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb"&gt;in this notebook&lt;/a&gt;) is the best way to do this.&lt;/p&gt;
&lt;p&gt;My &lt;code&gt;ttok&lt;/code&gt; tool is a &lt;a href="https://github.com/simonw/ttok/blob/0.1/ttok/cli.py"&gt;very thin wrapper&lt;/a&gt; around that library. It can do three different things:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Count tokens&lt;/li&gt;
&lt;li&gt;Truncate text to a desired number of tokens&lt;/li&gt;
&lt;li&gt;Show you the tokens&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here's a quick example showing all three of those in action:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;$ &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Here is some text&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; ttok
5
$ &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Here is some text&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; ttok --truncate 2
Here is
$ &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Here is some text&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; ttok --tokens    
8586 374 1063 1495 198&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;My &lt;a href="https://observablehq.com/@simonw/gpt-3-token-encoder-decoder"&gt;GPT-3 token encoder and decoder&lt;/a&gt; Observable notebook provides an interface for exploring how these tokens work in more detail.&lt;/p&gt;
&lt;h3&gt;Stripping tags from HTML&lt;/h3&gt;
&lt;p&gt;HTML tags take up a lot of tokens, and usually aren't relevant to the prompt you are sending to the model.&lt;/p&gt;
&lt;p&gt;My new &lt;code&gt;strip-tags&lt;/code&gt; command strips those tags out.&lt;/p&gt;
&lt;p&gt;Here's an example showing quite how much of a difference that can make:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;$ curl -s https://simonwillison.net/ &lt;span class="pl-k"&gt;|&lt;/span&gt; ttok
21543
$ curl -s https://simonwillison.net/ &lt;span class="pl-k"&gt;|&lt;/span&gt; strip-tags &lt;span class="pl-k"&gt;|&lt;/span&gt; ttok
9688&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;For my blog's homepage, stripping tags reduces the token count by more than half!&lt;/p&gt;
&lt;p&gt;The above is still too many tokens to send to the API.&lt;/p&gt;
&lt;p&gt;We could truncate them, like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;$ curl -s https://simonwillison.net/ \
  &lt;span class="pl-k"&gt;|&lt;/span&gt; strip-tags &lt;span class="pl-k"&gt;|&lt;/span&gt; ttok --truncate 4000 \
  &lt;span class="pl-k"&gt;|&lt;/span&gt; llm --system &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;turn this into a bad poem&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Which outputs:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;download-esm,

A tool to download ECMAScript modules.

Get your packages straight from CDN,

No need for build scripts, let that burden end.

All dependencies will be fetched,

Import statements will be re-writched.

Works like a charm, simple and sleek,

JavaScript just got a whole lot more chic.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;But often it's only specific parts of a page that we care about. The &lt;code&gt;strip-tags&lt;/code&gt; command takes an optional list of CSS selectors as arguments - if provided, only those parts of the page will be output.&lt;/p&gt;
&lt;p&gt;That's how the New York Times example works above. Compare the following:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;$ curl -s https://www.nytimes.com/ &lt;span class="pl-k"&gt;|&lt;/span&gt; ttok             
210544
$ curl -s https://www.nytimes.com/ &lt;span class="pl-k"&gt;|&lt;/span&gt; strip-tags &lt;span class="pl-k"&gt;|&lt;/span&gt; ttok
115117
$ curl -s https://www.nytimes.com/ &lt;span class="pl-k"&gt;|&lt;/span&gt; strip-tags .story-wrapper &lt;span class="pl-k"&gt;|&lt;/span&gt; ttok
2165&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;By selecting just the text from within the &lt;code&gt;&amp;lt;section class="story-wrapper"&amp;gt;&lt;/code&gt; elements we can trim the whole page down to just the headlines and summaries of each of the main articles on the page.&lt;/p&gt;
&lt;h3&gt;Future plans&lt;/h3&gt;
&lt;p&gt;I'm really enjoying being able to use the terminal to interact with LLMs in this way. Having a quick way to pipe content to a model opens up all kinds of fun opportunities.&lt;/p&gt;
&lt;p&gt;Want a quick explanation of how some code works using GPT-4? Try this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;cat ttok/cli.py | llm --system 'Explain this code' --gpt4
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;(&lt;a href="https://gist.github.com/simonw/a06e091310ceee0b9d5146722279c93c"&gt;Output here&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;I've been having fun piping my &lt;a href="https://shot-scraper.datasette.io/"&gt;shot-scraper tool&lt;/a&gt; into it too, which goes a step further than &lt;code&gt;strip-tags&lt;/code&gt; in providing a full headless browser.&lt;/p&gt;
&lt;p&gt;Here's an example that uses the &lt;a href="https://til.simonwillison.net/shot-scraper/readability"&gt;Readability recipe from this TIL&lt;/a&gt; to extract the main article content, then further strips HTML tags from it and pipes it into the &lt;code&gt;llm&lt;/code&gt; command:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;shot-scraper javascript https://www.theguardian.com/uk-news/2023/may/18/rmt-to-hold-rail-strike-across-england-on-eve-of-fa-cup-final &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;async () =&amp;gt; {&lt;/span&gt;
&lt;span class="pl-s"&gt;    const readability = await import('https://cdn.skypack.dev/@mozilla/readability');&lt;/span&gt;
&lt;span class="pl-s"&gt;    return (new readability.Readability(document)).parse().content;&lt;/span&gt;
&lt;span class="pl-s"&gt;}&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; strip-tags &lt;span class="pl-k"&gt;|&lt;/span&gt; llm --system summarize&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;In terms of next steps, the thing I'm most excited about is teaching that &lt;code&gt;llm&lt;/code&gt; command how to talk to other models - initially Claude and PaLM2 via APIs, but I'd love to get it working against locally hosted models running on things like &lt;a href="https://github.com/ggerganov/llama.cpp"&gt;llama.cpp&lt;/a&gt; as well.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatgpt"&gt;chatgpt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tokenization"&gt;tokenization&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="cli"/><category term="projects"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="chatgpt"/><category term="llms"/><category term="llm"/><category term="tokenization"/></entry></feed>