Simon Willison's Weblog: paul-gauthier

llm-gemini 0.19.1

2025-05-08T05:49:12+00:00

Bugfix release for my llm-gemini plugin, which was recording the number of output tokens (needed to calculate the price of a response) incorrectly for the Gemini "thinking" models. Those models turn out to return candidatesTokenCount and thoughtsTokenCount as two separate values which need to be added together to get the total billed output token count. Full details in this issue.

I spotted this potential bug in this response log this morning, and my concerns were confirmed when Paul Gauthier wrote about a similar fix in Aider in Gemini 2.5 Pro Preview 03-25 benchmark cost, where he noted that the $6.32 cost recorded to benchmark Gemini 2.5 Pro Preview 03-25 was incorrect. Since that model is no longer available (despite the date-based model alias persisting) Paul is not able to accurately calculate the new cost, but it's likely a lot more since the Gemini 2.5 Pro Preview 05-06 benchmark cost $37.

I've gone through my gemini tag and attempted to update my previous posts with new calculations - this mostly involved increases in the order of 12.336 cents to 16.316 cents (as seen here).

Tags: ai, generative-ai, llms, llm, gemini, aider, llm-pricing, paul-gauthier

Aider: Using uv as an installer

2025-03-06T01:47:20+00:00

Aider: Using uv as an installer

Paul Gauthier has an innovative solution for the challenge of helping end users get a copy of his Aider CLI Python utility installed in an isolated virtual environment without first needing to teach them what an "isolated virtual environment" is.

Provided you already have a Python install of version 3.8 or higher you can run this:

pip install aider-install && aider-install

The aider-install package itself depends on uv. When you run aider-install it executes the following Python code:

def install_aider():
    try:
        uv_bin = uv.find_uv_bin()
        subprocess.check_call([
            uv_bin, "tool", "install", "--force", "--python", "python3.12", "aider-chat@latest"
        ])
        subprocess.check_call([uv_bin, "tool", "update-shell"])
    except subprocess.CalledProcessError as e:
        print(f"Failed to install aider: {e}")
        sys.exit(1)

This first figures out the location of the uv Rust binary, then uses it to install his aider-chat package by running the equivalent of this command:

uv tool install --force --python python3.12 aider-chat@latest

This will in turn install a brand new standalone copy of Python 3.12 and tuck it away in uv's own managed directory structure where it shouldn't hurt anything else.

The aider-chat script defaults to being dropped in the XDG standard directory, which is probably ~/.local/bin - see uv's documentation. The --force flag ensures that uv will overwrite any previous attempts at installing aider-chat in that location with the new one.

Finally, running uv tool update-shell ensures that bin directory is on the user's PATH.

I think I like this. There is a LOT of stuff going on here, and experienced users may well opt for an alternative installation mechanism.

But for non-expert Python users who just want to start using Aider, I think this pattern represents quite a tasteful way of getting everything working with minimal risk of breaking the user's system.

Update: Paul adds:

Offering this install method dramatically reduced the number of GitHub issues from users with conflicted/broken python environments.

I also really like the "curl | sh" aider installer based on uv. Even users who don't have python installed can use it.

Tags: cli, python, aider, uv, paul-gauthier

Initial impressions of GPT-4.5

2025-02-27T22:02:59+00:00

GPT-4.5 is out today as a "research preview" - it's available to OpenAI Pro ($200/month) customers and to developers with an API key. OpenAI also published a GPT-4.5 system card.

I've added it to LLM so you can run it like this:

llm -m gpt-4.5-preview 'impress me'

It's very expensive right now: currently $75.00 per million input tokens and $150/million for output! For comparison, o1 is $15/$60 and GPT-4o is $2.50/$10. GPT-4o mini is $0.15/$0.60 making OpenAI's least expensive model 500x cheaper than GPT-4.5 for input and 250x cheaper for output!

As far as I can tell almost all of its key characteristics are the same as GPT-4o: it has the same 128,000 context length, handles the same inputs (text and image) and even has the same training cut-off date of October 2023.

So what's it better at? According to OpenAI's blog post:

Combining deep understanding of the world with improved collaboration results in a model that integrates ideas naturally in warm and intuitive conversations that are more attuned to human collaboration. GPT‑4.5 has a better understanding of what humans mean and interprets subtle cues or implicit expectations with greater nuance and “EQ”. GPT‑4.5 also shows stronger aesthetic intuition and creativity. It excels at helping with writing and design.

They include this chart of win-rates against GPT-4o, where it wins between 56.8% and 63.2% of the time for different classes of query:

They also report a SimpleQA hallucination rate of 37.1% - a big improvement on GPT-4o (61.8%) and o3-mini (80.3%) but not much better than o1 (44%). The coding benchmarks all appear to score similar to o3-mini.

Paul Gauthier reports a score of 45% on Aider's polyglot coding benchmark - below DeepSeek V3 (48%), Sonnet 3.7 (60% without thinking, 65% with thinking) and o3-mini (60.4%) but significantly ahead of GPT-4o (23.1%).

OpenAI don't seem to have enormous confidence in the model themselves:

GPT‑4.5 is a very large and compute-intensive model, making it more expensive⁠ than and not a replacement for GPT‑4o. Because of this, we're evaluating whether to continue serving it in the API long-term as we balance supporting current capabilities with building future models.

It drew me this for "Generate an SVG of a pelican riding a bicycle":

Accessed via the API the model feels weirdly slow - here's an animation showing how that pelican was rendered - the full response took 112 seconds!

OpenAI's Rapha Gontijo Lopes calls this "(probably) the largest model in the world" - evidently the problem with large models is that they are a whole lot slower than their smaller alternatives!

Andrej Karpathy has published some notes on the new model, where he highlights that the improvements are limited considering the 10x increase in training cost compute to GPT-4:

I remember being a part of a hackathon trying to find concrete prompts where GPT4 outperformed 3.5. They definitely existed, but clear and concrete "slam dunk" examples were difficult to find. [...] So it is with that expectation that I went into testing GPT4.5, which I had access to for a few days, and which saw 10X more pretraining compute than GPT4. And I feel like, once again, I'm in the same hackathon 2 years ago. Everything is a little bit better and it's awesome, but also not exactly in ways that are trivial to point to.

Andrej is also running a fun vibes-based polling evaluation comparing output from GPT-4.5 and GPT-4o. Update GPT-4o won 4/5 rounds!

There's an extensive thread about GPT-4.5 on Hacker News. When it hit 324 comments I ran a summary of it using GPT-4.5 itself with this script:

hn-summary.sh 43197872 -m gpt-4.5-preview

Here's the result, which took 154 seconds to generate and cost $2.11 (25797 input tokens and 1225 input, price calculated using my LLM pricing calculator).

For comparison, I ran the same prompt against GPT-4o, GPT-4o Mini, Claude 3.7 Sonnet, Claude 3.5 Haiku, Gemini 2.0 Flash, Gemini 2.0 Flash Lite and Gemini 2.0 Pro.

Tags: ai, openai, andrej-karpathy, generative-ai, llms, evals, uv, pelican-riding-a-bicycle, paul-gauthier, llm-release

Aider Polyglot leaderboard results for Claude 3.7 Sonnet

2025-02-25T00:56:03+00:00

Aider Polyglot leaderboard results for Claude 3.7 Sonnet

Paul Gauthier's Aider Polyglot benchmark is one of my favourite independent benchmarks for LLMs, partly because it focuses on code and partly because Paul is very responsive at evaluating new models.

The brand new Claude 3.7 Sonnet just took the top place, when run with an increased 32,000 thinking token limit.

It's interesting comparing the benchmark costs - 3.7 Sonnet spent $36.83 running the whole thing, significantly more than the previously leading DeepSeek R1 + Claude 3.5 combo, but a whole lot less than third place o1-high:

Model	% completed	Total cost
claude-3-7-sonnet-20250219 (32k thinking tokens)	64.9%	$36.83
DeepSeek R1 + claude-3-5-sonnet-20241022	64.0%	$13.29
o1-2024-12-17 (high)	61.7%	$186.5
claude-3-7-sonnet-20250219 (no thinking)	60.4%	$17.72
o3-mini (high)	60.4%	$18.16

No results yet for Claude 3.7 Sonnet on the LM Arena leaderboard, which has recently been dominated by Gemini 2.0 and Grok 3.

Via @paulgauthier

Tags: ai, generative-ai, llms, anthropic, claude, aider, evals, paul-gauthier

Quoting Paul Gauthier

2025-01-26T21:59:49+00:00

In my experience with AI coding, very large context windows aren't useful in practice. Every model seems to get confused when you feed them more than ~25-30k tokens. The models stop obeying their system prompts, can't correctly find/transcribe pieces of code in the context, etc.

Developing aider, I've seen this problem with gpt-4o, Sonnet, DeepSeek, etc. Many aider users report this too. It's perhaps the #1 problem users have, so I created a dedicated help page.

Very large context may be useful for certain tasks with lots of "low value" context. But for coding, it seems to lure users into a problematic regime.

— Paul Gauthier

Tags: ai, generative-ai, llms, ai-assisted-programming, aider, long-context, paul-gauthier

deepseek-ai/DeepSeek-V3-Base

2024-12-25T19:00:33+00:00

deepseek-ai/DeepSeek-V3-Base

No model card or announcement yet, but this new model release from Chinese AI lab DeepSeek (an arm of Chinese hedge fund High-Flyer) looks very significant.

It's a huge model - 685B parameters, 687.9 GB on disk (TIL how to size a git-lfs repo). The architecture is a Mixture of Experts with 256 experts, using 8 per token.

For comparison, Meta AI's largest released model is their Llama 3.1 model with 405B parameters.

The new model is apparently available to some people via both chat.deepseek.com and the DeepSeek API as part of a staged rollout.

Paul Gauthier got API access and used it to update his new Aider Polyglot leaderboard - DeepSeek v3 preview scored 48.4%, putting it in second place behind o1-2024-12-17 (high) and in front of both claude-3-5-sonnet-20241022 and gemini-exp-1206!

I never know if I can believe models or not (the first time I asked "what model are you?" it claimed to be "based on OpenAI's GPT-4 architecture"), but I just got this result using LLM and the llm-deepseek plugin:

llm -m deepseek-chat 'what deepseek model are you?'

I'm DeepSeek-V3 created exclusively by DeepSeek. I'm an AI assistant, and I'm at your service! Feel free to ask me anything you'd like. I'll do my best to assist you.

Here's my initial experiment log.

Via @ivanfioravanti

Tags: ai, generative-ai, llms, hugging-face, aider, deepseek, paul-gauthier, llm-release, ai-in-china

Quantization matters

2024-11-23T18:39:23+00:00

Quantization matters

What impact does quantization have on the performance of an LLM? been wondering about this for quite a while, now here are numbers from Paul Gauthier.

He ran differently quantized versions of Qwen 2.5 32B Instruct through his Aider code editing benchmark and saw a range of scores.

The original released weights (BF16) scored highest at 71.4%, with Ollama's qwen2.5-coder:32b-instruct-fp16 (a 66GB download) achieving the same score.

The quantized Ollama qwen2.5-coder:32b-instruct-q4_K_M (a 20GB download) saw a massive drop in quality, scoring just 53.4% on the same benchmark.

Via Paul Gauthier

Tags: ai, generative-ai, local-llms, llms, aider, qwen, ollama, paul-gauthier, ai-in-china

Qwen2.5-Coder-32B is an LLM that can code well that runs on my Mac

2024-11-12T23:37:36+00:00

There's a whole lot of buzz around the new Qwen2.5-Coder Series of open source (Apache 2.0 licensed) LLM releases from Alibaba's Qwen research team. On first impression it looks like the buzz is well deserved.

Qwen claim:

Qwen2.5-Coder-32B-Instruct has become the current SOTA open-source code model, matching the coding capabilities of GPT-4o.

That's a big claim for a 32B model that's small enough that it can run on my 64GB MacBook Pro M2. The Qwen published scores look impressive, comparing favorably with GPT-4o and Claude 3.5 Sonnet (October 2024) edition across various code-related benchmarks:

How about benchmarks from other researchers? Paul Gauthier's Aider benchmarks have a great reputation and Paul reports:

The new Qwen 2.5 Coder models did very well on aider's code editing benchmark. The 32B Instruct model scored in between GPT-4o and 3.5 Haiku.

84% 3.5 Sonnet, 75% 3.5 Haiku, 74% Qwen2.5 Coder 32B, 71% GPT-4o, 69% Qwen2.5 Coder 14B, 58% Qwen2.5 Coder 7B

That was for the Aider "whole edit" benchmark. The "diff" benchmark scores well too, with Qwen2.5 Coder 32B tying with GPT-4o (but a little behind Claude 3.5 Haiku).

Given these scores (and the positive buzz on Reddit) I had to try it for myself.

My attempts to run the Qwen/Qwen2.5-Coder-32B-Instruct-GGUF Q8 using llm-gguf were a bit too slow, because I don't have that compiled to use my Mac's GPU at the moment.

But both the Ollama version and the MLX version worked great!

I installed the Ollama version using:

ollama pull qwen2.5-coder:32b

That fetched a 20GB quantized file. I ran a prompt through that using my LLM tool and Sergey Alexandrov's llm-ollama plugin like this:

llm install llm-ollama
llm models # Confirming the new model is present
llm -m qwen2.5-coder:32b 'python function that takes URL to a CSV file and path to a SQLite database, fetches the CSV with the standard library, creates a table with the right columns and inserts the data'

Here's the result. The code worked, but I had to work around a frustrating ssl bug first (which wouldn't have been an issue if I'd allowed the model to use requests or httpx instead of the standard library).

I also tried running it using the Apple Silicon fast array framework MLX using the mlx-llm library directly, run via uv like this:

uv run --with mlx-lm \
  mlx_lm.generate \
  --model mlx-community/Qwen2.5-Coder-32B-Instruct-8bit \
  --max-tokens 4000 \
  --prompt 'write me a python function that renders a mandelbrot fractal as wide as the current terminal'

That gave me a very satisfying result - when I ran the code it generated in a terminal I got this:

MLX reported the following performance metrics:

Prompt: 49 tokens, 95.691 tokens-per-sec
Generation: 723 tokens, 10.016 tokens-per-sec
Peak memory: 32.685 GB

Let's see how it does on the Pelican on a bicycle benchmark.

llm -m qwen2.5-coder:32b 'Generate an SVG of a pelican riding a bicycle'

Here's what I got:

Questionable Pelican SVG drawings aside, this is a really promising development. 32GB is just small enough that I can run the model on my Mac without having to quit every other application I'm running, and both the speed and the quality of the results feel genuinely competitive with the current best of the hosted models.

Given that code assistance is probably around 80% of my LLM usage at the moment this is a meaningfully useful release for how I engage with this class of technology.

Tags: mandelbrot, open-source, ai, generative-ai, local-llms, llms, ai-assisted-programming, llm, uv, qwen, mlx, ollama, pelican-riding-a-bicycle, paul-gauthier, llm-release, ai-in-china

LLMs are bad at returning code in JSON

2024-08-16T17:04:39+00:00

LLMs are bad at returning code in JSON

Paul Gauthier's Aider is a terminal-based coding assistant which works against multiple different models. As part of developing the project Paul runs extensive benchmarks, and his latest shows an interesting result: LLMs are slightly less reliable at producing working code if you request that code be returned as part of a JSON response.

The May release of GPT-4o is the closest to a perfect score - the August appears to have regressed slightly, and the new structured output mode doesn't help and could even make things worse (though that difference may not be statistically significant).

Paul recommends using Markdown delimiters here instead, which are less likely to introduce confusing nested quoting issues.

Via @paulgauthier

Tags: json, ai, prompt-engineering, generative-ai, llms, aider, evals, paul-gauthier

Aider

2024-07-31T03:26:17+00:00

Aider

Aider is an impressive open source local coding chat assistant terminal application, developed by Paul Gauthier (founding CTO of Inktomi back in 1996-2000).

I tried it out today, using an Anthropic API key to run it using Claude 3.5 Sonnet:

pipx install aider-chat
export ANTHROPIC_API_KEY=api-key-here
aider --dark-mode

I found the --dark-mode flag necessary to make it legible using the macOS terminal "Pro" theme.

Aider starts by generating a concise map of files in your current Git repository. This is passed to the LLM along with the prompts that you type, and Aider can then request additional files be added to that context - or you can add the manually with the /add filename command.

It defaults to making modifications to files and then committing them directly to Git with a generated commit message. I found myself preferring the /ask command which lets you ask a question without making any file modifications:

The Aider documentation includes extensive examples and the tool can work with a wide range of different LLMs, though it recommends GPT-4o, Claude 3.5 Sonnet (or 3 Opus) and DeepSeek Coder V2 for the best results. Aider maintains its own leaderboard, emphasizing that "Aider works best with LLMs which are good at editing code, not just good at writing code".

The prompts it uses are pretty fascinating - they're tucked away in various *_prompts.py files in aider/coders.

Tags: python, ai, generative-ai, llms, ai-assisted-programming, claude-3-5-sonnet, aider, paul-gauthier