Simon Willison's Weblog: aider

llm-gemini 0.19.1

2025-05-08T05:49:12+00:00

Bugfix release for my llm-gemini plugin, which was recording the number of output tokens (needed to calculate the price of a response) incorrectly for the Gemini "thinking" models. Those models turn out to return candidatesTokenCount and thoughtsTokenCount as two separate values which need to be added together to get the total billed output token count. Full details in this issue.

I spotted this potential bug in this response log this morning, and my concerns were confirmed when Paul Gauthier wrote about a similar fix in Aider in Gemini 2.5 Pro Preview 03-25 benchmark cost, where he noted that the $6.32 cost recorded to benchmark Gemini 2.5 Pro Preview 03-25 was incorrect. Since that model is no longer available (despite the date-based model alias persisting) Paul is not able to accurately calculate the new cost, but it's likely a lot more since the Gemini 2.5 Pro Preview 05-06 benchmark cost $37.

I've gone through my gemini tag and attempted to update my previous posts with new calculations - this mostly involved increases in the order of 12.336 cents to 16.316 cents (as seen here).

Tags: ai, generative-ai, llms, llm, gemini, aider, llm-pricing, paul-gauthier

Aider: Using uv as an installer

2025-03-06T01:47:20+00:00

Aider: Using uv as an installer

Paul Gauthier has an innovative solution for the challenge of helping end users get a copy of his Aider CLI Python utility installed in an isolated virtual environment without first needing to teach them what an "isolated virtual environment" is.

Provided you already have a Python install of version 3.8 or higher you can run this:

pip install aider-install && aider-install

The aider-install package itself depends on uv. When you run aider-install it executes the following Python code:

def install_aider():
    try:
        uv_bin = uv.find_uv_bin()
        subprocess.check_call([
            uv_bin, "tool", "install", "--force", "--python", "python3.12", "aider-chat@latest"
        ])
        subprocess.check_call([uv_bin, "tool", "update-shell"])
    except subprocess.CalledProcessError as e:
        print(f"Failed to install aider: {e}")
        sys.exit(1)

This first figures out the location of the uv Rust binary, then uses it to install his aider-chat package by running the equivalent of this command:

uv tool install --force --python python3.12 aider-chat@latest

This will in turn install a brand new standalone copy of Python 3.12 and tuck it away in uv's own managed directory structure where it shouldn't hurt anything else.

The aider-chat script defaults to being dropped in the XDG standard directory, which is probably ~/.local/bin - see uv's documentation. The --force flag ensures that uv will overwrite any previous attempts at installing aider-chat in that location with the new one.

Finally, running uv tool update-shell ensures that bin directory is on the user's PATH.

I think I like this. There is a LOT of stuff going on here, and experienced users may well opt for an alternative installation mechanism.

But for non-expert Python users who just want to start using Aider, I think this pattern represents quite a tasteful way of getting everything working with minimal risk of breaking the user's system.

Update: Paul adds:

Offering this install method dramatically reduced the number of GitHub issues from users with conflicted/broken python environments.

I also really like the "curl | sh" aider installer based on uv. Even users who don't have python installed can use it.

Tags: cli, python, aider, uv, paul-gauthier

Aider Polyglot leaderboard results for Claude 3.7 Sonnet

2025-02-25T00:56:03+00:00

Aider Polyglot leaderboard results for Claude 3.7 Sonnet

Paul Gauthier's Aider Polyglot benchmark is one of my favourite independent benchmarks for LLMs, partly because it focuses on code and partly because Paul is very responsive at evaluating new models.

The brand new Claude 3.7 Sonnet just took the top place, when run with an increased 32,000 thinking token limit.

It's interesting comparing the benchmark costs - 3.7 Sonnet spent $36.83 running the whole thing, significantly more than the previously leading DeepSeek R1 + Claude 3.5 combo, but a whole lot less than third place o1-high:

Model	% completed	Total cost
claude-3-7-sonnet-20250219 (32k thinking tokens)	64.9%	$36.83
DeepSeek R1 + claude-3-5-sonnet-20241022	64.0%	$13.29
o1-2024-12-17 (high)	61.7%	$186.5
claude-3-7-sonnet-20250219 (no thinking)	60.4%	$17.72
o3-mini (high)	60.4%	$18.16

No results yet for Claude 3.7 Sonnet on the LM Arena leaderboard, which has recently been dominated by Gemini 2.0 and Grok 3.

Via @paulgauthier

Tags: ai, generative-ai, llms, anthropic, claude, aider, evals, paul-gauthier

Quoting Paul Gauthier

2025-01-26T21:59:49+00:00

In my experience with AI coding, very large context windows aren't useful in practice. Every model seems to get confused when you feed them more than ~25-30k tokens. The models stop obeying their system prompts, can't correctly find/transcribe pieces of code in the context, etc.

Developing aider, I've seen this problem with gpt-4o, Sonnet, DeepSeek, etc. Many aider users report this too. It's perhaps the #1 problem users have, so I created a dedicated help page.

Very large context may be useful for certain tasks with lots of "low value" context. But for coding, it seems to lure users into a problematic regime.

— Paul Gauthier

Tags: ai, generative-ai, llms, ai-assisted-programming, aider, long-context, paul-gauthier

Codestral 25.01

2025-01-13T21:33:37+00:00

Codestral 25.01

Brand new code-focused model from Mistral. Unlike the first Codestral this one isn't (yet) available as open weights. The model has a 256k token context - a new record for Mistral.

The new model scored an impressive joint first place with Claude 3.5 Sonnet and Deepseek V2.5 (FIM) on the Copilot Arena leaderboard.

Chatbot Arena announced Copilot Arena on 12th November 2024. The leaderboard is driven by results gathered through their Copilot Arena VS Code extensions, which provides users with free access to models in exchange for logged usage data plus their votes as to which of two models returns the most useful completion.

So far the only other independent benchmark result I've seen is for the Aider Polyglot test. This was less impressive:

Codestral 25.01 scored 11% on the aider polyglot benchmark.

62% o1 (high)
48% DeepSeek V3
16% Qwen 2.5 Coder 32B Instruct
11% Codestral 25.01
4% gpt-4o-mini

The new model can be accessed via my llm-mistral plugin using the codestral alias (which maps to codestral-latest on La Plateforme):

llm install llm-mistral
llm keys set mistral
# Paste Mistral API key here
llm -m codestral "JavaScript to reverse an array"

Via @sophiamyang

Tags: ai, generative-ai, llms, ai-assisted-programming, llm, mistral, aider, evals, chatbot-arena

deepseek-ai/DeepSeek-V3-Base

2024-12-25T19:00:33+00:00

deepseek-ai/DeepSeek-V3-Base

No model card or announcement yet, but this new model release from Chinese AI lab DeepSeek (an arm of Chinese hedge fund High-Flyer) looks very significant.

It's a huge model - 685B parameters, 687.9 GB on disk (TIL how to size a git-lfs repo). The architecture is a Mixture of Experts with 256 experts, using 8 per token.

For comparison, Meta AI's largest released model is their Llama 3.1 model with 405B parameters.

The new model is apparently available to some people via both chat.deepseek.com and the DeepSeek API as part of a staged rollout.

Paul Gauthier got API access and used it to update his new Aider Polyglot leaderboard - DeepSeek v3 preview scored 48.4%, putting it in second place behind o1-2024-12-17 (high) and in front of both claude-3-5-sonnet-20241022 and gemini-exp-1206!

I never know if I can believe models or not (the first time I asked "what model are you?" it claimed to be "based on OpenAI's GPT-4 architecture"), but I just got this result using LLM and the llm-deepseek plugin:

llm -m deepseek-chat 'what deepseek model are you?'

I'm DeepSeek-V3 created exclusively by DeepSeek. I'm an AI assistant, and I'm at your service! Feel free to ask me anything you'd like. I'll do my best to assist you.

Here's my initial experiment log.

Via @ivanfioravanti

Tags: ai, generative-ai, llms, hugging-face, aider, deepseek, paul-gauthier, llm-release, ai-in-china

Quantization matters

2024-11-23T18:39:23+00:00

Quantization matters

What impact does quantization have on the performance of an LLM? been wondering about this for quite a while, now here are numbers from Paul Gauthier.

He ran differently quantized versions of Qwen 2.5 32B Instruct through his Aider code editing benchmark and saw a range of scores.

The original released weights (BF16) scored highest at 71.4%, with Ollama's qwen2.5-coder:32b-instruct-fp16 (a 66GB download) achieving the same score.

The quantized Ollama qwen2.5-coder:32b-instruct-q4_K_M (a 20GB download) saw a massive drop in quality, scoring just 53.4% on the same benchmark.

Via Paul Gauthier

Tags: ai, generative-ai, local-llms, llms, aider, qwen, ollama, paul-gauthier, ai-in-china

LLMs are bad at returning code in JSON

2024-08-16T17:04:39+00:00

LLMs are bad at returning code in JSON

Paul Gauthier's Aider is a terminal-based coding assistant which works against multiple different models. As part of developing the project Paul runs extensive benchmarks, and his latest shows an interesting result: LLMs are slightly less reliable at producing working code if you request that code be returned as part of a JSON response.

The May release of GPT-4o is the closest to a perfect score - the August appears to have regressed slightly, and the new structured output mode doesn't help and could even make things worse (though that difference may not be statistically significant).

Paul recommends using Markdown delimiters here instead, which are less likely to introduce confusing nested quoting issues.

Via @paulgauthier

Tags: json, ai, prompt-engineering, generative-ai, llms, aider, evals, paul-gauthier

Aider

2024-07-31T03:26:17+00:00

Aider

Aider is an impressive open source local coding chat assistant terminal application, developed by Paul Gauthier (founding CTO of Inktomi back in 1996-2000).

I tried it out today, using an Anthropic API key to run it using Claude 3.5 Sonnet:

pipx install aider-chat
export ANTHROPIC_API_KEY=api-key-here
aider --dark-mode

I found the --dark-mode flag necessary to make it legible using the macOS terminal "Pro" theme.

Aider starts by generating a concise map of files in your current Git repository. This is passed to the LLM along with the prompts that you type, and Aider can then request additional files be added to that context - or you can add the manually with the /add filename command.

It defaults to making modifications to files and then committing them directly to Git with a generated commit message. I found myself preferring the /ask command which lets you ask a question without making any file modifications:

The Aider documentation includes extensive examples and the tool can work with a wide range of different LLMs, though it recommends GPT-4o, Claude 3.5 Sonnet (or 3 Opus) and DeepSeek Coder V2 for the best results. Aider maintains its own leaderboard, emphasizing that "Aider works best with LLMs which are good at editing code, not just good at writing code".

The prompts it uses are pretty fascinating - they're tucked away in various *_prompts.py files in aider/coders.

Tags: python, ai, generative-ai, llms, ai-assisted-programming, claude-3-5-sonnet, aider, paul-gauthier