<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: aider</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/aider.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-05-08T05:49:12+00:00</updated><author><name>Simon Willison</name></author><entry><title>llm-gemini 0.19.1</title><link href="https://simonwillison.net/2025/May/8/llm-gemini-0191/#atom-tag" rel="alternate"/><published>2025-05-08T05:49:12+00:00</published><updated>2025-05-08T05:49:12+00:00</updated><id>https://simonwillison.net/2025/May/8/llm-gemini-0191/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/llm-gemini/releases/tag/0.19.1"&gt;llm-gemini 0.19.1&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Bugfix release for my &lt;a href="https://github.com/simonw/llm-gemini"&gt;llm-gemini&lt;/a&gt; plugin, which was recording the number of output tokens (needed to calculate the price of a response) incorrectly for the Gemini "thinking" models. Those models turn out to return &lt;code&gt;candidatesTokenCount&lt;/code&gt; and &lt;code&gt;thoughtsTokenCount&lt;/code&gt; as two separate values which need to be added together to get the total billed output token count. Full details in &lt;a href="https://github.com/simonw/llm-gemini/issues/75"&gt;this issue&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I spotted this potential bug in &lt;a href="https://gist.github.com/simonw/87a59e7f5c12274d65e2ac053b0eacdb#token-usage"&gt;this response log&lt;/a&gt; this morning, and my concerns were confirmed when Paul Gauthier wrote about a similar fix in Aider in &lt;a href="https://aider.chat/2025/05/07/gemini-cost.html"&gt;Gemini 2.5 Pro Preview 03-25 benchmark cost&lt;/a&gt;, where he noted that the $6.32 cost recorded to benchmark Gemini 2.5 Pro Preview 03-25 was incorrect. Since that model is no longer available (despite &lt;a href="https://simonwillison.net/2025/May/6/gemini-25-pro-preview/"&gt;the date-based model alias persisting&lt;/a&gt;) Paul is not able to accurately calculate the new cost, but it's likely a lot more since the Gemini 2.5 Pro Preview 05-06 benchmark cost $37.&lt;/p&gt;
&lt;p&gt;I've gone through my &lt;a href="https://simonwillison.net/tags/gemini/"&gt;gemini tag&lt;/a&gt; and attempted to update my previous posts with new calculations - this mostly involved increases in the order of 12.336 cents to 16.316 cents (&lt;a href="https://simonwillison.net/2025/May/6/gemini-25-pro-preview/"&gt;as seen here&lt;/a&gt;).


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/aider"&gt;aider&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/paul-gauthier"&gt;paul-gauthier&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="gemini"/><category term="aider"/><category term="llm-pricing"/><category term="paul-gauthier"/></entry><entry><title>Aider: Using uv as an installer</title><link href="https://simonwillison.net/2025/Mar/6/aider-using-uv-as-an-installer/#atom-tag" rel="alternate"/><published>2025-03-06T01:47:20+00:00</published><updated>2025-03-06T01:47:20+00:00</updated><id>https://simonwillison.net/2025/Mar/6/aider-using-uv-as-an-installer/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://aider.chat/2025/01/15/uv.html"&gt;Aider: Using uv as an installer&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Paul Gauthier has an innovative solution for the challenge of helping end users get a copy of his Aider CLI Python utility installed in an isolated virtual environment without first needing to teach them what an "isolated virtual environment" is.&lt;/p&gt;
&lt;p&gt;Provided you already have a Python install of version 3.8 or higher you can run this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;pip install aider-install &amp;amp;&amp;amp; aider-install
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;a href="https://pypi.org/project/aider-install/"&gt;aider-install&lt;/a&gt; package itself depends on &lt;a href="https://github.com/astral-sh/uv"&gt;uv&lt;/a&gt;. When you run &lt;code&gt;aider-install&lt;/code&gt; it executes the following &lt;a href="https://github.com/Aider-AI/aider-install/blob/main/aider_install/main.py"&gt;Python code&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;install_aider&lt;/span&gt;():
    &lt;span class="pl-k"&gt;try&lt;/span&gt;:
        &lt;span class="pl-s1"&gt;uv_bin&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;uv&lt;/span&gt;.&lt;span class="pl-c1"&gt;find_uv_bin&lt;/span&gt;()
        &lt;span class="pl-s1"&gt;subprocess&lt;/span&gt;.&lt;span class="pl-c1"&gt;check_call&lt;/span&gt;([
            &lt;span class="pl-s1"&gt;uv_bin&lt;/span&gt;, &lt;span class="pl-s"&gt;"tool"&lt;/span&gt;, &lt;span class="pl-s"&gt;"install"&lt;/span&gt;, &lt;span class="pl-s"&gt;"--force"&lt;/span&gt;, &lt;span class="pl-s"&gt;"--python"&lt;/span&gt;, &lt;span class="pl-s"&gt;"python3.12"&lt;/span&gt;, &lt;span class="pl-s"&gt;"aider-chat@latest"&lt;/span&gt;
        ])
        &lt;span class="pl-s1"&gt;subprocess&lt;/span&gt;.&lt;span class="pl-c1"&gt;check_call&lt;/span&gt;([&lt;span class="pl-s1"&gt;uv_bin&lt;/span&gt;, &lt;span class="pl-s"&gt;"tool"&lt;/span&gt;, &lt;span class="pl-s"&gt;"update-shell"&lt;/span&gt;])
    &lt;span class="pl-k"&gt;except&lt;/span&gt; &lt;span class="pl-s1"&gt;subprocess&lt;/span&gt;.&lt;span class="pl-c1"&gt;CalledProcessError&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; &lt;span class="pl-s1"&gt;e&lt;/span&gt;:
        &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s"&gt;f"Failed to install aider: &lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;e&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;"&lt;/span&gt;)
        &lt;span class="pl-s1"&gt;sys&lt;/span&gt;.&lt;span class="pl-c1"&gt;exit&lt;/span&gt;(&lt;span class="pl-c1"&gt;1&lt;/span&gt;)&lt;/pre&gt;

&lt;p&gt;This first figures out the location of the &lt;code&gt;uv&lt;/code&gt; Rust binary, then uses it to install his &lt;a href="https://pypi.org/project/aider-chat/"&gt;aider-chat&lt;/a&gt; package by running the equivalent of this command:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uv tool install --force --python python3.12 aider-chat@latest
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will in turn install a brand new standalone copy of Python 3.12 and tuck it away in uv's own managed directory structure where it shouldn't hurt anything else.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;aider-chat&lt;/code&gt; script defaults to being dropped in the XDG standard directory, which is probably &lt;code&gt;~/.local/bin&lt;/code&gt; - see &lt;a href="https://docs.astral.sh/uv/concepts/tools/#the-bin-directory"&gt;uv's documentation&lt;/a&gt;. The &lt;a href="https://docs.astral.sh/uv/concepts/tools/#overwriting-executables"&gt;--force flag&lt;/a&gt; ensures that &lt;code&gt;uv&lt;/code&gt; will overwrite any previous attempts at installing &lt;code&gt;aider-chat&lt;/code&gt; in that location with the new one.&lt;/p&gt;
&lt;p&gt;Finally, running &lt;code&gt;uv tool update-shell&lt;/code&gt; ensures that bin directory is &lt;a href="https://docs.astral.sh/uv/concepts/tools/#the-path"&gt;on the user's PATH&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I &lt;em&gt;think&lt;/em&gt; I like this. There is a LOT of stuff going on here, and experienced users may well opt for an &lt;a href="https://aider.chat/docs/install.html"&gt;alternative installation mechanism&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;But for non-expert Python users who just want to start using Aider, I think this pattern represents quite a tasteful way of getting everything working with minimal risk of breaking the user's system.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: Paul &lt;a href="https://twitter.com/paulgauthier/status/1897486573857595877"&gt;adds&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Offering this install method dramatically reduced the number of GitHub issues from users with conflicted/broken python environments.&lt;/p&gt;
&lt;p&gt;I also really like the "curl | sh" aider installer based on uv. Even users who don't have python installed can use it.&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/aider"&gt;aider&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/paul-gauthier"&gt;paul-gauthier&lt;/a&gt;&lt;/p&gt;



</summary><category term="cli"/><category term="python"/><category term="aider"/><category term="uv"/><category term="paul-gauthier"/></entry><entry><title>Aider Polyglot leaderboard results for Claude 3.7 Sonnet</title><link href="https://simonwillison.net/2025/Feb/25/aider-polyglot-leaderboard/#atom-tag" rel="alternate"/><published>2025-02-25T00:56:03+00:00</published><updated>2025-02-25T00:56:03+00:00</updated><id>https://simonwillison.net/2025/Feb/25/aider-polyglot-leaderboard/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://aider.chat/docs/leaderboards/#polyglot-leaderboard"&gt;Aider Polyglot leaderboard results for Claude 3.7 Sonnet&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Paul Gauthier's &lt;a href="https://aider.chat/2024/12/21/polyglot.html#the-polyglot-benchmark"&gt;Aider Polyglot benchmark&lt;/a&gt; is one of my favourite independent benchmarks for LLMs, partly because it focuses on code and partly because Paul is very responsive at evaluating new models.&lt;/p&gt;
&lt;p&gt;The brand new Claude 3.7 Sonnet just took the top place, when run with an increased 32,000 thinking token limit.&lt;/p&gt;
&lt;p&gt;It's interesting comparing the benchmark costs - 3.7 Sonnet spent $36.83 running the whole thing, significantly more than the previously leading DeepSeek R1 + Claude 3.5 combo, but a whole lot less than third place o1-high:&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Model&lt;/th&gt;
      &lt;th&gt;% completed&lt;/th&gt;
      &lt;th&gt;Total cost&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;claude-3-7-sonnet-20250219 (32k thinking tokens)&lt;/td&gt;
      &lt;td&gt;64.9%&lt;/td&gt;
      &lt;td&gt;$36.83&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;DeepSeek R1 + claude-3-5-sonnet-20241022&lt;/td&gt;
      &lt;td&gt;64.0%&lt;/td&gt;
      &lt;td&gt;$13.29&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;o1-2024-12-17 (high)&lt;/td&gt;
      &lt;td&gt;61.7%&lt;/td&gt;
      &lt;td&gt;$186.5&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;claude-3-7-sonnet-20250219 (no thinking)&lt;/td&gt;
      &lt;td&gt;60.4%&lt;/td&gt;
      &lt;td&gt;$17.72&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;o3-mini (high)&lt;/td&gt;
      &lt;td&gt;60.4%&lt;/td&gt;
      &lt;td&gt;$18.16&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;No results yet for Claude 3.7 Sonnet on the &lt;a href="https://lmarena.ai/"&gt;LM Arena leaderboard&lt;/a&gt;, which has recently been dominated by Gemini 2.0 and Grok 3.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/paulgauthier/status/1894167915869737058"&gt;@paulgauthier&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/aider"&gt;aider&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/paul-gauthier"&gt;paul-gauthier&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="aider"/><category term="evals"/><category term="paul-gauthier"/></entry><entry><title>Quoting Paul Gauthier</title><link href="https://simonwillison.net/2025/Jan/26/paul-gauthier/#atom-tag" rel="alternate"/><published>2025-01-26T21:59:49+00:00</published><updated>2025-01-26T21:59:49+00:00</updated><id>https://simonwillison.net/2025/Jan/26/paul-gauthier/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://news.ycombinator.com/item?id=42831769#42834527"&gt;&lt;p&gt;In my experience with AI coding, very large context windows aren't useful in practice. Every model seems to get confused when you feed them more than ~25-30k tokens. The models stop obeying their system prompts, can't correctly find/transcribe pieces of code in the context, etc.&lt;/p&gt;
&lt;p&gt;Developing aider, I've seen this problem with gpt-4o, Sonnet, DeepSeek, etc. Many aider users report this too. It's perhaps the #1 problem users have, so I created a &lt;a href="https://aider.chat/docs/troubleshooting/edit-errors.html#dont-add-too-many-files"&gt;dedicated help page&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Very large context may be useful for certain tasks with lots of "low value" context. But for coding, it seems to lure users into a problematic regime.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://news.ycombinator.com/item?id=42831769#42834527"&gt;Paul Gauthier&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/aider"&gt;aider&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/long-context"&gt;long-context&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/paul-gauthier"&gt;paul-gauthier&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="aider"/><category term="long-context"/><category term="paul-gauthier"/></entry><entry><title>Codestral 25.01</title><link href="https://simonwillison.net/2025/Jan/13/codestral-2501/#atom-tag" rel="alternate"/><published>2025-01-13T21:33:37+00:00</published><updated>2025-01-13T21:33:37+00:00</updated><id>https://simonwillison.net/2025/Jan/13/codestral-2501/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://mistral.ai/news/codestral-2501/"&gt;Codestral 25.01&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Brand new code-focused model from Mistral. Unlike &lt;a href="https://simonwillison.net/2024/May/30/codestral/"&gt;the first Codestral&lt;/a&gt; this one isn't (&lt;a href="https://twitter.com/sophiamyang/status/1878908474811404664"&gt;yet&lt;/a&gt;) available as open weights. The model has a 256k token context - a new record for Mistral.&lt;/p&gt;
&lt;p&gt;The new model scored an impressive joint first place with Claude 3.5 Sonnet and Deepseek V2.5 (FIM) on the Copilot Arena leaderboard. &lt;/p&gt;
&lt;p&gt;Chatbot Arena &lt;a href="https://blog.lmarena.ai/blog/2024/copilot-arena/"&gt;announced Copilot Arena&lt;/a&gt; on 12th November 2024. The leaderboard is driven by results gathered through their &lt;a href="https://lmarena.ai/copilot"&gt;Copilot Arena&lt;/a&gt; VS Code extensions, which provides users with free access to models in exchange for logged usage data plus their votes as to which of two models returns the most useful completion.&lt;/p&gt;
&lt;p&gt;So far the only other independent benchmark result I've seen is for the &lt;a href="https://aider.chat/docs/leaderboards/"&gt;Aider Polyglot test&lt;/a&gt;. This was &lt;a href="https://twitter.com/paulgauthier/status/1878886495609815054"&gt;less impressive&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Codestral 25.01 scored 11% on the aider polyglot benchmark. &lt;/p&gt;
&lt;p&gt;62% o1 (high)&lt;br&gt;
48% DeepSeek V3&lt;br&gt;
16% Qwen 2.5 Coder 32B Instruct&lt;br&gt;
11% Codestral 25.01&lt;br&gt;
4% gpt-4o-mini&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The new model can be accessed via my &lt;a href="https://github.com/simonw/llm-mistral"&gt;llm-mistral&lt;/a&gt; plugin using the &lt;code&gt;codestral&lt;/code&gt; alias (which maps to &lt;code&gt;codestral-latest&lt;/code&gt; on &lt;a href="https://docs.mistral.ai/getting-started/models/models_overview/"&gt;La Plateforme&lt;/a&gt;):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-mistral
llm keys set mistral
# Paste Mistral API key here
llm -m codestral "JavaScript to reverse an array"
&lt;/code&gt;&lt;/pre&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/sophiamyang/status/1878902888434479204"&gt;@sophiamyang&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/aider"&gt;aider&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatbot-arena"&gt;chatbot-arena&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="llm"/><category term="mistral"/><category term="aider"/><category term="evals"/><category term="chatbot-arena"/></entry><entry><title>deepseek-ai/DeepSeek-V3-Base</title><link href="https://simonwillison.net/2024/Dec/25/deepseek-v3/#atom-tag" rel="alternate"/><published>2024-12-25T19:00:33+00:00</published><updated>2024-12-25T19:00:33+00:00</updated><id>https://simonwillison.net/2024/Dec/25/deepseek-v3/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V3-Base"&gt;deepseek-ai/DeepSeek-V3-Base&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
No model card or announcement yet, but this new model release from Chinese AI lab DeepSeek (an arm of Chinese hedge fund &lt;a href="https://en.wikipedia.org/wiki/High-Flyer_(company)"&gt;High-Flyer&lt;/a&gt;) looks very significant.&lt;/p&gt;
&lt;p&gt;It's a huge model - 685B parameters, 687.9 GB on disk (&lt;a href="https://til.simonwillison.net/git/size-of-lfs-files"&gt;TIL how to size a git-lfs repo&lt;/a&gt;). The architecture is &lt;a href="https://twitter.com/dysondunbar/status/1871955700949430299"&gt;a Mixture of Experts&lt;/a&gt; with 256 experts, using 8 per token.&lt;/p&gt;
&lt;p&gt;For comparison, Meta AI's largest released model is their &lt;a href="https://ai.meta.com/blog/meta-llama-3-1/"&gt;Llama 3.1 model&lt;/a&gt; with 405B parameters.&lt;/p&gt;
&lt;p&gt;The new model is apparently available to some people via both &lt;a href="https://chat.deepseek.com/"&gt;chat.deepseek.com&lt;/a&gt; and the DeepSeek API as part of a staged rollout.&lt;/p&gt;
&lt;p&gt;Paul Gauthier got API access and &lt;a href="https://twitter.com/paulgauthier/status/1871919612000092632"&gt;used it&lt;/a&gt; to update his new &lt;a href="https://aider.chat/docs/leaderboards/"&gt;Aider Polyglot leaderboard&lt;/a&gt; - DeepSeek v3 preview scored 48.4%, putting it in second place behind &lt;code&gt;o1-2024-12-17 (high)&lt;/code&gt; and in front of both &lt;code&gt;claude-3-5-sonnet-20241022&lt;/code&gt; and &lt;code&gt;gemini-exp-1206&lt;/code&gt;!&lt;/p&gt;
&lt;p&gt;&lt;img alt="Aider leaderboard chart showing DeepSeek Chat V3 preview in second place" src="https://static.simonwillison.net/static/2024/deepseek-v3.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;I never know if I can believe models or not (the first time I asked "what model are you?" it claimed to be "based on OpenAI's GPT-4 architecture"), but I just got this result using &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; and the &lt;a href="https://pypi.org/project/llm-deepseek/"&gt;llm-deepseek&lt;/a&gt; plugin:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m deepseek-chat 'what deepseek model are you?'
&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote&gt;
&lt;p&gt;I'm DeepSeek-V3 created exclusively by DeepSeek. I'm an AI assistant, and I'm at your service! Feel free to ask me anything you'd like. I'll do my best to assist you.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's my &lt;a href="https://gist.github.com/simonw/e7528dc52828fb31415f6e14e3527b93"&gt;initial experiment log&lt;/a&gt;.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/ivanfioravanti/status/1871945175616135298"&gt;@ivanfioravanti&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hugging-face"&gt;hugging-face&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/aider"&gt;aider&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deepseek"&gt;deepseek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/paul-gauthier"&gt;paul-gauthier&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="hugging-face"/><category term="aider"/><category term="deepseek"/><category term="paul-gauthier"/><category term="llm-release"/><category term="ai-in-china"/></entry><entry><title>Quantization matters</title><link href="https://simonwillison.net/2024/Nov/23/quantization-matters/#atom-tag" rel="alternate"/><published>2024-11-23T18:39:23+00:00</published><updated>2024-11-23T18:39:23+00:00</updated><id>https://simonwillison.net/2024/Nov/23/quantization-matters/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://aider.chat/2024/11/21/quantization.html"&gt;Quantization matters&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
What impact does quantization have on the performance of an LLM? been wondering about this for quite a while, now here are numbers from Paul Gauthier.&lt;/p&gt;
&lt;p&gt;He ran differently quantized versions of Qwen 2.5 32B Instruct through his &lt;a href="https://aider.chat/docs/benchmarks.html#the-benchmark"&gt;Aider code editing benchmark&lt;/a&gt; and saw a range of scores.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct"&gt;original released weights&lt;/a&gt; (BF16) scored highest at 71.4%, with Ollama's &lt;a href="https://ollama.com/library/qwen2.5-coder:32b-instruct-fp16"&gt;qwen2.5-coder:32b-instruct-fp16&lt;/a&gt; (a 66GB download) achieving the same score.&lt;/p&gt;
&lt;p&gt;The quantized Ollama &lt;a href="https://ollama.com/library/qwen2.5-coder:32b-instruct-q4_K_M"&gt;qwen2.5-coder:32b-instruct-q4_K_M&lt;/a&gt; (a 20GB download) saw a massive drop in quality, scoring just 53.4% on the same benchmark.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/paulgauthier/status/1859684310204473349"&gt;Paul Gauthier&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/aider"&gt;aider&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/paul-gauthier"&gt;paul-gauthier&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="aider"/><category term="qwen"/><category term="ollama"/><category term="paul-gauthier"/><category term="ai-in-china"/></entry><entry><title>LLMs are bad at returning code in JSON</title><link href="https://simonwillison.net/2024/Aug/16/llms-are-bad-at-returning-code-in-json/#atom-tag" rel="alternate"/><published>2024-08-16T17:04:39+00:00</published><updated>2024-08-16T17:04:39+00:00</updated><id>https://simonwillison.net/2024/Aug/16/llms-are-bad-at-returning-code-in-json/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://aider.chat/2024/08/14/code-in-json.html"&gt;LLMs are bad at returning code in JSON&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Paul Gauthier's &lt;a href="https://aider.chat/"&gt;Aider&lt;/a&gt; is a terminal-based coding assistant which works against multiple different models. As part of developing the project Paul runs extensive benchmarks, and his latest shows an interesting result: LLMs are slightly less reliable at producing working code if you request that code be returned as part of a JSON response.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Coding skill by model and code wrapping strategy - four models, each showing their pass rate % average of five runs. Claude 3.5 Sonnet gets 60.5% with Markdown, 54.1% with JSON. DeepSeek-Coder V2 0724 gets 60.6% with Markdown, 51.1% with JSON. GPT-4o-2024-05-13 gets 60.0% with Markdown, 59.6% with JSON. GPT-4o-2024-08-06 gets 60.8% with Markdown, 57.6% with JSON, and 56.9% with JSON (strict). Markdown consistently performs better than JSON across all models." src="https://static.simonwillison.net/static/2024/llm-code-json.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;The May release of GPT-4o is the closest to a perfect score - the  August appears to have regressed slightly, and the new structured output mode doesn't help and could even make things worse (though that difference may not be statistically significant).&lt;/p&gt;
&lt;p&gt;Paul recommends using Markdown delimiters here instead, which are less likely to introduce confusing nested quoting issues.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/paulgauthier/status/1824442504290374061"&gt;@paulgauthier&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/json"&gt;json&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-engineering"&gt;prompt-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/aider"&gt;aider&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/paul-gauthier"&gt;paul-gauthier&lt;/a&gt;&lt;/p&gt;



</summary><category term="json"/><category term="ai"/><category term="prompt-engineering"/><category term="generative-ai"/><category term="llms"/><category term="aider"/><category term="evals"/><category term="paul-gauthier"/></entry><entry><title>Aider</title><link href="https://simonwillison.net/2024/Jul/31/aider/#atom-tag" rel="alternate"/><published>2024-07-31T03:26:17+00:00</published><updated>2024-07-31T03:26:17+00:00</updated><id>https://simonwillison.net/2024/Jul/31/aider/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/paul-gauthier/aider"&gt;Aider&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Aider is an impressive open source local coding chat assistant terminal application, developed by Paul Gauthier (founding CTO of &lt;a href="https://en.wikipedia.org/wiki/Inktomi"&gt;Inktomi&lt;/a&gt; back in &lt;a href="https://www.linkedin.com/in/paulgauthier/details/experience/"&gt;1996-2000&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;I tried it out today, using an Anthropic API key to run it using Claude 3.5 Sonnet:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;pipx install aider-chat
export ANTHROPIC_API_KEY=api-key-here
aider --dark-mode
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I found the &lt;code&gt;--dark-mode&lt;/code&gt; flag necessary to make it legible using the macOS terminal "Pro" theme.&lt;/p&gt;
&lt;p&gt;Aider starts by generating a &lt;a href="https://aider.chat/docs/repomap.html"&gt;concise map of files&lt;/a&gt; in your current Git repository. This is passed to the LLM along with the prompts that you type, and Aider can then request additional files be added to that context - or you can add the manually with the &lt;code&gt;/add filename&lt;/code&gt; command.&lt;/p&gt;
&lt;p&gt;It defaults to making modifications to files and then committing them directly to Git with a generated commit message. I found myself preferring the &lt;code&gt;/ask&lt;/code&gt; command which lets you ask a question without making any file modifications:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://static.simonwillison.net/static/2024/aider.jpg"&gt;&lt;img alt="Loaded /Users/simon/Dropbox/Development/simonwillisonblog/.env Aider v0.46.1 Models: claude-3-5-sonnet-20240620 with diff edit format, weak model claude-3-haiku-20240307 Git repo: .git with 156 files Repo-map: using 1024 tokens Use /help &amp;lt;question&amp;gt; for help, run &amp;quot;aider --help&amp;quot; to see cmd line args  &amp;gt; /add blog/search.py  Added /Users/simon/Dropbox/Development/simonwillisonblog/blog/search.py to the chat  blog/search.py   &amp;gt; /ask what does ?exclude.tag= do in the search feature?         The ?exclude.tag= parameter in the search feature is used to exclude specific tags from the search results. Here's how it      works in the code:        1 The excluded tags are retrieved from the request's GET parameters:   excluded_tags = request.GET.getlist(&amp;quot;exclude.tag&amp;quot;)    This line collects all the tags that should be excluded from the search results. The getlist() method is used because there    might be multiple excluded tags. 2 These excluded tags are then applied to the queryset for each content type (Entry, Blogmark, Quotation) in the   make_queryset() function:       for exclude_tag in excluded_tags:  qs = qs.exclude(tags__tag=exclude_tag)     This loop iterates through each excluded tag and uses Django's exclude() method to filter out any items that have the   specified tag.    So, if a user includes ?exclude.tag=python&amp;amp;exclude.tag=django in their search query, the results will not include any items    tagged with either &amp;quot;python&amp;quot; or &amp;quot;django&amp;quot;. This feature allows users to refine their search by explicitly removing certain topics from the results, which can be   particularly useful when searching through a large number of tagged items.  Tokens: 3,884 sent, 248 received. Cost: $0.02 request, $0.02 session. blog/search.py" src="https://static.simonwillison.net/static/2024/aider.jpg" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The Aider documentation includes &lt;a href="https://aider.chat/examples/README.html"&gt;extensive examples&lt;/a&gt; and the tool can work with a wide range of &lt;a href="https://aider.chat/docs/llms.html"&gt;different LLMs&lt;/a&gt;, though it recommends GPT-4o, Claude 3.5 Sonnet (or 3 Opus) and DeepSeek Coder V2 for the best results. Aider maintains &lt;a href="https://aider.chat/docs/leaderboards/"&gt;its own leaderboard&lt;/a&gt;, emphasizing that "Aider works best with LLMs which are good at &lt;em&gt;editing&lt;/em&gt; code, not just good at writing code".&lt;/p&gt;
&lt;p&gt;The prompts it uses are pretty fascinating - they're tucked away in various &lt;code&gt;*_prompts.py&lt;/code&gt; files in &lt;a href="https://github.com/paul-gauthier/aider/tree/main/aider/coders"&gt;aider/coders&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-3-5-sonnet"&gt;claude-3-5-sonnet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/aider"&gt;aider&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/paul-gauthier"&gt;paul-gauthier&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="claude-3-5-sonnet"/><category term="aider"/><category term="paul-gauthier"/></entry></feed>