<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: paul-gauthier</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/paul-gauthier.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-05-08T05:49:12+00:00</updated><author><name>Simon Willison</name></author><entry><title>llm-gemini 0.19.1</title><link href="https://simonwillison.net/2025/May/8/llm-gemini-0191/#atom-tag" rel="alternate"/><published>2025-05-08T05:49:12+00:00</published><updated>2025-05-08T05:49:12+00:00</updated><id>https://simonwillison.net/2025/May/8/llm-gemini-0191/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/llm-gemini/releases/tag/0.19.1"&gt;llm-gemini 0.19.1&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Bugfix release for my &lt;a href="https://github.com/simonw/llm-gemini"&gt;llm-gemini&lt;/a&gt; plugin, which was recording the number of output tokens (needed to calculate the price of a response) incorrectly for the Gemini "thinking" models. Those models turn out to return &lt;code&gt;candidatesTokenCount&lt;/code&gt; and &lt;code&gt;thoughtsTokenCount&lt;/code&gt; as two separate values which need to be added together to get the total billed output token count. Full details in &lt;a href="https://github.com/simonw/llm-gemini/issues/75"&gt;this issue&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I spotted this potential bug in &lt;a href="https://gist.github.com/simonw/87a59e7f5c12274d65e2ac053b0eacdb#token-usage"&gt;this response log&lt;/a&gt; this morning, and my concerns were confirmed when Paul Gauthier wrote about a similar fix in Aider in &lt;a href="https://aider.chat/2025/05/07/gemini-cost.html"&gt;Gemini 2.5 Pro Preview 03-25 benchmark cost&lt;/a&gt;, where he noted that the $6.32 cost recorded to benchmark Gemini 2.5 Pro Preview 03-25 was incorrect. Since that model is no longer available (despite &lt;a href="https://simonwillison.net/2025/May/6/gemini-25-pro-preview/"&gt;the date-based model alias persisting&lt;/a&gt;) Paul is not able to accurately calculate the new cost, but it's likely a lot more since the Gemini 2.5 Pro Preview 05-06 benchmark cost $37.&lt;/p&gt;
&lt;p&gt;I've gone through my &lt;a href="https://simonwillison.net/tags/gemini/"&gt;gemini tag&lt;/a&gt; and attempted to update my previous posts with new calculations - this mostly involved increases in the order of 12.336 cents to 16.316 cents (&lt;a href="https://simonwillison.net/2025/May/6/gemini-25-pro-preview/"&gt;as seen here&lt;/a&gt;).


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/aider"&gt;aider&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/paul-gauthier"&gt;paul-gauthier&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="gemini"/><category term="aider"/><category term="llm-pricing"/><category term="paul-gauthier"/></entry><entry><title>Aider: Using uv as an installer</title><link href="https://simonwillison.net/2025/Mar/6/aider-using-uv-as-an-installer/#atom-tag" rel="alternate"/><published>2025-03-06T01:47:20+00:00</published><updated>2025-03-06T01:47:20+00:00</updated><id>https://simonwillison.net/2025/Mar/6/aider-using-uv-as-an-installer/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://aider.chat/2025/01/15/uv.html"&gt;Aider: Using uv as an installer&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Paul Gauthier has an innovative solution for the challenge of helping end users get a copy of his Aider CLI Python utility installed in an isolated virtual environment without first needing to teach them what an "isolated virtual environment" is.&lt;/p&gt;
&lt;p&gt;Provided you already have a Python install of version 3.8 or higher you can run this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;pip install aider-install &amp;amp;&amp;amp; aider-install
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;a href="https://pypi.org/project/aider-install/"&gt;aider-install&lt;/a&gt; package itself depends on &lt;a href="https://github.com/astral-sh/uv"&gt;uv&lt;/a&gt;. When you run &lt;code&gt;aider-install&lt;/code&gt; it executes the following &lt;a href="https://github.com/Aider-AI/aider-install/blob/main/aider_install/main.py"&gt;Python code&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;install_aider&lt;/span&gt;():
    &lt;span class="pl-k"&gt;try&lt;/span&gt;:
        &lt;span class="pl-s1"&gt;uv_bin&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;uv&lt;/span&gt;.&lt;span class="pl-c1"&gt;find_uv_bin&lt;/span&gt;()
        &lt;span class="pl-s1"&gt;subprocess&lt;/span&gt;.&lt;span class="pl-c1"&gt;check_call&lt;/span&gt;([
            &lt;span class="pl-s1"&gt;uv_bin&lt;/span&gt;, &lt;span class="pl-s"&gt;"tool"&lt;/span&gt;, &lt;span class="pl-s"&gt;"install"&lt;/span&gt;, &lt;span class="pl-s"&gt;"--force"&lt;/span&gt;, &lt;span class="pl-s"&gt;"--python"&lt;/span&gt;, &lt;span class="pl-s"&gt;"python3.12"&lt;/span&gt;, &lt;span class="pl-s"&gt;"aider-chat@latest"&lt;/span&gt;
        ])
        &lt;span class="pl-s1"&gt;subprocess&lt;/span&gt;.&lt;span class="pl-c1"&gt;check_call&lt;/span&gt;([&lt;span class="pl-s1"&gt;uv_bin&lt;/span&gt;, &lt;span class="pl-s"&gt;"tool"&lt;/span&gt;, &lt;span class="pl-s"&gt;"update-shell"&lt;/span&gt;])
    &lt;span class="pl-k"&gt;except&lt;/span&gt; &lt;span class="pl-s1"&gt;subprocess&lt;/span&gt;.&lt;span class="pl-c1"&gt;CalledProcessError&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; &lt;span class="pl-s1"&gt;e&lt;/span&gt;:
        &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s"&gt;f"Failed to install aider: &lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;e&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;"&lt;/span&gt;)
        &lt;span class="pl-s1"&gt;sys&lt;/span&gt;.&lt;span class="pl-c1"&gt;exit&lt;/span&gt;(&lt;span class="pl-c1"&gt;1&lt;/span&gt;)&lt;/pre&gt;

&lt;p&gt;This first figures out the location of the &lt;code&gt;uv&lt;/code&gt; Rust binary, then uses it to install his &lt;a href="https://pypi.org/project/aider-chat/"&gt;aider-chat&lt;/a&gt; package by running the equivalent of this command:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uv tool install --force --python python3.12 aider-chat@latest
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will in turn install a brand new standalone copy of Python 3.12 and tuck it away in uv's own managed directory structure where it shouldn't hurt anything else.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;aider-chat&lt;/code&gt; script defaults to being dropped in the XDG standard directory, which is probably &lt;code&gt;~/.local/bin&lt;/code&gt; - see &lt;a href="https://docs.astral.sh/uv/concepts/tools/#the-bin-directory"&gt;uv's documentation&lt;/a&gt;. The &lt;a href="https://docs.astral.sh/uv/concepts/tools/#overwriting-executables"&gt;--force flag&lt;/a&gt; ensures that &lt;code&gt;uv&lt;/code&gt; will overwrite any previous attempts at installing &lt;code&gt;aider-chat&lt;/code&gt; in that location with the new one.&lt;/p&gt;
&lt;p&gt;Finally, running &lt;code&gt;uv tool update-shell&lt;/code&gt; ensures that bin directory is &lt;a href="https://docs.astral.sh/uv/concepts/tools/#the-path"&gt;on the user's PATH&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I &lt;em&gt;think&lt;/em&gt; I like this. There is a LOT of stuff going on here, and experienced users may well opt for an &lt;a href="https://aider.chat/docs/install.html"&gt;alternative installation mechanism&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;But for non-expert Python users who just want to start using Aider, I think this pattern represents quite a tasteful way of getting everything working with minimal risk of breaking the user's system.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: Paul &lt;a href="https://twitter.com/paulgauthier/status/1897486573857595877"&gt;adds&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Offering this install method dramatically reduced the number of GitHub issues from users with conflicted/broken python environments.&lt;/p&gt;
&lt;p&gt;I also really like the "curl | sh" aider installer based on uv. Even users who don't have python installed can use it.&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/aider"&gt;aider&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/paul-gauthier"&gt;paul-gauthier&lt;/a&gt;&lt;/p&gt;



</summary><category term="cli"/><category term="python"/><category term="aider"/><category term="uv"/><category term="paul-gauthier"/></entry><entry><title>Initial impressions of GPT-4.5</title><link href="https://simonwillison.net/2025/Feb/27/introducing-gpt-45/#atom-tag" rel="alternate"/><published>2025-02-27T22:02:59+00:00</published><updated>2025-02-27T22:02:59+00:00</updated><id>https://simonwillison.net/2025/Feb/27/introducing-gpt-45/#atom-tag</id><summary type="html">
    &lt;p&gt;GPT-4.5 &lt;a href="https://openai.com/index/introducing-gpt-4-5/"&gt;is out today&lt;/a&gt; as a "research preview" - it's available to OpenAI Pro ($200/month) customers and to developers with an API key. OpenAI also published &lt;a href="https://openai.com/index/gpt-4-5-system-card/"&gt;a GPT-4.5 system card&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I've &lt;a href="https://github.com/simonw/llm/issues/795"&gt;added it to LLM&lt;/a&gt; so you can run it like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m gpt-4.5-preview 'impress me'&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It's &lt;em&gt;very&lt;/em&gt; expensive right now: &lt;a href="https://openai.com/api/pricing/"&gt;currently&lt;/a&gt; $75.00 per million input tokens and $150/million for output! For comparison, o1 is $15/$60 and GPT-4o is $2.50/$10. GPT-4o mini is $0.15/$0.60 making OpenAI's least expensive model 500x cheaper than GPT-4.5 for input and 250x cheaper for output!&lt;/p&gt;
&lt;p&gt;As far as I can tell almost all of its key characteristics are the same as GPT-4o: it has the same 128,000 context length, handles the same inputs (text and image) and even has the same training cut-off date of October 2023.&lt;/p&gt;
&lt;p&gt;So what's it better at? According to OpenAI's blog post:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Combining deep understanding of the world with improved collaboration results in a model that integrates ideas naturally in warm and intuitive conversations that are more attuned to human collaboration. GPT‑4.5 has a better understanding of what humans mean and interprets subtle cues or implicit expectations with greater nuance and “EQ”. GPT‑4.5 also shows stronger aesthetic intuition and creativity. It excels at helping with writing and design.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;They include this chart of win-rates against GPT-4o, where it wins between 56.8% and 63.2% of the time for different classes of query:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-win-rates.jpg" alt="Bar chart showing GPT-4.5 win-rate vs GPT-4o across three categories: Everyday queries (57.0%), Professional queries (63.2%), and Creative intelligence (56.8%)." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;They also report a SimpleQA hallucination rate of 37.1% - a big improvement on GPT-4o (61.8%) and o3-mini (80.3%) but not much better than o1 (44%). The coding benchmarks all appear to score similar to o3-mini.&lt;/p&gt;
&lt;p&gt;Paul Gauthier &lt;a href="https://twitter.com/paulgauthier/status/1895221869844013108"&gt;reports&lt;/a&gt; a score of 45% on Aider's &lt;a href="https://aider.chat/docs/leaderboards/"&gt;polyglot coding benchmark&lt;/a&gt; - below DeepSeek V3 (48%), Sonnet 3.7 (60% without thinking, 65% with thinking) and o3-mini (60.4%) but significantly ahead of GPT-4o (23.1%).&lt;/p&gt;
&lt;p id="confidence"&gt;OpenAI don't seem to have enormous confidence in the model themselves:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;GPT‑4.5 is a very large and compute-intensive model, making it more &lt;a href="https://openai.com/api/pricing/"&gt;expensive⁠&lt;/a&gt; than and not a replacement for GPT‑4o. Because of this, we're evaluating whether to continue serving it in the API long-term as we balance supporting current capabilities with building future models.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It drew me this for "Generate an SVG of a pelican riding a bicycle":&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/pelican-gpt45.jpg" alt="A pretty simple pelican, not as good as other leading models" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Accessed via the API the model feels weirdly slow - here's an animation showing how that pelican was rendered - the full response &lt;a href="https://gist.github.com/simonw/90834e1ca91e3f802d80f67bac94ad7d#file-pelican-json-L41"&gt;took 112 seconds&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/gpt-45.gif" alt="Animated terminal session - the tokens are coming back very slowly" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;OpenAI's Rapha Gontijo Lopes &lt;a href="https://twitter.com/rapha_gl/status/1895213014699385082"&gt;calls this&lt;/a&gt; "(probably) the largest model in the world" - evidently the problem with large models is that they are a whole lot slower than their smaller alternatives!&lt;/p&gt;
&lt;p&gt;Andrej Karpathy &lt;a href="https://x.com/karpathy/status/1895213020982472863"&gt;has published some notes&lt;/a&gt; on the new model, where he highlights that the improvements are limited considering the 10x increase in training cost compute to GPT-4:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I remember being a part of a hackathon trying to find concrete prompts where GPT4 outperformed 3.5. They definitely existed, but clear and concrete "slam dunk" examples were difficult to find. [...] So it is with that expectation that I went into testing GPT4.5, which I had access to for a few days, and which saw 10X more pretraining compute than GPT4. And I feel like, once again, I'm in the same hackathon 2 years ago. Everything is a little bit better and it's awesome, but also not exactly in ways that are trivial to point to.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Andrej is also running a fun &lt;a href="https://twitter.com/karpathy/status/1895213023238987854"&gt;vibes-based polling evaluation&lt;/a&gt; comparing output from GPT-4.5 and GPT-4o. &lt;strong&gt;Update&lt;/strong&gt; &lt;a href="https://twitter.com/karpathy/status/1895337579589079434"&gt;GPT-4o won 4/5 rounds&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;There's an &lt;a href="https://news.ycombinator.com/item?id=43197872"&gt;extensive thread&lt;/a&gt; about GPT-4.5 on Hacker News. When it hit 324 comments I ran a summary of it using GPT-4.5 itself with &lt;a href="https://til.simonwillison.net/llms/claude-hacker-news-themes"&gt;this script&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;hn-summary.sh 43197872 -m gpt-4.5-preview&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/5e9f5e94ac8840f698c280293d39965e"&gt;the result&lt;/a&gt;, which took 154 seconds to generate and cost $2.11 (25797 input tokens and 1225 input, price calculated using my &lt;a href="https://tools.simonwillison.net/llm-prices"&gt;LLM pricing calculator&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;For comparison, I ran the same prompt against &lt;a href="https://gist.github.com/simonw/592d651ec61daec66435a6f718c0618b"&gt;GPT-4o&lt;/a&gt;, &lt;a href="https://gist.github.com/simonw/cc760217623769f0d7e4687332bce409"&gt;GPT-4o Mini&lt;/a&gt;, &lt;a href="https://gist.github.com/simonw/6f11e1974e4d613258b3237380e0ecb3"&gt;Claude 3.7 Sonnet&lt;/a&gt;, &lt;a href="https://gist.github.com/simonw/c178f02c97961e225eb615d4b9a1dea3"&gt;Claude 3.5 Haiku&lt;/a&gt;, &lt;a href="https://gist.github.com/simonw/0c6f071d9ad1cea493de4e5e7a0986bb"&gt;Gemini 2.0 Flash&lt;/a&gt;, &lt;a href="https://gist.github.com/simonw/8a71396a4a219d8281e294b61a9d6dd5"&gt;Gemini 2.0 Flash Lite&lt;/a&gt; and &lt;a href="https://gist.github.com/simonw/112e3f4660a1a410151e86ec677e34ab"&gt;Gemini 2.0 Pro&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/andrej-karpathy"&gt;andrej-karpathy&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/paul-gauthier"&gt;paul-gauthier&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="openai"/><category term="andrej-karpathy"/><category term="generative-ai"/><category term="llms"/><category term="evals"/><category term="uv"/><category term="pelican-riding-a-bicycle"/><category term="paul-gauthier"/><category term="llm-release"/></entry><entry><title>Aider Polyglot leaderboard results for Claude 3.7 Sonnet</title><link href="https://simonwillison.net/2025/Feb/25/aider-polyglot-leaderboard/#atom-tag" rel="alternate"/><published>2025-02-25T00:56:03+00:00</published><updated>2025-02-25T00:56:03+00:00</updated><id>https://simonwillison.net/2025/Feb/25/aider-polyglot-leaderboard/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://aider.chat/docs/leaderboards/#polyglot-leaderboard"&gt;Aider Polyglot leaderboard results for Claude 3.7 Sonnet&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Paul Gauthier's &lt;a href="https://aider.chat/2024/12/21/polyglot.html#the-polyglot-benchmark"&gt;Aider Polyglot benchmark&lt;/a&gt; is one of my favourite independent benchmarks for LLMs, partly because it focuses on code and partly because Paul is very responsive at evaluating new models.&lt;/p&gt;
&lt;p&gt;The brand new Claude 3.7 Sonnet just took the top place, when run with an increased 32,000 thinking token limit.&lt;/p&gt;
&lt;p&gt;It's interesting comparing the benchmark costs - 3.7 Sonnet spent $36.83 running the whole thing, significantly more than the previously leading DeepSeek R1 + Claude 3.5 combo, but a whole lot less than third place o1-high:&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Model&lt;/th&gt;
      &lt;th&gt;% completed&lt;/th&gt;
      &lt;th&gt;Total cost&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;claude-3-7-sonnet-20250219 (32k thinking tokens)&lt;/td&gt;
      &lt;td&gt;64.9%&lt;/td&gt;
      &lt;td&gt;$36.83&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;DeepSeek R1 + claude-3-5-sonnet-20241022&lt;/td&gt;
      &lt;td&gt;64.0%&lt;/td&gt;
      &lt;td&gt;$13.29&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;o1-2024-12-17 (high)&lt;/td&gt;
      &lt;td&gt;61.7%&lt;/td&gt;
      &lt;td&gt;$186.5&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;claude-3-7-sonnet-20250219 (no thinking)&lt;/td&gt;
      &lt;td&gt;60.4%&lt;/td&gt;
      &lt;td&gt;$17.72&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;o3-mini (high)&lt;/td&gt;
      &lt;td&gt;60.4%&lt;/td&gt;
      &lt;td&gt;$18.16&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;No results yet for Claude 3.7 Sonnet on the &lt;a href="https://lmarena.ai/"&gt;LM Arena leaderboard&lt;/a&gt;, which has recently been dominated by Gemini 2.0 and Grok 3.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/paulgauthier/status/1894167915869737058"&gt;@paulgauthier&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/aider"&gt;aider&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/paul-gauthier"&gt;paul-gauthier&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="aider"/><category term="evals"/><category term="paul-gauthier"/></entry><entry><title>Quoting Paul Gauthier</title><link href="https://simonwillison.net/2025/Jan/26/paul-gauthier/#atom-tag" rel="alternate"/><published>2025-01-26T21:59:49+00:00</published><updated>2025-01-26T21:59:49+00:00</updated><id>https://simonwillison.net/2025/Jan/26/paul-gauthier/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://news.ycombinator.com/item?id=42831769#42834527"&gt;&lt;p&gt;In my experience with AI coding, very large context windows aren't useful in practice. Every model seems to get confused when you feed them more than ~25-30k tokens. The models stop obeying their system prompts, can't correctly find/transcribe pieces of code in the context, etc.&lt;/p&gt;
&lt;p&gt;Developing aider, I've seen this problem with gpt-4o, Sonnet, DeepSeek, etc. Many aider users report this too. It's perhaps the #1 problem users have, so I created a &lt;a href="https://aider.chat/docs/troubleshooting/edit-errors.html#dont-add-too-many-files"&gt;dedicated help page&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Very large context may be useful for certain tasks with lots of "low value" context. But for coding, it seems to lure users into a problematic regime.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://news.ycombinator.com/item?id=42831769#42834527"&gt;Paul Gauthier&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/aider"&gt;aider&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/long-context"&gt;long-context&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/paul-gauthier"&gt;paul-gauthier&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="aider"/><category term="long-context"/><category term="paul-gauthier"/></entry><entry><title>deepseek-ai/DeepSeek-V3-Base</title><link href="https://simonwillison.net/2024/Dec/25/deepseek-v3/#atom-tag" rel="alternate"/><published>2024-12-25T19:00:33+00:00</published><updated>2024-12-25T19:00:33+00:00</updated><id>https://simonwillison.net/2024/Dec/25/deepseek-v3/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V3-Base"&gt;deepseek-ai/DeepSeek-V3-Base&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
No model card or announcement yet, but this new model release from Chinese AI lab DeepSeek (an arm of Chinese hedge fund &lt;a href="https://en.wikipedia.org/wiki/High-Flyer_(company)"&gt;High-Flyer&lt;/a&gt;) looks very significant.&lt;/p&gt;
&lt;p&gt;It's a huge model - 685B parameters, 687.9 GB on disk (&lt;a href="https://til.simonwillison.net/git/size-of-lfs-files"&gt;TIL how to size a git-lfs repo&lt;/a&gt;). The architecture is &lt;a href="https://twitter.com/dysondunbar/status/1871955700949430299"&gt;a Mixture of Experts&lt;/a&gt; with 256 experts, using 8 per token.&lt;/p&gt;
&lt;p&gt;For comparison, Meta AI's largest released model is their &lt;a href="https://ai.meta.com/blog/meta-llama-3-1/"&gt;Llama 3.1 model&lt;/a&gt; with 405B parameters.&lt;/p&gt;
&lt;p&gt;The new model is apparently available to some people via both &lt;a href="https://chat.deepseek.com/"&gt;chat.deepseek.com&lt;/a&gt; and the DeepSeek API as part of a staged rollout.&lt;/p&gt;
&lt;p&gt;Paul Gauthier got API access and &lt;a href="https://twitter.com/paulgauthier/status/1871919612000092632"&gt;used it&lt;/a&gt; to update his new &lt;a href="https://aider.chat/docs/leaderboards/"&gt;Aider Polyglot leaderboard&lt;/a&gt; - DeepSeek v3 preview scored 48.4%, putting it in second place behind &lt;code&gt;o1-2024-12-17 (high)&lt;/code&gt; and in front of both &lt;code&gt;claude-3-5-sonnet-20241022&lt;/code&gt; and &lt;code&gt;gemini-exp-1206&lt;/code&gt;!&lt;/p&gt;
&lt;p&gt;&lt;img alt="Aider leaderboard chart showing DeepSeek Chat V3 preview in second place" src="https://static.simonwillison.net/static/2024/deepseek-v3.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;I never know if I can believe models or not (the first time I asked "what model are you?" it claimed to be "based on OpenAI's GPT-4 architecture"), but I just got this result using &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; and the &lt;a href="https://pypi.org/project/llm-deepseek/"&gt;llm-deepseek&lt;/a&gt; plugin:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm -m deepseek-chat 'what deepseek model are you?'
&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote&gt;
&lt;p&gt;I'm DeepSeek-V3 created exclusively by DeepSeek. I'm an AI assistant, and I'm at your service! Feel free to ask me anything you'd like. I'll do my best to assist you.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's my &lt;a href="https://gist.github.com/simonw/e7528dc52828fb31415f6e14e3527b93"&gt;initial experiment log&lt;/a&gt;.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/ivanfioravanti/status/1871945175616135298"&gt;@ivanfioravanti&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hugging-face"&gt;hugging-face&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/aider"&gt;aider&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deepseek"&gt;deepseek&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/paul-gauthier"&gt;paul-gauthier&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="hugging-face"/><category term="aider"/><category term="deepseek"/><category term="paul-gauthier"/><category term="llm-release"/><category term="ai-in-china"/></entry><entry><title>Quantization matters</title><link href="https://simonwillison.net/2024/Nov/23/quantization-matters/#atom-tag" rel="alternate"/><published>2024-11-23T18:39:23+00:00</published><updated>2024-11-23T18:39:23+00:00</updated><id>https://simonwillison.net/2024/Nov/23/quantization-matters/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://aider.chat/2024/11/21/quantization.html"&gt;Quantization matters&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
What impact does quantization have on the performance of an LLM? been wondering about this for quite a while, now here are numbers from Paul Gauthier.&lt;/p&gt;
&lt;p&gt;He ran differently quantized versions of Qwen 2.5 32B Instruct through his &lt;a href="https://aider.chat/docs/benchmarks.html#the-benchmark"&gt;Aider code editing benchmark&lt;/a&gt; and saw a range of scores.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct"&gt;original released weights&lt;/a&gt; (BF16) scored highest at 71.4%, with Ollama's &lt;a href="https://ollama.com/library/qwen2.5-coder:32b-instruct-fp16"&gt;qwen2.5-coder:32b-instruct-fp16&lt;/a&gt; (a 66GB download) achieving the same score.&lt;/p&gt;
&lt;p&gt;The quantized Ollama &lt;a href="https://ollama.com/library/qwen2.5-coder:32b-instruct-q4_K_M"&gt;qwen2.5-coder:32b-instruct-q4_K_M&lt;/a&gt; (a 20GB download) saw a massive drop in quality, scoring just 53.4% on the same benchmark.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/paulgauthier/status/1859684310204473349"&gt;Paul Gauthier&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/aider"&gt;aider&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/paul-gauthier"&gt;paul-gauthier&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="aider"/><category term="qwen"/><category term="ollama"/><category term="paul-gauthier"/><category term="ai-in-china"/></entry><entry><title>Qwen2.5-Coder-32B is an LLM that can code well that runs on my Mac</title><link href="https://simonwillison.net/2024/Nov/12/qwen25-coder/#atom-tag" rel="alternate"/><published>2024-11-12T23:37:36+00:00</published><updated>2024-11-12T23:37:36+00:00</updated><id>https://simonwillison.net/2024/Nov/12/qwen25-coder/#atom-tag</id><summary type="html">
    &lt;p&gt;There's a whole lot of buzz around the new &lt;a href="https://qwenlm.github.io/blog/qwen2.5-coder-family/"&gt;Qwen2.5-Coder Series&lt;/a&gt; of open source (Apache 2.0 licensed) LLM releases from Alibaba's Qwen research team. On first impression it looks like the buzz is well deserved.&lt;/p&gt;
&lt;p&gt;Qwen claim:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Qwen2.5-Coder-32B-Instruct has become the current SOTA open-source code model, matching the coding capabilities of GPT-4o.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That's a &lt;em&gt;big&lt;/em&gt; claim for a 32B model that's small enough that it can run on my 64GB MacBook Pro M2. The Qwen published scores look impressive, comparing favorably with GPT-4o and Claude 3.5 Sonnet (October 2024) edition across various code-related benchmarks:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/qwen-scores.jpg" alt="In benchmark comparisons, Qwen 2.5 Coder (32B Instruct) outperforms both GPT-4o and Claude 3.5 Sonnet on LiveCodeBench, Spider, and BIRD-SQL metrics, falls behind on MBPP, Aider, and CodeArena, shows mixed results on MultiPL-E, and performs similarly on HumanEval and McEval benchmarks." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;How about benchmarks from other researchers? Paul Gauthier's &lt;a href="https://aider.chat/docs/leaderboards/"&gt;Aider benchmarks&lt;/a&gt; have a great reputation and &lt;a href="https://twitter.com/paulgauthier/status/1856018124031832236"&gt;Paul reports&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The new Qwen 2.5 Coder models did very well on aider's code editing benchmark. The 32B Instruct model scored in between GPT-4o and 3.5 Haiku.&lt;/p&gt;
&lt;p&gt;84% 3.5 Sonnet,
75% 3.5 Haiku,
74% Qwen2.5 Coder 32B,
71% GPT-4o,
69% Qwen2.5 Coder 14B,
58% Qwen2.5 Coder 7B&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/qwen-paul.jpg" alt="Those numbers as a chart" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That was for the Aider "whole edit" benchmark. The "diff" benchmark &lt;a href="https://twitter.com/paulgauthier/status/1856042640279777420"&gt;scores well&lt;/a&gt; too, with Qwen2.5 Coder 32B tying with GPT-4o (but a little behind Claude 3.5 Haiku).&lt;/p&gt;
&lt;p&gt;Given these scores (and the &lt;a href="https://www.reddit.com/r/LocalLLaMA/comments/1gp84in/qwen25coder_32b_the_ai_thats_revolutionizing/"&gt;positive buzz on Reddit&lt;/a&gt;) I had to try it for myself.&lt;/p&gt;
&lt;p&gt;My attempts to run the &lt;a href="https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct-GGUF"&gt;Qwen/Qwen2.5-Coder-32B-Instruct-GGUF&lt;/a&gt; Q8 using &lt;a href="https://github.com/simonw/llm-gguf"&gt;llm-gguf&lt;/a&gt; were a bit too slow, because I don't have that compiled to use my Mac's GPU at the moment.&lt;/p&gt;
&lt;p&gt;But both the &lt;a href="https://ollama.com/"&gt;Ollama&lt;/a&gt; version &lt;em&gt;and&lt;/em&gt; the &lt;a href="https://github.com/ml-explore/mlx"&gt;MLX&lt;/a&gt; version worked great!&lt;/p&gt;
&lt;p&gt;I installed the Ollama version using:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ollama pull qwen2.5-coder:32b
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That fetched a 20GB quantized file. I ran a prompt through that using my &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; tool and Sergey Alexandrov's &lt;a href="https://github.com/taketwo/llm-ollama"&gt;llm-ollama&lt;/a&gt; plugin like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-ollama
llm models # Confirming the new model is present
llm -m qwen2.5-coder:32b 'python function that takes URL to a CSV file and path to a SQLite database, fetches the CSV with the standard library, creates a table with the right columns and inserts the data'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/0a47f9e35a50d4e25a47826f4ab75dda"&gt;the result&lt;/a&gt;. The code worked, but I had to work around a frustrating &lt;code&gt;ssl&lt;/code&gt; bug first (which wouldn't have been an issue if I'd allowed the model to use &lt;code&gt;requests&lt;/code&gt; or &lt;code&gt;httpx&lt;/code&gt; instead of the standard library).&lt;/p&gt;
&lt;p&gt;I also tried running it using the Apple Silicon fast array framework MLX using the &lt;a href="https://github.com/riccardomusmeci/mlx-llm"&gt;mlx-llm&lt;/a&gt; library directly, run via &lt;a href="https://github.com/astral-sh/uv"&gt;uv&lt;/a&gt; like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uv run --with mlx-lm \
  mlx_lm.generate \
  --model mlx-community/Qwen2.5-Coder-32B-Instruct-8bit \
  --max-tokens 4000 \
  --prompt 'write me a python function that renders a mandelbrot fractal as wide as the current terminal'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That gave me a &lt;em&gt;very&lt;/em&gt; &lt;a href="https://gist.github.com/simonw/1cc1e0418a04dbd19cd281cf9b43666f"&gt;satisfying result&lt;/a&gt; - when I ran the code it generated in a terminal I got this:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/mlx-fractal.jpg" alt="macOS terminal window displaying a pleasing mandelbrot fractal as ASCII art" style="max-width: 100%;" /&gt;&lt;/p&gt;

&lt;p&gt;MLX reported the following performance metrics:&lt;/p&gt;
&lt;pre&gt;Prompt: 49 tokens, 95.691 tokens-per-sec
Generation: 723 tokens, 10.016 tokens-per-sec
Peak memory: 32.685 GB&lt;/pre&gt;

&lt;p&gt;Let's see how it does on the &lt;a href="https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/"&gt;Pelican on a bicycle benchmark&lt;/a&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;llm -m qwen2.5-coder:32b 'Generate an SVG of a pelican riding a bicycle'&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/56217af454695a90be2c8e09c703198a"&gt;what I got&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/qwen-pelican.svg" alt="A jumble of shapes. The pelican has a yellow body, a black head and a weird proboscis kind of thing. The bicycle is several brown overlapping shapes that looks a bit like a tractor." /&gt;&lt;/p&gt;

&lt;p&gt;Questionable Pelican SVG drawings aside, this is a really promising development. 32GB is just small enough that I can run the model on my Mac without having to quit every other application I'm running, and both the speed and the quality of the results feel genuinely competitive with the current best of the hosted models.&lt;/p&gt;

&lt;p&gt;Given that code assistance is probably around 80% of my LLM usage at the moment this is a meaningfully useful release for how I engage with this class of technology.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/mandelbrot"&gt;mandelbrot&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlx"&gt;mlx&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ollama"&gt;ollama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pelican-riding-a-bicycle"&gt;pelican-riding-a-bicycle&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/paul-gauthier"&gt;paul-gauthier&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="mandelbrot"/><category term="open-source"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="ai-assisted-programming"/><category term="llm"/><category term="uv"/><category term="qwen"/><category term="mlx"/><category term="ollama"/><category term="pelican-riding-a-bicycle"/><category term="paul-gauthier"/><category term="llm-release"/><category term="ai-in-china"/></entry><entry><title>LLMs are bad at returning code in JSON</title><link href="https://simonwillison.net/2024/Aug/16/llms-are-bad-at-returning-code-in-json/#atom-tag" rel="alternate"/><published>2024-08-16T17:04:39+00:00</published><updated>2024-08-16T17:04:39+00:00</updated><id>https://simonwillison.net/2024/Aug/16/llms-are-bad-at-returning-code-in-json/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://aider.chat/2024/08/14/code-in-json.html"&gt;LLMs are bad at returning code in JSON&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Paul Gauthier's &lt;a href="https://aider.chat/"&gt;Aider&lt;/a&gt; is a terminal-based coding assistant which works against multiple different models. As part of developing the project Paul runs extensive benchmarks, and his latest shows an interesting result: LLMs are slightly less reliable at producing working code if you request that code be returned as part of a JSON response.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Coding skill by model and code wrapping strategy - four models, each showing their pass rate % average of five runs. Claude 3.5 Sonnet gets 60.5% with Markdown, 54.1% with JSON. DeepSeek-Coder V2 0724 gets 60.6% with Markdown, 51.1% with JSON. GPT-4o-2024-05-13 gets 60.0% with Markdown, 59.6% with JSON. GPT-4o-2024-08-06 gets 60.8% with Markdown, 57.6% with JSON, and 56.9% with JSON (strict). Markdown consistently performs better than JSON across all models." src="https://static.simonwillison.net/static/2024/llm-code-json.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;The May release of GPT-4o is the closest to a perfect score - the  August appears to have regressed slightly, and the new structured output mode doesn't help and could even make things worse (though that difference may not be statistically significant).&lt;/p&gt;
&lt;p&gt;Paul recommends using Markdown delimiters here instead, which are less likely to introduce confusing nested quoting issues.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/paulgauthier/status/1824442504290374061"&gt;@paulgauthier&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/json"&gt;json&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-engineering"&gt;prompt-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/aider"&gt;aider&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/paul-gauthier"&gt;paul-gauthier&lt;/a&gt;&lt;/p&gt;



</summary><category term="json"/><category term="ai"/><category term="prompt-engineering"/><category term="generative-ai"/><category term="llms"/><category term="aider"/><category term="evals"/><category term="paul-gauthier"/></entry><entry><title>Aider</title><link href="https://simonwillison.net/2024/Jul/31/aider/#atom-tag" rel="alternate"/><published>2024-07-31T03:26:17+00:00</published><updated>2024-07-31T03:26:17+00:00</updated><id>https://simonwillison.net/2024/Jul/31/aider/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/paul-gauthier/aider"&gt;Aider&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Aider is an impressive open source local coding chat assistant terminal application, developed by Paul Gauthier (founding CTO of &lt;a href="https://en.wikipedia.org/wiki/Inktomi"&gt;Inktomi&lt;/a&gt; back in &lt;a href="https://www.linkedin.com/in/paulgauthier/details/experience/"&gt;1996-2000&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;I tried it out today, using an Anthropic API key to run it using Claude 3.5 Sonnet:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;pipx install aider-chat
export ANTHROPIC_API_KEY=api-key-here
aider --dark-mode
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I found the &lt;code&gt;--dark-mode&lt;/code&gt; flag necessary to make it legible using the macOS terminal "Pro" theme.&lt;/p&gt;
&lt;p&gt;Aider starts by generating a &lt;a href="https://aider.chat/docs/repomap.html"&gt;concise map of files&lt;/a&gt; in your current Git repository. This is passed to the LLM along with the prompts that you type, and Aider can then request additional files be added to that context - or you can add the manually with the &lt;code&gt;/add filename&lt;/code&gt; command.&lt;/p&gt;
&lt;p&gt;It defaults to making modifications to files and then committing them directly to Git with a generated commit message. I found myself preferring the &lt;code&gt;/ask&lt;/code&gt; command which lets you ask a question without making any file modifications:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://static.simonwillison.net/static/2024/aider.jpg"&gt;&lt;img alt="Loaded /Users/simon/Dropbox/Development/simonwillisonblog/.env Aider v0.46.1 Models: claude-3-5-sonnet-20240620 with diff edit format, weak model claude-3-haiku-20240307 Git repo: .git with 156 files Repo-map: using 1024 tokens Use /help &amp;lt;question&amp;gt; for help, run &amp;quot;aider --help&amp;quot; to see cmd line args  &amp;gt; /add blog/search.py  Added /Users/simon/Dropbox/Development/simonwillisonblog/blog/search.py to the chat  blog/search.py   &amp;gt; /ask what does ?exclude.tag= do in the search feature?         The ?exclude.tag= parameter in the search feature is used to exclude specific tags from the search results. Here's how it      works in the code:        1 The excluded tags are retrieved from the request's GET parameters:   excluded_tags = request.GET.getlist(&amp;quot;exclude.tag&amp;quot;)    This line collects all the tags that should be excluded from the search results. The getlist() method is used because there    might be multiple excluded tags. 2 These excluded tags are then applied to the queryset for each content type (Entry, Blogmark, Quotation) in the   make_queryset() function:       for exclude_tag in excluded_tags:  qs = qs.exclude(tags__tag=exclude_tag)     This loop iterates through each excluded tag and uses Django's exclude() method to filter out any items that have the   specified tag.    So, if a user includes ?exclude.tag=python&amp;amp;exclude.tag=django in their search query, the results will not include any items    tagged with either &amp;quot;python&amp;quot; or &amp;quot;django&amp;quot;. This feature allows users to refine their search by explicitly removing certain topics from the results, which can be   particularly useful when searching through a large number of tagged items.  Tokens: 3,884 sent, 248 received. Cost: $0.02 request, $0.02 session. blog/search.py" src="https://static.simonwillison.net/static/2024/aider.jpg" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The Aider documentation includes &lt;a href="https://aider.chat/examples/README.html"&gt;extensive examples&lt;/a&gt; and the tool can work with a wide range of &lt;a href="https://aider.chat/docs/llms.html"&gt;different LLMs&lt;/a&gt;, though it recommends GPT-4o, Claude 3.5 Sonnet (or 3 Opus) and DeepSeek Coder V2 for the best results. Aider maintains &lt;a href="https://aider.chat/docs/leaderboards/"&gt;its own leaderboard&lt;/a&gt;, emphasizing that "Aider works best with LLMs which are good at &lt;em&gt;editing&lt;/em&gt; code, not just good at writing code".&lt;/p&gt;
&lt;p&gt;The prompts it uses are pretty fascinating - they're tucked away in various &lt;code&gt;*_prompts.py&lt;/code&gt; files in &lt;a href="https://github.com/paul-gauthier/aider/tree/main/aider/coders"&gt;aider/coders&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-3-5-sonnet"&gt;claude-3-5-sonnet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/aider"&gt;aider&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/paul-gauthier"&gt;paul-gauthier&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="claude-3-5-sonnet"/><category term="aider"/><category term="paul-gauthier"/></entry></feed>