<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: testing</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/testing.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2026-03-29T20:08:45+00:00</updated><author><name>Simon Willison</name></author><entry><title>Pretext</title><link href="https://simonwillison.net/2026/Mar/29/pretext/#atom-tag" rel="alternate"/><published>2026-03-29T20:08:45+00:00</published><updated>2026-03-29T20:08:45+00:00</updated><id>https://simonwillison.net/2026/Mar/29/pretext/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/chenglou/pretext"&gt;Pretext&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Exciting new browser library from Cheng Lou, previously a React core developer and the original creator of the &lt;a href="https://github.com/chenglou/react-motion"&gt;react-motion&lt;/a&gt; animation library.&lt;/p&gt;
&lt;p&gt;Pretext solves the problem of calculating the height of a paragraph of line-wrapped text &lt;em&gt;without touching the DOM&lt;/em&gt;. The usual way of doing this is to render the text and measure its dimensions, but this is extremely expensive. Pretext uses an array of clever tricks to make this much, much faster, which enables all sorts of new text rendering effects in browser applications.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://chenglou.me/pretext/dynamic-layout/"&gt;one demo&lt;/a&gt; that shows the kind of things this makes possible:&lt;/p&gt;
&lt;video autoplay loop muted playsinline
  poster="https://static.simonwillison.net/static/2026/pretex.jpg"&gt;
  &lt;source src="https://static.simonwillison.net/static/2026/pretex.mp4" type="video/mp4"&gt;
&lt;/video&gt;

&lt;p&gt;The key to how this works is the way it separates calculations into a call to a &lt;code&gt;prepare()&lt;/code&gt; function followed by multiple calls to &lt;code&gt;layout()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;prepare()&lt;/code&gt; function splits the input text into segments (effectively words, but it can take things like soft hyphens and non-latin character sequences and emoji into account as well) and measures those using an off-screen canvas, then caches the results. This is comparatively expensive but only runs once.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;layout()&lt;/code&gt; function can then emulate the word-wrapping logic in browsers to figure out how many wrapped lines the text will occupy at a specified width and measure the overall height.&lt;/p&gt;
&lt;p&gt;I &lt;a href="https://claude.ai/share/7859cbe1-1350-4341-bb40-6aa241d6a1fe"&gt;had Claude&lt;/a&gt; build me &lt;a href="https://tools.simonwillison.net/pretext-explainer"&gt;this interactive artifact&lt;/a&gt; to help me visually understand what's going on, based on a simplified version of Pretext itself.&lt;/p&gt;
&lt;p&gt;The way this is tested is particularly impressive. The earlier tests &lt;a href="https://github.com/chenglou/pretext/commit/d07dd7a5008726f99a15cebe0abd9031022e28ef#diff-835c37ed3b9234ed4d90c7703addb8e47f4fee6d9a28481314afd15ac472f8d2"&gt;rendered a full copy of the Great Gatsby&lt;/a&gt; in multiple browsers to confirm that the estimated measurements were correct against a large volume of text. This was later joined by &lt;a href="https://github.com/chenglou/pretext/tree/main/corpora"&gt;the corpora/ folder&lt;/a&gt; using the same technique against lengthy public domain documents in Thai, Chinese, Korean, Japanese, Arabic, and more.&lt;/p&gt;
&lt;p&gt;Cheng Lou &lt;a href="https://twitter.com/_chenglou/status/2037715226838343871"&gt;says&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The engine’s tiny (few kbs), aware of browser quirks, supports all the languages you’ll need, including Korean mixed with RTL Arabic and platform-specific emojis&lt;/p&gt;
&lt;p&gt;This was achieved through showing Claude Code and Codex the browsers ground truth, and have them measure &amp;amp; iterate against those at every significant container width, running over weeks&lt;/p&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/_chenglou/status/2037713766205608234"&gt;@_chenglou&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/browsers"&gt;browsers&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/css"&gt;css&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/react"&gt;react&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/typescript"&gt;typescript&lt;/a&gt;&lt;/p&gt;



</summary><category term="browsers"/><category term="css"/><category term="javascript"/><category term="testing"/><category term="react"/><category term="typescript"/></entry><entry><title>Agentic manual testing</title><link href="https://simonwillison.net/guides/agentic-engineering-patterns/agentic-manual-testing/#atom-tag" rel="alternate"/><published>2026-03-06T05:43:54+00:00</published><updated>2026-03-06T05:43:54+00:00</updated><id>https://simonwillison.net/guides/agentic-engineering-patterns/agentic-manual-testing/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;em&gt;&lt;a href="https://simonwillison.net/guides/agentic-engineering-patterns/"&gt;Agentic Engineering Patterns&lt;/a&gt; &amp;gt;&lt;/em&gt;&lt;/p&gt;
    &lt;p&gt;The defining characteristic of a coding agent is that it can &lt;em&gt;execute the code&lt;/em&gt; that it writes. This is what makes coding agents so much more useful than LLMs that simply spit out code without any way to verify it.&lt;/p&gt;
&lt;p&gt;Never assume that code generated by an LLM works until that code has been executed.&lt;/p&gt;
&lt;p&gt;Coding agents have the ability to confirm that the code they have produced works as intended, or iterate further on that code until it does.&lt;/p&gt;
&lt;p&gt;Getting agents to &lt;a href="https://simonwillison.net/guides/agentic-engineering-patterns/red-green-tdd/"&gt;write unit tests&lt;/a&gt;, especially using test-first TDD, is a powerful way to ensure they have exercised the code they are writing.&lt;/p&gt;
&lt;p&gt;That's not the only worthwhile approach, though. &lt;/p&gt;
&lt;p&gt;Just because code passes tests doesn't mean it works as intended. Anyone who's worked with automated tests will have seen cases where the tests all pass but the code itself fails in some obvious way - it might crash the server on startup, fail to display a crucial UI element, or miss some detail that the tests failed to cover.&lt;/p&gt;
&lt;p&gt;Automated tests are no replacement for &lt;strong&gt;manual testing&lt;/strong&gt;. I like to see a feature working with my own eye before I land it in a release.&lt;/p&gt;
&lt;p&gt;I've found that getting agents to manually test code is valuable as well, frequently revealing issues that weren't spotted by the automated tests.&lt;/p&gt;
&lt;h2 id="mechanisms-for-agentic-manual-testing"&gt;Mechanisms for agentic manual testing&lt;/h2&gt;
&lt;p&gt;How an agent should "manually" test a piece of code varies depending on what that code is.&lt;/p&gt;
&lt;p&gt;For Python libraries a useful pattern is &lt;code&gt;python -c "... code ..."&lt;/code&gt;. You can pass a string (or multiline string) of Python code directly to the Python interpreter, including code that imports other modules.&lt;/p&gt;
&lt;p&gt;The coding agents are all familiar with this trick and will sometimes use it without prompting. Reminding them to test using &lt;code&gt;python -c&lt;/code&gt; can often be effective though:&lt;/p&gt;
&lt;pre&gt;Try that new function on some edge cases using `python -c`&lt;/pre&gt;
&lt;p&gt;Other languages may have similar mechanisms, and if they don't it's still quick for an agent to write out a demo file and then compile and run it. I sometimes encourage it to use &lt;code&gt;/tmp&lt;/code&gt; purely to avoid those files being accidentally committed to the repository later on.&lt;/p&gt;
&lt;pre&gt;Write code in `/tmp` to try edge cases of that function and then compile and run it&lt;/pre&gt;
&lt;p&gt;Many of my projects involve building web applications with JSON APIs. For these I tell the agent to exercise them using &lt;code&gt;curl&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;Run a dev server and explore that new JSON API using `curl`&lt;/pre&gt;
&lt;p&gt;Telling an agent to "explore" often results in it trying out a bunch of different aspects of a new API, which can quickly cover a whole lot of ground.&lt;/p&gt;
&lt;p&gt;If an agent finds something that doesn't work through their manual testing, I like to tell them to fix it with red/green TDD. This ensures the new case ends up covered by the permanent automated tests.&lt;/p&gt;
&lt;h2 id="using-browser-automation-for-web-uis"&gt;Using browser automation for web UIs&lt;/h2&gt;
&lt;p&gt;Having a manual testing procedure in place becomes even more valuable if a project involves an interactive web UI.&lt;/p&gt;
&lt;p&gt;Historically these have been difficult to test from code, but the past decade has seen notable improvements in systems for automating real web browsers. Running a real Chrome or Firefox or Safari browser against an application can uncover all sorts of interesting problems in a realistic setting.&lt;/p&gt;
&lt;p&gt;Coding agents know how to use these tools extremely well.&lt;/p&gt;
&lt;p&gt;The most powerful of these today is &lt;strong&gt;&lt;a href="https://playwright.dev/"&gt;Playwright&lt;/a&gt;&lt;/strong&gt;, an open source library developed by Microsoft. Playwright offers a full-featured API with bindings in multiple popular programming languages and can automate any of the popular browser engines.&lt;/p&gt;
&lt;p&gt;Simply telling your agent to "test that with Playwright" may be enough. The agent can then select the language binding that makes the most sense, or use Playwright's &lt;a href="https://github.com/microsoft/playwright-cli"&gt;playwright-cli&lt;/a&gt; tool.&lt;/p&gt;
&lt;p&gt;Coding agents work really well with dedicated CLIs. &lt;a href="https://github.com/vercel-labs/agent-browser"&gt;agent-browser&lt;/a&gt; by Vercel is a comprehensive CLI wrapper around Playwright specially designed for coding agents to use.&lt;/p&gt;
&lt;p&gt;My own project &lt;a href="https://github.com/simonw/rodney"&gt;Rodney&lt;/a&gt; serves a similar purpose, albeit using the Chrome DevTools Protocol to directly control an instance of Chrome.&lt;/p&gt;
&lt;p&gt;Here's an example prompt I use to test things with Rodney:&lt;/p&gt;
&lt;p&gt;&lt;pre&gt;Start a dev server and then use `uvx rodney --help` to test the new homepage, look at screenshots to confirm the menu is in the right place&lt;/pre&gt;
There are three tricks in this prompt:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Saying "use &lt;code&gt;uvx rodney --help&lt;/code&gt;" causes the agent to run &lt;code&gt;rodney --help&lt;/code&gt; via the &lt;a href="https://docs.astral.sh/uv/guides/tools/"&gt;uvx&lt;/a&gt; package management tool, which automatically installs Rodney the first time it is called.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;rodney --help&lt;/code&gt; command is specifically designed to give agents everything they need to know to both understand and use the tool. Here's &lt;a href="https://github.com/simonw/rodney/blob/main/help.txt"&gt;that help text&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Saying "look at screenshots" hints to the agent that it should use the &lt;code&gt;rodney screenshot&lt;/code&gt; command and reminds it that it can use its own vision abilities against the resulting image files to evaluate the visual appearance of the page.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That's a whole lot of manual testing baked into a short prompt!&lt;/p&gt;
&lt;p&gt;Rodney and tools like it offer a wide array of capabilities, from running JavaScript on the loaded site to scrolling, clicking, typing, and even reading the accessibility tree of the page.&lt;/p&gt;
&lt;p&gt;As with other forms of manual tests, issues found and fixed via browser automation can then be added to permanent automated tests as well.&lt;/p&gt;
&lt;p&gt;Many developers have avoided too many automated browser tests in the past due to their reputation for flakiness - the smallest tweak to the HTML of a page can result in frustrating waves of test breaks.&lt;/p&gt;
&lt;p&gt;Having coding agents maintain those tests over time greatly reduces the friction involved in keeping them up-to-date in the face of design changes to the web interfaces.&lt;/p&gt;
&lt;h2 id="have-them-take-notes-with-showboat"&gt;Have them take notes with Showboat&lt;/h2&gt;
&lt;p&gt;Having agents manually test code can catch extra problems, but it can also be used to create artifacts that can help document the code and demonstrate how it has been tested.&lt;/p&gt;
&lt;p&gt;I'm fascinated by the challenge of having agents &lt;em&gt;show their work&lt;/em&gt;. Being able to see demos or documented experiments is a really useful way of confirming that the agent has comprehensively solved the challenge it was given.&lt;/p&gt;
&lt;p&gt;I built &lt;a href="https://github.com/simonw/showboat"&gt;Showboat&lt;/a&gt; to facilitate building documents that capture the agentic manual testing flow.&lt;/p&gt;
&lt;p&gt;Here's a prompt I frequently use:&lt;/p&gt;
&lt;p&gt;&lt;pre&gt;Run `uvx showboat --help` and then create a `notes/api-demo.md` showboat document and use it to test and document that new API.&lt;/pre&gt;
As with Rodney above, the &lt;code&gt;showboat --help&lt;/code&gt; command teaches the agent what Showboat is and how to use it. Here's &lt;a href="https://github.com/simonw/showboat/blob/main/help.txt"&gt;that help text in full&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The three key Showboat commands are &lt;code&gt;note&lt;/code&gt;, &lt;code&gt;exec&lt;/code&gt;, and &lt;code&gt;image&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;note&lt;/code&gt; appends a Markdown note to the Showboat document. &lt;code&gt;exec&lt;/code&gt; records a command, then runs that command and records its output. &lt;code&gt;image&lt;/code&gt; adds an image to the document - useful for screenshots of web applications taken using Rodney.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;exec&lt;/code&gt; command is the most important of these, because it captures a command along with the resulting output. This shows you what the agent did and what the result was, and is designed to discourage the agent from cheating and writing what it &lt;em&gt;hoped&lt;/em&gt; had happened into the document.&lt;/p&gt;
&lt;p&gt;I've been finding the Showboat pattern to work really well for documenting the work that has been achieved during my agent sessions. I'm hoping to see similar patterns adopted across a wider set of tools.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/playwright"&gt;playwright&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/agentic-engineering"&gt;agentic-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rodney"&gt;rodney&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/showboat"&gt;showboat&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="playwright"/><category term="testing"/><category term="agentic-engineering"/><category term="ai"/><category term="llms"/><category term="coding-agents"/><category term="ai-assisted-programming"/><category term="rodney"/><category term="showboat"/></entry><entry><title>First run the tests</title><link href="https://simonwillison.net/guides/agentic-engineering-patterns/first-run-the-tests/#atom-tag" rel="alternate"/><published>2026-02-24T12:30:05+00:00</published><updated>2026-02-24T12:30:05+00:00</updated><id>https://simonwillison.net/guides/agentic-engineering-patterns/first-run-the-tests/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;em&gt;&lt;a href="https://simonwillison.net/guides/agentic-engineering-patterns/"&gt;Agentic Engineering Patterns&lt;/a&gt; &amp;gt;&lt;/em&gt;&lt;/p&gt;
    &lt;p&gt;Automated tests are no longer optional when working with coding agents.&lt;/p&gt;
&lt;p&gt;The old excuses for not writing them - that they're time consuming and expensive to constantly rewrite while a codebase is rapidly evolving - no longer hold when an agent can knock them into shape in just a few minutes.&lt;/p&gt;
&lt;p&gt;They're also &lt;em&gt;vital&lt;/em&gt; for ensuring AI-generated code does what it claims to do.  If the code has never been executed it's pure luck if it actually works when deployed to production.&lt;/p&gt;
&lt;p&gt;Tests are also a great tool to help get an agent up to speed with an existing codebase. Watch what happens when you ask Claude Code or similar about an existing feature - the chances are high that they'll find and read the relevant tests.&lt;/p&gt;
&lt;p&gt;Agents are already biased towards testing, but the presence of an existing test suite will almost certainly push the agent into testing new changes that it makes.&lt;/p&gt;
&lt;p&gt;Any time I start a new session with an agent against an existing project I'll start by prompting a variant of the following:
&lt;pre&gt;First run the tests&lt;/pre&gt;
For my Python projects I have &lt;a href="https://til.simonwillison.net/uv/dependency-groups"&gt;pyproject.toml set up&lt;/a&gt; such that I can prompt this instead:
&lt;pre&gt;Run &amp;quot;uv run pytest&amp;quot;&lt;/pre&gt;
These four word prompts serve several purposes:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;It tells the agent that there is a test suite and forces it to figure out how to run the tests. This makes it almost certain that the agent will run the tests in the future to ensure it didn't break anything.&lt;/li&gt;
&lt;li&gt;Most test harnesses will give the agent a rough indication of how many tests they are. This can act as a proxy for how large and complex the project is, and also hints that the agent should search the tests themselves if they want to learn more.&lt;/li&gt;
&lt;li&gt;It puts the agent in a testing mindset. Having run the tests it's natural for it to then expand them with its own tests later on.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Similar to &lt;a href="https://simonwillison.net/guides/agentic-engineering-patterns/red-green-tdd/"&gt;"Use red/green TDD"&lt;/a&gt;, "First run the tests" provides a four word prompt that encompasses a substantial amount of software engineering discipline that's already baked into the models.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tdd"&gt;tdd&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/agentic-engineering"&gt;agentic-engineering&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="testing"/><category term="tdd"/><category term="ai"/><category term="llms"/><category term="coding-agents"/><category term="ai-assisted-programming"/><category term="generative-ai"/><category term="agentic-engineering"/></entry><entry><title>Red/green TDD</title><link href="https://simonwillison.net/guides/agentic-engineering-patterns/red-green-tdd/#atom-tag" rel="alternate"/><published>2026-02-23T07:12:28+00:00</published><updated>2026-02-23T07:12:28+00:00</updated><id>https://simonwillison.net/guides/agentic-engineering-patterns/red-green-tdd/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;em&gt;&lt;a href="https://simonwillison.net/guides/agentic-engineering-patterns/"&gt;Agentic Engineering Patterns&lt;/a&gt; &amp;gt;&lt;/em&gt;&lt;/p&gt;
    &lt;p&gt;"&lt;strong&gt;Use red/green TDD&lt;/strong&gt;" is a pleasingly succinct way to get better results out of a coding agent.&lt;/p&gt;
&lt;p&gt;TDD stands for Test Driven Development. It's a programming style where you ensure every piece of code you write is accompanied by automated tests that demonstrate the code works.&lt;/p&gt;
&lt;p&gt;The most disciplined form of TDD is test-first development. You write the automated tests first, confirm that they fail, then iterate on the implementation until the tests pass.&lt;/p&gt;
&lt;p&gt;This turns out to be a &lt;em&gt;fantastic&lt;/em&gt; fit for coding agents. A significant risk with coding agents is that they might write code that doesn't work, or build code that is unnecessary and never gets used, or both.&lt;/p&gt;
&lt;p&gt;Test-first development helps protect against both of these common mistakes, and also ensures a robust automated test suite that protects against future regressions. As projects grow the chance that a new change might break an existing feature grows with them. A comprehensive test suite is by far the most effective way to keep those features working.&lt;/p&gt;
&lt;p&gt;It's important to confirm that the tests fail before implementing the code to make them pass. If you skip that step you risk building a test that passes already, hence failing to exercise and confirm your new implementation.&lt;/p&gt;
&lt;p&gt;That's what "red/green" means: the red phase watches the tests fail, then the green phase confirms that they now pass.&lt;/p&gt;
&lt;p&gt;Every good model understands "red/green TDD" as a shorthand for the much longer "use test driven development, write the tests first, confirm that the tests fail before you implement the change that gets them to pass".&lt;/p&gt;
&lt;p&gt;Example prompt:
&lt;pre&gt;Build a Python function to extract headers from a markdown string. Use red/green TDD.&lt;/pre&gt;&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tdd"&gt;tdd&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/agentic-engineering"&gt;agentic-engineering&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="testing"/><category term="tdd"/><category term="coding-agents"/><category term="ai-assisted-programming"/><category term="agentic-engineering"/></entry><entry><title>Rodney v0.4.0</title><link href="https://simonwillison.net/2026/Feb/17/rodney/#atom-tag" rel="alternate"/><published>2026-02-17T23:02:33+00:00</published><updated>2026-02-17T23:02:33+00:00</updated><id>https://simonwillison.net/2026/Feb/17/rodney/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/rodney/releases/tag/v0.4.0"&gt;Rodney v0.4.0&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
My &lt;a href="https://github.com/simonw/rodney"&gt;Rodney&lt;/a&gt; CLI tool for browser automation attracted quite the flurry of PRs since I announced it &lt;a href="https://simonwillison.net/2026/Feb/10/showboat-and-rodney/#rodney-cli-browser-automation-designed-to-work-with-showboat"&gt;last week&lt;/a&gt;. Here are the release notes for the just-released v0.4.0:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Errors now use exit code 2, which means exit code 1 is just for for check failures. &lt;a href="https://github.com/simonw/rodney/pull/15"&gt;#15&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;New &lt;code&gt;rodney assert&lt;/code&gt; command for running JavaScript tests, exit code 1 if they fail. &lt;a href="https://github.com/simonw/rodney/issues/19"&gt;#19&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;New directory-scoped sessions with &lt;code&gt;--local&lt;/code&gt;/&lt;code&gt;--global&lt;/code&gt; flags. &lt;a href="https://github.com/simonw/rodney/pull/14"&gt;#14&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;New &lt;code&gt;reload --hard&lt;/code&gt; and &lt;code&gt;clear-cache&lt;/code&gt; commands. &lt;a href="https://github.com/simonw/rodney/pull/17"&gt;#17&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;New &lt;code&gt;rodney start --show&lt;/code&gt; option to make the browser window visible. Thanks, &lt;a href="https://github.com/antocuni"&gt;Antonio Cuni&lt;/a&gt;. &lt;a href="https://github.com/simonw/rodney/paull/13"&gt;#13&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;New &lt;code&gt;rodney connect PORT&lt;/code&gt; command to debug an already-running Chrome instance. Thanks, &lt;a href="https://github.com/pnf"&gt;Peter Fraenkel&lt;/a&gt;. &lt;a href="https://github.com/simonw/rodney/pull/12"&gt;#12&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;New &lt;code&gt;RODNEY_HOME&lt;/code&gt; environment variable to support custom state directories. Thanks, &lt;a href="https://github.com/senko"&gt;Senko Rašić&lt;/a&gt;. &lt;a href="https://github.com/simonw/rodney/pull/11"&gt;#11&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;New &lt;code&gt;--insecure&lt;/code&gt; flag to ignore certificate errors. Thanks, &lt;a href="https://github.com/zgolus"&gt;Jakub Zgoliński&lt;/a&gt;. &lt;a href="https://github.com/simonw/rodney/pull/10"&gt;#10&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Windows support: avoid &lt;code&gt;Setsid&lt;/code&gt; on Windows via build-tag helpers. Thanks, &lt;a href="https://github.com/adm1neca"&gt;adm1neca&lt;/a&gt;. &lt;a href="https://github.com/simonw/rodney/pull/18"&gt;#18&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Tests now run on &lt;code&gt;windows-latest&lt;/code&gt; and &lt;code&gt;macos-latest&lt;/code&gt; in addition to Linux.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;I've been using &lt;a href="https://github.com/simonw/showboat"&gt;Showboat&lt;/a&gt; to create demos of new features - here those are for &lt;a href="https://github.com/simonw/rodney/tree/v0.4.0/notes/assert-command-demo"&gt;rodney assert&lt;/a&gt;, &lt;a href="https://github.com/simonw/rodney/tree/v0.4.0/notes/clear-cache-demo"&gt;rodney reload --hard&lt;/a&gt;, &lt;a href="https://github.com/simonw/rodney/tree/v0.4.0/notes/error-codes-demo"&gt;rodney exit codes&lt;/a&gt;, and &lt;a href="https://github.com/simonw/rodney/tree/v0.4.0/notes/local-sessions-demo"&gt;rodney start --local&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;rodney assert&lt;/code&gt; command is pretty neat: you can now Rodney to test a web app through multiple steps in a shell script that looks something like this (adapted from &lt;a href="https://github.com/simonw/rodney/blob/v0.4.0/README.md#combining-checks-in-a-shell-script"&gt;the README&lt;/a&gt;):&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#!&lt;/span&gt;/bin/bash&lt;/span&gt;
&lt;span class="pl-c1"&gt;set&lt;/span&gt; -euo pipefail

FAIL=0

&lt;span class="pl-en"&gt;check&lt;/span&gt;() {
    &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-k"&gt;!&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$@&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;then&lt;/span&gt;
        &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;FAIL: &lt;span class="pl-smi"&gt;$*&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
        FAIL=1
    &lt;span class="pl-k"&gt;fi&lt;/span&gt;
}

rodney start
rodney open &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;https://example.com&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
rodney waitstable

&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Assert elements exist&lt;/span&gt;
check rodney exists &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;h1&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Assert key elements are visible&lt;/span&gt;
check rodney visible &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;h1&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
check rodney visible &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;#main-content&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Assert JS expressions&lt;/span&gt;
check rodney assert &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;document.title&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Example Domain&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
check rodney assert &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;document.querySelectorAll("p").length&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;2&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Assert accessibility requirements&lt;/span&gt;
check rodney ax-find --role navigation

rodney stop

&lt;span class="pl-k"&gt;if&lt;/span&gt; [ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$FAIL&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;-ne&lt;/span&gt; 0 ]&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;then&lt;/span&gt;
    &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Some checks failed&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
    &lt;span class="pl-c1"&gt;exit&lt;/span&gt; 1
&lt;span class="pl-k"&gt;fi&lt;/span&gt;
&lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;All checks passed&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/browsers"&gt;browsers&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/annotated-release-notes"&gt;annotated-release-notes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rodney"&gt;rodney&lt;/a&gt;&lt;/p&gt;



</summary><category term="browsers"/><category term="projects"/><category term="testing"/><category term="annotated-release-notes"/><category term="rodney"/></entry><entry><title>Introducing Showboat and Rodney, so agents can demo what they’ve built</title><link href="https://simonwillison.net/2026/Feb/10/showboat-and-rodney/#atom-tag" rel="alternate"/><published>2026-02-10T17:45:29+00:00</published><updated>2026-02-10T17:45:29+00:00</updated><id>https://simonwillison.net/2026/Feb/10/showboat-and-rodney/#atom-tag</id><summary type="html">
    &lt;p&gt;A key challenge working with coding agents is having them both test what they’ve built and demonstrate that software to you, their supervisor. This goes beyond automated tests - we need artifacts that show their progress and help us see exactly what the agent-produced software is able to do. I’ve just released two new tools aimed at this problem: &lt;a href="https://github.com/simonw/showboat"&gt;Showboat&lt;/a&gt; and &lt;a href="https://github.com/simonw/rodney"&gt;Rodney&lt;/a&gt;.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Feb/10/showboat-and-rodney/#proving-code-actually-works"&gt;Proving code actually works&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Feb/10/showboat-and-rodney/#showboat-agents-build-documents-to-demo-their-work"&gt;Showboat: Agents build documents to demo their work&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Feb/10/showboat-and-rodney/#rodney-cli-browser-automation-designed-to-work-with-showboat"&gt;Rodney: CLI browser automation designed to work with Showboat&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Feb/10/showboat-and-rodney/#test-driven-development-helps-but-we-still-need-manual-testing"&gt;Test-driven development helps, but we still need manual testing&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2026/Feb/10/showboat-and-rodney/#i-built-both-of-these-tools-on-my-phone"&gt;I built both of these tools on my phone&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id="proving-code-actually-works"&gt;Proving code actually works&lt;/h4&gt;
&lt;p&gt;I recently wrote about how the job of a software engineer isn't to write code, it's to &lt;em&gt;&lt;a href="https://simonwillison.net/2025/Dec/18/code-proven-to-work/"&gt;deliver code that works&lt;/a&gt;&lt;/em&gt;. A big part of that is proving to ourselves and to other people that the code we are responsible for behaves as expected.&lt;/p&gt;
&lt;p&gt;This becomes even more important - and challenging - as we embrace coding agents as a core part of our software development process.&lt;/p&gt;
&lt;p&gt;The more code we churn out with agents, the more valuable tools are that reduce the amount of manual QA time we need to spend.&lt;/p&gt;
&lt;p&gt;One of the most interesting things about &lt;a href="https://simonwillison.net/2026/Feb/7/software-factory/"&gt;the StrongDM software factory model&lt;/a&gt; is how they ensure that their software is well tested and delivers value despite their policy that "code must not be reviewed by humans". Part of their solution involves expensive swarms of QA agents running through "scenarios" to exercise their software. It's fascinating, but I don't want to spend thousands of dollars on QA robots if I can avoid it!&lt;/p&gt;
&lt;p&gt;I need tools that allow agents to clearly demonstrate their work to me, while minimizing the opportunities for them to cheat about what they've done.&lt;/p&gt;

&lt;h4 id="showboat-agents-build-documents-to-demo-their-work"&gt;Showboat: Agents build documents to demo their work&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/showboat"&gt;Showboat&lt;/a&gt;&lt;/strong&gt; is the tool I built to help agents demonstrate their work to me.&lt;/p&gt;
&lt;p&gt;It's a CLI tool (a Go binary, optionally &lt;a href="https://simonwillison.net/2026/Feb/4/distributing-go-binaries/"&gt;wrapped in Python&lt;/a&gt; to make it easier to install) that helps an agent construct a Markdown document demonstrating exactly what their newly developed code can do.&lt;/p&gt;
&lt;p&gt;It's not designed for humans to run, but here's how you would run it anyway:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;showboat init demo.md &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;How to use curl and jq&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
showboat note demo.md &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Here's how to use curl and jq together.&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
showboat &lt;span class="pl-c1"&gt;exec&lt;/span&gt; demo.md bash &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;curl -s https://api.github.com/repos/simonw/rodney | jq .description&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
showboat note demo.md &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;And the curl logo, to demonstrate the image command:&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
showboat image demo.md &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;curl -o curl-logo.png https://curl.se/logo/curl-logo.png &amp;amp;&amp;amp; echo curl-logo.png&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's what the result looks like if you open it up in VS Code and preview the Markdown:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2026/curl-demo.jpg" alt="Screenshot showing a Markdown file &amp;quot;demo.md&amp;quot; side-by-side with its rendered preview. The Markdown source (left) shows: &amp;quot;# How to use curl and jq&amp;quot;, italic timestamp &amp;quot;2026-02-10T01:12:30Z&amp;quot;, prose &amp;quot;Here's how to use curl and jq together.&amp;quot;, a bash code block with &amp;quot;curl -s https://api.github.com/repos/simonw/rodney | jq .description&amp;quot;, output block showing '&amp;quot;CLI tool for interacting with the web&amp;quot;', text &amp;quot;And the curl logo, to demonstrate the image command:&amp;quot;, a bash {image} code block with &amp;quot;curl -o curl-logo.png https://curl.se/logo/curl-logo.png &amp;amp;&amp;amp; echo curl-logo.png&amp;quot;, and a Markdown image reference &amp;quot;2056e48f-2026-02-10&amp;quot;. The rendered preview (right) displays the formatted heading, timestamp, prose, styled code blocks, and the curl logo image in dark teal showing &amp;quot;curl://&amp;quot; with circuit-style design elements." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Here's that &lt;a href="https://gist.github.com/simonw/fb0b24696ed8dd91314fe41f4c453563#file-demo-md"&gt;demo.md file in a Gist&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;So a sequence of &lt;code&gt;showboat init&lt;/code&gt;, &lt;code&gt;showboat note&lt;/code&gt;, &lt;code&gt;showboat exec&lt;/code&gt; and &lt;code&gt;showboat image&lt;/code&gt; commands constructs a Markdown document one section at a time, with the output of those &lt;code&gt;exec&lt;/code&gt; commands automatically added to the document directly following the commands that were run.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;image&lt;/code&gt; command is a little special - it looks for a file path to an image in the output of the command and copies that image to the current folder and references it in the file.&lt;/p&gt;
&lt;p&gt;That's basically the whole thing! There's a &lt;code&gt;pop&lt;/code&gt; command to remove the most recently added section if something goes wrong, a &lt;code&gt;verify&lt;/code&gt; command to re-run the document and check nothing has changed (I'm not entirely convinced by the design of that one) and a &lt;code&gt;extract&lt;/code&gt; command that reverse-engineers the CLI commands that were used to create the document.&lt;/p&gt;
&lt;p&gt;It's pretty simple - just 172 lines of Go.&lt;/p&gt;
&lt;p&gt;I packaged it up with my &lt;a href="https://github.com/simonw/go-to-wheel"&gt;go-to-wheel&lt;/a&gt; tool which means you can run it without even installing it first like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uvx showboat --help&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;That &lt;code&gt;--help&lt;/code&gt; command is really important: it's designed to provide a coding agent with &lt;em&gt;everything it needs to know&lt;/em&gt; in order to use the tool. Here's &lt;a href="https://github.com/simonw/showboat/blob/main/help.txt"&gt;that help text in full&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This means you can pop open Claude Code and tell it:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Run "uvx showboat --help" and then use showboat to create a demo.md document describing the feature you just built&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And that's it! The &lt;code&gt;--help&lt;/code&gt; text acts &lt;a href="https://simonwillison.net/2025/Oct/16/claude-skills/"&gt;a bit like a Skill&lt;/a&gt;. Your agent can read the help text and use every feature of Showboat to create a document that demonstrates whatever it is you need demonstrated.&lt;/p&gt;
&lt;p&gt;Here's a fun trick: if you set Claude off to build a Showboat document you can pop that open in VS Code and watch the preview pane update in real time as the agent runs through the demo. It's a bit like having your coworker talk you through their latest work in a screensharing session.&lt;/p&gt;
&lt;p&gt;And finally, some examples. Here are documents I had Claude create using Showboat to help demonstrate features I was working on in other projects:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/showboat-demos/blob/main/shot-scraper/README.md"&gt;shot-scraper: A Comprehensive Demo&lt;/a&gt; runs through the full suite of features of my &lt;a href="https://shot-scraper.datasette.io/"&gt;shot-scraper&lt;/a&gt; browser automation tool, mainly to exercise the &lt;code&gt;showboat image&lt;/code&gt; command.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/sqlite-history-json/blob/main/demos/cli.md"&gt;sqlite-history-json CLI demo&lt;/a&gt; demonstrates the CLI feature I added to my new &lt;a href="https://github.com/simonw/sqlite-history-json"&gt;sqlite-history-json&lt;/a&gt; Python library.
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://github.com/simonw/sqlite-history-json/blob/main/demos/row-state-sql.md"&gt;row-state-sql CLI Demo&lt;/a&gt; shows a new &lt;code&gt;row-state-sql&lt;/code&gt; command I added to that same project.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://github.com/simonw/sqlite-history-json/blob/main/demos/change-grouping.md"&gt;Change grouping with Notes&lt;/a&gt; demonstrates another feature where groups of changes within the same transaction can have a note attached to them.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/research/blob/main/libkrun-go-cli-tool/demo.md"&gt;krunsh: Pipe Shell Commands to an Ephemeral libkrun MicroVM&lt;/a&gt; is a particularly convoluted example where I managed to get Claude Code for web to run a libkrun microVM inside a QEMU emulated Linux environment inside the Claude gVisor sandbox.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I've now used Showboat often enough that I've convinced myself of its utility.&lt;/p&gt;
&lt;p&gt;(I've also seen agents cheat! Since the demo file is Markdown the agent will sometimes edit that file directly rather than using Showboat, which could result in command outputs that don't reflect what actually happened. Here's &lt;a href="https://github.com/simonw/showboat/issues/12"&gt;an issue about that&lt;/a&gt;.)&lt;/p&gt;
&lt;h4 id="rodney-cli-browser-automation-designed-to-work-with-showboat"&gt;Rodney: CLI browser automation designed to work with Showboat&lt;/h4&gt;
&lt;p&gt;Many of the projects I work on involve web interfaces. Agents often build entirely new pages for these, and I want to see those represented in the demos.&lt;/p&gt;
&lt;p&gt;Showboat's image feature was designed to allow agents to capture screenshots as part of their demos, originally using my &lt;a href="https://shot-scraper.datasette.io/"&gt;shot-scraper tool&lt;/a&gt; or &lt;a href="https://www.playwright.dev"&gt;Playwright&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The Showboat format benefits from CLI utilities. I went looking for good options for managing a multi-turn browser session from a CLI and came up short, so I decided to try building something new.&lt;/p&gt;
&lt;p&gt;Claude Opus 4.6 pointed me to the &lt;a href="https://github.com/go-rod/rod"&gt;Rod&lt;/a&gt; Go library for interacting with the Chrome DevTools protocol. It's fantastic - it provides a comprehensive wrapper across basically everything you can do with automated Chrome, all in a self-contained library that compiles to a few MBs.&lt;/p&gt;
&lt;p&gt;All Rod was missing was a CLI.&lt;/p&gt;
&lt;p&gt;I built the first version &lt;a href="https://github.com/simonw/research/blob/main/go-rod-cli/README.md"&gt;as an asynchronous report prototype&lt;/a&gt;, which convinced me it was worth spinning out into its own project.&lt;/p&gt;
&lt;p&gt;I called it Rodney as a nod to the Rod library it builds on and a reference to &lt;a href="https://en.wikipedia.org/wiki/Only_Fools_and_Horses"&gt;Only Fools and Horses&lt;/a&gt; - and because the package name was available on PyPI.&lt;/p&gt;
&lt;p&gt;You can run Rodney using &lt;code&gt;uvx rodney&lt;/code&gt; or install it like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uv tool install rodney&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;(Or grab a Go binary &lt;a href="https://github.com/simonw/rodney/releases/"&gt;from the releases page&lt;/a&gt;.)&lt;/p&gt;
&lt;p&gt;Here's a simple example session:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;rodney start &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; starts Chrome in the background&lt;/span&gt;
rodney open https://datasette.io/
rodney js &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Array.from(document.links).map(el =&amp;gt; el.href).slice(0, 5)&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
rodney click &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;a[href="/for"]&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
rodney js location.href
rodney js document.title
rodney screenshot datasette-for-page.png
rodney stop&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's what that looks like in the terminal:&lt;/p&gt;
&lt;p&gt;&lt;img alt=";~ % rodney start
Chrome started (PID 91462)
Debug URL: ws://127.0.0.1:64623/devtools/browser/cac6988e-8153-483b-80b9-1b75c611868d
~ % rodney open https://datasette.io/
Datasette: An open source multi-tool for exploring and publishing data
~ % rodney js 'Array.from(document.links).map(el =&amp;gt; el.href).slice(0, 5)'
[
&amp;quot;https://datasette.io/for&amp;quot;,
&amp;quot;https://docs.datasette.io/en/stable/&amp;quot;,
&amp;quot;https://datasette.io/tutorials&amp;quot;,
&amp;quot;https://datasette.io/examples&amp;quot;,
&amp;quot;https://datasette.io/plugins&amp;quot;
]
~ % rodney click 'a[href=&amp;quot;/for&amp;quot;]'
Clicked
~ % rodney js location.href
https://datasette.io/for
~ % rodney js document.title
Use cases for Datasette
~ % rodney screenshot datasette-for-page.png
datasette-for-page.png
~ % rodney stop
Chrome stopped" src="https://static.simonwillison.net/static/2026/rodney-demo.jpg" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;As with Showboat, this tool is not designed to be used by humans! The goal is for coding agents to be able to run &lt;code&gt;rodney --help&lt;/code&gt; and see everything they need to know to start using the tool. You can see &lt;a href="https://github.com/simonw/rodney/blob/main/help.txt"&gt;that help output&lt;/a&gt; in the GitHub repo.&lt;/p&gt;
&lt;p&gt;Here are three demonstrations of Rodney that I created using Showboat:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/showboat-demos/blob/main/rodney/README.md"&gt;Rodney's original feature set&lt;/a&gt;, including screenshots of pages and executing JavaScript.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/rodney/blob/main/notes/accessibility-features/README.md"&gt;Rodney's new accessibility testing features&lt;/a&gt;, built during development of those features to show what they could do.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/showboat-demos/blob/main/datasette-database-page-accessibility-audit/README.md"&gt;Using those features to run a basic accessibility audit of a page&lt;/a&gt;. I was impressed at how well Claude Opus 4.6 responded to the prompt "Use showboat and rodney to perform an accessibility audit of &lt;a href="https://latest.datasette.io/fixtures"&gt;https://latest.datasette.io/fixtures&lt;/a&gt;" - &lt;a href="https://gisthost.github.io/?dce6b2680db4b05c04469ed8f251eb34/index.html"&gt;transcript here&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="test-driven-development-helps-but-we-still-need-manual-testing"&gt;Test-driven development helps, but we still need manual testing&lt;/h4&gt;
&lt;p&gt;After being a career-long skeptic of the test-first, maximum test coverage school of software development (I like &lt;a href="https://simonwillison.net/2022/Oct/29/the-perfect-commit/#tests"&gt;tests included&lt;/a&gt; development instead) I've recently come around to test-first processes as a way to force agents to write only the code that's necessary to solve the problem at hand.&lt;/p&gt;
&lt;p&gt;Many of my Python coding agent sessions start the same way:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Run the existing tests with "uv run pytest". Build using red/green TDD.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Telling the agents how to run the tests doubles as an indicator that tests on this project exist and matter. Agents will read existing tests before writing their own so having a clean test suite with good patterns makes it more likely they'll write good tests of their own.&lt;/p&gt;
&lt;p&gt;The frontier models all understand that "red/green TDD" means they should write the test first, run it and watch it fail and then write the code to make it pass - it's a convenient shortcut.&lt;/p&gt;
&lt;p&gt;I find this greatly increases the quality of the code and the likelihood that the agent will produce the right thing with the smallest amount of prompts to guide it.&lt;/p&gt;
&lt;p&gt;But anyone who's worked with tests will know that just because the automated tests pass doesn't mean the software actually works! That’s the motivation behind Showboat and Rodney - I never trust any feature until I’ve seen it running with my own eye.&lt;/p&gt;
&lt;p&gt;Before building Showboat I'd often add a “manual” testing step to my agent sessions, something like:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Once the tests pass, start a development server and exercise the new feature using curl&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4 id="i-built-both-of-these-tools-on-my-phone"&gt;I built both of these tools on my phone&lt;/h4&gt;
&lt;p&gt;Both Showboat and Rodney started life as Claude Code for web projects created via the Claude iPhone app. Most of the ongoing feature work for them happened in the same way.&lt;/p&gt;
&lt;p&gt;I'm still a little startled at how much of my coding work I get done on my phone now, but I'd estimate that the majority of code I ship to GitHub these days was written for me by coding agents driven via that iPhone app.&lt;/p&gt;
&lt;p&gt;I initially designed these two tools for use in asynchronous coding agent environments like Claude Code for the web. So far that's working out really well.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/go"&gt;go&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/markdown"&gt;markdown&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/async-coding-agents"&gt;async-coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/showboat"&gt;showboat&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rodney"&gt;rodney&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="go"/><category term="projects"/><category term="testing"/><category term="markdown"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="coding-agents"/><category term="async-coding-agents"/><category term="showboat"/><category term="rodney"/></entry><entry><title>Tips for getting coding agents to write good Python tests</title><link href="https://simonwillison.net/2026/Jan/26/tests/#atom-tag" rel="alternate"/><published>2026-01-26T23:55:29+00:00</published><updated>2026-01-26T23:55:29+00:00</updated><id>https://simonwillison.net/2026/Jan/26/tests/#atom-tag</id><summary type="html">
    &lt;p&gt;Someone &lt;a href="https://news.ycombinator.com/item?id=46765460#46765823"&gt;asked&lt;/a&gt; on Hacker News if I had any tips for getting coding agents to write decent quality tests. Here's what I said:&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;I work in Python which helps a lot because there are a TON of good examples of pytest tests floating around in the training data, including things like usage of fixture libraries for mocking external HTTP APIs and snapshot testing and other neat patterns.&lt;/p&gt;
&lt;p&gt;Or I can say "use pytest-httpx to mock the endpoints" and Claude knows what I mean.&lt;/p&gt;
&lt;p&gt;Keeping an eye on the tests is important. The most common anti-pattern I see is large amounts of duplicated test setup code - which isn't a huge deal, I'm much more more tolerant of duplicated logic in tests than I am in implementation, but it's still worth pushing back on.&lt;/p&gt;
&lt;p&gt;"Refactor those tests to use pytest.mark.parametrize" and "extract the common setup into a pytest fixture" work really well there.&lt;/p&gt;
&lt;p&gt;Generally though the best way to get good tests out of a coding agent is to make sure it's working in a project with an existing test suite that uses good patterns. Coding agents pick the existing patterns up without needing any extra prompting at all.&lt;/p&gt;
&lt;p&gt;I find that once a project has clean basic tests the new tests added by the agents tend to match them in quality. It's similar to how working on large projects with a team of other developers work - keeping the code clean means when people look for examples of how to write a test they'll be pointed in the right direction.&lt;/p&gt;
&lt;p&gt;One last tip I use a lot is this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Clone datasette/datasette-enrichments
from GitHub to /tmp and imitate the
testing patterns it uses
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I do this all the time with different existing projects I've written - the quickest way to show an agent how you like something to be done is to have it look at an example.&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hacker-news"&gt;hacker-news&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pytest"&gt;pytest&lt;/a&gt;&lt;/p&gt;



</summary><category term="testing"/><category term="coding-agents"/><category term="python"/><category term="generative-ai"/><category term="ai"/><category term="llms"/><category term="hacker-news"/><category term="pytest"/></entry><entry><title>A Software Library with No Code</title><link href="https://simonwillison.net/2026/Jan/10/a-software-library-with-no-code/#atom-tag" rel="alternate"/><published>2026-01-10T23:41:58+00:00</published><updated>2026-01-10T23:41:58+00:00</updated><id>https://simonwillison.net/2026/Jan/10/a-software-library-with-no-code/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.dbreunig.com/2026/01/08/a-software-library-with-no-code.html"&gt;A Software Library with No Code&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Provocative experiment from Drew Breunig, who designed a new library for time formatting ("3 hours ago" kind of thing) called "whenwords" that has no code at all, just a carefully written specification, an AGENTS.md and a collection of conformance tests in a YAML file.&lt;/p&gt;
&lt;p&gt;Pass that to your coding agent of choice, tell it what language you need and it will write it for you on demand!&lt;/p&gt;
&lt;p&gt;This meshes nearly with my recent &lt;a href="https://simonwillison.net/2025/Dec/31/the-year-in-llms/#the-year-of-conformance-suites"&gt;interest in conformance suites&lt;/a&gt;. If you publish good enough language-independent tests it's pretty astonishing how far today's coding agents can take you!


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/drew-breunig"&gt;drew-breunig&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/conformance-suites"&gt;conformance-suites&lt;/a&gt;&lt;/p&gt;



</summary><category term="testing"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="drew-breunig"/><category term="coding-agents"/><category term="conformance-suites"/></entry><entry><title>Quoting D. Richard Hipp</title><link href="https://simonwillison.net/2025/Dec/29/d-richard-hipp/#atom-tag" rel="alternate"/><published>2025-12-29T21:51:49+00:00</published><updated>2025-12-29T21:51:49+00:00</updated><id>https://simonwillison.net/2025/Dec/29/d-richard-hipp/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://sigmodrecord.org/publications/sigmodRecord/1906/pdfs/06_Profiles_Hipp.pdf"&gt;&lt;p&gt;But once we got that and got this aviation grade testing in place, the number of bugs just dropped to a trickle. Now we still do have bugs but the aviation grade testing allows us to move fast, which is important because in this business you either move fast or you're disrupted. So, we're able to make major changes to the structure of the code that we deliver and be confident that we're not breaking things because we had these intense tests. Probably half the time we spend is actually writing new tests, we're constantly writing new tests. And over the 17-year history, we have amassed a huge suite of tests which we run constantly.&lt;/p&gt;
&lt;p&gt;Other database engines don't do this; don't have this
level of testing. But they're still high quality, I mean, I
noticed in particular, PostgreSQL is a very high-quality database engine, they don't have many bugs. I went to the PostgreSQL and ask them “how do you prevent the bugs”? We talked about this for a while. What I came away with was they've got a very elaborate peer review process, and if they've got code that has worked for 10 years they just don't mess with it, leave it alone, it
works. Whereas we change our code fearlessly, and we have a much smaller team and we don't have the peer review process.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://sigmodrecord.org/publications/sigmodRecord/1906/pdfs/06_Profiles_Hipp.pdf"&gt;D. Richard Hipp&lt;/a&gt;, ACM SIGMOD Record, June 2019 (PDF)&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/postgresql"&gt;postgresql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/d-richard-hipp"&gt;d-richard-hipp&lt;/a&gt;&lt;/p&gt;



</summary><category term="postgresql"/><category term="sqlite"/><category term="testing"/><category term="d-richard-hipp"/></entry><entry><title>TIL: Subtests in pytest 9.0.0+</title><link href="https://simonwillison.net/2025/Dec/5/til-pytest-subtests/#atom-tag" rel="alternate"/><published>2025-12-05T06:03:29+00:00</published><updated>2025-12-05T06:03:29+00:00</updated><id>https://simonwillison.net/2025/Dec/5/til-pytest-subtests/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://til.simonwillison.net/pytest/subtests"&gt;TIL: Subtests in pytest 9.0.0+&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I spotted an interesting new feature &lt;a href="https://docs.pytest.org/en/stable/changelog.html#pytest-9-0-0-2025-11-05"&gt;in the release notes for pytest 9.0.0&lt;/a&gt;: &lt;a href="https://docs.pytest.org/en/stable/how-to/subtests.html#subtests"&gt;subtests&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I'm a &lt;em&gt;big&lt;/em&gt; user of the &lt;a href="https://docs.pytest.org/en/stable/example/parametrize.html"&gt;pytest.mark.parametrize&lt;/a&gt; decorator - see &lt;a href="https://simonwillison.net/2018/Jul/28/documentation-unit-tests/"&gt;Documentation unit tests&lt;/a&gt; from 2018 - so I thought it would be interesting to try out subtests and see if they're a useful alternative.&lt;/p&gt;
&lt;p&gt;Short version: this parameterized test:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-en"&gt;@&lt;span class="pl-s1"&gt;pytest&lt;/span&gt;.&lt;span class="pl-c1"&gt;mark&lt;/span&gt;.&lt;span class="pl-c1"&gt;parametrize&lt;/span&gt;(&lt;span class="pl-s"&gt;"setting"&lt;/span&gt;, &lt;span class="pl-s1"&gt;app&lt;/span&gt;.&lt;span class="pl-c1"&gt;SETTINGS&lt;/span&gt;)&lt;/span&gt;
&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;test_settings_are_documented&lt;/span&gt;(&lt;span class="pl-s1"&gt;settings_headings&lt;/span&gt;, &lt;span class="pl-s1"&gt;setting&lt;/span&gt;):
    &lt;span class="pl-k"&gt;assert&lt;/span&gt; &lt;span class="pl-s1"&gt;setting&lt;/span&gt;.&lt;span class="pl-c1"&gt;name&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;settings_headings&lt;/span&gt;&lt;/pre&gt;
&lt;p&gt;Becomes this using subtests instead:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;test_settings_are_documented&lt;/span&gt;(&lt;span class="pl-s1"&gt;settings_headings&lt;/span&gt;, &lt;span class="pl-s1"&gt;subtests&lt;/span&gt;):
    &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;setting&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;app&lt;/span&gt;.&lt;span class="pl-c1"&gt;SETTINGS&lt;/span&gt;:
        &lt;span class="pl-k"&gt;with&lt;/span&gt; &lt;span class="pl-s1"&gt;subtests&lt;/span&gt;.&lt;span class="pl-c1"&gt;test&lt;/span&gt;(&lt;span class="pl-s1"&gt;setting&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;setting&lt;/span&gt;.&lt;span class="pl-c1"&gt;name&lt;/span&gt;):
            &lt;span class="pl-k"&gt;assert&lt;/span&gt; &lt;span class="pl-s1"&gt;setting&lt;/span&gt;.&lt;span class="pl-c1"&gt;name&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;settings_headings&lt;/span&gt;&lt;/pre&gt;
&lt;p&gt;Why is this better? Two reasons:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;It appears to run a bit faster&lt;/li&gt;
&lt;li&gt;Subtests can be created programatically after running some setup code first&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;I &lt;a href="https://gistpreview.github.io/?0487e5bb12bcbed850790a6324788e1b"&gt;had Claude Code&lt;/a&gt; port &lt;a href="https://github.com/simonw/datasette/pull/2609/files"&gt;several tests&lt;/a&gt; to the new pattern. I like it.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pytest"&gt;pytest&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/til"&gt;til&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="testing"/><category term="ai"/><category term="pytest"/><category term="til"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="coding-agents"/><category term="claude-code"/></entry><entry><title>Vibing a Non-Trivial Ghostty Feature</title><link href="https://simonwillison.net/2025/Oct/11/vibing-a-non-trivial-ghostty-feature/#atom-tag" rel="alternate"/><published>2025-10-11T16:33:59+00:00</published><updated>2025-10-11T16:33:59+00:00</updated><id>https://simonwillison.net/2025/Oct/11/vibing-a-non-trivial-ghostty-feature/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://mitchellh.com/writing/non-trivial-vibing"&gt;Vibing a Non-Trivial Ghostty Feature&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Mitchell Hashimoto provides a comprehensive answer to the &lt;em&gt;frequent&lt;/em&gt; demand for a detailed description of shipping a non-trivial production feature to an existing project using AI-assistance. In this case it's a slick unobtrusive auto-update UI for his &lt;a href="https://ghostty.org/"&gt;Ghostty&lt;/a&gt; terminal emulator, written in Swift.&lt;/p&gt;
&lt;p&gt;Mitchell shares full transcripts of the 16 coding sessions he carried out using &lt;a href="https://ampcode.com/"&gt;Amp Code&lt;/a&gt; across 2 days and around 8 hours of computer time, at a token cost of $15.98.&lt;/p&gt;
&lt;p&gt;Amp has the nicest shared transcript feature of any of the coding agent tools, as seen &lt;a href="https://ampcode.com/threads/T-9fc3eb88-5aa2-45e4-8f6d-03697f53102d"&gt;in this example&lt;/a&gt;. I'd love to see Claude Code and Codex CLI and Gemini CLI and friends imitate this.&lt;/p&gt;
&lt;p&gt;There are plenty of useful tips in here. I like this note about the importance of a cleanup step:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The cleanup step is really important. To cleanup effectively you have to have a pretty good understanding of the code, so this forces me to not blindly accept AI-written code. Subsequently, better organized and documented code helps future agentic sessions perform better.&lt;/p&gt;
&lt;p&gt;I sometimes tongue-in-cheek refer to this as the "anti-slop session".&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And this on how sometimes you can write manual code in a way that puts the agent the right track:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I spent some time manually restructured the view model. This involved switching to a tagged union rather than the struct with a bunch of optionals. I renamed some types, moved stuff around.&lt;/p&gt;
&lt;p&gt;I knew from experience that this small bit of manual work in the middle would set the agents up for success in future sessions for both the frontend and backend. After completing it, I continued with a marathon set of cleanup sessions.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's one of those refactoring prompts:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Turn each @macos/Sources/Features/Update/UpdatePopoverView.swift case into a dedicated fileprivate Swift view that takes the typed value as its parameter so that we can remove the guards.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Mitchell advises ending every session with a prompt like this one, asking the agent about any obvious omissions:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Are there any other improvements you can see to be made with the @macos/Sources/Features/Update feature? Don't write any code. Consult the oracle. Consider parts of the code that can also get more unit tests added.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;("Consult the oracle" is an Amp-specific pattern for running a task through a more expensive, more capable model.)&lt;/p&gt;
&lt;p&gt;Is this all worthwhile? Mitchell thinks so:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Many people on the internet argue whether AI enables you to work faster or not. In this case, I think I shipped this faster than I would have if I had done it all myself, in particular because iterating on minor SwiftUI styling is so tedious and time consuming for me personally and AI does it so well.&lt;/p&gt;
&lt;p&gt;I think the faster/slower argument for me personally is missing the thing I like the most: the AI can work for me while I step away to do other things.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's &lt;a href="https://github.com/ghostty-org/ghostty/pull/9116/files"&gt;the resulting PR&lt;/a&gt;, which touches 21 files.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/mitchellh/status/1977016133409820684"&gt;@mitchellh&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mitchell-hashimoto"&gt;mitchell-hashimoto&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/swift"&gt;swift&lt;/a&gt;&lt;/p&gt;



</summary><category term="testing"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="mitchell-hashimoto"/><category term="coding-agents"/><category term="swift"/></entry><entry><title>TIL: Testing different Python versions with uv with-editable and uv-test</title><link href="https://simonwillison.net/2025/Oct/9/uv-test/#atom-tag" rel="alternate"/><published>2025-10-09T03:37:06+00:00</published><updated>2025-10-09T03:37:06+00:00</updated><id>https://simonwillison.net/2025/Oct/9/uv-test/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://til.simonwillison.net/python/uv-tests"&gt;TIL: Testing different Python versions with uv with-editable and uv-test&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
While tinkering with upgrading various projects to handle Python 3.14 I finally figured out a universal &lt;code&gt;uv&lt;/code&gt; recipe for running the tests for the current project in any specified version of Python:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uv run --python 3.14 --isolated --with-editable '.[test]' pytest
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This should work in any directory with a &lt;code&gt;pyproject.toml&lt;/code&gt; (or even a &lt;code&gt;setup.py&lt;/code&gt;) that defines a &lt;code&gt;test&lt;/code&gt; set of extra dependencies and uses &lt;code&gt;pytest&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;--with-editable '.[test]'&lt;/code&gt; bit ensures that changes you make to that directory will be picked up by future test runs. The &lt;code&gt;--isolated&lt;/code&gt; flag ensures no other environments will affect your test run.&lt;/p&gt;
&lt;p&gt;I like this pattern so much I built a little shell script that uses it, &lt;a href="https://til.simonwillison.net/python/uv-tests#user-content-uv-test"&gt;shown here&lt;/a&gt;. Now I can change to any Python project directory and run:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uv-test
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or for a different Python version:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uv-test -p 3.11
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I can pass additional &lt;code&gt;pytest&lt;/code&gt; options too:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uv-test -p 3.11 -k permissions
&lt;/code&gt;&lt;/pre&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pytest"&gt;pytest&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/til"&gt;til&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="testing"/><category term="pytest"/><category term="til"/><category term="uv"/></entry><entry><title>Hypothesis is now thread-safe</title><link href="https://simonwillison.net/2025/Aug/8/hypothesis/#atom-tag" rel="alternate"/><published>2025-08-08T22:08:55+00:00</published><updated>2025-08-08T22:08:55+00:00</updated><id>https://simonwillison.net/2025/Aug/8/hypothesis/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://hypothesis.works/articles/thread-safe/"&gt;Hypothesis is now thread-safe&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Hypothesis is a property-based testing library for Python. It lets you write tests like this one:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;hypothesis&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;given&lt;/span&gt;, &lt;span class="pl-s1"&gt;strategies&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; &lt;span class="pl-s1"&gt;st&lt;/span&gt;

&lt;span class="pl-en"&gt;@&lt;span class="pl-en"&gt;given&lt;/span&gt;(&lt;span class="pl-s1"&gt;st&lt;/span&gt;.&lt;span class="pl-c1"&gt;lists&lt;/span&gt;(&lt;span class="pl-s1"&gt;st&lt;/span&gt;.&lt;span class="pl-c1"&gt;integers&lt;/span&gt;()))&lt;/span&gt;
&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;test_matches_builtin&lt;/span&gt;(&lt;span class="pl-s1"&gt;ls&lt;/span&gt;):
    &lt;span class="pl-k"&gt;assert&lt;/span&gt; &lt;span class="pl-en"&gt;sorted&lt;/span&gt;(&lt;span class="pl-s1"&gt;ls&lt;/span&gt;) &lt;span class="pl-c1"&gt;==&lt;/span&gt; &lt;span class="pl-en"&gt;my_sort&lt;/span&gt;(&lt;span class="pl-s1"&gt;ls&lt;/span&gt;)&lt;/pre&gt;

&lt;p&gt;This will automatically create a collection of test fixtures that exercise a large array of expected list and integer shapes. Here's &lt;a href="https://gist.github.com/simonw/74014071af1553921e0307efd2280168"&gt;a Gist&lt;/a&gt; demonstrating the tests the above code will run, which include things like:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[]
[0]
[-62, 13194]
[44, -19562, 44, -12803, -24012]
[-7531692443171623764, -109369043848442345045856489093298649615]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Hypothesis contributor Liam DeVoe was recently sponsored by Quansight to add thread safety to Hypothesis, which has become important recently due to Python free threading:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;While we of course would always have loved for Hypothesis to be thread-safe, thread-safety has historically not been a priority, because running Hypothesis tests under multiple threads is not something we see often.&lt;/p&gt;
&lt;p&gt;That changed recently. Python---as both a language, and a community---is gearing up to &lt;a href="https://peps.python.org/pep-0703/"&gt;remove the global interpreter lock (GIL)&lt;/a&gt;, in a build called &lt;a href="https://docs.python.org/3/howto/free-threading-python.html"&gt;free threading&lt;/a&gt;. Python packages, especially those that interact with the C API, will need to test that their code still works under the free threaded build. A great way to do this is to run each test in the suite in two or more threads simultaneously. [...]&lt;/p&gt;
&lt;p&gt;Nathan mentioned that because Hypothesis is not thread-safe, Hypothesis tests in community packages have to be skipped when testing free threaded compatibility, which removes a substantial battery of coverage.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Now that Hypothesis is thread-safe another blocker to increased Python ecosystem support for free threading has been removed!

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://lobste.rs/s/zrbpds/hypothesis_is_now_thread_safe"&gt;lobste.rs&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/gil"&gt;gil&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/threads"&gt;threads&lt;/a&gt;&lt;/p&gt;



</summary><category term="gil"/><category term="python"/><category term="testing"/><category term="threads"/></entry><entry><title>PR #537: Fix Markdown in og descriptions</title><link href="https://simonwillison.net/2025/Jun/3/openai-codex-pr/#atom-tag" rel="alternate"/><published>2025-06-03T23:58:34+00:00</published><updated>2025-06-03T23:58:34+00:00</updated><id>https://simonwillison.net/2025/Jun/3/openai-codex-pr/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/simonwillisonblog/pull/537"&gt;PR #537: Fix Markdown in og descriptions&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Since &lt;a href="https://openai.com/index/introducing-codex/"&gt;OpenAI Codex&lt;/a&gt; is now available to us ChatGPT Plus subscribers I decided to try it out against my blog.&lt;/p&gt;
&lt;p&gt;It's a very nice implementation of the GitHub-connected coding "agent" pattern, as also seen in Google's &lt;a href="https://jules.google/"&gt;Jules&lt;/a&gt; and Microsoft's &lt;a href="https://github.blog/changelog/2025-05-19-github-copilot-coding-agent-in-public-preview/"&gt;Copilot Coding Agent&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;First I had to configure an environment for it. My Django blog uses PostgreSQL which isn't part of the &lt;a href="https://github.com/openai/codex-universal"&gt;default Codex container&lt;/a&gt;, so I had Claude Sonnet 4 &lt;a href="https://claude.ai/share/a5ce65c2-a9a4-4ae7-b645-71bd9fd6ea2c"&gt;help me&lt;/a&gt; come up with a startup recipe to get PostgreSQL working.&lt;/p&gt;
&lt;p&gt;I attached my &lt;a href="https://github.com/simonw/simonwillisonblog"&gt;simonw/simonwillisonblog&lt;/a&gt; GitHub repo and used the following as the "setup script" for the environment:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Install PostgreSQL
apt-get update &amp;amp;&amp;amp; apt-get install -y postgresql postgresql-contrib

# Start PostgreSQL service
service postgresql start

# Create a test database and user
sudo -u postgres createdb simonwillisonblog
sudo -u postgres psql -c "CREATE USER testuser WITH PASSWORD 'testpass';"
sudo -u postgres psql -c "GRANT ALL PRIVILEGES ON DATABASE simonwillisonblog TO testuser;"
sudo -u postgres psql -c "ALTER USER testuser CREATEDB;"

pip install -r requirements.txt
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I left "Agent internet access" off for reasons &lt;a href="https://simonwillison.net/2025/Jun/3/codex-agent-internet-access/"&gt;described previously&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Then I prompted Codex with the following (after one previous experimental task to check that it could run my tests):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Notes and blogmarks can both use Markdown.&lt;/p&gt;
&lt;p&gt;They serve &lt;code&gt;meta property="og:description" content="&lt;/code&gt; tags on the page, but those tags include that raw Markdown which looks bad on social media previews.&lt;/p&gt;
&lt;p&gt;Fix it so they instead use just the text with markdown stripped - so probably render it to HTML and then strip the HTML tags.&lt;/p&gt;
&lt;p&gt;Include passing tests.&lt;/p&gt;
&lt;p&gt;Try to run the tests, the postgresql details are:&lt;/p&gt;
&lt;p&gt;database = simonwillisonblog
username = testuser
password = testpass&lt;/p&gt;
&lt;p&gt;Put those in the DATABASE_URL environment variable.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I left it to churn away for a few minutes (4m12s, to be precise) and &lt;a href="https://chatgpt.com/s/cd_683f8b81657881919a8d1ce71978a2df"&gt;it came back&lt;/a&gt; with a fix that edited two templates and added one more (passing) test. Here's &lt;a href="https://github.com/simonw/simonwillisonblog/pull/537/files"&gt;that change in full&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;And sure enough, the social media cards for my posts now look like this - no visible Markdown any more:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a web browser showing a blog post preview card on Bluesky. The URL in the address bar reads &amp;quot;https://simonwillison.net/2025/Jun/3/pr-537-fix-markdown-in-og-descriptions/&amp;quot;. The preview card shows the title &amp;quot;PR #537: Fix Markdown in og descriptions&amp;quot; and begins with the text &amp;quot;Since OpenAI Codex is now available to us ChatGPT Plus subscribers I decided to try it out against my blog. It's a very nice implementation of the GitHub-connected coding&amp;quot;. The domain &amp;quot;simonwillison.net&amp;quot; appears at the bottom of the card." src="https://static.simonwillison.net/static/2025/codex-fix.jpg" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/django"&gt;django&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/postgresql"&gt;postgresql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatgpt"&gt;chatgpt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-agents"&gt;ai-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/async-coding-agents"&gt;async-coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/jules"&gt;jules&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/codex"&gt;codex&lt;/a&gt;&lt;/p&gt;



</summary><category term="django"/><category term="github"/><category term="postgresql"/><category term="testing"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="chatgpt"/><category term="llms"/><category term="ai-assisted-programming"/><category term="ai-agents"/><category term="coding-agents"/><category term="async-coding-agents"/><category term="jules"/><category term="codex"/></entry><entry><title>AI-assisted development needs automated tests</title><link href="https://simonwillison.net/2025/May/28/automated-tests/#atom-tag" rel="alternate"/><published>2025-05-28T21:16:32+00:00</published><updated>2025-05-28T21:16:32+00:00</updated><id>https://simonwillison.net/2025/May/28/automated-tests/#atom-tag</id><summary type="html">
    &lt;p&gt;I wonder if one of the reasons I'm finding LLMs so much more useful for coding than a lot of people that I see in online discussions is that effectively &lt;em&gt;all&lt;/em&gt; of the code I work on has automated tests.&lt;/p&gt;
&lt;p&gt;I've been trying to stay true to the idea of a &lt;a href="https://simonwillison.net/2022/Oct/29/the-perfect-commit/"&gt;Perfect Commit&lt;/a&gt; - one that bundles the implementation, tests and documentation in a single unit - for over five years now. As a result almost every piece of (non &lt;a href="https://simonwillison.net/tags/vibe-coding/"&gt;vibe-coding&lt;/a&gt;) code I work on has pretty comprehensive test coverage.&lt;/p&gt;
&lt;p&gt;This &lt;em&gt;massively&lt;/em&gt; derisks my use of LLMs. If an LLM writes weird, convoluted code that solves my problem I can prove that it works with tests - and then have it refactor the code until it looks good to me, keeping the tests green the whole time.&lt;/p&gt;
&lt;p&gt;LLMs help write the tests, too. I finally have a 24/7 pair programmer who can remember how to use &lt;a href="https://docs.python.org/3/library/unittest.mock.html"&gt;unittest.mock&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;Next time someone complains that they've found LLMs to be more of a hindrance than a help in their programming work, I'm going to try to remember to ask after the health of their test suite.&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/vibe-coding"&gt;vibe-coding&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;&lt;/p&gt;



</summary><category term="vibe-coding"/><category term="testing"/><category term="ai-assisted-programming"/><category term="generative-ai"/><category term="ai"/><category term="llms"/></entry><entry><title>No docs, no bugs</title><link href="https://simonwillison.net/2025/May/22/no-docs-no-bugs/#atom-tag" rel="alternate"/><published>2025-05-22T01:55:31+00:00</published><updated>2025-05-22T01:55:31+00:00</updated><id>https://simonwillison.net/2025/May/22/no-docs-no-bugs/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;strong&gt;If your library doesn't have any documentation, it can't have any bugs.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Documentation specifies what your code is supposed to do. Your tests specify what it actually does.&lt;/p&gt;
&lt;p&gt;Bugs exist when your test-enforced implementation fails to match the behavior described in your documentation.  Without documentation a bug is just undefined behavior.&lt;/p&gt;
&lt;p&gt;If you aim to follow &lt;a href="https://semver.org/"&gt;semantic versioning&lt;/a&gt; you bump your major version when you release a backwards incompatible change. Such changes cannot exist if your code is not comprehensively documented!&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Inspired by a half-remembered conversation I had with &lt;a href="https://movieos.org/"&gt;Tom Insam&lt;/a&gt; many years ago. &lt;/small&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/semantic-versioning"&gt;semantic-versioning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/documentation"&gt;documentation&lt;/a&gt;&lt;/p&gt;



</summary><category term="testing"/><category term="semantic-versioning"/><category term="documentation"/></entry><entry><title>Quoting Evan Miller</title><link href="https://simonwillison.net/2025/Mar/13/evan-miller/#atom-tag" rel="alternate"/><published>2025-03-13T22:50:33+00:00</published><updated>2025-03-13T22:50:33+00:00</updated><id>https://simonwillison.net/2025/Mar/13/evan-miller/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://www.evanmiller.org/functional-tests-as-a-tree-of-continuations.html"&gt;&lt;p&gt;One of the most essential practices for maintaining the long-term quality of computer code is to write automated tests that ensure the program continues to act as expected, even when other people (including your future self) muck with it.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://www.evanmiller.org/functional-tests-as-a-tree-of-continuations.html"&gt;Evan Miller&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evan-miller"&gt;evan-miller&lt;/a&gt;&lt;/p&gt;



</summary><category term="testing"/><category term="evan-miller"/></entry><entry><title>Smoke test your Django admin site</title><link href="https://simonwillison.net/2025/Mar/13/smoke-test-your-django-admin/#atom-tag" rel="alternate"/><published>2025-03-13T15:02:09+00:00</published><updated>2025-03-13T15:02:09+00:00</updated><id>https://simonwillison.net/2025/Mar/13/smoke-test-your-django-admin/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://jmduke.com/posts/post/django-admin-changelist-test/"&gt;Smoke test your Django admin site&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Justin Duke demonstrates a neat pattern for running simple tests against your internal Django admin site: introspect every admin route via &lt;code&gt;django.urls.get_resolver()&lt;/code&gt; and loop through them with &lt;code&gt;@pytest.mark.parametrize&lt;/code&gt; to check they all return a 200 HTTP status code.&lt;/p&gt;
&lt;p&gt;This catches simple mistakes with the admin configuration that trigger exceptions that might otherwise go undetected.&lt;/p&gt;
&lt;p&gt;I rarely write automated tests against my own admin sites and often feel guilty about it. I wrote &lt;a href="https://til.simonwillison.net/django/testing-django-admin-with-pytest"&gt;some notes&lt;/a&gt; on testing it with &lt;a href="https://pytest-django.readthedocs.io/en/latest/helpers.html#fixtures"&gt;pytest-django fixtures&lt;/a&gt; a few years ago.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/django"&gt;django&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/django-admin"&gt;django-admin&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pytest"&gt;pytest&lt;/a&gt;&lt;/p&gt;



</summary><category term="django"/><category term="django-admin"/><category term="python"/><category term="testing"/><category term="pytest"/></entry><entry><title>Quoting Catherine Wu</title><link href="https://simonwillison.net/2025/Feb/24/catherine-wu/#atom-tag" rel="alternate"/><published>2025-02-24T23:48:37+00:00</published><updated>2025-02-24T23:48:37+00:00</updated><id>https://simonwillison.net/2025/Feb/24/catherine-wu/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://news.ycombinator.com/item?id=43163011#43164561"&gt;&lt;p&gt;We find that Claude is really good at test driven development, so we often ask Claude to write tests first and then ask Claude to iterate against the tests.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://news.ycombinator.com/item?id=43163011#43164561"&gt;Catherine Wu&lt;/a&gt;, Anthropic&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/tdd"&gt;tdd&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;&lt;/p&gt;



</summary><category term="tdd"/><category term="testing"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="anthropic"/><category term="claude"/></entry><entry><title>My Approach to Building Large Technical Projects</title><link href="https://simonwillison.net/2024/Dec/28/my-approach-to-building-large-technical-projects/#atom-tag" rel="alternate"/><published>2024-12-28T14:54:46+00:00</published><updated>2024-12-28T14:54:46+00:00</updated><id>https://simonwillison.net/2024/Dec/28/my-approach-to-building-large-technical-projects/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://mitchellh.com/writing/building-large-technical-projects"&gt;My Approach to Building Large Technical Projects&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Mitchell Hashimoto wrote this piece about taking on large projects back in June 2023. The project he described in the post is a terminal emulator written in Zig called &lt;a href="https://ghostty.org/"&gt;Ghostty&lt;/a&gt; which just reached its &lt;a href="https://mitchellh.com/writing/ghostty-1-0-reflection"&gt;1.0 release&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I've learned that when I break down my large tasks in chunks that result in seeing tangible forward progress, I tend to finish my work and retain my excitement throughout the project. People are all motivated and driven in different ways, so this may not work for you, but as a broad generalization I've not found an engineer who doesn't get excited by a good demo. And the goal is to always give yourself a good demo.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;For backend-heavy projects the lack of an initial UI is a challenge here, so Mitchell advocates for early automated tests as a way to start exercising code and seeing progress right from the start. Don't let tests get in the way of demos though:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;No matter what I'm working on, I try to build one or two demos per week intermixed with automated test feedback as explained in the previous section.&lt;/p&gt;
&lt;p&gt;Building a demo also provides you with invaluable product feedback. You can quickly intuit whether something &lt;em&gt;feels good&lt;/em&gt;, even if it isn't fully functional.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;For more on the development of Ghostty see &lt;a href="https://mitchellh.com/writing/ghostty-and-useful-zig-patterns"&gt;this talk&lt;/a&gt; Mitchell gave at Zig Showtime last year:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I want the terminal to be a modern platform for text application development, analogous to the browser being a modern platform for GUI application development (for better or worse).&lt;/p&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://bsky.app/profile/vickiboykis.com/post/3l7xplgkifb2p"&gt;@vickiboykis.com&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/software-engineering"&gt;software-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/zig"&gt;zig&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mitchell-hashimoto"&gt;mitchell-hashimoto&lt;/a&gt;&lt;/p&gt;



</summary><category term="open-source"/><category term="software-engineering"/><category term="testing"/><category term="zig"/><category term="mitchell-hashimoto"/></entry><entry><title>Quoting Carson Gross</title><link href="https://simonwillison.net/2024/Nov/26/carson-gross/#atom-tag" rel="alternate"/><published>2024-11-26T04:50:21+00:00</published><updated>2024-11-26T04:50:21+00:00</updated><id>https://simonwillison.net/2024/Nov/26/carson-gross/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://htmx.org/essays/codin-dirty/"&gt;&lt;p&gt;My preferred approach in many projects is to do some unit testing, but not a ton, early on in the project and wait until the core APIs and concepts of a module have crystallized.&lt;/p&gt;
&lt;p&gt;At that point I then test the API exhaustively with integrations tests.&lt;/p&gt;
&lt;p&gt;In my experience, these integration tests are much more useful than unit tests, because they remain stable and useful even as you change the implementation around. They aren’t as tied to the current codebase, but rather express higher level invariants that survive refactors much more readily.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://htmx.org/essays/codin-dirty/"&gt;Carson Gross&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/programming"&gt;programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/carson-gross"&gt;carson-gross&lt;/a&gt;&lt;/p&gt;



</summary><category term="programming"/><category term="testing"/><category term="carson-gross"/></entry><entry><title>[red-knot] type inference/checking test framework</title><link href="https://simonwillison.net/2024/Oct/16/markdown-test-framework/#atom-tag" rel="alternate"/><published>2024-10-16T20:43:55+00:00</published><updated>2024-10-16T20:43:55+00:00</updated><id>https://simonwillison.net/2024/Oct/16/markdown-test-framework/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/astral-sh/ruff/pull/13636"&gt;[red-knot] type inference/checking test framework&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Ruff maintainer Carl Meyer recently landed an interesting new design for a testing framework. It's based on Markdown, and could be described as a form of "literate testing" - the testing equivalent of Donald Knuth's &lt;a href="https://en.wikipedia.org/wiki/Literate_programming"&gt;literate programming&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A markdown test file is a suite of tests, each test can contain one or more Python files, with optionally specified path/name. The test writes all files to an in-memory file system, runs red-knot, and matches the resulting diagnostics against &lt;code&gt;Type:&lt;/code&gt; and &lt;code&gt;Error:&lt;/code&gt; assertions embedded in the Python source as comments.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Test suites are Markdown documents with embedded fenced blocks that look &lt;a href="https://github.com/astral-sh/ruff/blob/2095ea83728d32959a435ab749acce48dfb76256/crates/red_knot_python_semantic/resources/mdtest/literal/float.md?plain=1#L5-L7"&gt;like this&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;```py
reveal_type(1.0) # revealed: float
```
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Tests can optionally include a &lt;code&gt;path=&lt;/code&gt; specifier, which can provide neater messages when reporting test failures:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;```py path=branches_unify_to_non_union_type.py
def could_raise_returns_str() -&amp;gt; str:
    return 'foo'
...
```
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A larger example test suite can be browsed in the &lt;a href="https://github.com/astral-sh/ruff/tree/6282402a8cb44ac6362c6007fc911c3d75729648/crates/red_knot_python_semantic/resources/mdtest"&gt;red_knot_python_semantic/resources/mdtest&lt;/a&gt; directory.&lt;/p&gt;
&lt;p&gt;This document &lt;a href="https://github.com/astral-sh/ruff/blob/main/crates/red_knot_python_semantic/resources/mdtest/exception/control_flow.md"&gt;on control flow for exception handlers&lt;/a&gt; (from &lt;a href="https://github.com/astral-sh/ruff/pull/13729"&gt;this PR&lt;/a&gt;) is the best example I've found of detailed prose documentation to accompany the tests.&lt;/p&gt;
&lt;p&gt;The system is implemented in Rust, but it's easy to imagine an alternative version of this idea written in Python as a &lt;code&gt;pytest&lt;/code&gt; plugin. This feels like an evolution of the old Python &lt;a href="https://docs.python.org/3/library/doctest.html"&gt;doctest&lt;/a&gt; idea, except that tests are embedded directly in Markdown rather than being embedded in Python code docstrings.&lt;/p&gt;
&lt;p&gt;... and it looks like such plugins exist already. Here are two that I've found so far:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/modal-labs/pytest-markdown-docs"&gt;pytest-markdown-docs&lt;/a&gt; by Elias Freider and Modal Labs.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.sphinx-doc.org/en/master/usage/extensions/doctest.html"&gt;sphinx.ext.doctest&lt;/a&gt; is a core Sphinx extension for running test snippets in documentation.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/scientific-python/pytest-doctestplus"&gt;pytest-doctestplus&lt;/a&gt; from the Scientific Python community, first released in 2011.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I tried &lt;code&gt;pytest-markdown-docs&lt;/code&gt; by creating a &lt;code&gt;doc.md&lt;/code&gt; file like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Hello test doc

```py
assert 1 + 2 == 3
```

But this fails:

```py
assert 1 + 2 == 4
```
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And then running it with &lt;a href="https://docs.astral.sh/uv/guides/tools/"&gt;uvx&lt;/a&gt; like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uvx --with pytest-markdown-docs pytest --markdown-docs
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I got one pass and one fail:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;_______ docstring for /private/tmp/doc.md __________
Error in code block:
```
10   assert 1 + 2 == 4
11   
```
Traceback (most recent call last):
  File "/private/tmp/tt/doc.md", line 10, in &amp;lt;module&amp;gt;
    assert 1 + 2 == 4
AssertionError

============= short test summary info ==============
FAILED doc.md::/private/tmp/doc.md
=========== 1 failed, 1 passed in 0.02s ============
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I also &lt;a href="https://twitter.com/exhaze/status/1846675911225364742"&gt;just learned&lt;/a&gt; that the venerable Python &lt;code&gt;doctest&lt;/code&gt; standard library module has the ability to &lt;a href="https://docs.python.org/3/library/doctest.html#simple-usage-checking-examples-in-a-text-file"&gt;run tests in documentation files&lt;/a&gt; too, with &lt;code&gt;doctest.testfile("example.txt")&lt;/code&gt;: "The file content is treated as if it were a single giant docstring; the file doesn’t need to contain a Python program!"

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/charliermarsh/status/1846544708480168229"&gt;Charlie Marsh&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/markdown"&gt;markdown&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rust"&gt;rust&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pytest"&gt;pytest&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ruff"&gt;ruff&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/astral"&gt;astral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/donald-knuth"&gt;donald-knuth&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="testing"/><category term="markdown"/><category term="rust"/><category term="pytest"/><category term="ruff"/><category term="uv"/><category term="astral"/><category term="donald-knuth"/></entry><entry><title>An LLM TDD loop</title><link href="https://simonwillison.net/2024/Oct/13/an-llm-tdd-loop/#atom-tag" rel="alternate"/><published>2024-10-13T19:37:47+00:00</published><updated>2024-10-13T19:37:47+00:00</updated><id>https://simonwillison.net/2024/Oct/13/an-llm-tdd-loop/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://codeinthehole.com/tips/llm-tdd-loop-script/"&gt;An LLM TDD loop&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Super neat demo by David Winterbottom, who wrapped my &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; and &lt;a href="https://github.com/simonw/files-to-prompt"&gt;files-to-prompt&lt;/a&gt; tools in &lt;a href="https://gist.github.com/codeinthehole/d12af317a76b43423b111fd6d508c4fc"&gt;a short Bash script&lt;/a&gt; that can be fed a file full of Python unit tests and an empty implementation file and will then iterate on that file in a loop until the tests pass.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/codeinthehole/status/1845541873651274144"&gt;@codeinthehole&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pytest"&gt;pytest&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/files-to-prompt"&gt;files-to-prompt&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="testing"/><category term="ai"/><category term="pytest"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="llm"/><category term="files-to-prompt"/></entry><entry><title>inline-snapshot</title><link href="https://simonwillison.net/2024/Apr/16/inline-snapshot/#atom-tag" rel="alternate"/><published>2024-04-16T16:04:25+00:00</published><updated>2024-04-16T16:04:25+00:00</updated><id>https://simonwillison.net/2024/Apr/16/inline-snapshot/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://15r10nk.github.io/inline-snapshot/"&gt;inline-snapshot&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I'm a big fan of snapshot testing, where expected values are captured the first time a test suite runs and then asserted against in future runs. It's a very productive way to build a robust test suite.&lt;/p&gt;
&lt;p&gt;inline-snapshot by Frank Hoffmann is a particularly neat implementation of the pattern. It defines a &lt;code&gt;snapshot()&lt;/code&gt; function which you can use in your tests:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;assert 1548 * 18489 == snapshot()&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;When you run that test using &lt;code&gt;pytest --inline-snapshot=create&lt;/code&gt; the &lt;code&gt;snapshot()&lt;/code&gt; function will be replaced in your code (using AST manipulation) with itself wrapping the &lt;code&gt;repr()&lt;/code&gt; of the expected result:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;assert 1548 * 18489 == snapshot(28620972)&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;If you modify the code and need to update the tests you can run &lt;code&gt;pytest --inline-snapshot=fix&lt;/code&gt; to regenerate the recorded snapshot values.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pytest"&gt;pytest&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="testing"/><category term="pytest"/></entry><entry><title>Your AI Product Needs Evals</title><link href="https://simonwillison.net/2024/Mar/31/your-ai-product-needs-evals/#atom-tag" rel="alternate"/><published>2024-03-31T21:53:59+00:00</published><updated>2024-03-31T21:53:59+00:00</updated><id>https://simonwillison.net/2024/Mar/31/your-ai-product-needs-evals/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://hamel.dev/blog/posts/evals/"&gt;Your AI Product Needs Evals&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Hamel Husain: “I’ve seen many successful and unsuccessful approaches to building LLM products. I’ve found that unsuccessful products almost always share a common root cause: a failure to create robust evaluation systems.”&lt;/p&gt;

&lt;p&gt;I’ve been frustrated about this for a while: I know I need to move beyond “vibe checks” for the systems I have started to build on top of LLMs, but I was lacking a thorough guide about how to build automated (and manual) evals in a productive way.&lt;/p&gt;

&lt;p&gt;Hamel has provided exactly the tutorial I was needing for this, with a really thorough example case-study.&lt;/p&gt;

&lt;p&gt;Using GPT-4 to create test cases is an interesting approach: “Write 50 different instructions that a real estate agent can give to his assistant to create contacts on his CRM. The contact details can include name, phone, email, partner name, birthday, tags, company, address and job.”&lt;/p&gt;

&lt;p&gt;Also important: “... unlike traditional unit tests, you don’t necessarily need a 100% pass rate. Your pass rate is a product decision.”&lt;/p&gt;

&lt;p&gt;Hamel’s guide then covers the importance of traces for evaluating real-world performance of your deployed application, plus the pros and cons of leaning on automated evaluation using LLMs themselves.&lt;/p&gt;

&lt;p&gt;Plus some wisdom from a footnote: “A reasonable heuristic is to keep reading logs until you feel like you aren’t learning anything new.”

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/HamelHusain/status/1773765490663735319"&gt;@HamelHusain&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hamel-husain"&gt;hamel-husain&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/evals"&gt;evals&lt;/a&gt;&lt;/p&gt;



</summary><category term="testing"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="hamel-husain"/><category term="evals"/></entry><entry><title>time-machine example test for a segfault in Python</title><link href="https://simonwillison.net/2024/Mar/23/test-segfault-in-python/#atom-tag" rel="alternate"/><published>2024-03-23T19:44:07+00:00</published><updated>2024-03-23T19:44:07+00:00</updated><id>https://simonwillison.net/2024/Mar/23/test-segfault-in-python/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/adamchainz/time-machine/pull/433/files#diff-92ea7165ddf0128246b9758ee9554b3eccb4eceb3d4719bdea9f5495ebbe10a1R477-R495"&gt;time-machine example test for a segfault in Python&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Here's a really neat testing trick by Adam Johnson. Someone reported a segfault bug in his time-machine library. How you you write a unit test that exercises a segfault without crashing the entire test suite?&lt;/p&gt;
&lt;p&gt;Adam's solution is a test that does this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;subprocess.run([sys.executable, "-c", code_that_crashes_python], check=True)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;sys.executable&lt;/code&gt; is the path to the current Python executable - ensuring the code will run in the same virtual environment as the test suite itself. The &lt;code&gt;-c&lt;/code&gt; option can be used to have it run a (multi-line) string of Python code, and &lt;code&gt;check=True&lt;/code&gt; causes the &lt;code&gt;subprocess.run()&lt;/code&gt; function to raise an error if the subprocess fails to execute cleanly and returns an error code.&lt;/p&gt;
&lt;p&gt;I'm absolutely going to be borrowing this pattern next time I need to add tests to cover a crashing bug in one of my projects.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://fosstodon.org/@adamchainz/112144774490159195"&gt;@adamchainz&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/adam-johnson"&gt;adam-johnson&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="testing"/><category term="adam-johnson"/></entry><entry><title>Testcontainers</title><link href="https://simonwillison.net/2024/Feb/28/testcontainers/#atom-tag" rel="alternate"/><published>2024-02-28T02:41:38+00:00</published><updated>2024-02-28T02:41:38+00:00</updated><id>https://simonwillison.net/2024/Feb/28/testcontainers/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://testcontainers.com/"&gt;Testcontainers&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Not sure how I missed this: Testcontainers is a family of testing libraries (for Python, Go, JavaScript, Ruby, Rust and a bunch more) that make it trivial to spin up a service such as PostgreSQL or Redis in a container for the duration of your tests and then spin it back down again.&lt;/p&gt;
&lt;p&gt;The Python example code is delightful:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;redis = DockerContainer("redis:5.0.3-alpine").with_exposed_ports(6379)
redis.start()
wait_for_logs(redis, "Ready to accept connections")
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I much prefer integration-style tests over unit tests, and I like to make sure any of my projects that depend on PostgreSQL or similar can run their tests against a real running instance. I've invested heavily in spinning up Varnish or Elasticsearch ephemeral instances in the past - Testcontainers look like they could save me a lot of time.&lt;/p&gt;
&lt;p&gt;The open source project started in 2015, span off a company called AtomicJar in 2021 and was acquired by Docker in December 2023.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=39531536"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/redis"&gt;redis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/docker"&gt;docker&lt;/a&gt;&lt;/p&gt;



</summary><category term="redis"/><category term="testing"/><category term="docker"/></entry><entry><title>Quoting Will Wilson</title><link href="https://simonwillison.net/2024/Feb/13/foundationdb/#atom-tag" rel="alternate"/><published>2024-02-13T17:20:07+00:00</published><updated>2024-02-13T17:20:07+00:00</updated><id>https://simonwillison.net/2024/Feb/13/foundationdb/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://antithesis.com/blog/is_something_bugging_you/"&gt;&lt;p&gt;Before we even started writing the database, we first wrote a fully-deterministic event-based network simulation that our database could plug into. This system let us simulate an entire cluster of interacting database processes, all within a single-threaded, single-process application, and all driven by the same random number generator. We could run this virtual cluster, inject network faults, kill machines, simulate whatever crazy behavior we wanted, and see how it reacted. Best of all, if one particular simulation run found a bug in our application logic, we could run it over and over again with the same random seed, and the exact same series of events would happen in the exact same order. That meant that even for the weirdest and rarest bugs, we got infinity “tries” at figuring it out, and could add logging, or do whatever else we needed to do to track it down.&lt;/p&gt;
&lt;p&gt;[...] At FoundationDB, once we hit the point of having ~zero bugs and confidence that any new ones would be found immediately, we entered into this blessed condition and we flew.&lt;/p&gt;
&lt;p&gt;[...] We had built this sophisticated testing system to make our database more solid, but to our shock that wasn’t the biggest effect it had. The biggest effect was that it gave our tiny engineering team the productivity of a team 50x its size.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://antithesis.com/blog/is_something_bugging_you/"&gt;Will Wilson&lt;/a&gt;, on FoundationDB&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/databases"&gt;databases&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;&lt;/p&gt;



</summary><category term="databases"/><category term="testing"/></entry><entry><title>promptfoo: How to benchmark Llama2 Uncensored vs. GPT-3.5 on your own inputs</title><link href="https://simonwillison.net/2023/Sep/10/promptfoo/#atom-tag" rel="alternate"/><published>2023-09-10T16:19:18+00:00</published><updated>2023-09-10T16:19:18+00:00</updated><id>https://simonwillison.net/2023/Sep/10/promptfoo/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://promptfoo.dev/docs/guides/llama2-uncensored-benchmark-ollama/"&gt;promptfoo: How to benchmark Llama2 Uncensored vs. GPT-3.5 on your own inputs&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
promptfoo is a CLI and library for “evaluating LLM output quality”. This tutorial in their documentation about using it to compare Llama 2 to gpt-3.5-turbo is a good illustration of how it works: it uses YAML files to configure the prompts, and more YAML to define assertions such as “not-icontains: AI language model”.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;&lt;/p&gt;



</summary><category term="cli"/><category term="testing"/><category term="ai"/><category term="generative-ai"/><category term="llms"/></entry><entry><title>pytest-icdiff</title><link href="https://simonwillison.net/2023/Jun/3/pytest-icdiff/#atom-tag" rel="alternate"/><published>2023-06-03T16:59:24+00:00</published><updated>2023-06-03T16:59:24+00:00</updated><id>https://simonwillison.net/2023/Jun/3/pytest-icdiff/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/hjwp/pytest-icdiff"&gt;pytest-icdiff&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This is neat: “pip install pytest-icdiff” provides an instant usability upgrade to the output of failed tests in pytest, especially if the assertions involve comparing larger strings or nested JSON objects.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://mastodon.social/@hynek/110479665200902390"&gt;@hynek&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/testing"&gt;testing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pytest"&gt;pytest&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="testing"/><category term="pytest"/></entry></feed>