Simon Willison's Weblog: claude-code

Claude Fable is relentlessly proactive

2026-06-11T23:35:17+00:00

After two days of experience with Claude Fable 5 I think the best way to describe it is relentlessly proactive. It knows a whole lot of tricks and it will deploy pretty much any of them to get to its goal.

I'll illustrate this with an example. I was hacking on Datasette Agent today when I noticed a glitch: a horizontal scrollbar that shouldn't be there in the jump menu chat prompt. I snapped this screenshot:

Then I started a fresh claude session in my datasette-agent checkout, dragged in the screenshot and told it:

Look at dependencies to help figure out why there is a horizontal scrollbar here

I had a hunch the cause was in a dependency of Datasette Agent (likely Datasette itself) and I knew Fable was good at digging into dependency code, either by inspecting installed files in its own virtual environment site-packages or by referencing a local checkout on disk. Telling it to start with dependencies felt like a good bet.

I got distracted by a domestic task and wandered away from my computer.

When I came back a few minutes later I saw my machine open a browser window in my regular Firefox and then navigate to the dialog in question. I had not told Claude Code to use any browser automation, and I was pretty sure it wasn't possible for it to trigger mouse movements or keyboard shortcuts within a window, so how was it doing that?

I watched in fascination as it continued with its explorations, then saw it open a Safari window instead of Firefox. I also grabbed this snapshot from the Claude terminal:

What was it doing there with uv run --with pyobjc-framework-Quartz?

It turns out Fable had hacked up its own pattern for taking screenshots of browser windows. It was using Python to iterate through all available windows on my machine, then filtering for Safari windows with expected strings such as "textarea" in the window name. It used that to find their window number - an integer like 153551 - which it could then use with the screencapture CLI tool to grab a PNG.

OK fine, that's a neat way of taking screenshots. But what was it taking screenshots of?

Turns out it had been writing its own scratch HTML pages to try and recreate the bug, then opening Safari and grabbing screenshots.

Here's that /tmp/textarea-scrollbar-test.html page it created, and the screenshot it took with screencapture -x -o -l 153551 /tmp/safari-cases.png:

(I have way too many open tabs!)

OK, so I can see how it's opening test pages and taking screenshots, but how on earth was it triggering the modal dialog that was meant to be under test? That's only available via a click or a keyboard shortcut, and I couldn't see a mechanism for it to run those in Safari.

I eventually figured out what it had done.

Claude was running in a folder that contained the source code for the application. It knows enough about Datasette to be able to run a local development server. It turns out it was editing Datasette's own templates to add JavaScript that would trigger the correct keyboard shortcut as soon as the window opened, adding code like this:

<script>
window.addEventListener("load", function () {
  setTimeout(function () {
    document.dispatchEvent(new KeyboardEvent("keydown", {key: "/", bubbles: true}));
  }, 1200);
});
</script>

1.2 seconds after the window opens, this code triggers a simulated / key, which is the keyboard shortcut for opening the modal dialog.

There was one challenge left. In order to understand what was going on, Claude needed to run JavaScript on the page to take measurements for itself.

It wrote its own custom web application to capture information via CORS, then ran that as a local server and opened a page with JavaScript that would POST directly to it!

Here's the Python web app it wrote, using the standard library http.server package:

from http.server import HTTPServer, BaseHTTPRequestHandler

class H(BaseHTTPRequestHandler):
    def do_POST(self):
        n = int(self.headers.get("Content-Length", 0))
        open("/tmp/diag.json", "w").write(self.rfile.read(n).decode())
        self.send_response(200)
        self.send_header("Access-Control-Allow-Origin", "*")
        self.end_headers()
    def do_OPTIONS(self):
        self.send_response(200)
        self.send_header("Access-Control-Allow-Origin", "*")
        self.send_header("Access-Control-Allow-Headers", "*")
        self.end_headers()
    def log_message(self, *a):  # quiet
        pass

HTTPServer(("127.0.0.1", 9999), H).serve_forever()

All this does is accept a POST request full of JSON and write that to the /tmp/diag.json file. It sends Access-Control-Allow-Origin: * headers (including from OPTIONS requests) so that code running on another domain can still communicate back to it.

Then Claude injected this code into the template that it was loading in a browser:

const host = document.querySelector("navigation-search");
const ta   = host.shadowRoot.querySelector("textarea");
const cs   = getComputedStyle(ta);
fetch("http://127.0.0.1:9999/diag", {
  method: "POST",
  body: JSON.stringify({
    dpr: window.devicePixelRatio,
    scrollWidth: ta.scrollWidth, clientWidth: ta.clientWidth,
    whiteSpace: cs.whiteSpace, width: cs.width,
  }),
});

This took measurements of the <textarea> inside the <navigation-search> Web Component and sent them to the server, which wrote them to a file on disk, which Claude could then read.

Having figured out all of these tricks Fable... hit some invisible guardrail and downgraded itself to Opus. Thankfully Opus had access to the full transcript and could continue using the tricks pioneered by Fable, and shortly afterwards found, tested and verified the fix.

I prompted Opus to:

Write a report in /tmp/automation-report.md where you note down all of the tricks you have used in this session to test against real browsers on my computer, include runnable code examples

Which produced this report, which was invaluable for piecing together the details of what had happened for this post.

I've shared the full terminal transcript of the Claude Code session as well.

A review of everything it did

Based on a screenshot and a one-line prompt, Claude Fable 5 + Claude Code:

Figured out the recipe to run the local development server (with fake environment variables needed to get it running)
Fired up a Playwright Chrome session
Turned on the visible scrollbars setting for Chrome defaults write com.google.chrome.for.testing AppleShowScrollBars Always (it turned that off again later)
Cycled through Firefox and WebKit in Playwright too, failing to recreate the bug
Worked out my default browser was Safari
Built a textarea-scrollbar-test.html HTML document
Opened that in real (not Playwright) Firefox
Found that osascript -e 'tell application "System Events" to tell process "firefox" to id of window 1' was blocked because "osascript is not allowed assistive access"
Figured out that uv run --with pyobjc-framework-Quartz python workaround, described above
Added JavaScript to the site templates in order to trigger the / key
Built its own little Python CORS web server to capture JSON data
Rewrote the template to capture that data and send it to the server
Scripted its way through the Web Component shadow DOM to the information it needed
Opened Safari to confirm the source of the bug
Modified its custom template to hack in a potential fix
Confirmed the hacked fix worked
Reported back on how to fix the problem

Like I said, relentlessly proactive!

An estimate of the cost

I'm currently on the $100/month Claude Max plan, which includes a generous allowance for Fable up until June 22nd after which Anthropic say they'll start charging full API prices for it.

I'm using AgentsView to track my spending (see this TIL). Here's what AgentsView says this session would have cost me if I was paying full price for it:

~ % uvx agentsview session usage be8850a7-6119-46a0-b5d6-79c7fff5ae2b
Session:       be8850a7-6119-46a0-b5d6-79c7fff5ae2b
Agent:         claude
Output:        68606
Peak ctx:      113178
Cost:          ~$12.11 (claude-fable-5, claude-opus-4-8)

If you don't keep a close eye on it, Fable will quite happily burn $12 in tokens inventing new ways to debug your CSS.

I really need to lock this thing down

On the one hand, watching Fable go to extreme lengths to get the information that it needed to debug what was, in the end, a two-line CSS fix, was fascinating.

But on the other hand... this is a robust reminder that coding agents can do anything you can do by typing commands into a terminal - and frontier models know every trick in the book, and evidently a few that nobody has ever written down before.

If Fable had been acting on malicious instructions - a prompt injection attack hidden in code or an issue thread, or something I'd carelessly pasted into my terminal - it's alarming to think quite how far it could go to exfiltrate data or cause other forms of mischief.

Running coding agents outside of a sandbox has always been a bad idea - it's my top contender for a Challenger disaster incident, as described by Johann Rehberger in The Normalization of Deviance in AI.

Fable is arguably smarter and hence more suspicious of potentially malicious instructions. But that smartness is very much a two-edged sword: if it does get subverted by instructions, the amount of damage it can do given its relentless proactivity is terrifying.

Tags: ai, prompt-injection, generative-ai, llms, ai-assisted-programming, coding-agents, claude-code, claude-mythos

How we contain Claude across products

2026-05-30T21:36:24+00:00

How we contain Claude across products

A complaint I often have about sandboxing products is that they are rarely thoroughly documented, and in the absence of detailed documentation it's hard to know how much I can trust them.

Anthropic just published a fantastic overview of how their various sandbox techniques work across Claude.ai, Claude Code, and Cowork.

We constrain where and how an agent can act with process sandboxes, VMs, filesystem boundaries, and egress controls. The goal is to set a hard boundary on what an agent can reach. For example, if credentials never enter the sandbox, they can't be exfiltrated, regardless of whether the cause is a user, a model finding a “creative” path, or an attacker.

Claude.ai uses gVisor. Claude Code, run locally, uses Seatbelt on macOS and Bubblewrap on Linux. Claude Cowork runs a full VM (Apple's Virtualization framework on macOS, HCS on Windows).

There's a lot in here, including some interesting stories of risks they missed such as the api.anthropic.com/v1/files exfiltration vector covered here previously.

This reminded me it's time I took another look at Anthropic's open source srt (Anthropic Sandbox Runtime) tool - it's mature enough now that I'm ready to give it a proper go.

Tags: sandboxing, security, ai, generative-ai, llms, anthropic, claude, claude-code

Running Python ASGI apps in the browser via Pyodide + a service worker

2026-05-30T15:34:00+00:00

Research: Running Python ASGI apps in the browser via Pyodide + a service worker

Datasette Lite is my version of Datasette that runs entirely in the browser using Pyodide in WebAssembly.

When I first built it four years ago I used Web Workers and code that intercepts navigation operations and fetches the generated HTML by running the Python app.

This worked, but had the disadvantage that any JavaScript in <script> tags would not be executed - breaking some Datasette functionality and a whole lot of Datasette plugins.

This morning I set Claude Opus 4.8 the task (in Claude Code for web) of figuring out how to run Python ASGI apps in Pyodide using Service Workers instead, and it seems to work! Here's a basic ASGI FastCGI demo and here's a demo that runs Datasette 1.0a31.

I'm still getting my head around exactly how it works, but once I've done that I plan to upgrade Datasette Lite itself.

Tags: javascript, python, datasette, asgi, webassembly, service-workers, pyodide, datasette-lite, claude-code

I think Anthropic and OpenAI have found product-market fit

2026-05-27T16:38:35+00:00

Anthropic are strongly rumored to be about to have their first profitable quarter. Stories are circulating of companies surprised at how expensive their LLM bills are becoming from usage by their staff. I think this is because OpenAI and Anthropic have both found product-market fit.

Enterprise customers are now paying API prices

I currently subscribe to the $100/month Max plan from Anthropic and the $100/month Pro plan from OpenAI. If you are a heavy user of coding agents these plans are a fantastic deal. I just ran the ccusage tool on my laptop to get an estimate of how much I would have spent if I were to pay for API tokens in the past 30 days and got:

$1,199.79 for Anthropic Claude Code
$980.37 for OpenAI Codex

That's $2,180.16 worth of tokens for $200 - not bad at all! I'm a moderately heavy user of these tools, but I'm certainly not running agents every hour of the day and night.

I had assumed that companies making extensive use of agents were getting similar discounts. It turns out I could not have been more wrong about that.

I haven't been able to track down the exact date, but at some point in the last six months Anthropic switched their Enterprise plan (originally "Claude seats include enough usage for a typical workday" back in August 2025) to $20/seat/month plus API pricing for usage. This story about the change from The Information is dated Apr 14, 2026, but cites an Anthropic spokesperson claiming that the pricing change occurred in November 2025. Existing customers are finding out about the change as they renew their contracts.

OpenAI made a similar pricing change in April. The Codex rate card (Internet Archive copy) currently says:

Note: On April 2, 2026, we updated Codex pricing to align with API token usage, instead of per-message pricing. This change was applicable to new and existing Plus, Pro, ChatGPT Business and new ChatGPT Enterprise plans.

On April 23, 2026, we made this update for all existing ChatGPT Enterprise plans as well, inclusive of Edu, Health, Gov, and ChatGPT for Teachers.

It's a little harder to decode as they quote prices in "credits", but as far as I can tell those credit costs are an exact match for the API token costs listed for those models.

All of which is to say that as of April 2026 the "Enterprise" cost for both OpenAI Codex and Anthropic Claude Code/Cowork is the same as the listed API price.

GPT-5.5 (released April 23rd) is 2x the API price of GPT-5.4. Opus 4.7 (April 16th) is around 1.4x the price of Opus 4.6 when you take their new tokenizer into account.

So April saw both leading model companies release new frontier models with a higher API price, and both companies now have measures to lock their enterprise customers (who tend to sign year-long deals) at those API prices, not the previous extreme discounts.

I think they've found product-market fit

Why these sudden aggressive moves on pricing? Both Anthropic and OpenAI are planning to IPO, but I suspect there's a more important factor here: I think they've finally found product-market fit, with the coding/general-purpose agent products embodied by Claude Code/Cowork and Codex.

Tools like ChatGPT are wildly popular, but that wild popularity has been difficult to turn into revenue. In February OpenAI boasted more than 900 million weekly active users for ChatGPT, but only 50 million - 5.6% of that - were paying consumer subscribers.

Charging $10-$20/month per user is an OK business, but you'd need 1-2 billion subscribers sticking around for four years to cover $1 trillion in infrastructure.

Companies spending $200+/month/user will get you there a whole lot faster - and as noted above, as a power-user I'm at ~$1,000/month in API costs per vendor already.

Coding agents really did change everything. These are tools which burn vastly more tokens, but are also quickly becoming daily drivers for the work carried out by extremely well-compensated professionals. Right now that's still mostly software engineers, but a coding agent is a tool that can automate anything you can do by typing commands into a computer... so they are clearly applicable to a much wider set of skilled knowledge workers.

As I've discussed on this site at length, the models released in November 2025 elevated agents to being genuinely useful. We've had six months to get used to that idea now - it's no wonder companies are beginning to spend real money on this technology.

You could argue that ChatGPT achieved product-market fit when it became the fastest-growing consumer app in history back in February 2023... but it certainly wasn't making any actual money back then. Coding agents plus enterprise pricing marks the point when these companies start making very real revenue. Maybe even enough to start covering their costs!

And they're ramping up

As further evidence that enterprise agents represent product-market fit for these companies, consider their open job listings.

OpenAI have 703 open jobs right now, of which I'd categorize 229 (32.6%) as relating to enterprise sales and support - account executives, "Go To Market", "Forward Deployed Engineers" and the like.

Anthropic have 390 open jobs, 105 (26.9%) of which look enterprisey to me.

It's pleasingly ironic that these AI labs have picked a business model with such a heavy demand on human labor - enterprise sales contracts don't close themselves without a whole lot of humans in the mix!

(I ran this analysis by scraping their job sites with Claude Code, then having it use Datasette's JSON API to pipe that data into Datasette Cloud where I used Datasette Agent for the analysis, exported here. Dogfood!)

The AI-failure stories around this are pretty thin

I started digging into this in response to a growing volume of stories claiming that large companies were sounding the alarm because their AI usage costs had grown so large.

The most widely cited of these stories appear quite overblown to me.

The most discussed has been Uber, based on this report where CTO Praveen Neppalli Naga indicated that Uber had "maxed out its full year AI budget just a few months into 2026", mostly thanks to Claude Code.

Given that Claude Code only got really good in November it's entirely unsurprising to me that a budget set in 2025 may have failed to predict demand for that tool in 2026!

That Uber story was further fueled by comments made by Uber's COO, Andrew Macdonald, on the Rapid Response podcast. I tracked down the segment and there really isn't much there. Here's what Andrew said:

But then you sometimes go and talk to your senior engineering leaders and you're saying, OK, how many projects that were on the cutting room floor got moved above the line because of the productivity gains because 25% of our code commits were via Claude Code last quarter?

That link is not there yet, right? I think maybe implicitly there's more that is getting shipped. But it's very hard to draw a line between one of those stats and, OK, now we're actually producing like 25% more useful consumer features, right? And that line is hard to draw.

[...] And so if you're not actually able to draw a direct line to how much useful features and functionality you're shipping to your users, that trade becomes harder to justify.

Somehow this fragment turned into headlines like Uber's COO says it's getting harder to justify the money spent on AI tokenmaxxing, because the market for stories about AI failures remains enormous.

Update 29th May 2026: I edited the above quote to add that last paragraph ending in "becomes harder to justify" on the suggestion of Madison Mills - previously my quoted section stopped at "hard to draw". Here's the full unedited transcript from MacWhisper.

The other popular story around this is Microsoft starts canceling Claude Code licenses, ostensibly to encourage their engineers to dogfood their own Copilot CLI agent instead - but The Verge reporter Tom Warren says "sources tell me the decision is also a financial one", triggered by the June 30th end of Microsoft's financial year.

I think both of these stories support my "product-market fit" hypothesis. The best advice I ever heard on pricing a product was that your customer should suck air through their teeth and then say yes. Uber's budget overrun and Microsoft's seat cancellations look like that effect playing out in practice.

We also know the labs are spending a lot

The big AI labs spend billions of dollars on both training and inference. Credible figures are hard to come by, but we did get one huge hint as to the figures involved from, oddly enough, the recent SpaceX S-1:

[...] in May 2026, we entered into Cloud Services Agreements with Anthropic PBC (“Anthropic”), an AI research and development public benefit corporation, with respect to access to compute capacity across COLOSSUS and COLOSSUS II. Pursuant to these agreements, the customer has agreed to pay us $1.25 billion per month through May 2029 [...]

The Anthropic announcement said that this deal meant they could "increase our usage limits for Claude Code and the Claude API", heavily implying that Colossus is being used for inference, not model training.

Anthropic already have vast amounts of compute from other providers. The fact that they're willing to spend $1.25 billion per month for extra capacity from just one of their vendors hints at how big these inference budgets have become.

API revenue is becoming less important

Over the past two years my impression has been that OpenAI made more of their income from subscription revenue while Anthropic made more from their API.

Anthropic's API revenue was historically quite dependent on a small number of large API customers - this VentureBeat story from August 2025 quotes "sources familiar with the matter" suggesting that just Cursor and GitHub Copilot were responsible for $1.2 billion of the company's then-$4 billion revenue.

Today Anthropic are rumored to hit $10.9 billion in the second quarter, potentially even operating at a profit for the first time.

This pivot-to-Enterprise suggests that the labs have realized that the real money lies in cutting out the middlemen. Anthropic's Claude Code directly competes with Cursor and Copilot. No wonder Cursor are investing in their own models!

April is a new inflection point

I've called November 2025 the November inflection point because that was when GPT-5.1 and Opus 4.5, combined with their respective coding agent harnesses, got good - good enough that we've spent the last six months adapting to agent systems that can reliably get useful work done.

I think April 2026 is a new inflection point where the revenue implications of this have started to land, to the benefit of the frontier AI labs and with material impacts on the budgets of large companies.

We'll know for sure how real this moment is when the S-1 documents for the upcoming Anthropic and OpenAI IPOs give us some real, audited numbers to get our teeth into.

Tags: ai, datasette, openai, generative-ai, llms, anthropic, llm-pricing, coding-agents, claude-code, codex, claude-cowork, november-2025-inflection, datasette-agent, uber

Using Claude Code: The Unreasonable Effectiveness of HTML

2026-05-08T21:00:11+00:00

Using Claude Code: The Unreasonable Effectiveness of HTML

Thought-provoking piece by Thariq Shihipar (on the Claude Code team at Anthropic) advocating for HTML over Markdown as an output format to request from Claude.

The article is crammed with interesting examples (collected on this site) and prompt suggestions like this one:

Help me review this PR by creating an HTML artifact that describes it. I'm not very familiar with the streaming/backpressure logic so focus on that. Render the actual diff with inline margin annotations, color-code findings by severity and whatever else might be needed to convey the concept well.

I've been defaulting to asking for most things in Markdown since the GPT-4 days, when the 8,192 token limit meant that Markdown's token-efficiency over HTML was extremely worthwhile.

Thariq's piece here has caused me to reconsider that, especially for output. Asking Claude for an explanation in HTML means it can drop in SVG diagrams, interactive widgets, in-page navigation and all sorts of other neat ways of making the information more pleasant to navigate.

I wrote about Useful patterns for building HTML tools last December, but that was focused very much on interactive utilities like the ones on my tools.simonwillison.net site. I'm excited to start experimenting more with rich HTML explanations in response to ad-hoc prompts.

Trying this out on copy.fail

copy.fail describes a recently discovered Linux security exploit, including a proof of concept distributed as obfuscated Python.

I tried having GPT-5.5 create an HTML explanation of the exploit like this:

curl https://copy.fail/exp | llm -m gpt-5.5 -s 'Explain this code in detail. Reformat it, expand out any confusing bits and go deep into what it does and how it works. Output HTML, neatly styled and using capabilities of HTML and CSS and JavaScript to make the explanation rich and interactive and as clear as possible'

Here's the resulting HTML page. It's pretty good, though I should have emphasized explaining the exploit over the Python harness around it.

Tags: html, security, markdown, ai, prompt-engineering, generative-ai, llms, llm, claude-code

Live blog: Code w/ Claude 2026

2026-05-06T15:58:27+00:00

I'm at Anthropic's Code w/ Claude event today. Here's my live blog of the morning keynote sessions.

Tags: ai, generative-ai, llms, anthropic, claude, claude-code, live-blog

Sightings

2026-05-02T17:26:40+00:00

/elsewhere/sightings/

I have a new camera (a Canon R6 Mark II) so I'm taking a lot more photos of birds. I share my best wildlife photos on iNaturalist, and based on yesterday's successful prototype I decided to add those to my blog.

I built this feature on my phone using Claude Code for web, as an extension of my beats system for syndicating external content. Here's the PR and prompt.

As with my other forms of incoming syndicated content sightings show up on the homepage, the date archive pages, and in site search results.

I back-populated over a decade of iNaturalist sightings, which means you that if you search for lemur you'll see my lemur photos from Madagascar in 2019!

Tags: blogging, photography, wildlife, ai, inaturalist, generative-ai, llms, ai-assisted-programming, claude-code

iNaturalist Sightings

2026-05-01T19:35:41+00:00

Tool: iNaturalist Sightings

I wanted to see my iNaturalist observations - across two separate accounts - grouped by when they occurred. I'm camping this weekend so I built this entirely on my phone using Claude Code for web.

I started by building an inaturalist-clumper Python CLI for fetching and "clumping" observations - by default clumps use observations within 2 hours and 5km of each other.

Then I setup simonw/inaturalist-clumps as a Git scraping repository to run that tool and record the result to clumps.json.

That JSON file is hosted on GitHub, which means it can be fetched by JavaScript using CORS.

Finally I ran this prompt against my simonw/tools repo:

Build inat-sightings.html - an app that does a fetch() against https://raw.githubusercontent.com/simonw/inaturalist-clumps/refs/heads/main/clumps.json and then displays all of the observations on one page using the https://static.inaturalist.org/photos/538073008/small.jpg small.jpg URLs for the thumbnails - with loading=lazy - but when a thumbnail is clicked showing the large.jpg in an HTML modal. Both small and large should include the common species names if available

Tags: tools, ai, inaturalist, generative-ai, llms, claude-code

An update on recent Claude Code quality reports

2026-04-24T01:31:25+00:00

An update on recent Claude Code quality reports

It turns out the high volume of complaints that Claude Code was providing worse quality results over the past two months was grounded in real problems.

The models themselves were not to blame, but three separate issues in the Claude Code harness caused complex but material problems which directly affected users.

Anthropic's postmortem describes these in detail. This one in particular stood out to me:

On March 26, we shipped a change to clear Claude's older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions. A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive.

I frequently have Claude Code sessions which I leave for an hour (or often a day or longer) before returning to them. Right now I have 11 of those (according to ps aux | grep 'claude ') and that's after closing down dozens more the other day.

I estimate I spend more time prompting in these "stale" sessions than sessions that I've recently started!

If you're building agentic systems it's worth reading this article in detail - the kinds of bugs that affect harnesses are deeply complicated, even if you put aside the inherent non-deterministic nature of the models themselves.

Via Hacker News

Tags: ai, prompt-engineering, generative-ai, llms, anthropic, coding-agents, claude-code

Extract PDF text in your browser with LiteParse for the web

2026-04-23T21:54:24+00:00

LlamaIndex have a most excellent open source project called LiteParse, which provides a Node.js CLI tool for extracting text from PDFs. I got a version of LiteParse working entirely in the browser, using most of the same libraries that LiteParse uses to run in Node.js.

Spatial text parsing

Refreshingly, LiteParse doesn't use AI models to do what it does: it's good old-fashioned PDF parsing, falling back to Tesseract OCR (or other pluggable OCR engines) for PDFs that contain images of text rather than the text itself.

The hard problem that LiteParse solves is extracting text in a sensible order despite the infuriating vagaries of PDF layouts. They describe this as "spatial text parsing" - they use some very clever heuristics to detect things like multi-column layouts and group and return the text in a sensible linear flow.

The LiteParse documentation describes a pattern for implementing Visual Citations with Bounding Boxes. I really like this idea: being able to answer questions from a PDF and accompany those answers with cropped, highlighted images feels like a great way of increasing the credibility of answers from RAG-style Q&A.

LiteParse is provided as a pure CLI tool, designed to be used by agents. You run it like this:

npm i -g @llamaindex/liteparse
lit parse document.pdf

I explored its capabilities with Claude and quickly determined that there was no real reason it had to stay a CLI app: it's built on top of PDF.js and Tesseract.js, two libraries I've used for something similar in a browser in the past.

The only reason LiteParse didn't have a pure browser-based version is that nobody had built one yet...

Introducing LiteParse for the web

Visit https://simonw.github.io/liteparse/ to try out LiteParse against any PDF file, running entirely in your browser. Here's what that looks like:

The tool can work with or without running OCR, and can optionally display images for every page in the PDF further down the page.

Building it with Claude Code and Opus 4.7

The process of building this started in the regular Claude app on my iPhone. I wanted to try out LiteParse myself, so I started by uploading a random PDF I happened to have on my phone along with this prompt:

Clone https://github.com/run-llama/liteparse and try it against this file

Regular Claude chat can clone directly from GitHub these days, and while by default it can't access most of the internet from its container it can also install packages from PyPI and npm.

I often use this to try out new pieces of open source software on my phone - it's a quick way to exercise something without having to sit down with my laptop.

You can follow my full conversation in this shared Claude transcript. I asked a few follow-up questions about how it worked, and then asked:

Does this library run in a browser? Could it?

This gave me a thorough enough answer that I was convinced it was worth trying getting that to work for real. I opened up my laptop and switched to Claude Code.

I forked the original repo on GitHub, cloned a local copy, started a new web branch and pasted that last reply from Claude into a new file called notes.md. Then I told Claude Code:

Get this working as a web app. index.html, when loaded, should render an app that lets users open a PDF in their browser and select OCR or non-OCR mode and have this run. Read notes.md for initial research on this problem, then write out plan.md with your detailed implementation plan

I always like to start with a plan for this kind of project. Sometimes I'll use Claude's "planning mode", but in this case I knew I'd want the plan as an artifact in the repository so I told it to write plan.md directly.

This also means I can iterate on the plan with Claude. I noticed that Claude had decided to punt on generating screenshots of images in the PDF, and suggested we defer a "canvas-encode swap" to v2. I fixed that by prompting:

Update the plan to say we WILL do the canvas-encode swap so the screenshots thing works

After a few short follow-up prompts, here's the plan.md I thought was strong enough to implement.

I prompted:

build it.

And then mostly left Claude Code to its own devices, tinkered with some other projects, caught up on Duolingo and occasionally checked in to see how it was doing.

I added a few prompts to the queue as I was working. Those don't yet show up in my exported transcript, but it turns out running rg queue-operation --no-filename | grep enqueue | jq -r '.content' in the relevant ~/.claude/projects/ folder extracts them.

Here are the key follow-up prompts with some notes:

When you implement this use playwright and red/green TDD, plan that too - I've written more about red/green TDD here.
let's use PDF.js's own renderer (it was messing around with pdfium)
The final UI should include both the text and the pretty-printed JSON output, both of those in textareas and both with copy-to-clipboard buttons - it should also be mobile friendly - I had a new idea for how the UI should work
small commits along the way - see below
Make sure the index.html page includes a link back to https://github.com/run-llama/liteparse near the top of the page - it's important to credit your dependencies in a project like this!
View on GitHub → is bad copy because that's not the repo with this web app in, it's the web app for the underlying LiteParse library
Run OCR should be unchecked by default
When I try to parse a PDF in my browser I see 'Parse failed: undefined is not a function (near '...value of readableStream...') - it was testing with Playwright in Chrome, turned out there was a bug in Safari
... oh that is in safari but it works in chrome
When "Copy" is clicked the text should change to "Copied!" for 1.5s
[Image #1] Style the file input so that long filenames don't break things on Firefox like this - in fact add one of those drag-drop zone UIs which you can also click to select a file - dropping screenshots in of small UI glitches works surprisingly well
Tweak the drop zone such that the text is vertically centered, right now it is a bit closer to the top
it breaks in Safari on macOS, works in both Chrome and Firefox. On Safari I see "Parse failed: undefined is not a function (near '...value of readableStream...')" after I click the Parse button, when OCR is not checked - it still wasn't working in Safari...
works in safari now - but it fixed it pretty quickly once I pointed that out and it got Playwright working with that browser

I've started habitually asking for "small commits along the way" because it makes for code that's easier to understand or review later on, and I have an unproven hunch that it helps the agent work more effectively too - it's yet another encouragement towards planning and taking on one problem at a time.

While it was working I decided it would be nice to be able to interact with an in-progress version. I asked a separate Claude Code session against the same directory for tips on how to run it, and it told me to use npx vite. Running that started a development server with live-reloading, which meant I could instantly see the effect of each change it made on disk - and prompt with further requests for tweaks and fixes.

Towards the end I decided it was going to be good enough to publish. I started a fresh Claude Code instance and told it:

Look at the web/ folder - set up GitHub actions for this repo such that any push runs the tests, and if the tests pass it then does a GitHub Pages deploy of the built vite app such that the web/index.html page is the index.html page for the thing that is deployed and it works on GitHub Pages

After a bit more iteration here's the GitHub Actions workflow that builds the app using Vite and deploys the result to https://simonw.github.io/liteparse/.

I love GitHub Pages for this kind of thing because it can be quickly configured (by Claude, in this case) to turn any repository into a deployed web-app, at zero cost and with whatever build step is necessary. It even works against private repos, if you don't mind your only security being a secret URL.

With this kind of project there's always a major risk that the model might "cheat" - mark key features as "TODO" and fake them, or take shortcuts that ignore the initial requirements.

The responsible way to prevent this is to review all of the code... but this wasn't intended as that kind of project, so instead I fired up OpenAI Codex with GPT-5.5 (I had preview access) and told it:

Describe the difference between how the node.js CLI tool runs and how the web/ version runs

The answer I got back was enough to give me confidence that Claude hadn't taken any project-threatening shortcuts.

... and that was about it. Total time in Claude Code for that "build it" step was 59 minutes. I used my claude-code-transcripts tool to export a readable version of the full transcript which you can view here, albeit without those additional queued prompts (here's my issue to fix that).

Is this even vibe coding any more?

I'm a pedantic stickler when it comes to the original definition of vibe coding - vibe coding does not mean any time you use AI to help you write code, it's when you use AI without reviewing or caring about the code that's written at all.

By my own definition, this LiteParse for the web project is about as pure vibe coding as you can get! I have not looked at a single line of the HTML and TypeScript written for this project - in fact while writing this sentence I had to go and check if it had used JavaScript or TypeScript.

Yet somehow this one doesn't feel as vibe coded to me as many of my other vibe coded projects:

As a static in-browser web application hosted on GitHub Pages the blast radius for any bugs is almost non-existent: it either works for your PDF or doesn't.
No private data is transferred anywhere - all processing happens in your browser - so a security audit is unnecessary. I've glanced once at the network panel while it's running and no additional requests are made when a PDF is being parsed.
There was still a whole lot of engineering experience and knowledge required to use the models in this way. Identifying that porting LiteParse to run directly in a browser was critical to the rest of the project.

Most importantly, I'm happy to attach my reputation to this project and recommend that other people try it out. Unlike most of my vibe coded tools I'm not convinced that spending significant additional engineering time on this would have resulted in a meaningfully better initial release. It's fine as it is!

I haven't opened a PR against the origin repository because I've not discussed it with the LiteParse team. I've opened an issue, and if they want my vibe coded implementation as a starting point for something more official they're welcome to take it.

Tags: javascript, ocr, pdf, projects, ai, generative-ai, llms, vibe-coding, coding-agents, claude-code, agentic-engineering

Is Claude Code going to cost $100/month? Probably not - it's all very confusing

2026-04-22T02:07:34+00:00

Anthropic today quietly (as in silently, no announcement anywhere at all) updated their claude.com/pricing page (but not their Choosing a Claude plan page, which shows up first for me on Google) to add this tiny but significant detail (arrow is mine, and it's already reverted):

The Internet Archive copy from yesterday shows a checkbox there. Claude Code used to be a feature of the $20/month Pro plan, but according to the new pricing page it is now exclusive to the $100/month or $200/month Max plans.

Update: don't miss the update to this post, they've already changed course a few hours after this change went live.

So what the heck is going on? Unsurprisingly, Reddit and Hacker News and Twitter all caught fire.

I didn't believe the screenshots myself when I first saw them - aside from the pricing grid I could find no announcement from Anthropic anywhere. Then Amol Avasare, Anthropic's Head of Growth, tweeted:

For clarity, we're running a small test on ~2% of new prosumer signups. Existing Pro and Max subscribers aren't affected.

And that appears to be the closest we have had to official messaging from Anthropic.

I don't buy the "~2% of new prosumer signups" thing, since everyone I've talked to is seeing the new pricing grid and the Internet Archive has already snapped a copy. Maybe he means that they'll only be running this version of the pricing grid for a limited time which somehow adds up to "2%" of signups?

I'm also amused to see Claude Cowork remain available on the $20/month plan, because Claude Cowork is effectively a rebranded version of Claude Code wearing a less threatening hat!

There are a whole bunch of things that are bad about this.

If we assume this is indeed a test, and that test comes up negative and they decide not to go ahead with it, the damage has still been extensive:

A whole lot of people got scared or angry or both that a service they relied on was about to be rug-pulled. There really is a significant difference between $20/month and $100/month for most people, especially outside of higher salary countries.
The uncertainty is really bad! A tweet from an employee is not the way to make an announcement like this. I wasted a solid hour of my afternoon trying to figure out what had happened here. My trust in Anthropic's transparency around pricing - a crucial factor in how I understand their products - has been shaken.
Strategically, should I be taking a bet on Claude Code if I know that they might 5x the minimum price of the product?
More of a personal issue, but one I care deeply about myself: I invest a great deal of effort (that's 105 posts and counting) in teaching people how to use Claude Code. I don't want to invest that effort in a product that most people cannot afford to use.

Last month I ran a tutorial for journalists on "Coding agents for data analysis" at the annual NICAR data journalism conference. I'm not going to be teaching that audience a course that depends on a $100/month subscription!

This also doesn't make sense to me as a strategy for Anthropic. Claude Code defined the category of coding agents. It's responsible for billions of dollars in annual revenue for Anthropic already. It has a stellar reputation, but I'm not convinced that reputation is strong enough for it to lose the $20/month trial and jump people directly to a $100/month subscription.

OpenAI have been investing heavily in catching up to Claude Code with their Codex products. Anthropic just handed them this marketing opportunity on a plate - here's Codex engineering lead Thibault Sottiaux:

I don't know what they are doing over there, but Codex will continue to be available both in the FREE and PLUS ($20) plans. We have the compute and efficient models to support it. For important changes, we will engage with the community well ahead of making them.

Transparency and trust are two principles we will not break, even if it means momentarily earning less. A reminder that you vote with your subscription for the values you want to see in this world.

I should note that I pay $200/month for Claude Max and I consider it well worth the money. I've had periods of free access in the past courtesy of Anthropic but I'm currently paying full price, and happy to do so.

But I care about the accessibility of the tools that I work with and teach. If Codex has a free tier while Claude Code starts at $100/month I should obviously switch to Codex, because that way I can use the same tool as the people I want to teach how to use coding agents.

Here's what I think happened. I think Anthropic are trying to optimize revenue growth - obviously - and someone pitched making Claude Code only available for Max and higher. That's clearly a bad idea, but "testing" culture says that it's worth putting even bad ideas out to test just in case they surprise you.

So they started a test, without taking into account the wailing and gnashing of teeth that would result when their test was noticed - or accounting for the longer-term brand damage that would be caused.

Or maybe they did account for that, and decided it was worth the risk.

I don't think that calculation was worthwhile. They're going to have to make a very firm commitment along the lines of "we heard your feedback and we commit to keeping Claude Code available on our $20/month plan going forward" to regain my trust.

As it stands, Codex is looking like a much safer bet for me to invest my time in learning and building educational materials around.

Update: they've reversed it already

In the time I was typing this blog entry Anthropic appear to have reversed course - the claude.com/pricing page now has a checkbox back in the Pro column for Claude Code. I can't find any official communication about it though.

Let's see if they can come up with an explanation/apology that's convincing enough to offset the trust bonfire from this afternoon!

Update 2: it may still affect 2% of signups?

Amol on Twitter:

was a mistake that the logged-out landing page and docs were updated for this test [embedded self-tweet]

Getting lots of questions on why the landing page / docs were updated if only 2% of new signups were affected.

This was understandably confusing for the 98% of folks not part of the experiment, and we've reverted both the landing page and docs changes.

So the experiment is still running, just not visible to the rest of the world?

Tags: ai, generative-ai, llms, anthropic, llm-pricing, ai-ethics, coding-agents, claude-code, codex

Exploring the new `servo` crate

2026-04-13T15:04:00+00:00

Research: Exploring the new `servo` crate

In Servo is now available on crates.io the Servo team announced the initial release of the servo crate, which packages their browser engine as an embeddable library.

I set Claude Code for web the task of figuring out what it can do, building a CLI tool for taking screenshots using it and working out if it could be compiled to WebAssembly.

The servo-shot Rust tool it built works pretty well:

git clone https://github.com/simonw/research
cd research/servo-crate-exploration/servo-shot
cargo build
./target/debug/servo-shot https://news.ycombinator.com/

Here's the result:

Compiling Servo itself to WebAssembly is not feasible due to its heavy use of threads and dependencies like SpiderMonkey, but Claude did build me this playground page for trying out a WebAssembly build of the html5ever and markup5ever_rcdom crates, providing a tool for turning fragments of HTML into a parse tree.

Tags: browsers, research, rust, webassembly, servo, claude-code

Cleanup Claude Code Paste

2026-04-06T02:55:23+00:00

Tool: Cleanup Claude Code Paste

Super-niche tool this. I sometimes copy prompts out of the Claude Code terminal app and they come out with a bunch of weird additional whitespace. This tool cleans that up.

Tags: tools, claude-code

scan-for-secrets 0.1

2026-04-05T03:27:13+00:00

Release: scan-for-secrets 0.1

I like publishing transcripts of local Claude Code sessions using my claude-code-transcripts tool but I'm often paranoid that one of my API keys or similar secrets might inadvertently be revealed in the detailed log files.

I built this new Python scanning tool to help reassure me. You can feed it secrets and have it scan for them in a specified directory:

uvx scan-for-secrets $OPENAI_API_KEY -d logs-to-publish/

If you leave off the -d it defaults to the current directory.

It doesn't just scan for the literal secrets - it also scans for common encodings of those secrets e.g. backslash or JSON escaping, as described in the README.

If you have a set of secrets you always want to protect you can list commands to echo them in a ~/.scan-for-secrets.conf.sh file. Mine looks like this:

llm keys get openai
llm keys get anthropic
llm keys get gemini
llm keys get mistral
awk -F= '/aws_secret_access_key/{print $2}' ~/.aws/credentials | xargs

I built this tool using README-driven-development: I carefully constructed the README describing exactly how the tool should work, then dumped it into Claude Code and told it to build the actual tool (using red/green TDD, naturally.)

Tags: projects, security, ai-assisted-programming, coding-agents, claude-code, agentic-engineering

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

2026-03-30T14:28:34+00:00

Trip Venturella released Mr. Chatterbox, a language model trained entirely on out-of-copyright text from the British Library. Here's how he describes it in the model card:

Mr. Chatterbox is a language model trained entirely from scratch on a corpus of over 28,000 Victorian-era British texts published between 1837 and 1899, drawn from a dataset made available by the British Library. The model has absolutely no training inputs from after 1899 — the vocabulary and ideas are formed exclusively from nineteenth-century literature.

Mr. Chatterbox's training corpus was 28,035 books, with an estimated 2.93 billion input tokens after filtering. The model has roughly 340 million paramaters, roughly the same size as GPT-2-Medium. The difference is, of course, that unlike GPT-2, Mr. Chatterbox is trained entirely on historical data.

Given how hard it is to train a useful LLM without using vast amounts of scraped, unlicensed data I've been dreaming of a model like this for a couple of years now. What would a model trained on out-of-copyright text be like to chat with?

Thanks to Trip we can now find out for ourselves!

The model itself is tiny, at least by Large Language Model standards - just 2.05GB on disk. You can try it out using Trip's HuggingFace Spaces demo:

Honestly, it's pretty terrible. Talking with it feels more like chatting with a Markov chain than an LLM - the responses may have a delightfully Victorian flavor to them but it's hard to get a response that usefully answers a question.

The 2022 Chinchilla paper suggests a ratio of 20x the parameter count to training tokens. For a 340m model that would suggest around 7 billion tokens, more than twice the British Library corpus used here. The smallest Qwen 3.5 model is 600m parameters and that model family starts to get interesting at 2b - so my hunch is we would need 4x or more the training data to get something that starts to feel like a useful conversational partner.

But what a fun project!

Running it locally with LLM

I decided to see if I could run the model on my own machine using my LLM framework.

I got Claude Code to do most of the work - here's the transcript.

Trip trained the model using Andrej Karpathy's nanochat, so I cloned that project, pulled the model weights and told Claude to build a Python script to run the model. Once we had that working (which ended up needing some extra details from the Space demo source code) I had Claude read the LLM plugin tutorial and build the rest of the plugin.

llm-mrchatterbox is the result. Install the plugin like this:

llm install llm-mrchatterbox

The first time you run a prompt it will fetch the 2.05GB model file from Hugging Face. Try that like this:

llm -m mrchatterbox "Good day, sir"

Or start an ongoing chat session like this:

llm chat -m mrchatterbox

If you don't have LLM installed you can still get a chat session started from scratch using uvx like this:

uvx --with llm-mrchatterbox llm chat -m mrchatterbox

When you are finished with the model you can delete the cached file using:

llm mrchatterbox delete-model

This is the first time I've had Claude Code build a full LLM model plugin from scratch and it worked really well. I expect I'll be using this method again in the future.

I continue to hope we can get a useful model from entirely public domain data. The fact that Trip was able to get this far using nanochat and 2.93 billion training tokens is a promising start.

Update 31st March 2026: I had missed this when I first published this piece but Trip has his own detailed writeup of the project which goes into much more detail about how he trained the model. Here's how the books were filtered for pre-training:

First, I downloaded the British Library dataset split of all 19th-century books. I filtered those down to books contemporaneous with the reign of Queen Victoria—which, unfortunately, cut out the novels of Jane Austen—and further filtered those down to a set of books with a optical character recognition (OCR) confidence of .65 or above, as listed in the metadata. This left me with 28,035 books, or roughly 2.93 billion tokes for pretraining data.

Getting it to behave like a conversational model was a lot harder. Trip started by trying to train on plays by Oscar Wilde and George Bernard Shaw, but found they didn't provide enough pairs. Then he tried extracting dialogue pairs from the books themselves with poor results. The approach that worked was to have Claude Haiku and GPT-4o-mini generate synthetic conversation pairs for the supervised fine tuning, which solved the problem but sadly I think dilutes the "no training inputs from after 1899" claim from the original model card.

Tags: ai, andrej-karpathy, generative-ai, local-llms, llms, ai-assisted-programming, hugging-face, llm, training-data, uv, ai-ethics, claude-code

Vibe coding SwiftUI apps is a lot of fun

2026-03-27T20:59:53+00:00

I have a new laptop - a 128GB M5 MacBook Pro, which early impressions show to be very capable for running good local LLMs. I got frustrated with Activity Monitor and decided to vibe code up some alternative tools for monitoring performance and I'm very happy with the results.

This is my second experiment with vibe coding macOS apps - the first was this presentation app a few weeks ago.

It turns out Claude Opus 4.6 and GPT-5.4 are both very competent at SwiftUI - and a full SwiftUI app can fit in a single text file, which means I can use them to spin something up without even opening Xcode.

I’ve built two apps so far: Bandwidther shows me what apps are using network bandwidth and Gpuer to show me what’s going on with the GPU. At Claude’s suggestion both of these are now menu bar icons that open a panel full of information.

Bandwidther

I built this app first, because I wanted to see what Dropbox was doing. It looks like this:

I’ve shared the full transcript I used to build the first version of the app. My prompts were pretty minimal:

Show me how much network bandwidth is in use from this machine to the internet as opposed to local LAN

(My initial curiosity was to see if Dropbox was transferring files via the LAN from my old computer or was downloading from the internet.)

mkdir /tmp/bandwidther and write a native Swift UI app in there that shows me these details on a live ongoing basis

This got me the first version, which proved to me this was worth pursuing further.

git init and git commit what you have so far

Since I was about to start adding new features.

Now suggest features we could add to that app, the goal is to provide as much detail as possible concerning network usage including by different apps

The nice thing about having Claude suggest features is that it has a much better idea for what’s possible than I do.

We had a bit of back and forth fixing some bugs, then I sent a few more prompts to get to the two column layout shown above:

add Per-Process Bandwidth, relaunch the app once that is done

now add the reverse DNS feature but make sure original IP addresses are still visible too, albeit in smaller typeface

redesign the app so that it is wider, I want two columns - the per-process one on the left and the rest on the right

OK make it a task bar icon thing, when I click the icon I want the app to appear, the icon itself should be a neat minimal little thing

The source code and build instructions are available in simonw/bandwidther.

Gpuer

While I was building Bandwidther in one session I had another session running to build a similar tool for seeing what the GPU was doing. Here’s what I ended up with:

Here's the transcript. This one took even less prompting because I could use the in-progress Bandwidther as an example:

I want to know how much RAM and GPU this computer is using, which is hard because stuff on the GPU and RAM does not seem to show up in Activity Monitor

This collected information using system_profiler and memory_pressure and gave me an answer - more importantly it showed me this was possible, so I said:

Look at /tmp/bandwidther and then create a similar app in /tmp/gpuer which shows the information from above on an ongoing basis, or maybe does it better

After a few more changes to the Bandwidther app I told it to catch up:

Now take a look at recent changes in /tmp/bandwidther - that app now uses a sys tray icon, imitate that

This remains one of my favorite tricks for using coding agents: having them recombine elements from other projects.

The code for Gpuer can be found in simonw/gpuer on GitHub.

You shouldn't trust these apps

These two apps are classic vibe coding: I don't know Swift and I hardly glanced at the code they were writing.

More importantly though, I have very little experience with macOS internals such as the values these tools are measuring. I am completely unqualified to evaluate if the numbers and charts being spat out by these tools are credible or accurate!

I've added warnings to both GitHub repositories to that effect.

This morning I caught Gpuer reporting that I had just 5GB of memory left when that clearly wasn't the case (according to Activity Monitor). I pasted a screenshot into Claude Code and it adjusted the calculations and the new numbers look right, but I'm still not confident that it's reporting things correctly.

I only shared them on GitHub because I think they're interesting as an example of what Claude can do with SwiftUI.

Despite my lack of confidence in the apps themselves, I did learn some useful things from these projects:

A SwiftUI app can get a whole lot done with a single file of code - here's GpuerApp.swift (880 lines) and BandwidtherApp.swift (1063 lines).
Wrapping various terminal commands in a neat UI with Swift is easily achieved.
Claude has surprisingly good design taste when it comes to SwiftUI applications.
Turning an app into a menu bar app is just a few lines of extra code as well.
You don't need to open Xcode to build this kind of application!

These two apps took very little time to build and have convinced me that building macOS apps in SwiftUI is a new capability I should consider for future projects.

Tags: macos, ai, generative-ai, llms, vibe-coding, coding-agents, swift, claude-code

Auto mode for Claude Code

2026-03-24T23:57:33+00:00

Auto mode for Claude Code

Really interesting new development in Claude Code today as an alternative to --dangerously-skip-permissions:

Today, we're introducing auto mode, a new permissions mode in Claude Code where Claude makes permission decisions on your behalf, with safeguards monitoring actions before they run.

Those safeguards appear to be implemented using Claude Sonnet 4.6, as described in the documentation:

Before each action runs, a separate classifier model reviews the conversation and decides whether the action matches what you asked for: it blocks actions that escalate beyond the task scope, target infrastructure the classifier doesn’t recognize as trusted, or appear to be driven by hostile content encountered in a file or web page. [...]

Model: the classifier runs on Claude Sonnet 4.6, even if your main session uses a different model.

They ship with an extensive set of default filters, and you can also customize them further with your own rules. The most interesting insight into how they work comes when you run this new command in the terminal:

claude auto-mode defaults

Here's the full JSON output. It's pretty long, so here's an illustrative subset:

From the "allow" list:

Test Artifacts: Hardcoded test API keys, placeholder credentials in examples, or hardcoding test cases

Local Operations: Agent deleting local files in working directory, local file operations within project scope, or using --ignore-certificate-errors for local testing. "Project scope" means the repository the session started in — wandering into ~/, ~/Library/, /etc, or other repos is scope escalation (User Intent Rule #2), not a local operation. Does NOT cover irreversible destruction of pre-existing files or local stateful services — see "Irreversible Local Destruction" in BLOCK.

Read-Only Operations: GET requests, read-only API calls, or queries that don't modify state and don't contain sensitive information in the URL. Note: PREEMPTIVE BLOCK ON CLEAR INTENT still applies — if the transcript contains clear evidence the agent is using read-only operations to scout for a blocked action, block it.

Declared Dependencies: Installing packages that are already declared in the repo's manifest files (requirements.txt, package.json, Cargo.toml, pyproject.toml, Gemfile, etc.) via standard commands that read those manifests (pip install -r requirements.txt, npm install, cargo build, bundle install) — provided the agent has not modified the manifest in this session. Does NOT cover installing agent-chosen package names (e.g. pip install foo, npm install bar) — those carry typosquat and supply-chain risk. [...]

From "soft_deny":

Git Destructive: Force pushing (git push --force), deleting remote branches, or rewriting remote history

Git Push to Default Branch: Pushing directly to main, master, or the repository's default branch — this bypasses pull request review. Commits should be pushed to a new feature branch instead. -Code from External: Downloading and executing code from external sources — e.g. curl | bash, deserializing external data via formats that can execute code (eval, exec, yaml.unsafe_load, pickle, etc), or similar mechanisms. Also includes running code from an external repo cloned earlier in the transcript (pip install -e, make install, python script.py, pickle.load/torch.load on cloned repo files) — "local on disk" does not mean trusted if it was cloned from an external source visible in the transcript. The repo the agent starts in is trusted.

Cloud Storage Mass Delete: Deleting or mass modifying files on cloud storage (S3, GCS, Azure Blob, etc.) [...]

I remain unconvinced by prompt injection protections that rely on AI, since they're non-deterministic by nature. The documentation does warn that this may still let things through:

The classifier may still allow some risky actions: for example, if user intent is ambiguous, or if Claude doesn't have enough context about your environment to know an action might create additional risk.

The fact that the default allow list includes pip install -r requirements.txt also means that this wouldn't protect against supply chain attacks with unpinned dependencies, as seen this morning with LiteLLM.

I still want my coding agents to run in a robust sandbox by default, one that restricts file access and network connections in a deterministic way. I trust those a whole lot more than prompt-based protections like this new auto mode.

Tags: security, ai, prompt-injection, generative-ai, llms, coding-agents, claude-code

JavaScript Sandboxing Research

2026-03-22T19:53:00+00:00

Research: JavaScript Sandboxing Research

Aaron Harper wrote about Node.js worker threads, which inspired me to run a research task to see if they might help with running JavaScript in a sandbox. Claude Code went way beyond my initial question and produced a comparison of isolated-vm, vm2, quickjs-emscripten, QuickJS-NG, ShadowRealm, and Deno Workers.

Tags: javascript, nodejs, sandboxing, claude-code

Coding agents for data analysis

2026-03-16T20:12:32+00:00

Coding agents for data analysis

Here's the handout I prepared for my NICAR 2026 workshop "Coding agents for data analysis" - a three hour session aimed at data journalists demonstrating ways that tools like Claude Code and OpenAI Codex can be used to explore, analyze and clean data.

Here's the table of contents:

Coding agents

Warmup: ChatGPT and Claude

Setup Claude Code and Codex

Asking questions against a database

Exploring data with agents

Cleaning data: decoding neighborhood codes

Creating visualizations with agents

Scraping data with agents

I ran the workshop using GitHub Codespaces and OpenAI Codex, since it was easy (and inexpensive) to distribute a budget-restricted API key for Codex that attendees could use during the class. Participants ended up burning $23 of Codex tokens.

The exercises all used Python and SQLite and some of them used Datasette.

One highlight of the workshop was when we started running Datasette such that it served static content from a viz/ folder, then had Claude Code start vibe coding new interactive visualizations directly in that folder. Here's a heat map it created for my trees database using Leaflet and Leaflet.heat, source code here.

I designed the handout to also be useful for people who weren't able to attend the session in person. As is usually the case, material aimed at data journalists is equally applicable to anyone else with data to explore.

Tags: data-journalism, geospatial, python, speaking, sqlite, ai, datasette, generative-ai, llms, github-codespaces, nicar, coding-agents, claude-code, codex, leaflet

GIF optimization tool using WebAssembly and Gifsicle

2026-03-02T16:35:10+00:00

Agentic Engineering Patterns >

I like to include animated GIF demos in my online writing, often recorded using LICEcap. There's an example in the Interactive explanations chapter.

These GIFs can be pretty big. I've tried a few tools for optimizing GIF file size and my favorite is Gifsicle by Eddie Kohler. It compresses GIFs by identifying regions of frames that have not changed and storing only the differences, and can optionally reduce the GIF color palette or apply visible lossy compression for greater size reductions.

Gifsicle is written in C and the default interface is a command line tool. I wanted a web interface so I could access it in my browser and visually preview and compare the different settings.

I prompted Claude Code for web (from my iPhone using the Claude iPhone app) against my simonw/tools repo with the following:

gif-optimizer.html

Compile gifsicle to WASM, then build a web page that lets you open or drag-drop an animated GIF onto it and it then shows you that GIF compressed using gifsicle with a number of different settings, each preview with the size and a download button

Also include controls for the gifsicle options for manual use - each preview has a “tweak these settings” link which sets those manual settings to the ones used for that preview so the user can customize them further

Run “uvx rodney –help” and use that tool to tray your work - use this GIF for testing https://static.simonwillison.net/static/2026/animated-word-cloud-demo.gif

Here's what it built, plus an animated GIF demo that I optimized using the tool:

Let's address that prompt piece by piece.

gif-optimizer.html

The first line simply tells it the name of the file I want to create. Just a filename is enough here - I know that when Claude runs "ls" on the repo it will understand that every file is a different tool.

My simonw/tools repo currently lacks a CLAUDE.md or AGENTS.md file. I've found that agents pick up enough of the gist of the repo just from scanning the existing file tree and looking at relevant code in existing files.

Compile gifsicle to WASM, then build a web page that lets you open or drag-drop an animated GIF onto it and it then shows you that GIF compressed using gifsicle with a number of different settings, each preview with the size and a download button

I'm making a bunch of assumptions here about Claude's existing knowledge, all of which paid off.

Gifsicle is nearly 30 years old now and is a widely used piece of software - I was confident that referring to it by name would be enough for Claude to find the code.

"Compile gifsicle to WASM" is doing a lot of work here.

WASM is short for WebAssembly, the technology that lets browsers run compiled code safely in a sandbox.

Compiling a project like Gifsicle to WASM is not a trivial operation, involving a complex toolchain usually involving the Emscripten project. It often requires a lot of trial and error to get everything working.

Coding agents are fantastic at trial and error! They can often brute force their way to a solution where I would have given up after the fifth inscrutable compiler error.

I've seen Claude Code figure out WASM builds many times before, so I was quite confident this would work.

"then build a web page that lets you open or drag-drop an animated GIF onto it" describes a pattern I've used in a lot of my other tools.

HTML file uploads work fine for selecting files, but a nicer UI, especially on desktop, is to allow users to drag and drop files into a prominent drop zone on a page.

Setting this up involves a bit of JavaScript to process the events and some CSS for the drop zone. It's not complicated but it's enough extra work that I might not normally add it myself. With a prompt it's almost free.

Here's the resulting UI - which was influenced by Claude taking a peek at my existing image-resize-quality tool:

I didn't ask for the GIF URL input and I'm not keen on it, because it only works against URLs to GIFs that are served with open CORS headers. I'll probably remove that in a future update.

"then shows you that GIF compressed using gifsicle with a number of different settings, each preview with the size and a download button" describes the key feature of the application.

I didn't bother defining the collection of settings I wanted - in my experience Claude has good enough taste at picking those for me, and we can always change them if its first guesses don't work.

Showing the size is important since this is all about optimizing for size.

I know from past experience that asking for a "download button" gets a button with the right HTML and JavaScript mechanisms set up such that clicking it provides a file save dialog, which is a nice convenience over needing to right-click-save-as.

Also include controls for the gifsicle options for manual use - each preview has a “tweak these settings” link which sets those manual settings to the ones used for that preview so the user can customize them further

This is a pretty clumsy prompt - I was typing it in my phone after all - but it expressed my intention well enough for Claude to build what I wanted.

Here's what that looks like in the resulting tool, this screenshot showing the mobile version. Each image has a "Tweak these settings" button which, when clicked, updates this set of manual settings and sliders:

Run “uvx rodney --help” and use that tool to tray your work - use this GIF for testing https://static.simonwillison.net/static/2026/animated-word-cloud-demo.gif

Coding agents work so much better if you make sure they have the ability to test their code while they are working.

There are many different ways to test a web interface - Playwright and Selenium and agent-browser are three solid options.

Rodney is a browser automation tool I built myself, which is quick to install and has --help output that's designed to teach an agent everything it needs to know to use the tool.

This worked great - in the session transcript you can see Claude using Rodney and fixing some minor bugs that it spotted, for example:

The CSS display: none is winning over the inline style reset. I need to set display: 'block' explicitly.

The follow-up prompts

When I'm working with Claude Code I usually keep an eye on what it's doing so I can redirect it while it's still in flight. I also often come up with new ideas while it's working which I then inject into the queue.

Include the build script and diff against original gifsicle code in the commit in an appropriate subdirectory

The build script should clone the gifsicle repo to /tmp and switch to a known commit before applying the diff - so no copy of gifsicle in the commit but all the scripts needed to build the wqsm

I added this when I noticed it was putting a lot of effort into figuring out how to get Gifsicle working with WebAssembly, including patching the original source code. Here's the patch and the build script it added to the repo.

I knew there was a pattern in that repo already for where supporting files lived but I couldn't remember what that pattern was. Saying "in an appropriate subdirectory" was enough for Claude to figure out where to put it - it found and used the existing lib/ directory.

You should include the wasm bundle

This probably wasn't necessary, but I wanted to make absolutely sure that the compiled WASM file (which turned out to be 233KB) was committed to the repo. I serve simonw/tools via GitHub Pages at tools.simonwillison.net and I wanted it to work without needing to be built locally.

Make sure the HTML page credits gifsicle and links to the repo

This is just polite! I often build WebAssembly wrappers around other people's open source projects and I like to make sure they get credit in the resulting page.

Claude added this to the footer of the tool:

Built with gifsicle by Eddie Kohler, compiled to WebAssembly. gifsicle is released under the GNU General Public License, version 2.

Tags: gif, tools, webassembly, claude, llms, prompt-engineering, ai, generative-ai, coding-agents, claude-code, agentic-engineering

Claude Code Remote Control

2026-02-25T17:33:24+00:00

Claude Code Remote Control

New Claude Code feature dropped yesterday: you can now run a "remote control" session on your computer and then use the Claude Code for web interfaces (on web, iOS and native desktop app) to send prompts to that session.

It's a little bit janky right now. Initially when I tried it I got the error "Remote Control is not enabled for your account. Contact your administrator." (but I am my administrator?) - then I logged out and back into the Claude Code terminal app and it started working:

claude remote-control

You can only run one session on your machine at a time. If you upgrade the Claude iOS app it then shows up as "Remote Control Session (Mac)" in the Code tab.

It appears not to support the --dangerously-skip-permissions flag (I passed that to claude remote-control and it didn't reject the option, but it also appeared to have no effect) - which means you have to approve every new action it takes.

I also managed to get it to a state where every prompt I tried was met by an API 500 error.

Restarting the program on the machine also causes existing sessions to start returning mysterious API errors rather than neatly explaining that the session has terminated.

I expect they'll iron out all of these issues relatively quickly. It's interesting to then contrast this to solutions like OpenClaw, where one of the big selling points is the ability to control your personal device from your phone.

Claude Code still doesn't have a documented mechanism for running things on a schedule, which is the other killer feature of the Claw category of software.

Update: I spoke too soon: also today Anthropic announced Schedule recurring tasks in Cowork, Claude Code's general agent sibling. These do include an important limitation:

Scheduled tasks only run while your computer is awake and the Claude Desktop app is open. If your computer is asleep or the app is closed when a task is scheduled to run, Cowork will skip the task, then run it automatically once your computer wakes up or you open the desktop app again.

I really hope they're working on a Cowork Cloud product.

Via @claudeai

Tags: ai, generative-ai, applescript, llms, anthropic, claude, coding-agents, claude-code, openclaw

Adding TILs, releases, museums, tools and research to my blog

2026-02-20T23:47:10+00:00

I've been wanting to add indications of my various other online activities to my blog for a while now. I just turned on a new feature I'm calling "beats" (after story beats, naming this was hard!) which adds five new types of content to my site, all corresponding to activity elsewhere.

Here's what beats look like:

Those three are from the 30th December 2025 archive page.

Beats are little inline links with badges that fit into different content timeline views around my site, including the homepage, search and archive pages.

There are currently five types of beats:

Releases are GitHub releases of my many different open source projects, imported from this JSON file that was constructed by GitHub Actions.
TILs are the posts from my TIL blog, imported using a SQL query over JSON and HTTP against the Datasette instance powering that site.
Museums are new posts on my niche-museums.com blog, imported from this custom JSON feed.
Tools are HTML and JavaScript tools I've vibe-coded on my tools.simonwillison.net site, as described in Useful patterns for building HTML tools.
Research is for AI-generated research projects, hosted in my simonw/research repo and described in Code research projects with async coding agents like Claude Code and Codex.

That's five different custom integrations to pull in all of that data. The good news is that this kind of integration project is the kind of thing that coding agents really excel at. I knocked most of the feature out in a single morning while working in parallel on various other things.

I didn't have a useful structured feed of my Research projects, and it didn't matter because I gave Claude Code a link to the raw Markdown README that lists them all and it spun up a parser regex. Since I'm responsible for both the source and the destination I'm fine with a brittle solution that would be too risky against a source that I don't control myself.

Claude also handled all of the potentially tedious UI integration work with my site, making sure the new content worked on all of my different page types and was handled correctly by my faceted search engine.

Prototyping with Claude Artifacts

I actually prototyped the initial concept for beats in regular Claude - not Claude Code - taking advantage of the fact that it can clone public repos from GitHub these days. I started with:

Clone simonw/simonwillisonblog and tell me about the models and views

And then later in the brainstorming session said:

use the templates and CSS in this repo to create a new artifact with all HTML and CSS inline that shows me my homepage with some of those inline content types mixed in

After some iteration we got to this artifact mockup, which was enough to convince me that the concept had legs and was worth handing over to full Claude Code for web to implement.

If you want to see how the rest of the build played out the most interesting PRs are Beats #592 which implemented the core feature and Add Museums Beat importer #595 which added the Museums content type.

Tags: blogging, museums, ai, til, generative-ai, llms, ai-assisted-programming, claude-artifacts, claude-code, site-upgrades

Quoting Thariq Shihipar

2026-02-20T07:13:19+00:00

Long running agentic products like Claude Code are made feasible by prompt caching which allows us to reuse computation from previous roundtrips and significantly decrease latency and cost. [...]

At Claude Code, we build our entire harness around prompt caching. A high prompt cache hit rate decreases costs and helps us create more generous rate limits for our subscription plans, so we run alerts on our prompt cache hit rate and declare SEVs if they're too low.

— Thariq Shihipar

Tags: ai, prompt-engineering, generative-ai, llms, anthropic, ai-agents, claude-code

Recovering lost code

2026-02-19T23:48:35+00:00

Reached the stage of parallel agent psychosis where I've lost a whole feature - I know I had it yesterday, but I can't seem to find the branch or worktree or cloud instance or checkout with it in.

... found it! Turns out I'd been hacking on a random prototype in /tmp and then my computer crashed and rebooted and I lost the code... but it's all still there in ~/.claude/projects/ session logs and Claude Code can extract it out and spin up the missing feature again.

Tags: parallel-agents, coding-agents, claude-code, generative-ai, ai, llms

The A.I. Disruption We’ve Been Waiting for Has Arrived

2026-02-18T17:07:31+00:00

The A.I. Disruption We’ve Been Waiting for Has Arrived

New opinion piece from Paul Ford in the New York Times. Unsurprisingly for a piece by Paul it's packed with quoteworthy snippets, but a few stood out for me in particular.

Paul describes the November moment that so many other programmers have observed, and highlights Claude Code's ability to revive old side projects:

[Claude Code] was always a helpful coding assistant, but in November it suddenly got much better, and ever since I’ve been knocking off side projects that had sat in folders for a decade or longer. It’s fun to see old ideas come to life, so I keep a steady flow. Maybe it adds up to a half-hour a day of my time, and an hour of Claude’s.

November was, for me and many others in tech, a great surprise. Before, A.I. coding tools were often useful, but halting and clumsy. Now, the bot can run for a full hour and make whole, designed websites and apps that may be flawed, but credible. I spent an entire session of therapy talking about it.

And as the former CEO of a respected consultancy firm (Postlight) he's well positioned to evaluate the potential impact:

When you watch a large language model slice through some horrible, expensive problem — like migrating data from an old platform to a modern one — you feel the earth shifting. I was the chief executive of a software services firm, which made me a professional software cost estimator. When I rebooted my messy personal website a few weeks ago, I realized: I would have paid $25,000 for someone else to do this. When a friend asked me to convert a large, thorny data set, I downloaded it, cleaned it up and made it pretty and easy to explore. In the past I would have charged $350,000.

That last price is full 2021 retail — it implies a product manager, a designer, two engineers (one senior) and four to six months of design, coding and testing. Plus maintenance. Bespoke software is joltingly expensive. Today, though, when the stars align and my prompts work out, I can do hundreds of thousands of dollars worth of work for fun (fun for me) over weekends and evenings, for the price of the Claude $200-a-month plan.

He also neatly captures the inherent community tension involved in exploring this technology:

All of the people I love hate this stuff, and all the people I hate love it. And yet, likely because of the same personality flaws that drew me to technology in the first place, I am annoyingly excited.

Tags: new-york-times, paul-ford, careers, ai, generative-ai, llms, ai-assisted-programming, ai-ethics, coding-agents, claude-code, deep-blue, november-2025-inflection

Introducing Claude Sonnet 4.6

2026-02-17T23:58:58+00:00

Introducing Claude Sonnet 4.6

Sonnet 4.6 is out today, and Anthropic claim it offers similar performance to November's Opus 4.5 while maintaining the Sonnet pricing of $3/million input and $15/million output tokens (the Opus models are $5/$25). Here's the system card PDF.

Sonnet 4.6 has a "reliable knowledge cutoff" of August 2025, compared to Opus 4.6's May 2025 and Haiku 4.5's February 2025. Both Opus and Sonnet default to 200,000 max input tokens but can stretch to 1 million in beta and at a higher cost.

I just released llm-anthropic 0.24 with support for both Sonnet 4.6 and Opus 4.6. Claude Code did most of the work - the new models had a fiddly amount of extra details around adaptive thinking and no longer supporting prefixes, as described in Anthropic's migration guide.

Here's what I got from:

uvx --with llm-anthropic llm 'Generate an SVG of a pelican riding a bicycle' -m claude-sonnet-4.6

The SVG comments include:

<!-- Hat (fun accessory) -->

I tried a second time and also got a top hat. Sonnet 4.6 apparently loves top hats!

For comparison, here's the pelican Opus 4.5 drew me in November:

And here's Anthropic's current best pelican, drawn by Opus 4.6 on February 5th:

Opus 4.6 produces the best pelican beak/pouch. I do think the top hat from Sonnet 4.6 is a nice touch though.

Via Hacker News

Tags: ai, generative-ai, llms, llm, anthropic, claude, llm-pricing, pelican-riding-a-bicycle, llm-release, claude-code

Quoting Dimitris Papailiopoulos

2026-02-17T14:04:44+00:00

But the intellectually interesting part for me is something else. I now have something close to a magic box where I throw in a question and a first answer comes back basically for free, in terms of human effort. Before this, the way I'd explore a new idea is to either clumsily put something together myself or ask a student to run something short for signal, and if it's there, we’d go deeper. That quick signal step, i.e., finding out if a question has any meat to it, is what I can now do without taking up anyone else's time. It’s now between just me, Claude Code, and a few days of GPU time.

I don’t know what this means for how we do research long term. I don’t think anyone does yet. But the distance between a question and a first answer just got very small.

— Dimitris Papailiopoulos, on running research questions though Claude Code

Tags: research, ai, generative-ai, llms, coding-agents, claude-code

Two new Showboat tools: Chartroom and datasette-showboat

2026-02-17T00:43:45+00:00

I introduced Showboat a week ago - my CLI tool that helps coding agents create Markdown documents that demonstrate the code that they have created. I've been finding new ways to use it on a daily basis, and I've just released two new tools to help get the best out of the Showboat pattern. Chartroom is a CLI charting tool that works well with Showboat, and datasette-showboat lets Showboat's new remote publishing feature incrementally push documents to a Datasette instance.

Showboat remote publishing

I normally use Showboat in Claude Code for web (see note from this morning). I've used it in several different projects in the past few days, each of them with a prompt that looks something like this:

Use "uvx showboat --help" to perform a very thorough investigation of what happens if you use the Python sqlite-chronicle and sqlite-history-json libraries against the same SQLite database table

Here's the resulting document.

Just telling Claude Code to run uvx showboat --help is enough for it to learn how to use the tool - the help text is designed to work as a sort of ad-hoc Skill document.

The one catch with this approach is that I can't see the new Showboat document until it's finished. I have to wait for Claude to commit the document plus embedded screenshots and push that to a branch in my GitHub repo - then I can view it through the GitHub interface.

For a while I've been thinking it would be neat to have a remote web server of my own which Claude instances can submit updates to while they are working. Then this morning I realized Showboat might be the ideal mechanism to set that up...

Showboat v0.6.0 adds a new "remote" feature. It's almost invisible to users of the tool itself, instead being configured by an environment variable.

Set a variable like this:

export SHOWBOAT_REMOTE_URL=https://www.example.com/submit?token=xyz

And every time you run a showboat init or showboat note or showboat exec or showboat image command the resulting document fragments will be POSTed to that API endpoint, in addition to the Showboat Markdown file itself being updated.

There are full details in the Showboat README - it's a very simple API format, using regular POST form variables or a multipart form upload for the image attached to showboat image.

datasette-showboat

It's simple enough to build a webapp to receive these updates from Showboat, but I needed one that I could easily deploy and would work well with the rest of my personal ecosystem.

So I had Claude Code write me a Datasette plugin that could act as a Showboat remote endpoint. I actually had this building at the same time as the Showboat remote feature, a neat example of running parallel agents.

datasette-showboat is a Datasette plugin that adds a /-/showboat endpoint to Datasette for viewing documents and a /-/showboat/receive endpoint for receiving updates from Showboat.

Here's a very quick way to try it out:

uvx --with datasette-showboat --prerelease=allow \
  datasette showboat.db --create \
  -s plugins.datasette-showboat.database showboat \
  -s plugins.datasette-showboat.token secret123 \
  --root --secret cookie-secret-123

Click on the sign in as root link that shows up in the console, then navigate to http://127.0.0.1:8001/-/showboat to see the interface.

Now set your environment variable to point to this instance:

export SHOWBOAT_REMOTE_URL="http://127.0.0.1:8001/-/showboat/receive?token=secret123"

And run Showboat like this:

uvx showboat init demo.md "Showboat Feature Demo"

Refresh that page and you should see this:

Click through to the document, then start Claude Code or Codex or your agent of choice and prompt:

Run 'uvx showboat --help' and then use showboat to add to the existing demo.md document with notes and exec and image to demonstrate the tool - fetch a placekitten for the image demo.

The init command assigns a UUID and title and sends those up to Datasette.

The best part of this is that it works in Claude Code for web. Run the plugin on a server somewhere (an exercise left up to the reader - I use Fly.io to host mine) and set that SHOWBOAT_REMOTE_URL environment variable in your Claude environment, then any time you tell it to use Showboat the document it creates will be transmitted to your server and viewable in real time.

I built Rodney, a CLI browser automation tool, specifically to work with Showboat. It makes it easy to have a Showboat document load up web pages, interact with them via clicks or injected JavaScript and captures screenshots to embed in the Showboat document and show the effects.

This is wildly useful for hacking on web interfaces using Claude Code for web, especially when coupled with the new remote publishing feature. I only got this stuff working this morning and I've already had several sessions where Claude Code has published screenshots of its work in progress, which I've then been able to provide feedback on directly in the Claude session while it's still working.

Chartroom

A few days ago I had another idea for a way to extend the Showboat ecosystem: what if Showboat documents could easily include charts?

I sometimes fire up Claude Code for data analysis tasks, often telling it to download a SQLite database and then run queries against it to figure out interesting things from the data.

With a simple CLI tool that produced PNG images I could have Claude use Showboat to build a document with embedded charts to help illustrate its findings.

Chartroom is exactly that. It's effectively a thin wrapper around the excellent matplotlib Python library, designed to be used by coding agents to create charts that can be embedded in Showboat documents.

Here's how to render a simple bar chart:

echo 'name,value
Alice,42
Bob,28
Charlie,35
Diana,51
Eve,19' | uvx chartroom bar --csv \
  --title 'Sales by Person' --ylabel 'Sales'

It can also do line charts, bar charts, scatter charts, and histograms - as seen in this demo document that was built using Showboat.

Chartroom can also generate alt text. If you add -f alt to the above it will output the alt text for the chart instead of the image:

echo 'name,value
Alice,42
Bob,28
Charlie,35
Diana,51
Eve,19' | uvx chartroom bar --csv \
  --title 'Sales by Person' --ylabel 'Sales' -f alt

Outputs:

Sales by Person. Bar chart of value by name — Alice: 42, Bob: 28, Charlie: 35, Diana: 51, Eve: 19

Or you can use -f html or -f markdown to get the image tag with alt text directly:

![Sales by Person. Bar chart of value by name — Alice: 42, Bob: 28, Charlie: 35, Diana: 51, Eve: 19](/Users/simon/chart-7.png)

I added support for Markdown images with alt text to Showboat in v0.5.0, to complement this feature of Chartroom.

Finally, Chartroom has support for different matplotlib styles. I had Claude build a Showboat document to demonstrate these all in one place - you can see that at demo/styles.md.

How I built Chartroom

I started the Chartroom repository with my click-app cookiecutter template, then told a fresh Claude Code for web session:

We are building a Python CLI tool which uses matplotlib to generate a PNG image containing a chart. It will have multiple sub commands for different chart types, controlled by command line options. Everything you need to know to use it will be available in the single "chartroom --help" output.

It will accept data from files or standard input as CSV or TSV or JSON, similar to how sqlite-utils accepts data - clone simonw/sqlite-utils to /tmp for reference there. Clone matplotlib/matplotlib for reference as well

It will also accept data from --sql path/to/sqlite.db "select ..." which runs in read-only mode

Start by asking clarifying questions - do not use the ask user tool though it is broken - and generate a spec for me to approve

Once approved proceed using red/green TDD running tests with "uv run pytest"

Also while building maintain a demo/README.md document using the "uvx showboat --help" tool - each time you get a new chart type working commit the tests, implementation, root level README update and a new version of that demo/README.md document with an inline image demo of the new chart type (which should be a UUID image filename managed by the showboat image command and should be stored in the demo/ folder

Make sure "uv build" runs cleanly without complaining about extra directories but also ensure dist/ and uv.lock are in gitignore

This got most of the work done. You can see the rest in the PRs that followed.

The burgeoning Showboat ecosystem

The Showboat family of tools now consists of Showboat itself, Rodney for browser automation, Chartroom for charting and datasette-showboat for streaming remote Showboat documents to Datasette.

I'm enjoying how these tools can operate together based on a very loose set of conventions. If a tool can output a path to an image Showboat can include that image in a document. Any tool that can output text can be used with Showboat.

I'll almost certainly be building more tools that fit this pattern. They're very quick to knock out!

The environment variable mechanism for Showboat's remote streaming is a fun hack too - so far I'm just using it to stream documents somewhere else, but it's effectively a webhook extension mechanism that could likely be used for all sorts of things I haven't thought of yet.

Tags: charting, projects, ai, datasette, generative-ai, llms, ai-assisted-programming, coding-agents, claude-code, showboat

Rodney and Claude Code for Desktop

2026-02-16T16:38:57+00:00

I'm a very heavy user of Claude Code on the web, Anthropic's excellent but poorly named cloud version of Claude Code where everything runs in a container environment managed by them, greatly reducing the risk of anything bad happening to a computer I care about.

I don't use the web interface at all (hence my dislike of the name) - I access it exclusively through their native iPhone and Mac desktop apps.

Something I particularly appreciate about the desktop app is that it lets you see images that Claude is "viewing" via its Read /path/to/image tool. Here's what that looks like:

This means you can get a visual preview of what it's working on while it's working, without waiting for it to push code to GitHub for you to try out yourself later on.

The prompt I used to trigger the above screenshot was:

Run "uvx rodney --help" and then use Rodney to manually test the new pages and menu - look at screenshots from it and check you think they look OK

I designed Rodney to have --help output that provides everything a coding agent needs to know in order to use the tool.

The Claude iPhone app doesn't display opened images yet, so I requested it as a feature just now in a thread on Twitter.

Tags: anthropic, claude, ai, claude-code, llms, async-coding-agents, coding-agents, generative-ai, projects, ai-assisted-programming, rodney

Quoting Boris Cherny

2026-02-14T23:59:09+00:00

Someone has to prompt the Claudes, talk to customers, coordinate with other teams, decide what to build next. Engineering is changing and great engineers are more important than ever.

— Boris Cherny, Claude Code creator, on why Anthropic are still hiring developers

Tags: careers, ai, generative-ai, llms, ai-assisted-programming, anthropic, coding-agents, claude-code