Simon Willison's Weblog: agentic-engineering

AI enthusiasts are in a race against time, AI skeptics are in a race against entropy

2026-06-04T23:55:27+00:00

AI enthusiasts are in a race against time, AI skeptics are in a race against entropy

Charity Majors neatly captures the dynamic between AI enthusiasts and AI skeptics, both of whom are trying to build great software, often in the same teams:

The enthusiasts are not wrong. We are starting to see real, non-imaginary, discontinuous leaps in capabilities from teams that lean in hard to working with AI. And this does not feel like a normal technology cycle where you can wait for the dust to settle; teams that sit this out while competitors are hustling could be out of business before the dust settles. That’s a real, existential threat.

The skeptics are also not wrong. When you ship code faster than engineers can read it, in domains where nobody has full context, you are making withdrawals from a trust account that took years to build. Reliability degrades, institutional knowledge evaporates. You end up with systems nobody understands, products burbling into incoherence, and on-call rotations that grind people up and spit them out. That is ALSO a real existential threat.

Charity recommends treating this as both a leadership challenge and an engineering challenge. The key issue:

There is no natural feedback loop connecting enthusiasts with skeptics.

Designing feedback loops to help "mend the gap in shared reality" between the two groups is a fascinating organizational design problem.

Via Lobste.rs

Tags: ai, charity-majors, agentic-engineering

Quoting Mitchell Hashimoto

2026-05-14T22:31:20+00:00

[...] On the interesting side is how fungible programming languages are nowadays. Programming languages used to be LOCK IN, and they're increasingly not so. You think the Bun rewrite in Rust is good for Rust? Bun has shown they can be in probably any language they want in roughly a week or two. Rust is expendable. Its useful until its not then it can be thrown out. That's interesting!

— Mitchell Hashimoto, on Bun porting from Zig to Rust

Tags: ai, rust, zig, generative-ai, llms, mitchell-hashimoto, bun, agentic-engineering

Thoughts on GitLab's workforce reduction" and "structural and strategic decisions"

2026-05-11T23:58:55+00:00

GitLab Act 2

There's a lot going on in this announcement from GitLab about the "workforce reduction" and "structural and strategic decisions" they are making with respect to the agentic era.

They're "planning to reduce the number of countries by up to 30% where we have small teams". One of the most interesting things about GitLab is that they have employees spread across a large number of countries - 18 are listed in their public employee handbook but this post says they are "operating in nearly 60 countries". That handbook used to document their payroll workflows for those countries too - they stopped publishing that in 2023 but the last public version (hooray for version control) remains a fascinating read. Since we don't know which of those 60 countries have small teams, we can't calculate how many countries that 30% applies to.
"We're planning to flatten the organization, removing up to three layers of management in some functions so leaders are closer to the work." - this isn't the first announcement of this type I've seen that's trimming management. Coinbase recently announced a much more aggressive version of this: they were "flattening our org structure to 5 layers max below" and "No pure managers: Every leader at Coinbase must also be a strong and active individual contributor. Managers should be like player-coaches".
In terms of team structure: "We're re-organizing R&D to create roughly 60 smaller, more empowered teams with end-to-end ownership, nearly doubling the number of independent teams." I've always loved the idea of individual teams that can ship features unblocked by other teams, and it makes sense to me that agentic engineering can increase the capability of such teams. The 37signals public employee handbook used to have a section on working In self-sufficient, independent teams which perfectly captured this for me, I'm sad to see they removed that detail in January 2024!
Tucked away towards the bottom: "We will be retiring CREDIT as our values framework" - that's the values framework described on this page: "Collaboration, Results for Customers, Efficiency, Diversity, Inclusion & Belonging, Iteration, and Transparency". The new values are "Speed with Quality, Ownership Mindset, Customer Outcomes". The fact that "Diversity" is no longer in there is likely to attract a whole lot of attention, so it's worth noting that a sub-bullet under Customer Outcomes reads "Interpersonal excellence: individuals who are good humans, embrace diversity, inclusion and belonging, assume good intent and treat everyone with respect".

Here's the part of their new strategy that most resonated with me:

The agentic era multiplies demand for software. Software has been the force multiplier behind nearly every business transformation of the last two decades. The constraint was the cost and time of producing and managing it. That constraint is collapsing. As the cost of producing software collapses, demand for it will expand. Last year, the developer platform market used to be measured in tens of dollars per user per month, this year it is hundreds/user/month and headed to thousands. Not only is the value of software for builders increasing, but we believe there will be more software and builders than ever, and we will serve an increasing volume of both.

That very much encapsulates my own optimistic, Jevons-paradox-inspired hope for how this will all work out.

Their opinion on this does need to be taken with a big grain of salt though. GitLab's stock price was ~$52 a year ago and is ~$26 today, and it's plausible that the drop corresponds to uncertainty about GitLab's continued growth as agentic engineering eats its way through their core market.

If your entire business depends on software engineering growing as a field and producing larger volumes of more lucrative seats, you have a strong incentive to believe that agents will have that effect!

Via Hacker News

Tags: 37signals, careers, ai, gitlab, coding-agents, jevons-paradox, agentic-engineering

Quoting James Shore

2026-05-11T19:48:32+00:00

Your AI coding agent, the one you use to write code, needs to reduce your maintenance costs. Not by a little bit, either. You write code twice as quick now? Better hope you’ve halved your maintenance costs. Three times as productive? One third the maintenance costs. Otherwise, you’re screwed. You’re trading a temporary speed boost for permanent indenture. [...]

The math only works if the LLM decreases your maintenance costs, and by exactly the inverse of the rate it adds code. If you double your output and your cost of maintaining that output, two times two means you’ve quadrupled your maintenance costs. If you double your output and hold your maintenance costs steady, two times one means you’ve still doubled your maintenance costs.

— James Shore, You Need AI That Reduces Maintenance Costs

Tags: ai, generative-ai, llms, ai-assisted-programming, coding-agents, agentic-engineering

Vibe coding and agentic engineering are getting closer than I'd like

2026-05-06T14:24:08+00:00

I recently talked with Joseph Ruscio about AI coding tools for Heavybit's High Leverage podcast: Ep. #9, The AI Coding Paradigm Shift with Simon Willison. Here are some of my highlights, including my disturbing realization that vibe coding and agentic engineering have started to converge in my own work.

One thing I really enjoy about podcasts is that they sometimes push me to think out loud in a way that exposes an idea I've not previously been able to put into words.

Vibe coding and agentic engineering are starting to overlap

A few weeks after vibe coding was first coined I published Not all AI-assisted programming is vibe coding (but vibe coding rocks), where I firmly staked out my belief that "vibe coding" is a very different beast from responsible use of AI to write code, which I've since started to call agentic engineering.

When Joseph brought up the distinction between the two I had a sudden realization that they're not nearly as distinct for me as they used to be:

Weirdly though, those things have started to blur for me already, which is quite upsetting.

I thought we had a very clear delineation where vibe coding is the thing where you're not looking at the code at all. You might not even know how to program. You might be a non-programmer who asks for a thing, and gets a thing, and if the thing works, then great! And if it doesn't, you tell it that it doesn't work and cross your fingers.

But at no point are you really caring about the code quality or any of those additional constraints. And my take on vibe coding was that it's fantastic, provided you understand when it can be used and when it can't.

A personal tool for you, where if there's a bug it hurts only you, go ahead!

If you're building software for other people, vibe coding is grossly irresponsible because it's other people's information. Other people get hurt by your stupid bugs. You need to have a higher level than that.

This contrasts with agentic engineering where you are a professional software engineer. You understand security and maintainability and operations and performance and so forth. You're using these tools to the highest of your own ability. I'm finding the scope of challenges I can take on has gone up by a significant amount because I've got the support of these tools.

But I'm still leaning on my 25 years of experience as a software engineer.

The goal is to build high quality production systems: if you're building lower quality stuff faster, I think that's bad. I want to build higher quality stuff faster. I want everything I'm building to be better in every way than it was before.

The problem is that as the coding agents get more reliable, I'm not reviewing every line of code that they write anymore, even for my production level stuff.

I know full well that if you ask Claude Code to build a JSON API endpoint that runs a SQL query and outputs the results as JSON, it's just going to do it right. It's not going to mess that up. You have it add automated tests, you have it add documentation, you know it's going to be good.

But I'm not reviewing that code. And now I've got that feeling of guilt: if I haven't reviewed the code, is it really responsible for me to use this in production?

The thing that really helps me is thinking back to when I've worked at larger organizations where I've been an engineering manager. Other teams are building software that my team depends on.

If another team hands over something and says, "hey, this is the image resize service, here's how to use it to resize your images"... I'm not going to go and read every line of code that they wrote.

I'm going to look at their documentation and I'm going to use it to resize some images. And then I'm going to start shipping my own features. And if I start running into problems where the image resizer thing appears to have bugs or the performance isn't good, that's when I might dig into their Git repositories and see what's going on. But for the most part I treat that as a semi-black box that I don't look at until I need to.

I'm starting to treat the agents in the same way. And it still feels uncomfortable, because human beings are accountable for what they do. A team can build a reputation. I can say "I trust that team over there. They built good software in the past. They're not going to build something rubbish because that affects their professional reputations."

Claude Code does not have a professional reputation! It can't take accountability for what it's done. But it's been proving itself anyway - time and time again it's churning out straightforward things and doing them right in the style that I like.

There's an element of the normalization of deviance here - every time a model turns out to have written the right code without me monitoring it closely there's a risk that I'll trust it at the wrong moment in the future and get burned.

The new challenge of evaluating software

It used to be if you found a GitHub repository with a hundred commits and a good readme and automated tests and stuff, you could be pretty sure that the person writing that had put a lot of care and attention into that project.

And now I can knock out a git repository with a hundred commits and a beautiful readme and comprehensive tests of every line of code in half an hour! It looks identical to those projects that have had a great deal of care and attention. Maybe it is as good as them. I don't know. I can't tell from looking at it. Even for my own projects, I can't tell.

So I realized what I value more than the quality of the tests and documentation is that I want somebody to have used the thing. If you've got a vibe coded thing which you have used every day for the past two weeks, that's much more valuable to me than something that you've just spat out and hardly even exercised.

The bottlenecks have shifted

If you can go from producing 200 lines of code a day to 2,000 lines of code a day, what else breaks? The entire software development lifecycle was, it turns out, designed around the idea that it takes a day to produce a few hundred lines of code. And now it doesn't.

It's not just the downstream stuff, it's the upstream stuff as well. I saw a great talk by Jenny Wen, who's the design leader at Anthropic, where she said we have all of these design processes that are based around the idea that you need to get the design right - because if you hand it off to the engineers and they spend three months building the wrong thing, that's catastrophic.

There's this whole very extensive design process that you put in place because that design results in expensive work. But if it doesn't take three months to build, maybe the design process can be a whole lot riskier because cost, if you get something wrong, has been reduced so much.

Why I'm still not afraid for my career

When I look at my conversations with the agents, it's very clear to me that this is moon language for the vast majority of human beings.

There are a whole bunch of reasons I'm not scared that my career as a software engineer is over now that computers can write their own code, partly because these things are amplifiers of existing experience. If you know what you're doing, you can run so much faster with them. [...]

I'm constantly reminded as I work with these tools how hard the thing that we do is. Producing software is a ferociously difficult thing to do. And you could give me all of the AI tools in the world and what we're trying to achieve here is still really difficult. [...]

Matthew Yglesias, who's a political commentator, yesterday tweeted, "Five months in, I think I've decided that I don't want to vibecode — I want professionally managed software companies to use AI coding assistance to make more/better/cheaper software products that they sell to me for money." And that feels about right to me. I can plumb my house if I watch enough YouTube videos on plumbing. I would rather hire a plumber.

On the threat to SaaS providers of companies rolling their own solutions instead:

I just realized it's the thing I said earlier about how I only want to use your side project if you've used it for a few weeks. The enterprise version of that is I don't want a CRM unless at least two other giant enterprises have successfully used that CRM for six months. [...] You want solutions that are proven to work before you take a risk on them.

Tags: ai, generative-ai, llms, podcast-appearances, vibe-coding, coding-agents, agentic-engineering

Redis Array Playground

2026-05-04T15:53:57+00:00

Tool: Redis Array Playground

Salvatore Sanfilippo submitted a PR adding a new data type - arrays - to Redis.

The new commands are ARCOUNT, ARDEL, ARDELRANGE, ARGET, ARGETRANGE, ARGREP, ARINFO, ARINSERT, ARLASTITEMS, ARLEN, ARMGET, ARMSET, ARNEXT, AROP, ARRING, ARSCAN, ARSEEK, ARSET.

The implementation is currently available in a branch, so I had Claude Code for web build this interactive playground for trying out the new commands in a WASM-compiled build of a subset of Redis running in the browser.

The most interesting new command is ARGREP which can run a server-side grep against a range of values in the array using the newly vendored TRE regex library.

Salvatore wrote more about the AI-assisted development process for the array type in Redis array type: short story of a long development.

Tags: c, redis, regular-expressions, salvatore-sanfilippo, ai, webassembly, generative-ai, llms, agentic-engineering

Codex CLI 0.128.0 adds /goal

2026-04-30T23:23:17+00:00

Codex CLI 0.128.0 adds /goal

The latest version of OpenAI's Codex CLI coding agent adds their own version of the Ralph loop: you can now set a /goal and Codex will keep on looping until it evaluates that the goal has been completed... or the configured token budget has been exhausted.

It looks like the feature is mainly implemented though the goals/continuation.md and goals/budget_limit.md prompts, which are automatically injected at the end of a turn.

Via @fcoury

Tags: ai, openai, prompt-engineering, generative-ai, llms, coding-agents, system-prompts, codex, agentic-engineering

Quoting Matthew Yglesias

2026-04-28T13:25:29+00:00

Five months in, I think I've decided that I don't want to vibecode — I want professionally managed software companies to use AI coding assistance to make more/better/cheaper software products that they sell to me for money.

— Matthew Yglesias, in a now-deleted Tweet

Tags: ai, ai-assisted-programming, vibe-coding, agentic-engineering

Extract PDF text in your browser with LiteParse for the web

2026-04-23T21:54:24+00:00

LlamaIndex have a most excellent open source project called LiteParse, which provides a Node.js CLI tool for extracting text from PDFs. I got a version of LiteParse working entirely in the browser, using most of the same libraries that LiteParse uses to run in Node.js.

Spatial text parsing

Refreshingly, LiteParse doesn't use AI models to do what it does: it's good old-fashioned PDF parsing, falling back to Tesseract OCR (or other pluggable OCR engines) for PDFs that contain images of text rather than the text itself.

The hard problem that LiteParse solves is extracting text in a sensible order despite the infuriating vagaries of PDF layouts. They describe this as "spatial text parsing" - they use some very clever heuristics to detect things like multi-column layouts and group and return the text in a sensible linear flow.

The LiteParse documentation describes a pattern for implementing Visual Citations with Bounding Boxes. I really like this idea: being able to answer questions from a PDF and accompany those answers with cropped, highlighted images feels like a great way of increasing the credibility of answers from RAG-style Q&A.

LiteParse is provided as a pure CLI tool, designed to be used by agents. You run it like this:

npm i -g @llamaindex/liteparse
lit parse document.pdf

I explored its capabilities with Claude and quickly determined that there was no real reason it had to stay a CLI app: it's built on top of PDF.js and Tesseract.js, two libraries I've used for something similar in a browser in the past.

The only reason LiteParse didn't have a pure browser-based version is that nobody had built one yet...

Introducing LiteParse for the web

Visit https://simonw.github.io/liteparse/ to try out LiteParse against any PDF file, running entirely in your browser. Here's what that looks like:

The tool can work with or without running OCR, and can optionally display images for every page in the PDF further down the page.

Building it with Claude Code and Opus 4.7

The process of building this started in the regular Claude app on my iPhone. I wanted to try out LiteParse myself, so I started by uploading a random PDF I happened to have on my phone along with this prompt:

Clone https://github.com/run-llama/liteparse and try it against this file

Regular Claude chat can clone directly from GitHub these days, and while by default it can't access most of the internet from its container it can also install packages from PyPI and npm.

I often use this to try out new pieces of open source software on my phone - it's a quick way to exercise something without having to sit down with my laptop.

You can follow my full conversation in this shared Claude transcript. I asked a few follow-up questions about how it worked, and then asked:

Does this library run in a browser? Could it?

This gave me a thorough enough answer that I was convinced it was worth trying getting that to work for real. I opened up my laptop and switched to Claude Code.

I forked the original repo on GitHub, cloned a local copy, started a new web branch and pasted that last reply from Claude into a new file called notes.md. Then I told Claude Code:

Get this working as a web app. index.html, when loaded, should render an app that lets users open a PDF in their browser and select OCR or non-OCR mode and have this run. Read notes.md for initial research on this problem, then write out plan.md with your detailed implementation plan

I always like to start with a plan for this kind of project. Sometimes I'll use Claude's "planning mode", but in this case I knew I'd want the plan as an artifact in the repository so I told it to write plan.md directly.

This also means I can iterate on the plan with Claude. I noticed that Claude had decided to punt on generating screenshots of images in the PDF, and suggested we defer a "canvas-encode swap" to v2. I fixed that by prompting:

Update the plan to say we WILL do the canvas-encode swap so the screenshots thing works

After a few short follow-up prompts, here's the plan.md I thought was strong enough to implement.

I prompted:

build it.

And then mostly left Claude Code to its own devices, tinkered with some other projects, caught up on Duolingo and occasionally checked in to see how it was doing.

I added a few prompts to the queue as I was working. Those don't yet show up in my exported transcript, but it turns out running rg queue-operation --no-filename | grep enqueue | jq -r '.content' in the relevant ~/.claude/projects/ folder extracts them.

Here are the key follow-up prompts with some notes:

When you implement this use playwright and red/green TDD, plan that too - I've written more about red/green TDD here.
let's use PDF.js's own renderer (it was messing around with pdfium)
The final UI should include both the text and the pretty-printed JSON output, both of those in textareas and both with copy-to-clipboard buttons - it should also be mobile friendly - I had a new idea for how the UI should work
small commits along the way - see below
Make sure the index.html page includes a link back to https://github.com/run-llama/liteparse near the top of the page - it's important to credit your dependencies in a project like this!
View on GitHub → is bad copy because that's not the repo with this web app in, it's the web app for the underlying LiteParse library
Run OCR should be unchecked by default
When I try to parse a PDF in my browser I see 'Parse failed: undefined is not a function (near '...value of readableStream...') - it was testing with Playwright in Chrome, turned out there was a bug in Safari
... oh that is in safari but it works in chrome
When "Copy" is clicked the text should change to "Copied!" for 1.5s
[Image #1] Style the file input so that long filenames don't break things on Firefox like this - in fact add one of those drag-drop zone UIs which you can also click to select a file - dropping screenshots in of small UI glitches works surprisingly well
Tweak the drop zone such that the text is vertically centered, right now it is a bit closer to the top
it breaks in Safari on macOS, works in both Chrome and Firefox. On Safari I see "Parse failed: undefined is not a function (near '...value of readableStream...')" after I click the Parse button, when OCR is not checked - it still wasn't working in Safari...
works in safari now - but it fixed it pretty quickly once I pointed that out and it got Playwright working with that browser

I've started habitually asking for "small commits along the way" because it makes for code that's easier to understand or review later on, and I have an unproven hunch that it helps the agent work more effectively too - it's yet another encouragement towards planning and taking on one problem at a time.

While it was working I decided it would be nice to be able to interact with an in-progress version. I asked a separate Claude Code session against the same directory for tips on how to run it, and it told me to use npx vite. Running that started a development server with live-reloading, which meant I could instantly see the effect of each change it made on disk - and prompt with further requests for tweaks and fixes.

Towards the end I decided it was going to be good enough to publish. I started a fresh Claude Code instance and told it:

Look at the web/ folder - set up GitHub actions for this repo such that any push runs the tests, and if the tests pass it then does a GitHub Pages deploy of the built vite app such that the web/index.html page is the index.html page for the thing that is deployed and it works on GitHub Pages

After a bit more iteration here's the GitHub Actions workflow that builds the app using Vite and deploys the result to https://simonw.github.io/liteparse/.

I love GitHub Pages for this kind of thing because it can be quickly configured (by Claude, in this case) to turn any repository into a deployed web-app, at zero cost and with whatever build step is necessary. It even works against private repos, if you don't mind your only security being a secret URL.

With this kind of project there's always a major risk that the model might "cheat" - mark key features as "TODO" and fake them, or take shortcuts that ignore the initial requirements.

The responsible way to prevent this is to review all of the code... but this wasn't intended as that kind of project, so instead I fired up OpenAI Codex with GPT-5.5 (I had preview access) and told it:

Describe the difference between how the node.js CLI tool runs and how the web/ version runs

The answer I got back was enough to give me confidence that Claude hadn't taken any project-threatening shortcuts.

... and that was about it. Total time in Claude Code for that "build it" step was 59 minutes. I used my claude-code-transcripts tool to export a readable version of the full transcript which you can view here, albeit without those additional queued prompts (here's my issue to fix that).

Is this even vibe coding any more?

I'm a pedantic stickler when it comes to the original definition of vibe coding - vibe coding does not mean any time you use AI to help you write code, it's when you use AI without reviewing or caring about the code that's written at all.

By my own definition, this LiteParse for the web project is about as pure vibe coding as you can get! I have not looked at a single line of the HTML and TypeScript written for this project - in fact while writing this sentence I had to go and check if it had used JavaScript or TypeScript.

Yet somehow this one doesn't feel as vibe coded to me as many of my other vibe coded projects:

As a static in-browser web application hosted on GitHub Pages the blast radius for any bugs is almost non-existent: it either works for your PDF or doesn't.
No private data is transferred anywhere - all processing happens in your browser - so a security audit is unnecessary. I've glanced once at the network panel while it's running and no additional requests are made when a PDF is being parsed.
There was still a whole lot of engineering experience and knowledge required to use the models in this way. Identifying that porting LiteParse to run directly in a browser was critical to the rest of the project.

Most importantly, I'm happy to attach my reputation to this project and recommend that other people try it out. Unlike most of my vibe coded tools I'm not convinced that spending significant additional engineering time on this would have resulted in a meaningfully better initial release. It's fine as it is!

I haven't opened a PR against the origin repository because I've not discussed it with the LiteParse team. I've opened an issue, and if they want my vibe coded implementation as a starting point for something more official they're welcome to take it.

Tags: javascript, ocr, pdf, projects, ai, generative-ai, llms, vibe-coding, coding-agents, claude-code, agentic-engineering

Adding a new content type to my blog-to-newsletter tool

2026-04-18T03:15:36+00:00

Agentic Engineering Patterns >

Here's an example of a deceptively short prompt that got a quite a lot of work done in a single shot.

First, some background. I send out a free Substack newsletter around once a week containing content copied-and-pasted from my blog. I'm effectively using Substack as a lightweight way to allow people to subscribe to my blog via email.

I generate the newsletter with my blog-to-newsletter tool - an HTML and JavaScript app that fetches my latest content from this Datasette instance and formats it as rich text HTML, which I can then copy to my clipboard and paste into the Substack editor. Here's a detailed explanation of how that works.

I recently added a new type of content to my blog to capture content that I post elsewhere, which I called "beats". These include things like releases of my open source projects, new tools that I've built, museums that I've visited (from niche-museums.com) and other external content.

I wanted to include these in the generated newsletter. Here's the prompt I ran against the simonw/tools repository that hosts my blog-to-newsletter tool, using Claude Code on the web.

Clone simonw/simonwillisonblog from github to /tmp for reference

Update blog-to-newsletter.html to include beats that have descriptions - similar to how the Atom everything feed on the blog works

Run it with python -m http.server and use `uvx rodney --help` to test it - compare what shows up in the newsletter with what's on the homepage of https://simonwillison.net

This got me the exact solution I needed. Let's break down the prompt.

Clone simonw/simonwillisonblog from github to /tmp for reference

I use this pattern a lot. Coding agents can clone code from GitHub, and the best way to explain a problem is often to have them look at relevant code. By telling them to clone to /tmp I ensure they don't accidentally end up including that reference code in their own commit later on.

The simonw/simonwillisonblog repository contains the source code for my Django-powered simonwillison.net blog. This includes the logic and database schema for my new "beats" feature.

Update blog-to-newsletter.html to include beats that have descriptions - similar to how the Atom everything feed on the blog works

Referencing blog-to-newsletter.html is all I need here to tell Claude which of the 200+ HTML apps in that simonw/tools repo it should be modifying.

Beats are automatically imported from multiple sources. Often they aren't very interesting - a dot-release bug fix for one of my smaller open source projects, for example.

My blog includes a way for me to add additional descriptions to any beat, which provides extra commentary but also marks that beat as being more interesting than those that I haven't annotated in some way.

I already use this as a distinction to decide which beats end up in my site's Atom feed. Telling Claude to imitate that saves me from having to describe the logic in any extra detail.

Run it with python -m http.server and use `uvx rodney --help` to test it - compare what shows up in the newsletter with what's on the homepage of https://simonwillison.net

Coding agents always work best if they have some kind of validation mechanism they can use to test their own work.

In this case I wanted Claude Code to actively check that the changes it made to my tool would correctly fetch and display the latest data.

I reminded it to use python -m http.server as a static server because I've had issues in the past with applications that fetch data and break when served as a file from disk instead of a localhost server. In this particular case that may not have been necessary, but my prompting muscle memory has python -m http.server baked in at this point!

I described the uvx rodney --help trick in the agentic manual testing chapter. Rodney is browser automation software that can be installed using uvx, and that has --help output designed to teach an agent everything it needs to know in order to use the tool.

I figured that telling Claude to compare the results in the newsletter to the content of my blog's homepage would be enough for it to confidently verify that the new changes were working correctly, since I had recently posted content that matched the new requirements.

You can see the full session here, or if that doesn't work I have an alternative transcript showing all of the individual tool calls.

The resulting PR made exactly the right change. It added an additional UNION clause to the SQL query that fetched the blog's content, filtering out draft beats and beats that have nothing in their note column:

...
union all
select
  id,
  'beat' as type,
  title,
  created,
  slug,
  'No HTML' as html,
  json_object(
    'created', date(created),
    'beat_type', beat_type,
    'title', title,
    'url', url,
    'commentary', commentary,
    'note', note
  ) as json,
  url as external_url
from blog_beat
where coalesce(note, '') != '' and is_draft = 0
union all
...

And it figured out a mapping of beat types to their formal names, presumably derived from the Django ORM definition that it read while it was exploring the reference codebase:

const beatTypeDisplay = {
  release: 'Release',
  til: 'TIL',
  til_update: 'TIL updated',
  research: 'Research',
  tool: 'Tool',
  museum: 'Museum'
};

Telling agents to use another codebase as reference is a powerful shortcut for communicating complex concepts with minimal additional information needed in the prompt.

Tags: ai, llms, prompt-engineering, coding-agents, ai-assisted-programming, generative-ai, agentic-engineering, github

Steve Yegge

2026-04-13T20:59:00+00:00

Steve Yegge:

I was chatting with my buddy at Google, who's been a tech director there for about 20 years, about their AI adoption. Craziest convo I've had all year.

The TL;DR is that Google engineering appears to have the same AI adoption footprint as John Deere, the tractor company. Most of the industry has the same internal adoption curve: 20% agentic power users, 20% outright refusers, 60% still using Cursor or equivalent chat tool. It turns out Google has this curve too. [...]

There has been an industry-wide hiring freeze for 18+ months, during which time nobody has been moving jobs. So there are no clued-in people coming in from the outside to tell Google how far behind they are, how utterly mediocre they have become as an eng org.

Addy Osmani:

On behalf of @Google, this post doesn't match the state of agentic coding at our company. Over 40K SWEs use agentic coding weekly here. Googlers have access to our own versions of @antigravity, @geminicli, custom models, skills, CLIs and MCPs for our daily work. Orchestrators, agent loops, virtual SWE teams and many other systems are actively available to folks. [...]

Demis Hassabis:

Maybe tell your buddy to do some actual work and to stop spreading absolute nonsense. This post is completely false and just pure clickbait.

Update 20th April 2026: Steve doubled down:

My tweet last week about Google's AI adoption drew a lot of pushback, to say the least.

Since then, Googlers from multiple orgs have reached out to me independently and anonymously. They've expressed fear of being doxxed, concern about what they saw as bullying of me, and general corroboration of my original tweet. [...]

Tags: addy-osmani, steve-yegge, google, generative-ai, agentic-engineering, ai, llms

Eight years of wanting, three months of building with AI

2026-04-05T23:54:18+00:00

Eight years of wanting, three months of building with AI

Lalit Maganti provides one of my favorite pieces of long-form writing on agentic engineering I've seen in ages.

They spent eight years thinking about and then three months building syntaqlite, which they describe as "high-fidelity devtools that SQLite deserves".

The goal was to provide fast, robust and comprehensive linting and verifying tools for SQLite, suitable for use in language servers and other development tools - a parser, formatter, and verifier for SQLite queries. I've found myself wanting this kind of thing in the past myself, hence my (far less production-ready) sqlite-ast project from a few months ago.

Lalit had been procrastinating on this project for years, because of the inevitable tedium of needing to work through 400+ grammar rules to help build a parser. That's exactly the kind of tedious work that coding agents excel at!

Claude Code helped get over that initial hump and build the first prototype:

AI basically let me put aside all my doubts on technical calls, my uncertainty of building the right thing and my reluctance to get started by giving me very concrete problems to work on. Instead of “I need to understand how SQLite’s parsing works”, it was “I need to get AI to suggest an approach for me so I can tear it up and build something better". I work so much better with concrete prototypes to play with and code to look at than endlessly thinking about designs in my head, and AI lets me get to that point at a pace I could not have dreamed about before. Once I took the first step, every step after that was so much easier.

That first vibe-coded prototype worked great as a proof of concept, but they eventually made the decision to throw it away and start again from scratch. AI worked great for the low level details but did not produce a coherent high-level architecture:

I found that AI made me procrastinate on key design decisions. Because refactoring was cheap, I could always say “I’ll deal with this later.” And because AI could refactor at the same industrial scale it generated code, the cost of deferring felt low. But it wasn’t: deferring decisions corroded my ability to think clearly because the codebase stayed confusing in the meantime.

The second attempt took a lot longer and involved a great deal more human-in-the-loop decision making, but the result is a robust library that can stand the test of time.

It's worth setting aside some time to read this whole thing - it's full of non-obvious downsides to working heavily with AI, as well as a detailed explanation of how they overcame those hurdles.

The key idea I took away from this concerns AI's weakness in terms of design and architecture:

When I was working on something where I didn’t even know what I wanted, AI was somewhere between unhelpful and harmful. The architecture of the project was the clearest case: I spent weeks in the early days following AI down dead ends, exploring designs that felt productive in the moment but collapsed under scrutiny. In hindsight, I have to wonder if it would have been faster just thinking it through without AI in the loop at all.

But expertise alone isn’t enough. Even when I understood a problem deeply, AI still struggled if the task had no objectively checkable answer. Implementation has a right answer, at least at a local level: the code compiles, the tests pass, the output matches what you asked for. Design doesn’t. We’re still arguing about OOP decades after it first took off.

Via Hacker News

Tags: sqlite, ai, generative-ai, llms, ai-assisted-programming, vibe-coding, agentic-engineering

Syntaqlite Playground

2026-04-05T19:32:59+00:00

Tool: Syntaqlite Playground

Lalit Maganti's syntaqlite is currently being discussed on Hacker News thanks to Eight years of wanting, three months of building with AI, a deep dive into how it was built.

This inspired me to revisit a research project I ran when Lalit first released it a couple of weeks ago, where I tried it out and then compiled it to a WebAssembly wheel so it could run in Pyodide in a browser (the library itself uses C and Rust).

This new playground loads up the Python library and provides a UI for trying out its different features: formating, parsing into an AST, validating, and tokenizing SQLite SQL queries.

Update: not sure how I missed this but syntaqlite has its own WebAssembly playground linked to from the README.

Tags: sql, sqlite, tools, webassembly, ai-assisted-programming, agentic-engineering

scan-for-secrets 0.1

2026-04-05T03:27:13+00:00

Release: scan-for-secrets 0.1

I like publishing transcripts of local Claude Code sessions using my claude-code-transcripts tool but I'm often paranoid that one of my API keys or similar secrets might inadvertently be revealed in the detailed log files.

I built this new Python scanning tool to help reassure me. You can feed it secrets and have it scan for them in a specified directory:

uvx scan-for-secrets $OPENAI_API_KEY -d logs-to-publish/

If you leave off the -d it defaults to the current directory.

It doesn't just scan for the literal secrets - it also scans for common encodings of those secrets e.g. backslash or JSON escaping, as described in the README.

If you have a set of secrets you always want to protect you can list commands to echo them in a ~/.scan-for-secrets.conf.sh file. Mine looks like this:

llm keys get openai
llm keys get anthropic
llm keys get gemini
llm keys get mistral
awk -F= '/aws_secret_access_key/{print $2}' ~/.aws/credentials | xargs

I built this tool using README-driven-development: I carefully constructed the README describing exactly how the tool should work, then dumped it into Claude Code and told it to build the actual tool (using red/green TDD, naturally.)

Tags: projects, security, ai-assisted-programming, coding-agents, claude-code, agentic-engineering

The cognitive impact of coding agents

2026-04-03T23:57:04+00:00

A fun thing about recording a podcast with a professional like Lenny Rachitsky is that his team know how to slice the resulting video up into TikTok-sized short form vertical videos. Here's one he shared on Twitter today which ended up attracting over 1.1m views!

That was 48 seconds. Our full conversation lasted 1 hour 40 minutes.

Tags: ai-ethics, coding-agents, agentic-engineering, generative-ai, podcast-appearances, ai, llms, cognitive-debt

Highlights from my conversation about agentic engineering on Lenny's Podcast

2026-04-02T20:40:47+00:00

I was a guest on Lenny Rachitsky's podcast, in a new episode titled An AI state of the union: We've passed the inflection point, dark factories are coming, and automation timelines. It's available on YouTube, Spotify, and Apple Podcasts. Here are my highlights from our conversation, with relevant links.

The November inflection point

4:19 - The end result of these two labs throwing everything they had at making their models better at code is that in November we had what I call the inflection point where GPT 5.1 and Claude Opus 4.5 came along.

They were both incrementally better than the previous models, but in a way that crossed a threshold where previously the code would mostly work, but you had to pay very close attention to it. And suddenly we went from that to... almost all of the time it does what you told it to do, which makes all of the difference in the world.

Now you can spin up a coding agent and say, build me a Mac application that does this thing, and you'll get something back which won't just be a buggy pile of rubbish that doesn't do anything.

Software engineers as bellwethers for other information workers

5:49 - I can churn out 10,000 lines of code in a day. And most of it works. Is that good? Like, how do we get from most of it works to all of it works? There are so many new questions that we're facing, which I think makes us a bellwether for other information workers.

Code is easier than almost every other problem that you pose these agents because code is obviously right or wrong - either it works or it doesn't work. There might be a few subtle hidden bugs, but generally you can tell if the thing actually works.

If it writes you an essay, if it prepares a lawsuit for you, it's so much harder to derive if it's actually done a good job, and to figure out if it got things right or wrong. But it's happening to us as software engineers. It came for us first.

And we're figuring out, OK, what do our careers look like? How do we work as teams when part of what we did that used to take most of the time doesn't take most of the time anymore? What does that look like? And it's going to be very interesting seeing how this rolls out to other information work in the future.

Lawyers are falling for this really badly. The AI hallucination cases database is up to 1,228 cases now!

Plus this bit from the cold open at the start:

It used to be you'd ask ChatGPT for some code, and it would spit out some code, and you'd have to run it and test it. The coding agents take that step for you now. And an open question for me is how many other knowledge work fields are actually prone to these agent loops?

Writing code on my phone

8:19 - I write so much of my code on my phone. It's wild. I can get good work done walking the dog along the beach, which is delightful.

I mainly use the Claude iPhone app for this, both with a regular Claude chat session (which can execute code now) or using it to control Claude Code for web.

Responsible vibe coding

9:55 If you're vibe coding something for yourself, where the only person who gets hurt if it has bugs is you, go wild. That's completely fine. The moment you ship your vibe coding code for other people to use, where your bugs might actually harm somebody else, that's when you need to take a step back.

Dark Factories and StrongDM

12:49 The reason it's called the dark factory is there's this idea in factory automation that if your factory is so automated that you don't need any people there, you can turn the lights off. Like the machines can operate in complete darkness if you don't need people on the factory floor. What does that look like for software? [...]

So there's this policy that nobody writes any code: you cannot type code into a computer. And honestly, six months ago, I thought that was crazy. And today, probably 95% of the code that I produce, I didn't type myself. That world is practical already because the latest models are good enough that you can tell them to rename that variable and refactor and add this line there... and they'll just do it - it's faster than you typing on the keyboard yourself.

The next rule though, is nobody reads the code. And this is the thing which StrongDM started doing last year.

I wrote a lot more about StrongDM's dark factory explorations back in February.

The bottleneck has moved to testing

21:27 - It used to be, you'd come up with a spec and you hand it to your engineering team. And three weeks later, if you're lucky, they'd come back with an implementation. And now that maybe takes three hours, depending on how well the coding agents are established for that kind of thing. So now what, right? Now, where else are the bottlenecks?

Anyone who's done any product work knows that your initial ideas are always wrong. What matters is proving them, and testing them.

We can test things so much faster now because we can build workable prototypes so much quicker. So there's an interesting thing I've been doing in my own work where any feature that I want to design, I'll often prototype three different ways it could work because that takes very little time.

I've always loved prototyping things, and prototyping is even more valuable now.

22:40 - A UI prototype is free now. ChatGPT and Claude will just build you a very convincing UI for anything that you describe. And that's how you should be working. I think anyone who's doing product design and isn't vibe coding little prototypes is missing out on the most powerful boost that we get in that step.

But then what do you do? Given your three options that you have instead of one option, how do you prove to yourself which one of those is the best? I don't have a confident answer to that. I expect this is where the good old fashioned usability testing comes in.

This stuff is exhausting

26:25 - I'm finding that using coding agents well is taking every inch of my 25 years of experience as a software engineer, and it is mentally exhausting. I can fire up four agents in parallel and have them work on four different problems. And by like 11 AM, I am wiped out for the day. [...]

There's a personal skill we have to learn in finding our new limits - what's a responsible way for us not to burn out.

I've talked to a lot of people who are losing sleep because they're like, my coding agents could be doing work for me. I'm just going to stay up an extra half hour and set off a bunch of extra things... and then waking up at four in the morning. That's obviously unsustainable. [...]

There's an element of sort of gambling and addiction to how we're using some of these tools.

Interruptions cost a lot less now

45:16 - People talk about how important it is not to interrupt your coders. Your coders need to have solid two to four hour blocks of uninterrupted work so they can spin up their mental model and churn out the code. That's changed completely. My programming work, I need two minutes every now and then to prompt my agent about what to do next. And then I can do the other stuff and I can go back. I'm much more interruptible than I used to be.

My ability to estimate software is broken

28:19 - I've got 25 years of experience in how long it takes to build something. And that's all completely gone - it doesn't work anymore because I can look at a problem and say that this is going to take two weeks, so it's not worth it. And now it's like... maybe it's going to take 20 minutes because the reason it would have taken two weeks was all of the sort of crufty coding things that the AI is now covering for us.

I constantly throw tasks at AI that I don't think it'll be able to do because every now and then it does it. And when it doesn't do it, you learn, right? But when it does do something, especially something that the previous models couldn't do, that's actually cutting edge AI research.

And a related anecdote:

36:56 - A lot of my friends have been talking about how they have this backlog of side projects, right? For the last 10, 15 years, they've got projects they never quite finished. And some of them are like, well, I've done them all now. Last couple of months, I just went through and every evening I'm like, let's take that project and finish it. And they almost feel a sort of sense of loss at the end where they're like, well, okay, my backlog's gone. Now what am I going to build?

It's tough for people in the middle

29:29 - So ThoughtWorks, the big IT consultancy, did an offsite about a month ago, and they got a whole bunch of engineering VPs in from different companies to talk about this stuff. And one of the interesting theories they came up with is they think this stuff is really good for experienced engineers, like it amplifies their skills. It's really good for new engineers because it solves so many of those onboarding problems. The problem is the people in the middle. If you're mid-career, if you haven't made it to sort of super senior engineer yet, but you're not sort of new either, that's the group which is probably in the most trouble right now.

I mentioned Cloudflare hiring 1,000 interns, and Shopify too.

Lenny asked for my advice for people stuck in that middle:

31:21 - That's a big responsibility you're putting on me there! I think the way forward is to lean into this stuff and figure out how do I help this make me better?

A lot of people worry about skill atrophy: if the AI is doing it for you, you're not learning anything. I think if you're worried about that, you push back at it. You have to be mindful about how you're applying the technology and think, okay, I've been given this thing that can answer any question and often gets it right. How can I use this to amplify my own skills, to learn new things, to take on much more ambitious projects? [...]

33:05 - Everything is changing so fast right now. The only universal skill is being able to roll with the changes. That's the thing that we all need.

The term that comes up most in these conversations about how you can be great with AI is agency. I think agents have no agency at all. I would argue that the one thing AI can never have is agency because it doesn't have human motivations.

So I'd say that's the thing is to invest in your own agency and invest in how to use this technology to get better at what you do and to do new things.

It's harder to evaluate software

The fact that it's so easy to create software with detailed documentation and robust tests means it's harder to figure out what's a credible project.

37:47 Sometimes I'll have an idea for a piece of software, Python library or whatever, and I can knock it out in like an hour and get to a point where it's got documentation and tests and all of those things, and it looks like the kind of software that previously I'd have spent several weeks on - and I can stick it up on GitHub

And yet... I don't believe in it. And the reason I don't believe in it is that I got to rush through all of those things... I think the quality is probably good, but I haven't spent enough time with it to feel confident in that quality. Most importantly, I haven't used it yet.

It turns out when I'm using somebody else's software, the thing I care most about is I want them to have used it for months.

I've got some very cool software that I built that I've never used. It was quicker to build it than to actually try and use it!

The misconception that AI tools are easy

41:31 - Everyone's like, oh, it must be easy. It's just a chat bot. It's not easy. That's one of the great misconceptions in AI is that using these tools effectively is easy. It takes a lot of practice and it takes a lot of trying things that didn't work and trying things that did work.

Coding agents are useful for security research now

19:04 - In the past sort of three to six months, they've started being credible as security researchers, which is sending shockwaves through the security research industry.

See Thomas Ptacek: Vulnerability Research Is Cooked.

At the same time, open source projects are being bombarded with junk security reports:

20:05 - There are these people who don't know what they're doing, who are asking ChatGPT to find a security hole and then reporting it to the maintainer. And the report looks good. ChatGPT can produce a very well formatted report of a vulnerability. It's a total waste of time. It's not actually verified as being a real problem.

A good example of the right way to do this is Anthropic's collaboration with Firefox, where Anthropic's security team verified every security problem before passing them to Mozilla.

OpenClaw

Of course we had to talk about OpenClaw! Lenny had his running on a Mac Mini.

1:29:23 - OpenClaw demonstrates that people want a personal digital assistant so much that they are willing to not just overlook the security side of things, but also getting the thing running is not easy. You've got to create API keys and tokens and install stuff. It's not trivial to get set up and hundreds of thousands of people got it set up. [...]

The first line of code for OpenClaw was written on November the 25th. And then in the Super Bowl, there was an ad for AI.com, which was effectively a vaporware white labeled OpenClaw hosting provider. So we went from first line of code in November to Super Bowl ad in what? Three and a half months.

I continue to love Drew Breunig's description of OpenClaw as a digital pet:

A friend of mine said that OpenClaw is basically a Tamagotchi. It's a digital pet and you buy the Mac Mini as an aquarium.

Journalists are good at dealing with unreliable sources

In talking about my explorations of AI for data journalism through Datasette:

1:34:58 - You would have thought that AI is a very bad fit for journalism where the whole idea is to find the truth. But the flip side is journalists deal with untrustworthy sources all the time. The art of journalism is you talk to a bunch of people and some of them lie to you and you figure out what's true. So as long as the journalist treats the AI as yet another unreliable source, they're actually better equipped to work with AI than most other professions are.

The pelican benchmark

Obviously we talked about pelicans riding bicycles:

56:10 - There appears to be a very strong correlation between how good their drawing of a pelican riding a bicycle is and how good they are at everything else. And nobody can explain to me why that is. [...]

People kept on asking me, what if labs cheat on the benchmark? And my answer has always been, really, all I want from life is a really good picture of a pelican riding a bicycle. And if I can trick every AI lab in the world into cheating on benchmarks to get it, then that just achieves my goal.

59:56 - I think something people often miss is that this space is inherently funny. The fact that we have these incredibly expensive, power hungry, supposedly the most advanced computers of all time. And if you ask them to draw a pelican on a bicycle, it looks like a five-year-old drew it. That's really funny to me.

And finally, some good news about parrots

Lenny asked if I had anything else I wanted to leave listeners with to wrap up the show, so I went with the best piece of news in the world right now.

1:38:10 - There is a rare parrot in New Zealand called the Kākāpō. There are only 250 of these parrots left in the world. They are flightless nocturnal parrots - beautiful green dumpy looking things. And the good news is they're having a fantastic breeding season in 2026,

They only breed when the Rimu trees in New Zealand have a mass fruiting season, and the Rimu trees haven't done that since 2022 - so there has not been a single baby kākāpō born in four years.

This year, the Rimu trees are in fruit. The kākāpō are breeding. There have been dozens of new chicks born. It's a really, really good time. It's great news for rare New Zealand parrots and you should look them up because they're delightful.

Everyone should watch the live stream of Rakiura on her nest with two chicks!

YouTube chapters

Here's the full list of chapters Lenny's team defined for the YouTube video:

00:00: Introduction to Simon Willison
02:40: The November 2025 inflection point
08:01: What's possible now with AI coding
10:42: Vibe coding vs. agentic engineering
13:57: The dark-factory pattern
20:41: Where bottlenecks have shifted
23:36: Where human brains will continue to be valuable
25:32: Defending of software engineers
29:12: Why experienced engineers get better results
30:48: Advice for avoiding the permanent underclass
33:52: Leaning into AI to amplify your skills
35:12: Why Simon says he's working harder than ever
37:23: The market for pre-2022 human-written code
40:01: Prediction: 50% of engineers writing 95% AI code by the end of 2026
44:34: The impact of cheap code
48:27: Simon's AI stack
54:08: Using AI for research
55:12: The pelican-riding-a-bicycle benchmark
59:01: The inherent ridiculousness of AI
1:00:52: Hoarding things you know how to do
1:08:21: Red/green TDD pattern for better AI code
1:14:43: Starting projects with good templates
1:16:31: The lethal trifecta and prompt injection
1:21:53: Why 97% effectiveness is a failing grade
1:25:19: The normalization of deviance
1:28:32: OpenClaw: the security nightmare everyone is looking past
1:34:22: What's next for Simon
1:36:47: Zero-deliverable consulting
1:38:05: Good news about Kakapo parrots

Tags: ai, kakapo, generative-ai, llms, podcast-appearances, coding-agents, agentic-engineering

Quoting Soohoon Choi

2026-04-01T02:07:16+00:00

I want to argue that AI models will write good code because of economic incentives. Good code is cheaper to generate and maintain. Competition is high between the AI models right now, and the ones that win will help developers ship reliable features fastest, which requires simple, maintainable code. Good code will prevail, not only because we want it to (though we do!), but because economic forces demand it. Markets will not reward slop in coding, in the long-term.

— Soohoon Choi, Slop Is Not Necessarily The Future

Tags: ai, generative-ai, llms, ai-assisted-programming, slop, agentic-engineering

Quoting Matt Webb

2026-03-28T12:04:26+00:00

The thing about agentic coding is that agents grind problems into dust. Give an agent a problem and a while loop and - long term - it’ll solve that problem even if it means burning a trillion tokens and re-writing down to the silicon. [...]

But we want AI agents to solve coding problems quickly and in a way that is maintainable and adaptive and composable (benefiting from improvements elsewhere), and where every addition makes the whole stack better.

So at the bottom is really great libraries that encapsulate hard problems, with great interfaces that make the “right” way the easy way for developers building apps with them. Architecture!

While I’m vibing (I call it vibing now, not coding and not vibe coding) while I’m vibing, I am looking at lines of code less than ever before, and thinking about architecture more than ever before.

— Matt Webb, An appreciation for (technical) architecture

Tags: definitions, matt-webb, ai, generative-ai, llms, ai-assisted-programming, vibe-coding, coding-agents, agentic-engineering

We Rewrote JSONata with AI in a Day, Saved $500K/Year

2026-03-27T00:35:01+00:00

We Rewrote JSONata with AI in a Day, Saved $500K/Year

Bit of a hyperbolic framing but this looks like another case study of vibe porting, this time spinning up a new custom Go implementation of the JSONata JSON expression language - similar in focus to jq, and heavily associated with the Node-RED platform.

As with other vibe-porting projects the key enabling factor was JSONata's existing test suite, which helped build the first working Go version in 7 hours and $400 of token spend.

The Reco team then used a shadow deployment for a week to run the new and old versions in parallel to confirm the new implementation exactly matched the behavior of the old one.

Tags: go, json, ai, generative-ai, llms, agentic-engineering, vibe-porting

Thoughts on slowing the fuck down

2026-03-25T21:47:17+00:00

Thoughts on slowing the fuck down

Mario Zechner created the Pi agent framework used by OpenClaw, giving considerable credibility to his opinions on current trends in agentic engineering. He's not impressed:

We have basically given up all discipline and agency for a sort of addiction, where your highest goal is to produce the largest amount of code in the shortest amount of time. Consequences be damned.

Agents and humans both make mistakes, but agent mistakes accumulate much faster:

A human is a bottleneck. A human cannot shit out 20,000 lines of code in a few hours. Even if the human creates such booboos at high frequency, there's only so many booboos the human can introduce in a codebase per day. [...]

With an orchestrated army of agents, there is no bottleneck, no human pain. These tiny little harmless booboos suddenly compound at a rate that's unsustainable. You have removed yourself from the loop, so you don't even know that all the innocent booboos have formed a monster of a codebase. You only feel the pain when it's too late. [...]

You have zero fucking idea what's going on because you delegated all your agency to your agents. You let them run free, and they are merchants of complexity.

I think Mario is exactly right about this. Agents let us move so much faster, but this speed also means that changes which we would normally have considered over the course of weeks are landing in a matter of hours.

It's so easy to let the codebase evolve outside of our abilities to reason clearly about it. Cognitive debt is real.

Mario recommends slowing down:

Give yourself time to think about what you're actually building and why. Give yourself an opportunity to say, fuck no, we don't need this. Set yourself limits on how much code you let the clanker generate per day, in line with your ability to actually review the code.

Anything that defines the gestalt of your system, that is architecture, API, and so on, write it by hand. [...]

I'm not convinced writing by hand is the best way to address this, but it's absolutely the case that we need the discipline to find a new balance of speed v.s. mental thoroughness now that typing out the code is no longer anywhere close to being the bottleneck on writing software.

Tags: ai, generative-ai, llms, coding-agents, cognitive-debt, agentic-engineering, pi

Experimenting with Starlette 1.0 with Claude skills

2026-03-22T23:57:44+00:00

Starlette 1.0 is out! This is a really big deal. I think Starlette may be the Python framework with the most usage compared to its relatively low brand recognition because Starlette is the foundation of FastAPI, which has attracted a huge amount of buzz that seems to have overshadowed Starlette itself.

Kim Christie started working on Starlette in 2018 and it quickly became my favorite out of the new breed of Python ASGI frameworks. The only reason I didn't use it as the basis for my own Datasette project was that it didn't yet promise stability, and I was determined to provide a stable API for Datasette's own plugins... albeit I still haven't been brave enough to ship my own 1.0 release (after 26 alphas and counting)!

Then in September 2025 Marcelo Trylesinski announced that Starlette and Uvicorn were transferring to their GitHub account, in recognition of their many years of contributions and to make it easier for them to receive sponsorship against those projects.

The 1.0 version has a few breaking changes compared to the 0.x series, described in the release notes for 1.0.0rc1 that came out in February.

The most notable of these is a change to how code runs on startup and shutdown. Previously that was handled by on_startup and on_shutdown parameters, but the new system uses a neat lifespan mechanism instead based around an async context manager:

@contextlib.asynccontextmanager
async def lifespan(app):
    async with some_async_resource():
        print("Run at startup!")
        yield
        print("Run on shutdown!")

app = Starlette(
    routes=routes,
    lifespan=lifespan
)

If you haven't tried Starlette before it feels to me like an asyncio-native cross between Flask and Django, unsurprising since creator Kim Christie is also responsible for Django REST Framework. Crucially, this means you can write most apps as a single Python file, Flask style.

This makes it really easy for LLMs to spit out a working Starlette app from a single prompt.

There's just one problem there: if 1.0 breaks compatibility with the Starlette code that the models have been trained on, how can we have them generate code that works with 1.0?

I decided to see if I could get this working with a Skill.

Building a Skill with Claude

Regular Claude Chat on claude.ai has skills, and one of those default skills is the skill-creator skill. This means Claude knows how to build its own skills.

So I started a chat session and told it:

Clone Starlette from GitHub - it just had its 1.0 release. Build a skill markdown document for this release which includes code examples of every feature.

I didn't even tell it where to find the repo, Starlette is widely enough known that I expected it could find it on its own.

It ran git clone https://github.com/encode/starlette.git which is actually the old repository name, but GitHub handles redirects automatically so this worked just fine.

The resulting skill document looked very thorough to me... and then I noticed a new button at the top I hadn't seen before labelled "Copy to your skills". So I clicked it:

And now my regular Claude chat has access to that skill!

A task management demo app

I started a new conversation and prompted:

Build a task management app with Starlette, it should have projects and tasks and comments and labels

And Claude did exactly that, producing a simple GitHub Issues clone using Starlette 1.0, a SQLite database (via aiosqlite) and a Jinja2 template.

Claude even tested the app manually like this:

cd /home/claude/taskflow && timeout 5 python -c "
import asyncio
from database import init_db
asyncio.run(init_db())
print('DB initialized successfully')
" 2>&1

pip install httpx --break-system-packages -q \
  && cd /home/claude/taskflow && \
  python -c "
from starlette.testclient import TestClient
from main import app

client = TestClient(app)

r = client.get('/api/stats')
print('Stats:', r.json())

r = client.get('/api/projects')
print('Projects:', len(r.json()), 'found')

r = client.get('/api/tasks')
print('Tasks:', len(r.json()), 'found')

r = client.get('/api/labels')
print('Labels:', len(r.json()), 'found')

r = client.get('/api/tasks/1')
t = r.json()
print(f'Task 1: \"{t[\"title\"]}\" - {len(t[\"comments\"])} comments, {len(t[\"labels\"])} labels')

r = client.post('/api/tasks', json={'title':'Test task','project_id':1,'priority':'high','label_ids':[1,2]})
print('Created task:', r.status_code, r.json()['title'])

r = client.post('/api/comments', json={'task_id':1,'content':'Test comment'})
print('Created comment:', r.status_code)

r = client.get('/')
print('Homepage:', r.status_code, '- length:', len(r.text))

print('\nAll tests passed!')
"

For all of the buzz about Claude Code, it's easy to overlook that Claude itself counts as a coding agent now, fully able to both write and then test the code that it is writing.

Here's what the resulting app looked like. The code is here in my research repository.

Tags: open-source, python, ai, asgi, kim-christie, generative-ai, llms, ai-assisted-programming, claude, coding-agents, skills, agentic-engineering, starlette

Using Git with coding agents

2026-03-21T22:08:24+00:00

Agentic Engineering Patterns >

Git is a key tool for working with coding agents. Keeping code in version control lets us record how that code changes over time and investigate and reverse any mistakes. All of the coding agents are fluent in using Git's features, both basic and advanced.

This fluency means we can be more ambitious about how we use Git ourselves. We don't need to memorize how to do things with Git, but staying aware of what's possible means we can take advantage of the full suite of Git's abilities.

Git essentials

Each Git project lives in a repository - a folder on disk that can track changes made to the files within it. Those changes are recorded in commits - timestamped bundles of changes to one or more files accompanied by a commit message describing those changes and an author recording who made them.

Git supports branches, which allow you to construct and experiment with new changes independently of each other. Branches can then be merged back into your main branch (using various methods) once they are deemed ready.

Git repositories can be cloned onto a new machine, and that clone includes both the current files and the full history of changes to them. This means developers - or coding agents - can browse and explore that history without any extra network traffic, making history diving effectively free.

Git repositories can live just on your own machine, but Git is designed to support collaboration and backups by publishing them to a remote, which can be public or private. GitHub is the most popular place for these remotes but Git is open source software that enables hosting these remotes on any machine or service that supports the Git protocol.

Core concepts and prompts

Coding agents all have a deep understanding of Git jargon. The following prompts should work with any of them:

Start a new Git repo here

To turn the folder the agent is working in into a Git repository - the agent will probably run the git init command. If you just say "repo" agents will assume you mean a Git repository.

Commit these changes

Create a new Git commit to record the changes the agent has made - usually with the git commit -m "commit message" command.

Add username/repo as a github remote

This should configure your repository for GitHub. You'll need to create a new repo first using github.com/new, and configure your machine to talk to GitHub.

Review changes made today

Or "recent changes" or "last three commits".

This is a great way to start a fresh coding agents session. Telling the agent to look at recent changes causes it to run git log, which can instantly load its context with details of what you have been working on recently - both the modified code and the commit messages that describe it.

Seeding the session in this way means you can start talking about that code - suggest additional fixes, ask questions about how it works, or propose the next change that builds on what came before.

Integrate latest changes from main

Run this on your main branch to fetch other contributions from the remote repository, or run it in a branch to integrate the latest changes on main.

There are multiple ways to merge changes, including merge, rebase, squash or fast-forward. If you can't remember the details of these that's fine:

Discuss options for integrating changes from main

Agents are great at explaining the pros and cons of different merging strategies, and everything in git can always be undone so there's minimal risk in trying new things.

Sort out this git mess for me

I use this universal prompt surprisingly often! Here's a recent example where it fixed a cherry-pick for me that failed with a merge conflict.

There are plenty of ways you can get into a mess with Git, often through pulls or rebase commands that end in a merge conflict, or just through adding the wrong things to Git's staging environment.

Unpicking those used to be the most difficult and time consuming parts of working with Git. No more! Coding agents can navigate the most Byzantine of merge conflicts, reasoning through the intent of the new code and figuring out what to keep and how to combine conflicting changes. If your code has automated tests (and it should) the agent can ensure those pass before finalizing that merge.

Find and recover my code that does ...

If you lose code that you are working on that's previously been committed (or saved with git stash) your agent can probably find it for you.

Git has a mechanism called the reflog which can often capture details of code that hasn't been committed to a permanent branch. Agents can search that, and search other branches too.

Just tell them what to find and watch them dive in.

Use git bisect to find when this bug was introduced: ...

Git bisect is one of the most powerful debugging tools in Git's arsenal, but it has a relatively steep learning curve that often deters developers from using it.

When you run a bisect operation you provide Git with some kind of test condition and a start and ending commit range. Git then runs a binary search to identify the earliest commit for which your test condition fails.

This can efficiently answer the question "what first caused this bug". The only downside is the need to express the test for the bug in a format that Git bisect can execute.

Coding agents can handle this boilerplate for you. This upgrades Git bisect from an occasional use tool to one you can deploy any time you are curious about the historic behavior of your software.

Rewriting history

Let's get into the fun advanced stuff.

The commit history of a Git repository is not fixed. The data is just files on disk after all (tucked away in a hidden .git/ directory), and Git itself provides tools that can be used to modify that history.

Don't think of the Git history as a permanent record of what actually happened - instead consider it to be a deliberately authored story that describes the progression of the software project.

This story is a tool to aid future development. Permanently recording mistakes and cancelled directions can sometimes be useful, but repository authors can make editorial decisions about what to keep and how best to capture that history.

Coding agents are really good at using Git's advanced history rewriting features.

Undo or rewrite commits

Undo last commit

It's common to commit code and then regret it - realize that it includes a file you didn't mean to include, for example. The git recipe for this is git reset --soft HEAD~1. I've never been able to remember that, and now I don't have to!

Remove uv.lock from that last commit

You can also perform more finely grained surgery on commits - rewriting them to remove just a single file, for example.

Combine last three commits with a better commit message

Agents can rewrite commit messages and can combine multiple commits into a single unit.

I've found that frontier models usually have really good taste in commit messages. I used to insist on writing these myself but I've accepted that the quality they produce is generally good enough, and often even better than what I would have produced myself.

Building a new repository from scraps of an older one

A trick I find myself using quite often is extracting out code from a larger repository into a new one while maintaining the key history of that code.

One common example is library extraction. I may have built some classes and functions into a project and later realized they would make more sense as a standalone reusable code library.

Start a new repo at /tmp/distance-functions and build a Python library there with the lib/distance_functions.py module from here - build a similar commit history copying the author and commit dates in the new repo

This kind of operation used to be involved enough that most developers would create a fresh copy detached from that old commit history. We don't have to settle for that any more!

Tags: coding-agents, generative-ai, github, agentic-engineering, ai, git, llms

Subagents

2026-03-17T12:32:28+00:00

Agentic Engineering Patterns >

LLMs are restricted by their context limit - how many tokens they can fit in their working memory at any given time. These values have not increased much over the past two years even as the LLMs themselves have seen dramatic improvements in their abilities - they generally top out at around 1,000,000, and benchmarks frequently report better quality results below 200,000.

Carefully managing the context such that it fits within those limits is critical to getting great results out of a model.

Subagents provide a simple but effective way to handle larger tasks without burning through too much of the coding agent’s valuable top-level context.

When a coding agent uses a subagent it effectively dispatches a fresh copy of itself to achieve a specified goal, with a new context window that starts with a fresh prompt.

Claude Code’s Explore subagent

Claude Code uses subagents extensively as part of its standard way of working. Let's use that as an illustrative example.

Any time you start a new task against an existing repo Claude Code first needs to explore that repo to figure out its general shape and find relevant information needed to achieve that task.

It does this by constructing a prompt and dispatching a subagent to perform that exploration and return a description of what it finds.

I kicked off a Claude Code session with a screenshot of a page from my guide and this prompt:

Make the chapter diffs also show which characters have changed in this diff view with a darker color of red or green for the individually changed segments of text within the line

Here's the transcript of that session.

Claude Code started by running an “Explore” subagent with the following starter prompt:

Find the code that implements the diff view for "chapters" in this Django blog. I need to find:

Templates that render diffs (look for diff-related HTML/CSS with red/green backgrounds)

Python code that generates diffs (look for difflib usage or similar)

Any JavaScript related to diff rendering

CSS styles for the diff view (red/green line backgrounds)

Search thoroughly - check templates/, static/, blog/ directories. Look for keywords like "diff", "chapter", "revision", "history", "compare".

Subagents work similar to any other tool call: the parent agent dispatches them just as they would any other tool and waits for the response. It's interesting to see models prompt themselves in this way - they generally have good taste in prompting strategies.

The Explore subagent churned away for a few seconds and returned a response that started like this:

Perfect! Now let me create a comprehensive summary with all the findings:

Summary: Chapter Diff View Implementation

I found the complete implementation of the diff view for chapters in this Django blog. Here are the key components:

1. Python Code - Diff Generation (/home/user/simonwillisonblog/guides/views.py, lines 139-202)

The chapter_changes() view function uses Python's built-in difflib.unified_diff() to generate diffs: […]

The full subagent response included all of the details the parent agent needed in order to start editing the code to address my original request.

Parallel subagents

This Explore subagent is the simplest example of how subagents can work, with the parent agent pausing while the subagent runs. The principle advantage of this kind of subagent is that it can work with a fresh context in a way that avoids spending tokens from the parent’s available limit.

Subagents can also provide a significant performance boost by having the parent agent run multiple subagents at the same time, potentially also using faster and cheaper models such as Claude Haiku to accelerate those tasks.

Coding agents that support subagents can use them based on your instructions. Try prompts like this:

Use subagents to find and update all of the templates that are affected by this change.

For tasks that involve editing several files - and where those files are not dependent on each other - this can offer a significant speed boost.

Specialist subagents

Some coding agents allow subagents to run with further customizations, often in the form of a custom system prompt or custom tools or both, which allow those subagents to take on a different role.

These roles can cover a variety of useful specialties:

A code reviewer agent can review code and identify bugs, feature gaps or weaknesses in the design.
A test runner agent can run the test. This is particularly worthwhile if your test suite is large and verbose, as the subagent can hide the full test output from the main coding agent and report back with just details of any failures.
A debugger agent can specialize in debugging problems, spending its token allowance reasoning though the codebase and running snippets of code to help isolate steps to reproduce and determine the root cause of a bug.

While it can be tempting to go overboard breaking up tasks across dozens of different specialist subagents, it's important to remember that the main value of subagents is in preserving that valuable root context and managing token-heavy operations. Your root coding agent is perfectly capable of debugging or reviewing its own output provided it has the tokens to spare.

Official documentation

Several popular coding agents support subagents, each with their own documentation on how to use them:

Tags: parallel-agents, coding-agents, generative-ai, agentic-engineering, ai, llms

Use subagents and custom agents in Codex

2026-03-16T23:03:56+00:00

Use subagents and custom agents in Codex

Subagents were announced in general availability today for OpenAI Codex, after several weeks of preview behind a feature flag.

They're very similar to the Claude Code implementation, with default subagents for "explorer", "worker" and "default". It's unclear to me what the difference between "worker" and "default" is but based on their CSV example I think "worker" is intended for running large numbers of small tasks in parallel.

Codex also lets you define custom agents as TOML files in ~/.codex/agents/. These can have custom instructions and be assigned to use specific models - including gpt-5.3-codex-spark if you want some raw speed. They can then be referenced by name, as demonstrated by this example prompt from the documentation:

Investigate why the settings modal fails to save. Have browser_debugger reproduce it, code_mapper trace the responsible code path, and ui_fixer implement the smallest fix once the failure mode is clear.

The subagents pattern is widely supported in coding agents now. Here's documentation across a number of different platforms:

Update: I added a chapter on Subagents to my Agentic Engineering Patterns guide.

Via @OpenAIDevs

Tags: ai, openai, generative-ai, llms, coding-agents, codex, parallel-agents, agentic-engineering

How coding agents work

2026-03-16T14:01:41+00:00

Agentic Engineering Patterns >

As with any tool, understanding how coding agents work under the hood can help you make better decisions about how to apply them.

A coding agent is a piece of software that acts as a harness for an LLM, extending that LLM with additional capabilities that are powered by invisible prompts and implemented as callable tools.

Large Language Models

At the heart of any coding agent is a Large Language Model, or LLM. These have names like GPT-5.4 or Claude Opus 4.6 or Gemini 3.1 Pro or Qwen3.5-35B-A3B.

An LLM is a machine learning model that can complete a sentence of text. Give the model the phrase "the cat sat on the " and it will (almost certainly) suggest "mat" as the next word in the sentence.

As these models get larger and train on increasing amounts of data, they can complete more complex sentences - like "a python function to download a file from a URL is def download_file(url): ".

LLMs don't actually work directly with words - they work with tokens. A sequence of text is converted into a sequence of integer tokens, so "the cat sat on the " becomes [3086, 9059, 10139, 402, 290, 220]. This is worth understanding because LLM providers charge based on the number of tokens processed, and are limited in how many tokens they can consider at a time.

You can experiment with the OpenAI tokenizer to see how this works at platform.openai.com/tokenizer.

The input to an LLM is called the prompt. The text returned by an LLM is called the completion, or sometimes the response.

Many models today are multimodal, which means they can accept more than just text as input. Vision LLMs (vLLMs) can accept images as part of the input, which means you can feed them sketches or photos or screenshots. A common misconception is that these are run through a separate process for OCR or image analysis, but these inputs are actually turned into yet more token integers which are processed in the same way as text.

Chat templated prompts

The first LLMs worked as completion engines - users were expected to provide a prompt which could then be completed by the model, such as the two examples shown above.

This wasn't particularly user-friendly so models mostly switched to using chat templated prompts instead, which represent communication with the model as a simulated conversation.

This is actually just a form of completion prompt with a special format that looks something like this.

user: write a python function to download a file from a URL
assistant:

The natural completion for this prompt is for the assistant (represented by the LLM) to answer the user's question with some Python code.

LLMs are stateless: every time they execute a prompt they start from the same blank slate.

To maintain the simulation of a conversation, the software that talks to the model needs to maintain its own state and replay the entire existing conversation every time the user enters a new chat prompt:

user: write a python function to download a file from a URL
assistant: def download_url(url):
    return urllib.request.urlopen(url).read()
user: use the requests library instead
assistant:

Since providers charge for both input and output tokens, this means that as a conversation gets longer, each prompt becomes more expensive since the number of input tokens grows every time.

Token caching

Most model providers offset this somewhat through a cheaper rate for cached input tokens - common token prefixes that have been processed within a short time period can be charged at a lower rate as the underlying infrastructure can cache and then reuse many of the expensive calculations used to process that input.

Coding agents are designed with this optimization in mind - they avoid modifying earlier conversation content to ensure the cache is used as efficiently as possible.

Calling tools

The defining feature of an LLM agent is that agents can call tools. But what is a tool?

A tool is a function that the agent harness makes available to the LLM.

At the level of the prompt itself, that looks something like this:

system: If you need to access the weather, end your turn with <tool>get_weather(city_name)</tool>
user: what's the weather in San Francisco?
assistant:

Here the assistant might respond with the following text:

<tool>get_weather("San Francisco")</tool>

The model harness software then extracts that function call request from the response - probably with a regular expression - and executes the tool.

It then returns the result to the model, with a constructed prompt that looks something like this:

system: If you need to access the weather, end your turn with <tool>get_weather(city_name)</tool>
user: what's the weather in San Francisco?
assistant: <tool>get_weather("San Francisco")</tool>
user: <tool-result>61°, Partly cloudy</tool-result>
assistant:

The LLM can now use that tool result to help generate an answer to the user's question.

Most coding agents define a dozen or more tools for the agent to call. The most powerful of these allow for code execution - a Bash() tool for executing terminal commands, or a Python() tool for running Python code, for example.

The system prompt

In the previous example I included an initial message marked "system" which informed the LLM about the available tool and how to call it.

Coding agents usually start every conversation with a system prompt like this, which is not shown to the user but provides instructions telling the model how it should behave.

These system prompts can be hundreds of lines long. Here's the system prompt for OpenAI Codex as-of March 2026, which is a useful clear example of the kind of instructions that make these coding agents work.

Reasoning

One of the big new advances in 2025 was the introduction of reasoning to the frontier model families.

Reasoning, sometimes presented as thinking in the UI, is when a model spends additional time generating text that talks through the problem and its potential solutions before presenting a reply to the user.

This can look similar to a person thinking out loud, and has a similar effect. Crucially it allows models to spend more time (and more tokens) working on a problem in order to hopefully get a better result.

Reasoning is particularly useful for debugging issues in code as it gives the model an opportunity to navigate more complex code paths, mixing in tool calls and using the reasoning phase to follow function calls back to the potential source of an issue.

Many coding agents include options for dialing up or down the reasoning effort level, encouraging models to spend more time chewing on harder problems.

LLM + system prompt + tools in a loop

Believe it or not, that's most of what it takes to build a coding agent!

If you want to develop a deeper understanding of how these things work, a useful exercise is to try building your own agent from scratch. A simple tool loop can be achieved with a few dozen lines of code on top of an existing LLM API.

A good tool loop is a great deal more work than that, but the fundamental mechanics are surprisingly straightforward.

Tags: coding-agents, generative-ai, agentic-engineering, ai, llms

What is agentic engineering?

2026-03-15T22:41:57+00:00

Agentic Engineering Patterns >

I use the term agentic engineering to describe the practice of developing software with the assistance of coding agents.

What are coding agents? They're agents that can both write and execute code. Popular examples include Claude Code, OpenAI Codex, and Gemini CLI.

What's an agent? Clearly defining that term is a challenge that has frustrated AI researchers since at least the 1990s but the definition I've come to accept, at least in the field of Large Language Models (LLMs) like GPT-5 and Gemini and Claude, is this one:

Agents run tools in a loop to achieve a goal

The "agent" is software that calls an LLM with your prompt and passes it a set of tool definitions, then calls any tools that the LLM requests and feeds the results back into the LLM.

For coding agents, those tools include one that can execute code.

You prompt the coding agent to define a goal. The agent then generates and executes code in a loop until that goal has been met.

Code execution is the defining capability that makes agentic engineering possible. Without the ability to directly run the code, anything output by an LLM is of limited value. With code execution, these agents can start iterating towards software that demonstrably works.

Agentic engineering

Now that we have software that can write working code, what is there left for us humans to do?

The answer is so much stuff.

Writing code has never been the sole activity of a software engineer. The craft has always been figuring out what code to write. Any given software problem has dozens of potential solutions, each with their own tradeoffs. Our job is to navigate those options and find the ones that are the best fit for our unique set of circumstances and requirements.

Getting great results out of coding agents is a deep subject in its own right, especially now as the field continues to evolve at a bewildering rate.

We need to provide our coding agents with the tools they need to solve our problems, specify those problems in the right level of detail, and verify and iterate on the results until we are confident they address our problems in a robust and credible way.

LLMs don't learn from their past mistakes, but coding agents can, provided we deliberately update our instructions and tool harnesses to account for what we learn along the way.

Used effectively, coding agents can help us be much more ambitious with the projects we take on. Agentic engineering should help us produce more, better quality code that solves more impactful problems.

Isn't this just vibe coding?

The term "vibe coding" was coined by Andrej Karpathy in February 2025 - coincidentally just three weeks prior to the original release of Claude Code - to describe prompting LLMs to write code while you "forget that the code even exists".

Some people extend that definition to cover any time an LLM is used to produce code at all, but I think that's a mistake. Vibe coding is more useful in its original definition - we need a term to describe unreviewed, prototype-quality LLM-generated code that distinguishes it from code that the author has brought up to a production ready standard.

About this guide

Just like the field it attempts to cover, Agentic Engineering Patterns is very much a work in progress. My goal is to identify and describe patterns for working with these tools that demonstrably get results, and that are unlikely to become outdated as the tools advance.

I'll continue adding more chapters as new techniques emerge. No chapter should be considered finished. I'll be updating existing chapters as our understanding of these patterns evolves.

Tags: coding-agents, agent-definitions, generative-ai, agentic-engineering, ai, llms

My fireside chat about agentic engineering at the Pragmatic Summit

2026-03-14T18:19:38+00:00

I was a speaker last month at the Pragmatic Summit in San Francisco, where I participated in a fireside chat session about Agentic Engineering hosted by Eric Lui from Statsig.

The video is available on YouTube. Here are my highlights from the conversation.

Stages of AI adoption

We started by talking about the different phases a software developer goes through in adopting AI coding tools.

02:45

I feel like there are different stages of AI adoption as a programmer. You start off with you've got ChatGPT and you ask it questions and occasionally it helps you out. And then the big step is when you move to the coding agents that are writing code for you—initially writing bits of code and then there's that moment where the agent writes more code than you do, which is a big moment. And that for me happened only about maybe six months ago.

03:42

The new thing as of what, three weeks ago, is you don't read the code. If anyone saw StrongDM—they had a big thing come out last week where they talked about their software factory and their two principles were nobody writes any code, nobody reads any code, which is clear insanity. That is wildly irresponsible. They're a security company building security software, which is why it's worth paying close attention—like how could this possibly be working?

I talked about StrongDM more in How StrongDM's AI team build serious software without even looking at the code.

Trusting AI output

We discussed the challenge of knowing when to trust the AI's output as opposed to reviewing every line with a fine tooth-comb.