Simon Willison's Weblog: pydantic

Running Pydantic's Monty Rust sandboxed Python subset in WebAssembly

2026-02-06T22:31:31+00:00

There's a jargon-filled headline for you! Everyone's building sandboxes for running untrusted code right now, and Pydantic's latest attempt, Monty, provides a custom Python-like language (a subset of Python) in Rust and makes it available as both a Rust library and a Python package. I got it working in WebAssembly, providing a sandbox-in-a-sandbox.

Here's how they describe Monty:

Monty avoids the cost, latency, complexity and general faff of using full container based sandbox for running LLM generated code.

Instead, it let's you safely run Python code written by an LLM embedded in your agent, with startup times measured in single digit microseconds not hundreds of milliseconds.

What Monty can do:

Run a reasonable subset of Python code - enough for your agent to express what it wants to do

Completely block access to the host environment: filesystem, env variables and network access are all implemented via external function calls the developer can control

Call functions on the host - only functions you give it access to [...]

A quick way to try it out is via uv:

uv run --with pydantic-monty python -m asyncio

Then paste this into the Python interactive prompt - the -m asyncio enables top-level await:

import pydantic_monty
code = pydantic_monty.Monty('print("hello " + str(4 * 5))')
await pydantic_monty.run_monty_async(code)

Monty supports a very small subset of Python - it doesn't even support class declarations yet!

But, given its target use-case, that's not actually a problem.

The neat thing about providing tools like this for LLMs is that they're really good at iterating against error messages. A coding agent can run some Python code, get an error message telling it that classes aren't supported and then try again with a different approach.

I wanted to try this in a browser, so I fired up a code research task in Claude Code for web and kicked it off with the following:

Clone https://github.com/pydantic/monty to /tmp and figure out how to compile it into a python WebAssembly wheel that can then be loaded in Pyodide. The wheel file itself should be checked into the repo along with build scripts and passing pytest playwright test scripts that load Pyodide from a CDN and the wheel from a “python -m http.server” localhost and demonstrate it working

Then a little later:

I want an additional WASM file that works independently of Pyodide, which is also usable in a web browser - build that too along with playwright tests that show it working. Also build two HTML files - one called demo.html and one called pyodide-demo.html - these should work similar to https://tools.simonwillison.net/micropython (download that code with curl to inspect it) - one should load the WASM build, the other should load Pyodide and have it use the WASM wheel. These will be served by GitHub Pages so they can load the WASM and wheel from a relative path since the .html files will be served from the same folder as the wheel and WASM file

Here's the transcript, and the final research report it produced.

I now have the Monty Rust code compiled to WebAssembly in two different shapes - as a .wasm bundle you can load and call from JavaScript, and as a monty-wasm-pyodide/pydantic_monty-0.0.3-cp313-cp313-emscripten_4_0_9_wasm32.whl wheel file which can be loaded into Pyodide and then called from Python in Pyodide in WebAssembly in a browser.

Here are those two demos, hosted on GitHub Pages:

Monty WASM demo - a UI over JavaScript that loads the Rust WASM module directly.
Monty Pyodide demo - this one provides an identical interface but here the code is loading Pyodide and then installing the Monty WASM wheel.

As a connoisseur of sandboxes - the more options the better! - this new entry from Pydantic ticks a lot of my boxes. It's small, fast, widely available (thanks to Rust and WebAssembly) and provides strict limits on memory usage, CPU time and access to disk and network.

It was also a great excuse to spin up another demo showing how easy it is these days to turn compiled code like C or Rust into WebAssembly that runs in both a browser and a Pyodide environment.

Tags: javascript, python, sandboxing, ai, rust, webassembly, pyodide, generative-ai, llms, ai-assisted-programming, pydantic, coding-agents, claude-code

mistralai/mistral-vibe

2025-12-09T20:19:21+00:00

mistralai/mistral-vibe

Here's the Apache 2.0 licensed source code for Mistral's new "Vibe" CLI coding agent, released today alongside Devstral 2.

It's a neat implementation of the now standard terminal coding agent pattern, built in Python on top of Pydantic and Rich/Textual (here are the dependencies.) Gemini CLI is TypeScript, Claude Code is closed source (TypeScript, now on top of Bun), OpenAI's Codex CLI is Rust. OpenHands is the other major Python coding agent I know of, but I'm likely missing some others. (UPDATE: Kimi CLI is another open source Apache 2 Python one.)

The Vibe source code is pleasant to read and the crucial prompts are neatly extracted out into Markdown files. Some key places to look:

core/prompts/cli.md is the main system prompt ("You are operating as and within Mistral Vibe, a CLI coding-agent built by Mistral AI...")
core/prompts/compact.md is the prompt used to generate compacted summaries of conversations ("Create a comprehensive summary of our entire conversation that will serve as complete context for continuing this work...")
Each of the core tools has its own prompt file:

The Python implementations of those tools can be found here.

I tried it out and had it build me a Space Invaders game using three.js with the following prompt:

make me a space invaders game as HTML with three.js loaded from a CDN

Here's the source code and the live game (hosted in my new space-invaders-by-llms repo). It did OK.

Tags: python, ai, prompt-engineering, generative-ai, llms, textual, ai-assisted-programming, mistral, pydantic, vibe-coding, coding-agents, system-prompts, space-invaders

Reflections on OpenAI

2025-07-15T18:02:41+00:00

Reflections on OpenAI

Calvin French-Owen spent just over a year working at OpenAI, during which time the organization grew from 1,000 to 3,000 people and Calvin found himself in "the top 30% by tenure".

His reflections on leaving are fascinating - absolutely crammed with detail about OpenAI's internal culture that I haven't seen described anywhere else before.

I think of OpenAI as an organization that started like Los Alamos. It was a group of scientists and tinkerers investigating the cutting edge of science. That group happened to accidentally spawn the most viral consumer app in history. And then grew to have ambitions to sell to governments and enterprises.

There's a lot in here, and it's worth spending time with the whole thing. A few points that stood out to me below.

Firstly, OpenAI are a Python shop who lean a whole lot on Pydantic and FastAPI:

OpenAI uses a giant monorepo which is ~mostly Python (though there is a growing set of Rust services and a handful of Golang services sprinkled in for things like network proxies). This creates a lot of strange-looking code because there are so many ways you can write Python. You will encounter both libraries designed for scale from 10y Google veterans as well as throwaway Jupyter notebooks newly-minted PhDs. Pretty much everything operates around FastAPI to create APIs and Pydantic for validation. But there aren't style guides enforced writ-large.

ChatGPT's success has influenced everything that they build, even at a technical level:

Chat runs really deep. Since ChatGPT took off, a lot of the codebase is structured around the idea of chat messages and conversations. These primitives are so baked at this point, you should probably ignore them at your own peril.

Here's a rare peek at how improvements to large models get discovered and incorporated into training runs:

How large models are trained (at a high-level). There's a spectrum from "experimentation" to "engineering". Most ideas start out as small-scale experiments. If the results look promising, they then get incorporated into a bigger run. Experimentation is as much about tweaking the core algorithms as it is tweaking the data mix and carefully studying the results. On the large end, doing a big run almost looks like giant distributed systems engineering. There will be weird edge cases and things you didn't expect.

Via Hacker News

Tags: python, ai, openai, generative-ai, chatgpt, llms, pydantic

MCP Run Python

2025-04-18T04:51:20+00:00

MCP Run Python

Pydantic AI's MCP server for running LLM-generated Python code in a sandbox. They ended up using a trick I explored two years ago: using a Deno process to run Pyodide in a WebAssembly sandbox.

Here's a bit of a wild trick: since Deno loads code on-demand from JSR, and uv run can install Python dependencies on demand via the --with option... here's a one-liner you can paste into a macOS shell (provided you have Deno and uv installed already) which will run the example from their README - calculating the number of days between two dates in the most complex way imaginable:

ANTHROPIC_API_KEY="sk-ant-..." \
uv run --with pydantic-ai python -c '
import asyncio
from pydantic_ai import Agent
from pydantic_ai.mcp import MCPServerStdio

server = MCPServerStdio(
    "deno",
    args=[
        "run",
        "-N",
        "-R=node_modules",
        "-W=node_modules",
        "--node-modules-dir=auto",
        "jsr:@pydantic/mcp-run-python",
        "stdio",
    ],
)
agent = Agent("claude-3-5-haiku-latest", mcp_servers=[server])

async def main():
    async with agent.run_mcp_servers():
        result = await agent.run("How many days between 2000-01-01 and 2025-03-18?")
    print(result.output)

asyncio.run(main())'

I ran that just now and got:

The number of days between January 1st, 2000 and March 18th, 2025 is 9,208 days.

I thoroughly enjoy how tools like uv and Deno enable throwing together shell one-liner demos like this one.

Here's an extended version of this example which adds pretty-printed logging of the messages exchanged with the LLM to illustrate exactly what happened. The most important piece is this tool call where Claude 3.5 Haiku asks for Python code to be executed my the MCP server:

ToolCallPart(
    tool_name='run_python_code',
    args={
        'python_code': (
            'from datetime import date\n'
            '\n'
            'date1 = date(2000, 1, 1)\n'
            'date2 = date(2025, 3, 18)\n'
            '\n'
            'days_between = (date2 - date1).days\n'
            'print(f"Number of days between {date1} and {date2}: {days_between}")'
        ),
    },
    tool_call_id='toolu_01TXXnQ5mC4ry42DrM1jPaza',
    part_kind='tool-call',
)

I also managed to run it against Mistral Small 3.1 (15GB) running locally using Ollama (I had to add "Use your python tool" to the prompt to get it to work):

ollama pull mistral-small3.1:24b

uv run --with devtools --with pydantic-ai python -c '
import asyncio
from devtools import pprint
from pydantic_ai import Agent, capture_run_messages
from pydantic_ai.models.openai import OpenAIModel
from pydantic_ai.providers.openai import OpenAIProvider
from pydantic_ai.mcp import MCPServerStdio

server = MCPServerStdio(
    "deno",
    args=[
        "run",
        "-N",
        "-R=node_modules",
        "-W=node_modules",
        "--node-modules-dir=auto",
        "jsr:@pydantic/mcp-run-python",
        "stdio",
    ],
)

agent = Agent( 
    OpenAIModel(                          
        model_name="mistral-small3.1:latest",
        provider=OpenAIProvider(base_url="http://localhost:11434/v1"),                
    ),            
    mcp_servers=[server],
)

async def main():
    with capture_run_messages() as messages:
        async with agent.run_mcp_servers():
            result = await agent.run("How many days between 2000-01-01 and 2025-03-18? Use your python tool.")
    pprint(messages)
    print(result.output)

asyncio.run(main())'

Here's the full output including the debug logs.

Via Hacker News

Tags: python, sandboxing, ai, deno, generative-ai, local-llms, llms, claude, mistral, llm-tool-use, uv, ollama, pydantic, model-context-protocol

llm-docsmith

2025-04-10T18:09:18+00:00

llm-docsmith

Matheus Pedroni released this neat plugin for LLM for adding docstrings to existing Python code. You can run it like this:

llm install llm-docsmith
llm docsmith ./scripts/main.py -o

The -o option previews the changes that will be made - without -o it edits the files directly.

It also accepts a -m claude-3.7-sonnet parameter for using an alternative model from the default (GPT-4o mini).

The implementation uses the Python libcst "Concrete Syntax Tree" package to manipulate the code, which means there's no chance of it making edits to anything other than the docstrings.

Here's the full system prompt it uses.

One neat trick is at the end of the system prompt it says:

You will receive a JSON template. Fill the slots marked with <SLOT> with the appropriate description. Return as JSON.

That template is actually provided JSON generated using these Pydantic classes:

class Argument(BaseModel):
    name: str
    description: str
    annotation: str | None = None
    default: str | None = None

class Return(BaseModel):
    description: str
    annotation: str | None

class Docstring(BaseModel):
    node_type: Literal["class", "function"]
    name: str
    docstring: str
    args: list[Argument] | None = None
    ret: Return | None = None

class Documentation(BaseModel):
    entries: list[Docstring]

The code adds <SLOT> notes to that in various places, so the template included in the prompt ends up looking like this:

{
  "entries": [
    {
      "node_type": "function",
      "name": "create_docstring_node",
      "docstring": "<SLOT>",
      "args": [
        {
          "name": "docstring_text",
          "description": "<SLOT>",
          "annotation": "str",
          "default": null
        },
        {
          "name": "indent",
          "description": "<SLOT>",
          "annotation": "str",
          "default": null
        }
      ],
      "ret": {
        "description": "<SLOT>",
        "annotation": "cst.BaseStatement"
      }
    }
  ]
}

Via @pnmath

Tags: plugins, python, ai, prompt-engineering, generative-ai, llm, pydantic

Pydantic Evals

2025-04-01T04:43:56+00:00

Pydantic Evals

Brand new package from David Montague and the Pydantic AI team which directly tackles what I consider to be the single hardest problem in AI engineering: building evals to determine if your LLM-based system is working correctly and getting better over time.

The feature is described as "in beta" and comes with this very realistic warning:

Unlike unit tests, evals are an emerging art/science; anyone who claims to know for sure exactly how your evals should be defined can safely be ignored.

This code example from their documentation illustrates the relationship between the two key nouns - Cases and Datasets:

from pydantic_evals import Case, Dataset

case1 = Case(
    name="simple_case",
    inputs="What is the capital of France?",
    expected_output="Paris",
    metadata={"difficulty": "easy"},
)

dataset = Dataset(cases=[case1])

The library also supports custom evaluators, including LLM-as-a-judge:

Case(
    name="vegetarian_recipe",
    inputs=CustomerOrder(
        dish_name="Spaghetti Bolognese", dietary_restriction="vegetarian"
    ),
    expected_output=None,
    metadata={"focus": "vegetarian"},
    evaluators=(
        LLMJudge(
            rubric="Recipe should not contain meat or animal products",
        ),
    ),
)

Cases and datasets can also be serialized to YAML.

My first impressions are that this looks like a solid implementation of a sensible design. I'm looking forward to trying it out against a real project.

Via @samuel_colvin

Tags: python, ai, generative-ai, llms, evals, pydantic

googleapis/python-genai

2024-12-12T16:21:46+00:00

googleapis/python-genai

Google released this brand new Python library for accessing their generative AI models yesterday, offering an alternative to their existing generative-ai-python library.

The API design looks very solid to me, and it includes both sync and async implementations. Here's an async streaming response:

async for response in client.aio.models.generate_content_stream(
    model='gemini-2.0-flash-exp',
    contents='Tell me a story in 300 words.'
):
    print(response.text)

It also includes Pydantic-based output schema support and some nice syntactic sugar for defining tools using Python functions.

Tags: async, google, python, ai, generative-ai, llms, gemini, llm-tool-use, pydantic

PydanticAI

2024-12-02T21:08:50+00:00

PydanticAI

New project from Pydantic, which they describe as an "Agent Framework / shim to use Pydantic with LLMs".

I asked which agent definition they are using and it's the "system prompt with bundled tools" one. To their credit, they explain that in their documentation:

The Agent has full API documentation, but conceptually you can think of an agent as a container for:

A system prompt — a set of instructions for the LLM written by the developer

One or more retrieval tool — functions that the LLM may call to get information while generating a response

An optional structured result type — the structured datatype the LLM must return at the end of a run

Given how many other existing tools already lean on Pydantic to help define JSON schemas for talking to LLMs this is an interesting complementary direction for Pydantic to take.

There's some overlap here with my own LLM project, which I still hope to add a function calling / tools abstraction to in the future.

Via @pydantic

Tags: python, generative-ai, llms, llm, llm-tool-use, ai-agents, pydantic, agent-definitions

Themes from DjangoCon US 2024

2024-09-27T23:36:02+00:00

I just arrived home from a trip to Durham, North Carolina for DjangoCon US 2024. I’ve already written about my talk where I announced a new plugin system for Django; here are my notes on some of the other themes that resonated with me during the conference.

Growing the Django Software Foundation (DSF)

Jacob Kaplan-Moss gave my favorite talk of the conference, asking what the Django Software Foundation could do if it quadrupled its annual income from $250,000 to $1 million dollars, and then mapping out a convincing path to get there.

I really liked this diagram Jacob provided summarizing the foundation’s current income and expenditures. It’s pretty cool that $90,000 of annual income comes from individual donors, over a third of the total since corporate donors provide $160,000.

Top priority would be hiring an Executive Director for the foundation, which is currently lead entirely by an elected, volunteer board. I’ve seen how useful a professional ED is from my own experiences on the Python Software Foundation board.

Having someone working full time on the foundation outside of our current fellows - who have more than enough on their plates already - would enable the foundation to both take on more ambitious goals and also raise more money with which to tackle them.

A line that Jacob used repeatedly in his talk about funding the foundation was this: if you or your organization wouldn’t want to sponsor Django, he’d love to know why that is - understanding those blockers right now is almost as valuable as receiving actual cash. You can reach out to him at jacob at djangoproject.com.

Could we fund a Django LTS accessibility audit?

Django fellows and the Django Accessibility Team have been focusing significant effort on the accessibility of the Django admin. I found this very inspiring, and in combination with the talk of more funding for the foundation it put an idea in my head: what if every Django LTS release (once every two years) was backed by a full, professional accessibility audit, run by an agency staffed with developers who use screen readers?

Imagine how much impact it would have if the default Django admin interface had excellent, documented accessibility out of the box. It could improve things for hundreds of thousands of users, and set an excellent precedent for projects (and foundations) in the wider open source community.

This also feels to me like something that should be inherently attractive to sponsors. A lot of agencies use Django for government work, where accessibility is a requirement with teeth. Would one of those agencies like to be the “accessibility sponsor” for a major Django release?

Django fellows continue to provide outstanding value

The DSF’s fellowship program remains one of the most impactful initiatives I’ve seen anywhere for ensuring the ongoing sustainability of a community-driven open source project.

Both of the current fellows, Natalia Bidart and Sarah Boyce, were in attendance and gave talks. It was great getting to meet them in person.

If you’re not familiar with the program, the fellows are contractors who are paid by the DSF to keep the Django project ticking over - handling many of the somewhat less glamorous tasks of responsible open source maintenance such as ticket triage, release management, security fixes and code review.

The fellows program is in its tenth year, and is a key reason that Django continues to release new versions on a regular schedule despite having no single corporate parent with paid developers.

Unsurprisingly there is always more work than fellow capacity, hence Jacob’s desire to further expand the existing program.

The fellows program launched with a policy that fellows should not work on new feature development. I believe this was partly related to interpretation of IRS nonprofit guidelines which have since been reconsidered, and there is a growing consensus now that this policy should be dropped.

Django needs feature champions

Django has a well deserved reputation for stability, reliability and a dependable release process. It has less of a reputation for constantly turning out ground-breaking new features.

Long-time Django contributors who I talked to all had a similar position on this: the challenge here is that big new features need dedicated champions to both lead design and development on them and to push them through to completion.

The pool of community members who are both willing and able to take on these larger projects is currently too small.

There are a number of ways we could address this - most notably through investing financial resources in sponsoring feature development. This has worked well for Django in the past - Django’s migrations work was funded by a Kickstarter campaign back in 2013.

The Django Software Foundation will shortly be announcing details of elections for both the DSF board and the Django Steering Council. These are extremely influential positions for people who want to help solve some of these larger problems.

htmx fits Django really well

htmx is an incredibly good fit for the uncodified Django community philosophy of building for the web. It came up in multiple talks. It feels like it may be a solution that the Django community has been seeking for years, as a very compelling alternative to writing everything in SPA JavaScript and using Django purely as a backend via something like Django REST Framework.

I've been slightly resistant to embracing htmx myself purely because it's such a critical dependency and in the past I wasn't convinced of its staying power. It's now mature, stable and widely-enough used that I'm ready to consider it for my own long-term projects.

Django Ninja has positive buzz

I haven’t paid much attention to Django Ninja but it had a lot of very positive buzz at the conference as well, as a tool for quickly building full-featured, performative API endpoints (thanks to Rust-backed Pydantic for serialization) with interactive API docs powered by OpenAPI.

I respect Django REST Framework a lot, but my personal programming style leans away from Class Based Views, which it uses quite a bit. Django Ninja looks like it might fit my function-view biases better.

I wrote about Richard Terry’s excellent nanodjango single-file Django application tool the other day - Django Ninja comes baked into that project as well.

The three platinum sponsors for DjangoCon this year were REVSYS, Caktus Group and Valkey. Valkey were a late and somewhat surprising addition to the sponsorship lineup.

Valkey is the Linux Foundation backed fork of Redis, created in response to Redis ditching their Open Source license (which I took quite personally, having contributed my own free effort to promoting and improving Redis in the past).

Aside from expressing thanks to them, I usually don’t pay sponsors that much attention. For some reason this one hit differently - the fact that Valkey were ready to step in as a major sponsor despite being only a few months old has caused me to take that project a whole lot more seriously than I did before. I’ll certainly consider them next time I come across a Redis-shaped problem.

Durham has a world-class collection of tubas

My favorite category of Niche Museum is one that's available by appointment only where the person who collected everything is available to show you around.

I always check Atlas Obscura any time I visit a new city, and this time I was delighted to learn about The Vincent and Ethel Simonetti Historic Tuba Collection!

I promoted it in the DjangoCon US #outings Slack channel and got together a group of five conference attendees for a visit on Thursday, shortly before my flight.

It was peak Niche Museum. I’ve posted photos and notes over on my Niche Museums website, the first new article there in quite a while.

Tags: accessibility, conferences, django, djangocon, jacob-kaplan-moss, python, redis, dsf, pydantic, htmx

Jiter

2024-09-22T20:03:07+00:00

Jiter

One of the challenges in dealing with LLM streaming APIs is the need to parse partial JSON - until the stream has ended you won't have a complete valid JSON object, but you may want to display components of that JSON as they become available.

I've solved this previously using the ijson streaming JSON library, see my previous TIL.

Today I found out about Jiter, a new option from the team behind Pydantic. It's written in Rust and extracted from pydantic-core, so the Python wrapper for it can be installed using:

pip install jiter

You can feed it an incomplete JSON bytes object and use partial_mode="on" to parse the valid subset:

import jiter
partial_json = b'{"name": "John", "age": 30, "city": "New Yor'
jiter.from_json(partial_json, partial_mode="on")
# {'name': 'John', 'age': 30}

Or use partial_mode="trailing-strings" to include incomplete string fields too:

jiter.from_json(partial_json, partial_mode="trailing-strings")
# {'name': 'John', 'age': 30, 'city': 'New Yor'}

The current README was a little thin, so I submiitted a PR with some extra examples. I got some help from files-to-prompt and Claude 3.5 Sonnet):

cd crates/jiter-python/ && files-to-prompt -c README.md tests | llm -m claude-3.5-sonnet --system 'write a new README with comprehensive documentation'

Via jackmpcollins on Hacker News

Tags: json, python, rust, ai-assisted-programming, pydantic, files-to-prompt

Python Developers Survey 2023 Results

2024-09-03T02:47:45+00:00

Python Developers Survey 2023 Results

The seventh annual Python survey is out. Here are the things that caught my eye or that I found surprising:

25% of survey respondents had been programming in Python for less than a year, and 33% had less than a year of professional experience.

37% of Python developers reported contributing to open-source projects last year - a new question for the survey. This is delightfully high!

6% of users are still using Python 2. The survey notes:

Almost half of Python 2 holdouts are under 21 years old and a third are students. Perhaps courses are still using Python 2?

In web frameworks, Flask and Django neck and neck at 33% each, but FastAPI is a close third at 29%! Starlette is at 6%, but that's an under-count because it's the basis for FastAPI.

The most popular library in "other framework and libraries" was BeautifulSoup with 31%, then Pillow 28%, then OpenCV-Python at 22% (wow!) and Pydantic at 22%. Tkinter had 17%. These numbers are all a surprise to me.

pytest scores 52% for unit testing, unittest from the standard library just 25%. I'm glad to see pytest so widely used, it's my favourite testing tool across any programming language.

The top cloud providers are AWS, then Google Cloud Platform, then Azure... but PythonAnywhere (11%) took fourth place just ahead of DigitalOcean (10%). And Alibaba Cloud is a new entrant in sixth place (after Heroku) with 4%. Heroku's ending of its free plan dropped them from 14% in 2021 to 7% now.

Linux and Windows equal at 55%, macOS is at 29%. This was one of many multiple-choice questions that could add up to more than 100%.

In databases, SQLite usage was trending down - 38% in 2021 to 34% for 2023, but still in second place behind PostgreSQL, stable at 43%.

The survey incorporates quotes from different Python experts responding to the numbers, it's worth reading through the whole thing.

Via PSF news

Tags: open-source, postgresql, python, sqlite, surveys, pytest, psf, pydantic, starlette

OpenAI: Introducing Structured Outputs in the API

2024-08-06T18:32:25+00:00

OpenAI: Introducing Structured Outputs in the API

OpenAI have offered structured outputs for a while now: you could specify "response_format": {"type": "json_object"}} to request a valid JSON object, or you could use the function calling mechanism to request responses that match a specific schema.

Neither of these modes were guaranteed to return valid JSON! In my experience they usually did, but there was always a chance that something could go wrong and the returned code could not match the schema, or even not be valid JSON at all.

Outside of OpenAI techniques like jsonformer and llama.cpp grammars could provide those guarantees against open weights models, by interacting directly with the next-token logic to ensure that only tokens that matched the required schema were selected.

OpenAI credit that work in this announcement, so they're presumably using the same trick. They've provided two new ways to guarantee valid outputs. The first a new "strict": true option for function definitions. The second is a new feature: a "type": "json_schema" option for the "response_format" field which lets you then pass a JSON schema (and another "strict": true flag) to specify your required output.

I've been using the existing "tools" mechanism for exactly this already in my datasette-extract plugin - defining a function that I have no intention of executing just to get structured data out of the API in the shape that I want.

Why isn't "strict": true by default? Here's OpenAI's Ted Sanders:

We didn't cover this in the announcement post, but there are a few reasons:

The first request with each JSON schema will be slow, as we need to preprocess the JSON schema into a context-free grammar. If you don't want that latency hit (e.g., you're prototyping, or have a use case that uses variable one-off schemas), then you might prefer "strict": false

You might have a schema that isn't covered by our subset of JSON schema. (To keep performance fast, we don't support some more complex/long-tail features.)

In JSON mode and Structured Outputs, failures are rarer but more catastrophic. If the model gets too confused, it can get stuck in loops where it just prints technically valid output forever without ever closing the object. In these cases, you can end up waiting a minute for the request to hit the max_token limit, and you also have to pay for all those useless tokens. So if you have a really tricky schema, and you'd rather get frequent failures back quickly instead of infrequent failures back slowly, you might also want "strict": false

But in 99% of cases, you'll want "strict": true.

More from Ted on how the new mode differs from function calling:

Under the hood, it's quite similar to function calling. A few differences:

Structured Outputs is a bit more straightforward. e.g., you don't have to pretend you're writing a function where the second arg could be a two-page report to the user, and then pretend the "function" was called successfully by returning {"success": true}

Having two interfaces lets us teach the model different default behaviors and styles, depending on which you use

Another difference is that our current implementation of function calling can return both a text reply plus a function call (e.g., "Let me look up that flight for you"), whereas Structured Outputs will only return the JSON

The official openai-python library also added structured output support this morning, based on Pydantic and looking very similar to the Instructor library (also credited as providing inspiration in their announcement).

There are some key limitations on the new structured output mode, described in the documentation. Only a subset of JSON schema is supported, and most notably the "additionalProperties": false property must be set on all objects and all object keys must be listed in "required" - no optional keys are allowed.

Another interesting new feature: if the model denies a request on safety grounds a new refusal message will be returned:

{
  "message": {
    "role": "assistant",
    "refusal": "I'm sorry, I cannot assist with that request."
  }
}

Finally, tucked away at the bottom of this announcement is a significant new model release with a major price cut:

By switching to the new gpt-4o-2024-08-06, developers save 50% on inputs ($2.50/1M input tokens) and 33% on outputs ($10.00/1M output tokens) compared to gpt-4o-2024-05-13.

This new model also supports 16,384 output tokens, up from 4,096.

The price change is particularly notable because GPT-4o-mini, the much cheaper alternative to GPT-4o, prices image inputs at the same price as GPT-4o. This new model cuts that by half (confirmed here), making gpt-4o-2024-08-06 the new cheapest model from OpenAI for handling image inputs.

Tags: json, ai, openai, generative-ai, llms, structured-extraction, pydantic

Quoting James Bennett

2023-04-07T02:19:12+00:00

Several libraries let you declare objects with type-hinted members and automatically derive validation rules and serialization/deserialization from the type hints – Pydantic is the most popular, but alternatives like msgspec are out there too. There’s also a whole new generation of web frameworks like FastAPI and Starlite which use type hints at runtime to do not just input validation and serialization/deserialization but also things like dependency injection.

Personally, I’ve seen more significant gains in productivity from those runtime usages of Python’s type hints than from any static ahead-of-time type checking, which mostly is only useful to me as documentation.

— James Bennett

Tags: james-bennett, python, pydantic

The different uses of Python type hints

2023-04-07T02:17:04+00:00

The different uses of Python type hints

Luke Plant describes five different categories for how Python optional types are being used today: IDE assistants, type checking, runtime behavior changes via introspection (e.g. Pydantic), code documentation, compiler instructions (ala mypyc)—and a bonus sixth, dependency injection.

Via lobste.rs

Tags: luke-plant, python, pydantic

SQLModel

2021-08-24T23:16:42+00:00

SQLModel

A new project by FastAPI creator Sebastián Ramírez: SQLModel builds on top of both SQLAlchemy and Sebastián’s Pydantic validation library to provide a new ORM that’s designed around Python 3’s optional typing. The real brilliance here is that a SQLModel subclass is simultaneously a valid SQLAlchemy ORM model AND a valid Pydantic validation model, saving on duplicate code by allowing the same class to be used both for form/API validation and for interacting with the database.

Tags: orm, python, sql, sqlalchemy, pydantic