Simon Willison's Weblog: json

research-llm-apis 2026-04-04

2026-04-05T00:32:11+00:00

Release: research-llm-apis 2026-04-04

I'm working on a major change to my LLM Python library and CLI tool. LLM provides an abstraction layer over hundreds of different LLMs from dozens of different vendors thanks to its plugin system, and some of those vendors have grown new features over the past year which LLM's abstraction layer can't handle, such as server-side tool execution.

To help design that new abstraction layer I had Claude Code read through the Python client libraries for Anthropic, OpenAI, Gemini and Mistral and use those to help craft curl commands to access the raw JSON for both streaming and non-streaming modes across a range of different scenarios. Both the scripts and the captured outputs now live in this new repo.

Tags: apis, json, llms, llm

We Rewrote JSONata with AI in a Day, Saved $500K/Year

2026-03-27T00:35:01+00:00

We Rewrote JSONata with AI in a Day, Saved $500K/Year

Bit of a hyperbolic framing but this looks like another case study of vibe porting, this time spinning up a new custom Go implementation of the JSONata JSON expression language - similar in focus to jq, and heavily associated with the Node-RED platform.

As with other vibe-porting projects the key enabling factor was JSONata's existing test suite, which helped build the first working Go version in 7 hours and $400 of token spend.

The Reco team then used a shadow deployment for a week to run the new and old versions in parallel to confirm the new implementation exactly matched the behavior of the old one.

Tags: go, json, ai, generative-ai, llms, agentic-engineering, vibe-porting

SQLite Tags Benchmark: Comparing 5 Tagging Strategies

2026-03-20T02:57:00+00:00

Research: SQLite Tags Benchmark: Comparing 5 Tagging Strategies

I had Claude Code run a micro-benchmark comparing different approaches to implementing tagging in SQLite. Traditional many-to-many tables won, but FTS5 came a close second. Full table scans with LIKE queries performed better than I expected, but full table scans with JSON arrays and json_each() were much slower.

Tags: json, sqlite

Open Responses

2026-01-15T23:56:56+00:00

Open Responses

This is the standardization effort I've most wanted in the world of LLMs: a vendor-neutral specification for the JSON API that clients can use to talk to hosted LLMs.

Open Responses aims to provide exactly that as a documented standard, derived from OpenAI's Responses API.

I was hoping for one based on their older Chat Completions API since so many other products have cloned the already, but basing it on Responses does make sense since that API was designed with the feature of more recent models - such as reasoning traces - baked into the design.

What's certainly notable is the list of launch partners. OpenRouter alone means we can expect to be able to use this protocol with almost every existing model, and Hugging Face, LM Studio, vLLM, Ollama and Vercel cover a huge portion of the common tools used to serve models.

For protocols like this I really want to see a comprehensive, language-independent conformance test site. Open Responses has a subset of that - the official repository includes src/lib/compliance-tests.ts which can be used to exercise a server implementation, and is available as a React app on the official site that can be pointed at any implementation served via CORS.

What's missing is the equivalent for clients. I plan to spin up my own client library for this in Python and I'd really like to be able to run that against a conformance suite designed to check that my client correctly handles all of the details.

Via VB

Tags: json, standards, ai, openai, generative-ai, llms, openrouter, conformance-suites

Progressive JSON

2025-06-01T04:45:32+00:00

Progressive JSON

This post by Dan Abramov is a trap! It proposes a fascinating way of streaming JSON objects to a client in a way that provides the shape of the JSON before the stream has completed, then fills in the gaps as more data arrives... and then turns out to be a sneaky tutorial in how React Server Components work.

Ignoring the sneakiness, the imaginary streaming JSON format it describes is a fascinating thought exercise:

{
  header: "$1",
  post: "$2",
  footer: "$3"
}
/* $1 */
"Welcome to my blog"
/* $3 */
"Hope you like it"
/* $2 */
{
  content: "$4",
  comments: "$5"
}
/* $4 */
"This is my article"
/* $5 */
["$6", "$7", "$8"]
/* $6 */
"This is the first comment"
/* $7 */
"This is the second comment"
/* $8 */
"This is the third comment"

After each block the full JSON document so far can be constructed, and Dan suggests interleaving Promise() objects along the way for placeholders that have not yet been fully resolved - so after receipt of block $3 above (note that the blocks can be served out of order) the document would look like this:

{
  header: "Welcome to my blog",
  post: new Promise(/* ... not yet resolved ... */),
  footer: "Hope you like it"
}

I'm tucking this idea away in case I ever get a chance to try it out in the future.

Tags: javascript, json, react, dan-abramov

Incomplete JSON Pretty Printer

2025-03-28T00:18:43+00:00

Incomplete JSON Pretty Printer

Every now and then a log file or a tool I'm using will spit out a bunch of JSON that terminates unexpectedly, meaning I can't copy it into a text editor and pretty-print it to see what's going on.

The other day I got frustrated with this and had the then-new GPT-4.5 build me a pretty-printer that didn't mind incomplete JSON, using an OpenAI Canvas. Here's the chat and here's the resulting interactive.

I spotted a bug with the way it indented code today so I pasted it into Claude 3.7 Sonnet Thinking mode and had it make a bunch of improvements - full transcript here. Here's the finished code.

In many ways this is a perfect example of vibe coding in action. At no point did I look at a single line of code that either of the LLMs had written for me. I honestly don't care how this thing works: it could not be lower stakes for me, the worst a bug could do is show me poorly formatted incomplete JSON.

I was vaguely aware that some kind of state machine style parser would be needed, because you can't parse incomplete JSON with a regular JSON parser. Building simple parsers is the kind of thing LLMs are surprisingly good at, and also the kind of thing I don't want to take on for a trivial project.

At one point I told Claude "Try using your code execution tool to check your logic", because I happen to know Claude can write and then execute JavaScript independently of using it for artifacts. That helped it out a bunch.

I later dropped in the following:

modify the tool to work better on mobile screens and generally look a bit nicer - and remove the pretty print JSON button, it should update any time the input text is changed. Also add a "copy to clipboard" button next to the results. And add a button that says "example" which adds a longer incomplete example to demonstrate the tool, make that example pelican themed.

It's fun being able to say "generally look a bit nicer" and get a perfectly acceptable result!

Tags: json, tools, ai, generative-ai, chatgpt, llms, claude, vibe-coding

PostgreSQL 17: SQL/JSON is here!

2024-10-13T19:01:02+00:00

PostgreSQL 17: SQL/JSON is here!

Hubert Lubaczewski dives into the new JSON features added in PostgreSQL 17, released a few weeks ago on the 26th of September. This is the latest in his long series of similar posts about new PostgreSQL features.

The features are based on the new SQL:2023 standard from June 2023. If you want to actually read the specification for SQL:2023 it looks like you have to buy a PDF from ISO for 194 Swiss Francs (currently $226). Here's a handy summary by Peter Eisentraut: SQL:2023 is finished: Here is what's new.

There's a lot of neat stuff in here. I'm particularly interested in the json_table() table-valued function, which can convert a JSON string into a table with quite a lot of flexibility. You can even specify a full table schema as part of the function call:

SELECT * FROM json_table(
    '[{"a":10,"b":20},{"a":30,"b":40}]'::jsonb,
    '$[*]'
    COLUMNS (
        id FOR ORDINALITY,
        column_a int4 path '$.a',
        column_b int4 path '$.b',
        a int4,
        b int4,
        c text
    )
);

SQLite has solid JSON support already and often imitates PostgreSQL features, so I wonder if we'll see an update to SQLite that reflects some aspects of this new syntax.

Via lobste.rs

Tags: json, postgresql, sql, sqlite

Wikidata is a Giant Crosswalk File

2024-10-05T15:45:36+00:00

Wikidata is a Giant Crosswalk File

Drew Breunig shows how to take the 140GB Wikidata JSON export, use sed 's/,$//' to convert it to newline-delimited JSON, then use DuckDB to run queries and extract external identifiers, including a query that pulls out 500MB of latitude and longitude points.

Tags: json, wikipedia, duckdb, drew-breunig

Jiter

2024-09-22T20:03:07+00:00

Jiter

One of the challenges in dealing with LLM streaming APIs is the need to parse partial JSON - until the stream has ended you won't have a complete valid JSON object, but you may want to display components of that JSON as they become available.

I've solved this previously using the ijson streaming JSON library, see my previous TIL.

Today I found out about Jiter, a new option from the team behind Pydantic. It's written in Rust and extracted from pydantic-core, so the Python wrapper for it can be installed using:

pip install jiter

You can feed it an incomplete JSON bytes object and use partial_mode="on" to parse the valid subset:

import jiter
partial_json = b'{"name": "John", "age": 30, "city": "New Yor'
jiter.from_json(partial_json, partial_mode="on")
# {'name': 'John', 'age': 30}

Or use partial_mode="trailing-strings" to include incomplete string fields too:

jiter.from_json(partial_json, partial_mode="trailing-strings")
# {'name': 'John', 'age': 30, 'city': 'New Yor'}

The current README was a little thin, so I submiitted a PR with some extra examples. I got some help from files-to-prompt and Claude 3.5 Sonnet):

cd crates/jiter-python/ && files-to-prompt -c README.md tests | llm -m claude-3.5-sonnet --system 'write a new README with comprehensive documentation'

Via jackmpcollins on Hacker News

Tags: json, python, rust, ai-assisted-programming, pydantic, files-to-prompt

How streaming LLM APIs work

2024-09-22T03:48:12+00:00

How streaming LLM APIs work

New TIL. I used curl to explore the streaming APIs provided by OpenAI, Anthropic and Google Gemini and wrote up detailed notes on what I learned.

Also includes example code for receiving streaming events in Python with HTTPX and receiving streaming events in client-side JavaScript using fetch().

Tags: apis, http, json, llms

json-flatten, now with format documentation

2024-09-07T05:43:01+00:00

json-flatten, now with format documentation

json-flatten is a fun little Python library I put together a few years ago for converting JSON data into a flat key-value format, suitable for inclusion in an HTML form or query string. It lets you take a structure like this one:

{"foo": {"bar": [1, True, None]}

And convert it into key-value pairs like this:

foo.bar.[0]$int=1
foo.bar.[1]$bool=True
foo.bar.[2]$none=None

The flatten(dictionary) function function converts to that format, and unflatten(dictionary) converts back again.

I was considering the library for a project today and realized that the 0.3 README was a little thin - it showed how to use the library but didn't provide full details of the format it used.

On a hunch, I decided to see if files-to-prompt plus LLM plus Claude 3.5 Sonnet could write that documentation for me. I ran this command:

files-to-prompt *.py | llm -m claude-3.5-sonnet --system 'write detailed documentation in markdown describing the format used to represent JSON and nested JSON as key/value pairs, include a table as well'

That *.py picked up both json_flatten.py and test_json_flatten.py - I figured the test file had enough examples in that it should act as a good source of information for the documentation.

This worked really well! You can see the first draft it produced here.

It included before and after examples in the documentation. I didn't fully trust these to be accurate, so I gave it this follow-up prompt:

llm -c "Rewrite that document to use the Python cog library to generate the examples"

I'm a big fan of Cog for maintaining examples in READMEs that are generated by code. Cog has been around for a couple of decades now so it was a safe bet that Claude would know about it.

This almost worked - it produced valid Cog syntax like the following:

[[[cog
example = {
"fruits": ["apple", "banana", "cherry"]
}

cog.out("```json\n")
cog.out(str(example))
cog.out("\n```\n")
cog.out("Flattened:\n```\n")
for key, value in flatten(example).items():
    cog.out(f"{key}: {value}\n")
cog.out("```\n")
]]]
[[[end]]]

But that wasn't entirely right, because it forgot to include the Markdown comments that would hide the Cog syntax, which should have looked like this:

<!-- [[[cog -->
...
<!-- ]]] -->
...
<!-- [[[end]]] -->

I could have prompted it to correct itself, but at this point I decided to take over and edit the rest of the documentation by hand.

The end result was documentation that I'm really happy with, and that I probably wouldn't have bothered to write if Claude hadn't got me started.

Tags: json, projects, ai, generative-ai, llms, ai-assisted-programming, llm, anthropic, claude, claude-3-5-sonnet, files-to-prompt

json-flatten 0.3.1

2024-09-07T04:14:40+00:00

Release: json-flatten 0.3.1

Tags: json, python

New improved commit messages for scrape-hacker-news-by-domain

2024-09-06T05:40:01+00:00

New improved commit messages for scrape-hacker-news-by-domain

My simonw/scrape-hacker-news-by-domain repo has a very specific purpose. Once an hour it scrapes the Hacker News /from?site=simonwillison.net page (and the equivalent for datasette.io) using my shot-scraper tool and stashes the parsed links, scores and comment counts in JSON files in that repo.

It does this mainly so I can subscribe to GitHub's Atom feed of the commit log - visit simonw/scrape-hacker-news-by-domain/commits/main and add .atom to the URL to get that.

NetNewsWire will inform me within about an hour if any of my content has made it to Hacker News, and the repo will track the score and comment count for me over time. I wrote more about how this works in Scraping web pages from the command line with shot-scraper back in March 2022.

Prior to the latest improvement, the commit messages themselves were pretty uninformative. The message had the date, and to actually see which Hacker News post it was referring to, I had to click through to the commit and look at the diff.

I built my csv-diff tool a while back to help address this problem: it can produce a slightly more human-readable version of a diff between two CSV or JSON files, ideally suited for including in a commit message attached to a git scraping repo like this one.

I got that working, but there was still room for improvement. I recently learned that any Hacker News thread has an undocumented URL at /latest?id=x which displays the most recently added comments at the top.

I wanted that in my commit messages, so I could quickly click a link to see the most recent comments on a thread.

So... I added one more feature to csv-diff: a new --extra option lets you specify a Python format string to be used to add extra fields to the displayed difference.

My GitHub Actions workflow now runs this command:

csv-diff simonwillison-net.json simonwillison-net-new.json \
  --key id --format json \
  --extra latest 'https://news.ycombinator.com/latest?id={id}' \
  >> /tmp/commit.txt

This generates the diff between the two versions, using the id property in the JSON to tie records together. It adds a latest field linking to that URL.

The commits now look like this:

Tags: hacker-news, json, projects, github-actions, git-scraping, shot-scraper

LLMs are bad at returning code in JSON

2024-08-16T17:04:39+00:00

LLMs are bad at returning code in JSON

Paul Gauthier's Aider is a terminal-based coding assistant which works against multiple different models. As part of developing the project Paul runs extensive benchmarks, and his latest shows an interesting result: LLMs are slightly less reliable at producing working code if you request that code be returned as part of a JSON response.

The May release of GPT-4o is the closest to a perfect score - the August appears to have regressed slightly, and the new structured output mode doesn't help and could even make things worse (though that difference may not be statistically significant).

Paul recommends using Markdown delimiters here instead, which are less likely to introduce confusing nested quoting issues.

Via @paulgauthier

Tags: json, ai, prompt-engineering, generative-ai, llms, aider, evals, paul-gauthier

Share Claude conversations by converting their JSON to Markdown

2024-08-08T20:40:20+00:00

Share Claude conversations by converting their JSON to Markdown

Anthropic's Claude is missing one key feature that I really appreciate in ChatGPT: the ability to create a public link to a full conversation transcript. You can publish individual artifacts from Claude, but I often find myself wanting to publish the whole conversation.

Before ChatGPT added that feature I solved it myself with this ChatGPT JSON transcript to Markdown Observable notebook. Today I built the same thing for Claude.

Here's how to use it:

The key is to load a Claude conversation on their website with your browser DevTools network panel open and then filter URLs for chat_. You can use the Copy -> Response right click menu option to get the JSON for that conversation, then paste it into that new Observable notebook to get a Markdown transcript.

I like sharing these by pasting them into a "secret" Gist - that way they won't be indexed by search engines (adding more AI generated slop to the world) but can still be shared with people who have the link.

Here's an example transcript from this morning. I started by asking Claude:

I want to breed spiders in my house to get rid of all of the flies. What spider would you recommend?

When it suggested that this was a bad idea because it might attract pests, I asked:

What are the pests might they attract? I really like possums

It told me that possums are attracted by food waste, but "deliberately attracting them to your home isn't recommended" - so I said:

Thank you for the tips on attracting possums to my house. I will get right on that! [...] Once I have attracted all of those possums, what other animals might be attracted as a result? Do you think I might get a mountain lion?

It emphasized how bad an idea that would be and said "This would be extremely dangerous and is a serious public safety risk.", so I said:

OK. I took your advice and everything has gone wrong: I am now hiding inside my house from the several mountain lions stalking my backyard, which is full of possums

Claude has quite a preachy tone when you ask it for advice on things that are clearly a bad idea, which makes winding it up with increasingly ludicrous questions a lot of fun.

Tags: json, projects, tools, markdown, ai, observable, generative-ai, llms, anthropic, claude

OpenAI: Introducing Structured Outputs in the API

2024-08-06T18:32:25+00:00

OpenAI: Introducing Structured Outputs in the API

OpenAI have offered structured outputs for a while now: you could specify "response_format": {"type": "json_object"}} to request a valid JSON object, or you could use the function calling mechanism to request responses that match a specific schema.

Neither of these modes were guaranteed to return valid JSON! In my experience they usually did, but there was always a chance that something could go wrong and the returned code could not match the schema, or even not be valid JSON at all.

Outside of OpenAI techniques like jsonformer and llama.cpp grammars could provide those guarantees against open weights models, by interacting directly with the next-token logic to ensure that only tokens that matched the required schema were selected.

OpenAI credit that work in this announcement, so they're presumably using the same trick. They've provided two new ways to guarantee valid outputs. The first a new "strict": true option for function definitions. The second is a new feature: a "type": "json_schema" option for the "response_format" field which lets you then pass a JSON schema (and another "strict": true flag) to specify your required output.

I've been using the existing "tools" mechanism for exactly this already in my datasette-extract plugin - defining a function that I have no intention of executing just to get structured data out of the API in the shape that I want.

Why isn't "strict": true by default? Here's OpenAI's Ted Sanders:

We didn't cover this in the announcement post, but there are a few reasons:

The first request with each JSON schema will be slow, as we need to preprocess the JSON schema into a context-free grammar. If you don't want that latency hit (e.g., you're prototyping, or have a use case that uses variable one-off schemas), then you might prefer "strict": false

You might have a schema that isn't covered by our subset of JSON schema. (To keep performance fast, we don't support some more complex/long-tail features.)

In JSON mode and Structured Outputs, failures are rarer but more catastrophic. If the model gets too confused, it can get stuck in loops where it just prints technically valid output forever without ever closing the object. In these cases, you can end up waiting a minute for the request to hit the max_token limit, and you also have to pay for all those useless tokens. So if you have a really tricky schema, and you'd rather get frequent failures back quickly instead of infrequent failures back slowly, you might also want "strict": false

But in 99% of cases, you'll want "strict": true.

More from Ted on how the new mode differs from function calling:

Under the hood, it's quite similar to function calling. A few differences:

Structured Outputs is a bit more straightforward. e.g., you don't have to pretend you're writing a function where the second arg could be a two-page report to the user, and then pretend the "function" was called successfully by returning {"success": true}

Having two interfaces lets us teach the model different default behaviors and styles, depending on which you use

Another difference is that our current implementation of function calling can return both a text reply plus a function call (e.g., "Let me look up that flight for you"), whereas Structured Outputs will only return the JSON

The official openai-python library also added structured output support this morning, based on Pydantic and looking very similar to the Instructor library (also credited as providing inspiration in their announcement).

There are some key limitations on the new structured output mode, described in the documentation. Only a subset of JSON schema is supported, and most notably the "additionalProperties": false property must be set on all objects and all object keys must be listed in "required" - no optional keys are allowed.

Another interesting new feature: if the model denies a request on safety grounds a new refusal message will be returned:

{
  "message": {
    "role": "assistant",
    "refusal": "I'm sorry, I cannot assist with that request."
  }
}

Finally, tucked away at the bottom of this announcement is a significant new model release with a major price cut:

By switching to the new gpt-4o-2024-08-06, developers save 50% on inputs ($2.50/1M input tokens) and 33% on outputs ($10.00/1M output tokens) compared to gpt-4o-2024-05-13.

This new model also supports 16,384 output tokens, up from 4,096.

The price change is particularly notable because GPT-4o-mini, the much cheaper alternative to GPT-4o, prices image inputs at the same price as GPT-4o. This new model cuts that by half (confirmed here), making gpt-4o-2024-08-06 the new cheapest model from OpenAI for handling image inputs.

Tags: json, ai, openai, generative-ai, llms, structured-extraction, pydantic

Ham radio general exam question pool as JSON

2024-05-11T19:16:49+00:00

Ham radio general exam question pool as JSON

I scraped a pass of my Ham radio general exam this morning. One of the tools I used to help me pass was a Datasette instance with all 429 questions from the official question pool. I've published that raw data as JSON on GitHub, which I converted from the official question pool document using an Observable notebook.

Relevant TIL: How I studied for my Ham radio general exam.

Tags: json, projects, radio, datasette, observable, ham-radio

Tips on Adding JSON Output to Your CLI App

2024-04-20T21:43:58+00:00

Tips on Adding JSON Output to Your CLI App

Kelly Brazil - also the author of jc, the neat CLI tool that converts the output of common Unix utilities such as dig into JSON - provides some useful do's and don'ts for adding JSON output as an option to a command-line tool.

Kelly recommends defaulting to arrays of flat objects - or newline-delimited objects - and suggests including an "unbuffer" option for streaming tools that discourages the OS from buffering output that is being sent through a pipe.

Via Hacker News

Tags: cli, json

Caddy: Config Adapters

2024-02-13T04:22:08+00:00

Caddy: Config Adapters

The Caddy web application server is configured using JSON, but their “config adapters” plugin mechanism allows you to write configuration files in YAML, TOML, JSON5 (JSON with comments), and even nginx format which then gets automatically converted to JSON for you.

Caddy author Matt Holt: “We put an end to the config format wars in Caddy by letting you use any format you want!”

Via @mholt6

Tags: json, matt-holt

SQLite 3.45

2024-01-15T20:15:42+00:00

SQLite 3.45

Released today. The big new feature is JSONB support, a new, specific-to-SQLite binary internal representation of JSON which can provide up to a 3x performance improvement for JSON-heavy operations, plus a 5-10% saving it terms of bytes stored on disk.

Tags: databases, json, sqlite

jo

2023-10-08T05:20:09+00:00

Neat little C utility (available via brew/apt-get install etc) for conveniently outputting JSON from a shell: “jo -p name=jo n=17 parser=false” will output a JSON object with string, integer and boolean values, and you can nest it to create nested objects. Looks very handy.

Via lobste.rs

Tags: c, json

jq 1.7

2023-10-02T04:58:54+00:00

jq 1.7

First new release of jq in five years! The project has moved from a solo maintainer to a new team with a dedicated GitHub organization. A ton of new features in this release—I’m most excited about the new pick(.key1, .key2.nested) builtin for emitting a selected subset of the incoming objects, and the --raw-output0 option which outputs zero byte delimited lists, designed to be piped to “xargs -0”.

Tags: json, jq

Lark parsing library JSON tutorial

2023-08-13T21:50:16+00:00

Lark parsing library JSON tutorial

A very convincing tutorial for a new-to-me parsing library for Python called Lark.

The tutorial covers building a full JSON parser from scratch, which ends up being just 19 lines of grammar definition code and 15 lines for the transformer to turn that tree into the final JSON.

It then gets into the details of optimization—the default Earley algorithm is quite slow, but swapping that out for a LALR parser (a one-line change) provides a 5x speedup for this particular example.

Via spandanb/learndb-py

Tags: compilers, json, parsing, python

Datasette 1.0a3

2023-08-09T20:49:35+00:00

Datasette 1.0a3

A new Datasette alpha release. This one previews the new default JSON API design that’s coming in 1.0—the single most significant change in the 1.0 milestone, since I plan to keep that API stable for many years to come.

Tags: json, projects, datasette

SQLite 3.42.0

2023-05-18T21:14:07+00:00

SQLite 3.42.0

The latest SQLite has a tiny feature I requested on the SQLite Forum - SELECT unixepoch('subsec') now returns the current time in milliseconds since the Unix epoch, a big improvement on the previous recipe of select cast((julianday('now') - 2440587.5) * 86400 * 1000 as integer)!

Also in the release: JSON5 support (JSON with multi-line strings and comments), a bunch of improvements to the query planner and CLI tool, plus various interesting internal changes.

Tags: json, sqlite

Jsonformer: A Bulletproof Way to Generate Structured JSON from Language Models

2023-05-08T23:02:01+00:00

Jsonformer: A Bulletproof Way to Generate Structured JSON from Language Models

This is such an interesting trick. A common challenge with LLMs is getting them to output a specific JSON shape of data reliably, without occasionally messing up and generating invalid JSON or outputting other text.

Jsonformer addresses this in a truly ingenious way: it implements code that interacts with the logic that decides which token to output next, influenced by a JSON schema. If that code knows that the next token after a double quote should be a comma it can force the issue for that specific token.

This means you can get reliable, robust JSON output even for much smaller, less capable language models.

It’s built against Hugging Face transformers, but there’s no reason the same idea couldn’t be applied in other contexts as well.

Tags: json, ai, generative-ai, llms, hugging-face

Datasette: Gather feedback on new ?_extra= design

2023-03-22T23:14:19+00:00

Datasette: Gather feedback on new ?_extra= design

I just landed the single biggest backwards-incompatible change to Datasette ever, in preparation for the 1.0 release. It’s a change to the default JSON format from the Datasette API—the new format is much slimmer, and can be expanded using a new ?_extra= query string parameter. I’m desperately keen on getting feedback on this change! This issues has more details and a call for feedback.

Tags: json, datasette

sqlite-jsonschema

2023-01-28T03:50:46+00:00

sqlite-jsonschema

“A SQLite extension for validating JSON objects with JSON Schema”, building on the jsonschema Rust crate. SQLite and JSON are already a great combination—Alex suggests using this extension to implement check constraints to validate JSON columns before inserting into a table, or just to run queries finding existing data that doesn’t match a given schema.

Tags: json, jsonschema, sqlite, rust, alex-garcia

Datasette's new JSON write API: The first alpha of Datasette 1.0

2022-12-02T23:15:07+00:00

This week I published the first alpha release of Datasette 1.0, with a significant new feature: Datasette core now includes a JSON API for creating and dropping tables and inserting, updating and deleting data.

Combined with Datasette's existing APIs for reading and filtering table data and executing SELECT queries this effectively turns Datasette into a SQLite-backed JSON data layer for any application.

If you squint at it the right way, you could even describe it as offering a NoSQL interface to a SQL database!

My initial motivation for this work was to provide an API for loading data into my Datasette Cloud SaaS product - but now that I've got it working I'm realizing that it can be applied to a whole host of interesting things.

I shipped the 1.0a0 alpha on Wednesday, then spent the last two days ironing out some bugs (released in 1.0a1) and building some illustrative demos.

Scraping Hacker News to build an atom feed

My first demo reuses my scrape-hacker-news-by-domain project from earlier this year.

https://news.ycombinator.com/from?site=simonwillison.net is the page on Hacker News that shows submissions from my blog. I like to keep an eye on that page to see if anyone has linked to my work.

Data from that page is not currently available through the official Hacker News API... but it's in an HTML format that's pretty easy to scrape.

My shot-scraper command-line browser automation tool has the ability to execute JavaScript against a web page and return scraped data as JSON.

I wrote about that in Scraping web pages from the command line with shot-scraper, including a recipe for scraping that Hacker News page that looks like this:

shot-scraper javascript \
  "https://news.ycombinator.com/from?site=simonwillison.net" \
  -i scrape.js -o simonwillison-net.json

Here's that scrape.js script.

I've been running a Git scraper that executes that scraping script using GitHub Actions for several months now, out of my simonw/scrape-hacker-news-by-domain repository.

Today I modified that script to also publish the data it has scraped to my personal Datasette Cloud account using the new API - and then used the datasette-atom plugin to generate an Atom feed from that data.

Here's the new table in Datasette Cloud.

This is the bash script that runs in GitHub Actions and pushes the data to Datasette:

export SIMONWILLISON_ROWS=$(
  jq -n --argjson rows "$(cat simonwillison-net.json)" \
  '{ "rows": $rows, "replace": true }'
)
curl -X POST \
  https://simon.datasette.cloud/data/hacker_news_posts/-/insert \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DS_TOKEN" \
  -d "$SIMONWILLISON_ROWS"

$DS_TOKEN is an environment variable containing a signed API token, see the API token documentation for details.

I'm using jq here (with a recipe generated using GPT-3) to convert the scraped data into the JSON format needeed by the Datasette API. The result looks like this:

{
  "rows": [
    {
      "id": "33762438",
      "title": "Coping strategies for the serial project hoarder",
      "url": "https://simonwillison.net/2022/Nov/26/productivity/",
      "dt": "2022-11-27T12:12:56",
      "points": 222,
      "submitter": "usrme",
      "commentsUrl": "https://news.ycombinator.com/item?id=33762438",
      "numComments": 38
    }
  ],
  "replace": true
}

This is then POSTed up to the https://simon.datasette.cloud/data/hacker_news_posts/-/insert API endpoint.

The "rows" key is a list of rows to be inserted.

"replace": true tells Datasette to replace any existing rows with the same primary key. Without that, the API would return an error if any rows already existed.

The API also accepts "ignore": true which will cause it to ignore any rows that already exist.

Full insert API documentation is here.

Initially creating the table

Before I could insert any rows I needed to create the table.

I did that from the command-line too, using this recipe:

export ROWS=$(
  jq -n --argjson rows "$(cat simonwillison-net.json)" \
  '{ "table": "hacker_news_posts", "rows": $rows, "pk": "id" }'
)
# Use curl to POST some JSON to a URL
curl -X POST \
  https://simon.datasette.cloud/data/-/create \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DS_TOKEN" \
  -d $ROWS

This uses the same trick as above, but hits a different API endpoint: /data/-/create which is the endpoint for creating a table in the data.db database.

The JSON submitted to that endpoint looks like this:

{
  "table": "hacker_news_posts",
  "pk": "id",
  "rows": [
    {
      "id": "33762438",
      "title": "Coping strategies for the serial project hoarder",
      "url": "https://simonwillison.net/2022/Nov/26/productivity/",
      "dt": "2022-11-27T12:12:56",
      "points": 222,
      "submitter": "usrme",
      "commentsUrl": "https://news.ycombinator.com/item?id=33762438",
      "numComments": 38
    }
  ]
}

It's almost the same shape as the /-/insert call above. That's because it's using a feature of the Datasette API inherited from sqlite-utils - it can create a table from a list of rows, automatically determining the correct schema.

If you already know your schema you can pass a "columns": [...] key instead, but I've found that this kind of automatic schema generation works really well in practice.

Datasette will let you call the create API like that multiple times, and if the table already exists it will insert new rows directly into the existing tables. I expect this to be a really convenient way to write automation scripts where you don't want to bother checking if the table exists already.

Building an Atom feed

My end goal with this demo was to build an Atom feed I could subscribe to in my NetNewsWire feed reader.

I have a plugin for that already: datasette-atom, which lets you generate an Atom feed for any data in Datasette, defined using a SQL query.

I created a SQL view for this (using the datasette-write plugin, which is installed on Datasette Cloud):

CREATE VIEW hacker_news_posts_atom as select
  id as atom_id,
  title as atom_title,
  url,
  commentsUrl as atom_link,
  dt || 'Z' as atom_updated,
  'Submitter: ' || submitter || ' - ' || points || ' points, ' || numComments || ' comments' as atom_content
from
  hacker_news_posts
order by
  dt desc
limit
  100;

datasette-atom requires a table, view or SQL query that returns atom_id, atom_title and atom_updated columns - and will make use of atom_link and atom_content as well if they are present.

Datasette Cloud defaults to keeping all tables and views private - but a while ago I created the datasette-public plugin to provide a UI for making a table public.

It turned out this didn't work for SQL views yet, so I fixed that - then used that option to make my view public. You can visit it at:

https://simon.datasette.cloud/data/hacker_news_posts_atom

And to get an Atom feed, just add .atom to the end of the URL:

https://simon.datasette.cloud/data/hacker_news_posts_atom.atom

Here's what it looks like in NetNewsWire:

I'm pretty excited about being able to combine these tools in this way: it makes getting from scraped data to a Datasette table to an Atom feed a very repeatable process.

Building a TODO list application

My second demo explores what it looks like to develop custom applications against the new API.

TodoMVC is a project that provides the same TODO list interface built using dozens of different JavaScript frameworks, as a comparison tool.

I decided to use it to build my own TODO list application, using Datasette as the backend.

You can try it out at https://todomvc.datasette.io/ - but be warned that the demo resets every 15 minutes so don't use it for real task tracking!

The source code for this demo lives in simonw/todomvc-datasette - which also serves the demo itself using GitHub Pages.

The code is based on the TodoMVC Vanilla JavaScript example. I used that unmodified, except for one file - store.js, which I modified to use the Datasette API instead of localStorage.

The demo currently uses a hard-coded authentication token, which is signed to allow actions to be performed against the https://latest.datasette.io/ demo instance as a user called todomvc.

That user is granted permissions in a custom plugin at the moment, but I plan to provide a more user-friendly way to do this in the future.

A couple of illustrative snippets of code. First, on page load this constructor uses the Datasette API to create the table used by the application:

function Store(name, callback) {
  callback = callback || function () {};

  // Ensure a table exists with this name
  let self = this;
  self._dbName = `todo_${name}`;
  fetch("https://latest.datasette.io/ephemeral/-/create", {
    method: "POST",
    mode: "cors",
    headers: {
      Authorization: `Bearer ${TOKEN}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      table: self._dbName,
      columns: [
        {name: "id", type: "integer"},
        {name: "title", type: "text"},
        {name: "completed", type: "integer"},
      ],
      pk: "id",
    }),
  }).then(function (r) {
    callback.call(this, []);
  });
}

Most applications would run against a table that has already been created, but this felt like a good opportunity to show what table creation looks like.

Note that the table is being created using /ephemeral/-/create - this endpoint that lets you create tables in the ephemeral database, which is a temporary database that drops every table after 15 minutes. I built the datasette-ephemeral-tables plugin to make this possible.

Here's the code which is called when a new TODO list item is created or updated:

Store.prototype.save = function (updateData, callback, id) {
// {title, completed}
callback = callback || function () {};
var table = this._dbName;

// If an ID was actually given, find the item and update each property
if (id) {
  fetch(
    `https://latest.datasette.io/ephemeral/${table}/${id}/-/update`,
    {
      method: "POST",
      mode: "cors",
      headers: {
        Authorization: `Bearer ${TOKEN}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({update: updateData}),
    }
  )
    .then((r) => r.json())
    .then((data) => {
      callback.call(self, data);
    });
} else {
  // Save it and store ID
  fetch(`https://latest.datasette.io/ephemeral/${table}/-/insert`, {
    method: "POST",
    mode: "cors",
    headers: {
      Authorization: `Bearer ${TOKEN}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      row: updateData,
    }),
  })
    .then((r) => r.json())
    .then((data) => {
      let row = data.rows[0];
      callback.call(self, row);
    });
}
};

TodoMVC passes an id if a record is being updated - which this code uses as a sign that the ...table/row-id/-/update API should be called (see update API documentation).

If the row doen't have an ID it is inserted using table/-/insert, this time using the "row": key because we are only inserting a single row.

The hardest part of getting this to work was ensuring Datasette's CORS mode worked correctly for writes. I had to add a new Access-Control-Allow-Methods header, which I shipped in Datasette 1.0a1 (see issue #1922).

Try the ephemeral hosted API

I built the datasette-ephemeral-tables plugin because I wanted to provide a demo instance of the write API that anyone could try out without needing to install Datasette themselves - but that wouldn't leave me responsible for taking care of their data or cleaning up any of their mess.

You're welcome to experiment with the API using the https://latest.datasette.io/ demo instance.

First, you'll need to sign in as a root user. You can do that (no password required) using the button on this page.

Once signed in you can view the ephemeral database (which isn't visible to anonymous users) here:

https://latest.datasette.io/ephemeral

You can use the API explorer to try out the different write APIs against it here:

https://latest.datasette.io/-/api

And you can create your own signed token for accessing the API on this page:

https://latest.datasette.io/-/create-token

The TodoMVC application described above also uses the ephemeral database, so you may see a todo_todos-vanillajs table appear there if anyone is playing with that demo.

Or run this on your own machine

You can install the latest Datasette alpha like this:

pip install datasette==1.0a1

Then create a database and sign in as the root user in order to gain access to the API:

datasette demo.db --create --root

Click on the link it outputs to sign in as the root user, then visit the API explorer to start trying out the API:

http://127.0.0.1:8001/-/api

The API explorer works without a token at all, using your existing browser cookies.

If you want to try the API using curl or similar you can use this page to create a new signed API token for the root user:

http://127.0.0.1:8001/-/create-token

This token will become invalid if you restart the server, unless you fix the DATASETTE_SECRET environment variable to a stable string before you start the server:

export DATASETTE_SECRET=$(
  python3 -c 'print(__import__("secrets").token_hex(16))'
)

Check the Write API documentation for more details.

What's next?

If you have feedback on these APIs, now is the time to share it! I'm hoping to ship Datasette 1.0 at the start of 2023, after which these APIs will be considered stable for hopefully a long time to come.

If you have thoughts or feedback (or questions) join us on the Datasette Discord. You can also file issue comments against Datasette itself.

My priority for the next 1.0 alpha is to bake in a small number of backwards incompatible changes to other aspects of Datasette's JSON API that I've been hoping to include in 1.0 for a while.

I'm also going to be rolling out API support to my Datasette Cloud preview users. If you're interested in trying that out you can request access here.

Tags: apis, json, projects, datasette

Building a BFT JSON CRDT

2022-11-21T19:56:50+00:00

Building a BFT JSON CRDT

Jacky Zhao describes their project to build a CRDT library for JSON data in Rust, and includes a thorough explanation of what CRDTs are and how they work. “I write this blog post mostly as a note to my past self, distilling a lot of what I’ve learned since into a blog post I wish I had read before going in”—the best kind of blog post!

Via Hacker News

Tags: json, rust, crdt