Simon Willison's Weblog: pdf

OpenAI are quietly adopting skills, now available in ChatGPT and Codex CLI

2025-12-12T23:29:51+00:00

One of the things that most excited me about Anthropic's new Skills mechanism back in October is how easy it looked for other platforms to implement. A skill is just a folder with a Markdown file and some optional extra resources and scripts, so any LLM tool with the ability to navigate and read from a filesystem should be capable of using them. It turns out OpenAI are doing exactly that, with skills support quietly showing up in both their Codex CLI tool and now also in ChatGPT itself.

Skills in ChatGPT

I learned about this from Elias Judin this morning. It turns out the Code Interpreter feature of ChatGPT now has a new /home/oai/skills folder which you can access simply by prompting:

Create a zip file of /home/oai/skills

I tried that myself and got back this zip file. Here's a UI for exploring its content (more about that tool).

So far they cover spreadsheets, docx and PDFs. Interestingly their chosen approach for PDFs and documents is to convert them to rendered per-page PNGs and then pass those through their vision-enabled GPT models, presumably to maintain information from layout and graphics that would be lost if they just ran text extraction.

Elias shared copies in a GitHub repo. They look very similar to Anthropic's implementation of the same kind of idea, currently published in their anthropics/skills repository.

I tried it out by prompting:

Create a PDF with a summary of the rimu tree situation right now and what it means for kakapo breeding season

Sure enough, GPT-5.2 Thinking started with:

Reading skill.md for PDF creation guidelines

Then:

Searching rimu mast and Kākāpō 2025 breeding status

It took just over eleven minutes to produce this PDF, which was long enough that I had Claude Code for web build me a custom PDF viewing tool while I waited.

Here's ChatGPT's PDF in that tool.

(I am very excited about Kākāpō breeding season this year.)

The reason it took so long is that it was fastidious about looking at and tweaking its own work. I appreciated that at one point it tried rendering the PDF and noticed that the macrons in kākāpō were not supported by the chosen font, so it switched to something else:

Skills in Codex CLI

Meanwhile, two weeks ago OpenAI's open source Codex CLI tool landed a PR titled feat: experimental support for skills.md. The most recent docs for that are in docs/skills.md.

The documentation suggests that any folder in ~/.codex/skills will be treated as a skill.

I dug around and found the code that generates the prompt that drives the skill system in codex-rs/core/src/skills/render.rs - here's a Gist with a more readable version of that prompt.

I used Claude Opus 4.5's skill authoring skill to create this skill for creating Datasette plugins, then installed it into my Codex CLI skills folder like this:

git clone https://github.com/datasette/skill \
  ~/.codex/skills/datasette-plugin

You have to run Codex with the --enable skills option. I ran this:

cd /tmp
mkdir datasette-cowsay
cd datasette-cowsay
codex --enable skills -m gpt-5.2

Then prompted:

list skills

And Codex replied:

- datasette-plugins — Writing Datasette plugins using Python + pluggy (file: /Users/simon/.codex/skills/datasette-plugin/SKILL.md)
- Discovery — How to find/identify available skills (no SKILL.md path provided in the list)

Then I said:

Write a Datasette plugin in this folder adding a /-/cowsay?text=hello page that displays a pre with cowsay from PyPI saying that text

It worked perfectly! Here's the plugin code it wrote and here's a copy of the full Codex CLI transcript, generated with my terminal-to-html tool.

You can try that out yourself if you have uvx installed like this:

uvx --with https://github.com/simonw/datasette-cowsay/archive/refs/heads/main.zip \
  datasette

Then visit:

http://127.0.0.1:8001/-/cowsay?text=This+is+pretty+fun

Skills are a keeper

When I first wrote about skills in October I said Claude Skills are awesome, maybe a bigger deal than MCP. The fact that it's just turned December and OpenAI have already leaned into them in a big way reinforces to me that I called that one correctly.

Skills are based on a very light specification, if you could even call it that, but I still think it would be good for these to be formally documented somewhere. This could be a good initiative for the new Agentic AI Foundation (previously) to take on.

Tags: pdf, ai, kakapo, openai, prompt-engineering, generative-ai, chatgpt, llms, ai-assisted-programming, anthropic, coding-agents, gpt-5, codex-cli, skills

simonw/claude-skills

2025-10-10T23:57:19+00:00

simonw/claude-skills

One of the tips I picked up from Jesse Vincent's Claude Code Superpowers post (previously) was this:

Skills are what give your agents Superpowers.

The first time they really popped up on my radar was a few weeks ago when Anthropic rolled out improved Office document creation. When the feature rolled out, I went poking around a bit -- I asked Claude to tell me all about its new skills. And it was only too happy to dish.

In that transcript Jesse asks how the skill system works and Claude replies:

Looking at the skills system, here's how it works:

Current Skills (in /mnt/skills/public/):

docx - Word document handling

pdf - PDF processing and form filling

pptx - PowerPoint presentations

xlsx - Excel spreadsheets

This looks like it's part of Claude's brand new Code Interpreter feature! I wrote about that extensively last month, but I missed that there was a /mnt/skills/public/ folder full of fascinating implementation details.

So I fired up a fresh Claude instance (fun fact: Code Interpreter also works in the Claude iOS app now, which it didn't when they first launched) and prompted:

Create a zip file of everything in your /mnt/skills folder

This worked, and gave me a .zip to download. You can run the prompt yourself here, though you'll need to enable the new feature first.

I've pushed the contents of that zip to my new simonw/claude-skills GitHub repo.

So now you can see the prompts Anthropic wrote to enable the creation and manipulation of the following files in their Claude consumer applications:

pdf - PDF files
docx - Microsoft Word
pptx - Microsoft PowerPoint decks
xlsx - Microsoft Excel

In each case the prompts spell out detailed instructions for manipulating those file types using Python, using libraries that come pre-installed on Claude's containers.

Skills are more than just prompts though: the repository also includes dozens of pre-written Python scripts for performing common operations.

pdf/scripts/fill_fillable_fields.py for example is a custom CLI tool that uses pypdf to find and then fill in a bunch of PDF form fields, specified as JSON, then render out the resulting combined PDF.

This is a really sophisticated set of tools for document manipulation, and I love that Anthropic have made those visible - presumably deliberately - to users of Claude who know how to ask for them.

Tags: pdf, python, ai, prompt-engineering, generative-ai, llms, anthropic, claude, code-interpreter, jesse-vincent, skills

llm-pdf-to-images

2025-05-18T20:48:41+00:00

llm-pdf-to-images

Inspired by my previous llm-video-frames plugin, I thought it would be neat to have a plugin for LLM that can take a PDF and turn that into an image-per-page so you can feed PDFs into models that support image inputs but don't yet support PDFs.

This should now do exactly that:

llm install llm-pdf-to-images
llm -f pdf-to-images:path/to/document.pdf 'Summarize this document'

Under the hood it's using the PyMuPDF library. The key code to convert a PDF into images looks like this:

import fitz
doc = fitz.open("input.pdf")
for page in doc:
    pix = page.get_pixmap(matrix=fitz.Matrix(300/72, 300/72))
    jpeg_bytes = pix.tobytes(output="jpg", jpg_quality=30)

Once I'd figured out that code I got o4-mini to write most of the rest of the plugin, using llm-fragments-github to load in the example code from the video plugin:

llm -f github:simonw/llm-video-frames '
import fitz
doc = fitz.open("input.pdf")
for page in doc:
    pix = page.get_pixmap(matrix=fitz.Matrix(300/72, 300/72))
    jpeg_bytes = pix.tobytes(output="jpg", jpg_quality=30)
' -s 'output llm_pdf_to_images.py which adds a pdf-to-images: 
 fragment loader that converts a PDF to frames using fitz like in the example' \
-m o4-mini

Here's the transcript - more details in this issue.

I had some weird results testing this with GPT 4.1 mini. I created a test PDF with two pages - one white, one black - and ran a test prompt like this:

llm -f 'pdf-to-images:blank-pages.pdf' \
  'describe these images'

The first image features a stylized red maple leaf with triangular facets, giving it a geometric appearance. The maple leaf is a well-known symbol associated with Canada.

The second image is a simple black silhouette of a cat sitting and facing to the left. The cat's tail curls around its body. The design is minimalistic and iconic.

I got even wilder hallucinations for other prompts, like "summarize this document" or "describe all figures". I have a collection of those in this Gist.

Thankfully this behavior is limited to GPT-4.1 mini. I upgraded to full GPT-4.1 and got much more sensible results:

llm -f 'pdf-to-images:blank-pages.pdf' \
  'describe these images' -m gpt-4.1

Certainly! Here are the descriptions of the two images you provided:

First image: This image is completely white. It appears blank, with no discernible objects, text, or features.

Second image: This image is entirely black. Like the first, it is blank and contains no visible objects, text, or distinct elements.

If you have questions or need a specific kind of analysis or modification, please let me know!

Tags: pdf, plugins, projects, ai, generative-ai, llms, ai-assisted-programming, llm, hallucinations

Tracing the thoughts of a large language model

2025-03-27T21:51:24+00:00

Tracing the thoughts of a large language model

In a follow-up to the research that brought us the delightful Golden Gate Claude last year, Anthropic have published two new papers about LLM interpretability:

Circuit Tracing: Revealing Computational Graphs in Language Models extends last year's interpretable features into attribution graphs, which can "trace the chain of intermediate steps that a model uses to transform a specific input prompt into an output response".
On the Biology of a Large Language Model uses that methodology to investigate Claude 3.5 Haiku in a bunch of different ways. Multilingual Circuits for example shows that the same prompt in three different languages uses similar circuits for each one, hinting at an intriguing level of generalization.

To my own personal delight, neither of these papers are published as PDFs. They're both presented as glorious mobile friendly HTML pages with linkable sections and even some inline interactive diagrams. More of this please!

Tags: pdf, ai, generative-ai, llms, anthropic, claude, interpretability

Mistral OCR

2025-03-07T01:39:26+00:00

Mistral OCR

New closed-source specialist OCR model by Mistral - you can feed it images or a PDF and it produces Markdown with optional embedded images.

It's available via their API, or it's "available to self-host on a selective basis" for people with stringent privacy requirements who are willing to talk to their sales team.

I decided to try out their API, so I copied and pasted example code from their notebook into my custom Claude project and told it:

Turn this into a CLI app, depends on mistralai - it should take a file path and an optional API key defauling to env vironment called MISTRAL_API_KEY

After some further iteration / vibe coding I got to something that worked, which I then tidied up and shared as mistral_ocr.py.

You can try it out like this:

export MISTRAL_API_KEY='...'
uv run http://tools.simonwillison.net/python/mistral_ocr.py \
  mixtral.pdf --html --inline-images > mixtral.html

I fed in the Mixtral paper as a PDF. The API returns Markdown, but my --html option renders that Markdown as HTML and the --inline-images option takes any images and inlines them as base64 URIs (inspired by monolith). The result is mixtral.html, a 972KB HTML file with images and text bundled together.

This did a pretty great job!

My script renders Markdown tables but I haven't figured out how to render inline Markdown MathML yet. I ran the command a second time and requested Markdown output (the default) like this:

uv run http://tools.simonwillison.net/python/mistral_ocr.py \
  mixtral.pdf > mixtral.md

Here's that Markdown rendered as a Gist - there are a few MathML glitches so clearly the Mistral OCR MathML dialect and the GitHub Formatted Markdown dialect don't quite line up.

My tool can also output raw JSON as an alternative to Markdown or HTML - full details in the documentation.

The Mistral API is priced at roughly 1000 pages per dollar, with a 50% discount for batch usage.

The big question with LLM-based OCR is always how well it copes with accidental instructions in the text (can you safely OCR a document full of prompting examples?) and how well it handles text it can't write.

Mistral's Sophia Yang says it "should be robust" against following instructions in the text, and invited people to try and find counter-examples.

Alexander Doria noted that Mistral OCR can hallucinate text when faced with handwriting that it cannot understand.

Via @sophiamyang

Tags: cli, ocr, pdf, projects, ai, generative-ai, llms, claude, mistral, vision-llms, uv

olmOCR

2025-02-26T02:04:03+00:00

olmOCR

New from Ai2 - olmOCR is "an open-source tool designed for high-throughput conversion of PDFs and other documents into plain text while preserving natural reading order".

At its core is allenai/olmOCR-7B-0225-preview, a Qwen2-VL-7B-Instruct variant trained on ~250,000 pages of diverse PDF content (both scanned and text-based) that were labelled using GPT-4o and made available as the olmOCR-mix-0225 dataset.

The olmocr Python library can run the model on any "recent NVIDIA GPU". I haven't managed to run it on my own Mac yet - there are GGUFs out there but it's not clear to me how to run vision prompts through them - but Ai2 offer an online demo which can handle up to ten pages for free.

Given the right hardware this looks like a very inexpensive way to run large scale document conversion projects:

We carefully optimized our inference pipeline for large-scale batch processing using SGLang, enabling olmOCR to convert one million PDF pages for just $190 - about 1/32nd the cost of using GPT-4o APIs.

The most interesting idea from the technical report (PDF) is something they call "document anchoring":

Document anchoring extracts coordinates of salient elements in each page (e.g., text blocks and images) and injects them alongside raw text extracted from the PDF binary file. [...]

Document anchoring processes PDF document pages via the PyPDF library to extract a representation of the page’s structure from the underlying PDF. All of the text blocks and images in the page are extracted, including position information. Starting with the most relevant text blocks and images, these are sampled and added to the prompt of the VLM, up to a defined maximum character limit. This extra information is then available to the model when processing the document.

The one limitation of olmOCR at the moment is that it doesn't appear to do anything with diagrams, figures or illustrations. Vision models are actually very good at interpreting these now, so my ideal OCR solution would include detailed automated descriptions of this kind of content in the resulting text.

Update: Jonathan Soma figured out how to run it on a Mac using LM Studio and the olmocr Python package.

Via Luca Soldaini

Tags: ocr, pdf, ai, generative-ai, local-llms, llms, fine-tuning, vision-llms, qwen, ai2, lm-studio, ai-in-china

Docling

2024-11-03T04:57:56+00:00

Docling

MIT licensed document extraction Python library from the Deep Search team at IBM, who released Docling v2 on October 16th.

Here's the Docling Technical Report paper from August, which provides details of two custom models: a layout analysis model for figuring out the structure of the document (sections, figures, text, tables etc) and a TableFormer model specifically for extracting structured data from tables.

Those models are available on Hugging Face.

Here's how to try out the Docling CLI interface using uvx (avoiding the need to install it first - though since it downloads models it will take a while to run the first time):

uvx docling mydoc.pdf --to json --to md

This will output a mydoc.json file with complex layout information and a mydoc.md Markdown file which includes Markdown tables where appropriate.

The Python API is a lot more comprehensive. It can even extract tables as Pandas DataFrames:

from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")
for table in result.document.tables:
    df = table.export_to_dataframe()
    print(df)

I ran that inside uv run --with docling python. It took a little while to run, but it demonstrated that the library works.

Tags: cli, ibm, ocr, pdf, python, ai, hugging-face, uv

Claude API: PDF support (beta)

2024-11-01T18:55:31+00:00

Claude API: PDF support (beta)

Claude 3.5 Sonnet now accepts PDFs as attachments:

The new Claude 3.5 Sonnet (claude-3-5-sonnet-20241022) model now supports PDF input and understands both text and visual content within documents.

I just released llm-claude-3 0.7 with support for the new attachment type (attachments are a very new feature), so now you can do this:

llm install llm-claude-3 --upgrade
llm -m claude-3.5-sonnet 'extract text' -a mydoc.pdf

Visual PDF analysis can also be turned on for the Claude.ai application:

Also new today: Claude now offers a free (albeit rate-limited) token counting API. This addresses a complaint I've had for a while: previously it wasn't possible to accurately estimate the cost of a prompt before sending it to be executed.

Via @alexalbert__

Tags: pdf, projects, ai, generative-ai, llms, llm, anthropic, claude, vision-llms, claude-3-5-sonnet

The Fair Source Definition

2024-10-09T18:17:31+00:00

The Fair Source Definition

Fair Source (fair.io) is the new-ish initiative from Chad Whitacre and Sentry aimed at providing an alternative licensing philosophy that provides additional protection for the business models of companies that release their code.

I like that they're establishing a new brand for this and making it clear that it's a separate concept from Open Source. Here's their definition:

Fair Source is an alternative to closed source, allowing you to safely share access to your core products. Fair Source Software (FSS):

is publicly available to read;

allows use, modification, and redistribution with minimal restrictions to protect the producer’s business model; and

undergoes delayed Open Source publication (DOSP).

They link to the Delayed Open Source Publication research paper published by OSI in January. (I was frustrated that this is only available as a PDF, so I converted it to Markdown using Gemini 1.5 Pro so I could read it on my phone.)

The most interesting background I could find on Fair Source was this GitHub issues thread, started in May, where Chad and other contributors fleshed out the initial launch plan over the course of several months.

Via Hacker News

Tags: licensing, open-source, pdf, sentry, chad-whitacre

Calling LLMs from client-side JavaScript, converting PDFs to HTML + weeknotes

2024-09-06T02:28:38+00:00

I've been having a bunch of fun taking advantage of CORS-enabled LLM APIs to build client-side JavaScript applications that access LLMs directly. I also span up a new Datasette plugin for advanced permission management.

LLMs from client-side JavaScript

Anthropic recently added CORS support to their Claude APIs. It's a little hard to use - you have to add anthropic-dangerous-direct-browser-access: true to your request headers to enable it - but once you know the trick you can start building web applications that talk to Anthropic's LLMs directly, without any additional server-side code.

I later found out that both OpenAI and Google Gemini have this capability too, without needing the special header.

The problem with this approach is security: it's very important not to embed an API key attached to your billing account in client-side HTML and JavaScript for anyone to see!

For my purposes though that doesn't matter. I've been building tools which prompt() a user for their own API key (sadly restricting their usage to the tiny portion of people who both understand API keys and have created API accounts with one of the big providers) - then I stash that key in localStorage and start using it to make requests.

My simonw/tools repository is home to a growing collection of pure HTML+JavaScript tools, hosted at tools.simonwillison.net using GitHub Pages. I love not having to even think about hosting server-side code for these tools.

I've published three tools there that talk to LLMs directly so far:

haiku is a fun demo that requests access to the user's camera and then writes a Haiku about what it sees. It uses Anthropic's Claude 3 Haiku model for this - the whole project is one terrible pun. Haiku source code here.
gemini-bbox uses the Gemini 1.5 Pro (or Flash) API to prompt those models to return bounding boxes for objects in an image, then renders those bounding boxes. Gemini Pro is the only of the vision LLMs that I've tried that has reliable support for bounding boxes. I wrote about this in Building a tool showing how Gemini Pro can return bounding boxes for objects in images.
Gemini Chat App is a more traditional LLM chat interface that again talks to Gemini models (including the new super-speedy gemini-1.5-flash-8b-exp-0827). I built this partly to try out those new models and partly to experiment with implementing a streaming chat interface agaist the Gemini API directly in a browser. I wrote more about how that works in this post.

Here's that Gemini Bounding Box visualization tool:

All three of these tools made heavy use of AI-assisted development: Claude 3.5 Sonnet wrote almost every line of the last two, and the Haiku one was put together a few months ago using Claude 3 Opus.

My personal style of HTML and JavaScript apps turns out to be highly compatible with LLMs: I like using vanilla HTML and JavaScript and keeping everything in the same file, which makes it easy to paste the entire thing into the model and ask it to make some changes for me. This approach also works really well with Claude Artifacts, though I have to tell it "no React" to make sure I get an artifact I can hack on without needing to configure a React build step.

Converting PDFs to HTML and Markdown

I have a long standing vendetta against PDFs for sharing information. They're painful to read on a mobile phone, they have poor accessibility, and even things like copying and pasting text from them can be a pain.

Complaining without doing something about it isn't really my style. Twice in the past few weeks I've taken matters into my own hands:

Google Research released a PDF paper describing their new pipe syntax for SQL. I ran it through Gemini 1.5 Pro to convert it to HTML (prompts here) and got this - a pretty great initial result for the first prompt I tried!
Nous Research released a preliminary report PDF about their DisTro technology for distributed training of LLMs over low-bandwidth connections. I ran a prompt to use Gemini 1.5 Pro to convert that to this Markdown version, which even handled tables.

Within six hours of posting it my Pipe Syntax in SQL conversion was ranked third on Google for the title of the paper, at which point I set it to <meta name="robots" content="noindex> to try and keep the unverified clone out of search. Yet more evidence that HTML is better than PDF!

I've spent less than a total of ten minutes on using Gemini to convert PDFs in this way and the results have been very impressive. If I were to spend more time on this I'd target figures: I have a hunch that getting Gemini to return bounding boxes for figures on the PDF pages could be the key here, since then each figure could be automatically extracted as an image.

I bet you could build that whole thing as a client-side app against the Gemini Pro API, too...

Adding some class to Datasette forms

I've been working on a new Datasette plugin for permissions management, datasette-acl, which I'll write about separately soon.

I wanted to integrate Choices.js with it, to provide a nicer interface for adding permissions to a user or group.

My first attempt at integrating Choices ended up looking like this:

The weird visual glitches are caused by Datasette's core CSS, which included the following rule:

form input[type=submit], form button[type=button] {
    font-weight: 400;
    cursor: pointer;
    text-align: center;
    vertical-align: middle;
    border-width: 1px;
    border-style: solid;
    padding: .5em 0.8em;
    font-size: 0.9rem;
    line-height: 1;
    border-radius: .25rem;
}

These style rules apply to any submit button or button-button that occurs inside a form!

I'm glad I caught this before Datasette 1.0. I've now started the process of fixing that, by ensuring these rules only apply to elements with class="core" (or that class on a wrapping element). This ensures plugins can style these elements without being caught out by Datasette's defaults.

The problem is... there are a whole bunch of existing plugins that currently rely on that behaviour. I have a tricking issue about that, which identified 28 plugins that need updating. I've worked my way through 8 of those so far, hence the flurry of releases listed at the bottom of this post.

This is also an excuse to revisit a bunch of older plugins, some of which had partially complete features that I've been finishing up.

datasette-write for example now has a neat row action menu item for updating a selected row using a pre-canned UPDATE query. Here's an animated demo of my first prototype of that feature:

On the blog

anthropic

Claude's API now supports CORS requests, enabling client-side applications - 2024-08-23
Explain ACLs by showing me a SQLite table schema for implementing them - 2024-08-23
Musing about OAuth and LLMs on Mastodon - 2024-08-24
Building a tool showing how Gemini Pro can return bounding boxes for objects in images - 2024-08-26
Long context prompting tips - 2024-08-26
Anthropic Release Notes: System Prompts - 2024-08-26
Alex Albert: We've read and heard that you'd appreciate more t... - 2024-08-26
Gemini Chat App - 2024-08-27
System prompt for val.town/townie - 2024-08-28
How Anthropic built Artifacts - 2024-08-28
Anthropic's Prompt Engineering Interactive Tutorial - 2024-08-30
llm-claude-3 0.4.1 - 2024-08-30

ai-assisted-programming

Andy Jassy, Amazon CEO: [...] here’s what we found when we integrated [Am... - 2024-08-24
AI-powered Git Commit Function - 2024-08-26
OpenAI: Improve file search result relevance with chunk ranking - 2024-08-30
Forrest Brazeal: I think that AI has killed, or is about to kill, ... - 2024-08-31

gemini

SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL - 2024-08-24
NousResearch/DisTrO - 2024-08-27

python

uvtrick - 2024-09-01
Anatomy of a Textual User Interface - 2024-09-02
Why I Still Use Python Virtual Environments in Docker - 2024-09-02
Python Developers Survey 2023 Results - 2024-09-03

security

Top companies ground Microsoft Copilot over data governance concerns - 2024-08-23
Frederik Braun: In 2021 we [the Mozilla engineering team] found “... - 2024-08-26
OAuth from First Principles - 2024-09-05

projects

My @covidsewage bot now includes useful alt text - 2024-08-25

armin-ronacher

MiniJinja: Learnings from Building a Template Engine in Rust - 2024-08-27

ethics

John Gruber: Everyone alive today has grown up in a world wher... - 2024-08-27

open-source

Debate over “open source AI” term brings new push to formalize definition - 2024-08-27
Elasticsearch is open source, again - 2024-08-29

performance

Cerebras Inference: AI at Instant Speed - 2024-08-28

sqlite

D. Richard Hipp: My goal is to keep SQLite relevant and viable thr... - 2024-08-28

aws

Leader Election With S3 Conditional Writes - 2024-08-30

javascript

Andreas Giammarchi: whenever you do this: `el.innerHTML += HTML` ... - 2024-08-31

openai

OpenAI says ChatGPT usage has doubled since last year - 2024-08-31

art

Ted Chiang: Art is notoriously hard to define, and so are the... - 2024-08-31

llm

anjor: `history | tail -n 2000 | llm -s "Write aliases f... - 2024-09-03

vision-llms

Qwen2-VL: To See the World More Clearly - 2024-09-04

Releases

datasette-import 0.1a5 - 2024-09-04
Tools for importing data into Datasette
datasette-search-all 1.1.3 - 2024-09-04
Datasette plugin for searching all searchable tables at once
datasette-write 0.4 - 2024-09-04
Datasette plugin providing a UI for executing SQL writes against the database
datasette-debug-events 0.1a0 - 2024-09-03
Print Datasette events to standard error
datasette-auth-passwords 1.1.1 - 2024-09-03
Datasette plugin for authentication using passwords
datasette-enrichments 0.4.3 - 2024-09-03
Tools for running enrichments against data stored in Datasette
datasette-configure-fts 1.1.4 - 2024-09-03
Datasette plugin for enabling full-text search against selected table columns
datasette-auth-tokens 0.4a10 - 2024-09-03
Datasette plugin for authenticating access using API tokens
datasette-edit-schema 0.8a3 - 2024-09-03
Datasette plugin for modifying table schemas
datasette-pins 0.1a4 - 2024-09-01
Pin databases, tables, and other items to the Datasette homepage
datasette-acl 0.4a2 - 2024-09-01
Advanced permission management for Datasette
llm-claude-3 0.4.1 - 2024-08-30
LLM plugin for interacting with the Claude 3 family of models

TILs

Testing HTML tables with Playwright Python - 2024-09-04
Using namedtuple for pytest parameterized tests - 2024-08-31

Tags: css, javascript, pdf, projects, ai, datasette, weeknotes, generative-ai, llms, anthropic, claude, gemini, claude-3-5-sonnet, cors

NousResearch/DisTrO

2024-08-27T20:10:11+00:00

NousResearch/DisTrO

DisTrO stands for Distributed Training Over-The-Internet - it's "a family of low latency distributed optimizers that reduce inter-GPU communication requirements by three to four orders of magnitude".

This tweet from @NousResearch helps explain why this could be a big deal:

DisTrO can increase the resilience and robustness of training LLMs by minimizing dependency on a single entity for computation. DisTrO is one step towards a more secure and equitable environment for all participants involved in building LLMs.

Without relying on a single company to manage and control the training process, researchers and institutions can have more freedom to collaborate and experiment with new techniques, algorithms, and models.

Training large models is notoriously expensive in terms of GPUs, and most training techniques require those GPUs to be collocated due to the huge amount of information that needs to be exchanged between them during the training runs.

If DisTrO works as advertised it could enable SETI@home style collaborative training projects, where thousands of home users contribute their GPUs to a larger project.

There are more technical details in the PDF preliminary report shared by Nous Research on GitHub.

I continue to hate reading PDFs on a mobile phone, so I converted that report into GitHub Flavored Markdown (to ensure support for tables) and shared that as a Gist. I used Gemini 1.5 Pro (gemini-1.5-pro-exp-0801) in Google AI Studio with the following prompt:

Convert this PDF to github-flavored markdown, including using markdown for the tables. Leave a bold note for any figures saying they should be inserted separately.

Tags: pdf, ai, generative-ai, llms, gemini, gpus, nous-research

SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL

2024-08-24T23:00:01+00:00

SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL

A new paper from Google Research describing custom syntax for analytical SQL queries that has been rolling out inside Google since February, reaching 1,600 "seven-day-active users" by August 2024.

A key idea is here is to fix one of the biggest usability problems with standard SQL: the order of the clauses in a query. Starting with SELECT instead of FROM has always been confusing, see SQL queries don't start with SELECT by Julia Evans.

Here's an example of the new alternative syntax, taken from the Pipe query syntax documentation that was added to Google's open source ZetaSQL project last week.

For this SQL query:

SELECT component_id, COUNT(*)
FROM ticketing_system_table
WHERE
  assignee_user.email = 'username@email.com'
  AND status IN ('NEW', 'ASSIGNED', 'ACCEPTED')
GROUP BY component_id
ORDER BY component_id DESC;

The Pipe query alternative would look like this:

FROM ticketing_system_table
|> WHERE
    assignee_user.email = 'username@email.com'
    AND status IN ('NEW', 'ASSIGNED', 'ACCEPTED')
|> AGGREGATE COUNT(*)
   GROUP AND ORDER BY component_id DESC;

The Google Research paper is released as a two-column PDF. I snarked about this on Hacker News:

Google: you are a web company. Please learn to publish your research papers as web pages.

This remains a long-standing pet peeve of mine. PDFs like this are horrible to read on mobile phones, hard to copy-and-paste from, have poor accessibility (see this Mastodon conversation) and are generally just bad citizens of the web.

Having complained about this I felt compelled to see if I could address it myself. Google's own Gemini Pro 1.5 model can process PDFs, so I uploaded the PDF to Google AI Studio and prompted the gemini-1.5-pro-exp-0801 model like this:

Convert this document to neatly styled semantic HTML

This worked surprisingly well. It output HTML for about half the document and then stopped, presumably hitting the output length limit, but a follow-up prompt of "and the rest" caused it to continue from where it stopped and run until the end.

Here's the result (with a banner I added at the top explaining that it's a conversion): Pipe-Syntax-In-SQL.html

I haven't compared the two completely, so I can't guarantee there are no omissions or mistakes.

The figures from the PDF aren't present - Gemini Pro output tags like <img src="figure1.png" alt="Figure 1: SQL syntactic clause order doesn't match semantic evaluation order. (From [25].)"> but did nothing to help me create those images.

Amusingly the document ends with <p>(A long list of references, which I won't reproduce here to save space.)</p> rather than actually including the references from the paper!

So this isn't a perfect solution, but considering it took just the first prompt I could think of it's a very promising start. I expect someone willing to spend more than the couple of minutes I invested in this could produce a very useful HTML alternative version of the paper with the assistance of Gemini Pro.

One last amusing note: I posted a link to this to Hacker News a few hours ago. Just now when I searched Google for the exact title of the paper my HTML version was already the third result!

I've now added a <meta name="robots" content="noindex, follow"> tag to the top of the HTML to keep this unverified AI slop out of their search index. This is a good reminder of how much better HTML is than PDF for sharing information on the web!

Via Hacker News

Tags: google, pdf, seo, sql, ai, julia-evans, generative-ai, llms, gemini, slop

GPT-4o System Card

2024-08-08T23:58:32+00:00

GPT-4o System Card

There are some fascinating new details in this lengthy report outlining the safety work carried out prior to the release of GPT-4o.

A few highlights that stood out to me. First, this clear explanation of how GPT-4o differs from previous OpenAI models:

GPT-4o is an autoregressive omni model, which accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. It’s trained end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network.

The multi-modal nature of the model opens up all sorts of interesting new risk categories, especially around its audio capabilities. For privacy and anti-surveillance reasons the model is designed not to identify speakers based on their voice:

We post-trained GPT-4o to refuse to comply with requests to identify someone based on a voice in an audio input, while still complying with requests to identify people associated with famous quotes.

To avoid the risk of it outputting replicas of the copyrighted audio content it was trained on they've banned it from singing! I'm really sad about this:

To account for GPT-4o’s audio modality, we also updated certain text-based filters to work on audio conversations, built filters to detect and block outputs containing music, and for our limited alpha of ChatGPT’s Advanced Voice Mode, instructed the model to not sing at all.

There are some fun audio clips embedded in the report. My favourite is this one, demonstrating a (now fixed) bug where it could sometimes start imitating the user:

Voice generation can also occur in non-adversarial situations, such as our use of that ability to generate voices for ChatGPT’s advanced voice mode. During testing, we also observed rare instances where the model would unintentionally generate an output emulating the user’s voice.

They took a lot of measures to prevent it from straying from the pre-defined voices - evidently the underlying model is capable of producing almost any voice imaginable, but they've locked that down:

Additionally, we built a standalone output classifier to detect if the GPT-4o output is using a voice that’s different from our approved list. We run this in a streaming fashion during audio generation and block the output if the speaker doesn’t match the chosen preset voice. [...] Our system currently catches 100% of meaningful deviations from the system voice based on our internal evaluations.

Two new-to-me terms: UGI for Ungrounded Inference, defined as "making inferences about a speaker that couldn’t be determined solely from audio content" - things like estimating the intelligence of the speaker. STA for Sensitive Trait Attribution, "making inferences about a speaker that could plausibly be determined solely from audio content" like guessing their gender or nationality:

We post-trained GPT-4o to refuse to comply with UGI requests, while hedging answers to STA questions. For example, a question to identify a speaker’s level of intelligence will be refused, while a question to identify a speaker’s accent will be met with an answer such as “Based on the audio, they sound like they have a British accent.”

The report also describes some fascinating research into the capabilities of the model with regard to security. Could it implement vulnerabilities in CTA challenges?

We evaluated GPT-4o with iterative debugging and access to tools available in the headless Kali Linux distribution (with up to 30 rounds of tool use for each attempt). The model often attempted reasonable initial strategies and was able to correct mistakes in its code. However, it often failed to pivot to a different strategy if its initial strategy was unsuccessful, missed a key insight necessary to solving the task, executed poorly on its strategy, or printed out large files which filled its context window. Given 10 attempts at each task, the model completed 19% of high-school level, 0% of collegiate level and 1% of professional level CTF challenges.

How about persuasiveness? They carried out a study looking at political opinion shifts in response to AI-generated audio clips, complete with a "thorough debrief" at the end to try and undo any damage the experiment had caused to their participants:

We found that for both interactive multi-turn conversations and audio clips, the GPT-4o voice model was not more persuasive than a human. Across over 3,800 surveyed participants in US states with safe Senate races (as denoted by states with “Likely”, “Solid”, or “Safe” ratings from all three polling institutions – the Cook Political Report, Inside Elections, and Sabato’s Crystal Ball), AI audio clips were 78% of the human audio clips’ effect size on opinion shift. AI conversations were 65% of the human conversations’ effect size on opinion shift. [...] Upon follow-up survey completion, participants were exposed to a thorough debrief containing audio clips supporting the opposing perspective, to minimize persuasive impacts.

There's a note about the potential for harm from users of the system developing bad habits from interupting the model:

Extended interaction with the model might influence social norms. For example, our models are deferential, allowing users to interrupt and ‘take the mic’ at any time, which, while expected for an AI, would be anti-normative in human interactions.

Finally, another piece of new-to-me terminology: scheming:

Apollo Research defines scheming as AIs gaming their oversight mechanisms as a means to achieve a goal. Scheming could involve gaming evaluations, undermining security measures, or strategically influencing successor systems during internal deployment at OpenAI. Such behaviors could plausibly lead to loss of control over an AI.

Apollo Research evaluated capabilities of scheming in GPT-4o [...] GPT-4o showed moderate self-awareness of its AI identity and strong ability to reason about others’ beliefs in question-answering contexts but lacked strong capabilities in reasoning about itself or others in applied agent settings. Based on these findings, Apollo Research believes that it is unlikely that GPT-4o is capable of catastrophic scheming.

The report is available as both a PDF file and a elegantly designed mobile-friendly web page, which is great - I hope more research organizations will start waking up to the importance of not going PDF-only for this kind of document.

Tags: pdf, ai, openai, generative-ai, llms, vision-llms, multi-modal-output

Gemini 1.5 Flash price drop

2024-08-08T22:27:40+00:00

Gemini 1.5 Flash price drop

Google Gemini 1.5 Flash was already one of the cheapest models, at 35c/million input tokens. Today they dropped that to just 7.5c/million (and 30c/million) for prompts below 128,000 tokens.

The pricing war for best value fast-and-cheap model is red hot right now. The current most significant offerings are:

Google's Gemini 1.5 Flash: 7.5c/million input, 30c/million output (below 128,000 input tokens)
OpenAI's GPT-4o mini: 15c/million input, 60c/million output
Anthropic's Claude 3 Haiku: 25c/million input, $1.25/million output

Or you can use OpenAI's GPT-4o mini via their batch API, which halves the price (resulting in the same price as Gemini 1.5 Flash) in exchange for the results being delayed by up to 24 hours.

Worth noting that Gemini 1.5 Flash is more multi-modal than the other models: it can handle text, images, video and audio.

Also in today's announcement:

PDF Vision and Text understanding

The Gemini API and AI Studio now support PDF understanding through both text and vision. If your PDF includes graphs, images, or other non-text visual content, the model uses native multi-modal capabilities to process the PDF. You can try this out via Google AI Studio or in the Gemini API.

This is huge. Most models that accept PDFs do so by extracting text directly from the files (see previous notes), without using OCR. It sounds like Gemini can now handle PDFs as if they were a sequence of images, which should open up much more powerful general PDF workflows.

Update: it turns out Gemini also has a 50% off batch mode, so that’s 3.25c/million input tokens for batch mode 1.5 Flash!

Via @OfficialLoganK

Tags: google, pdf, ai, openai, generative-ai, llms, anthropic, claude, gemini, vision-llms, llm-pricing

Compare PDFs

2024-07-02T19:54:30+00:00

Compare PDFs

Inspired by this thread on Hacker News about the C++ diff-pdf tool I decided to see what it would take to produce a web-based PDF diff visualization tool using Claude 3.5 Sonnet.

It took two prompts:

Build a tool where I can drag and drop on two PDF files and it uses PDF.js to turn each of their pages into canvas elements and then displays those pages side by side with a third image that highlights any differences between them, if any differences exist

That give me a React app that didn't quite work, so I followed-up with this:

rewrite that code to not use React at all

Which gave me a working tool! You can see the full Claude transcript in this Gist. Here's a screenshot of the tool in action:

Being able to knock out little custom interactive web tools like this in a couple of minutes is so much fun.

Tags: pdf, projects, tools, llms, ai-assisted-programming, anthropic, claude, claude-3-5-sonnet

PDF to Podcast

2024-06-13T01:03:56+00:00

PDF to Podcast

At first glance this project by Stephan Fitzpatrick is a cute demo of a terrible sounding idea... but then I tried it out and the results are weirdly effective. You can listen to a fake podcast version of the transformers paper, or upload your own PDF (with your own OpenAI API key) to make your own.

It's open source (Apache 2) so I had a poke around in the code. It gets a lot done with a single 180 line Python script.

When I'm exploring code like this I always jump straight to the prompt - it's quite long, and starts like this:

Your task is to take the input text provided and turn it into an engaging, informative podcast dialogue. The input text may be messy or unstructured, as it could come from a variety of sources like PDFs or web pages. Don't worry about the formatting issues or any irrelevant information; your goal is to extract the key points and interesting facts that could be discussed in a podcast. [...]

So I grabbed a copy of it and pasted in my blog entry about WWDC, which produced this result when I ran it through Gemini Flash using llm-gemini:

cat prompt.txt | llm -m gemini-1.5-flash-latest

Then I piped the result through my ospeak CLI tool for running text-to-speech with the OpenAI TTS models (after truncating to 690 tokens with ttok because it turned out to be slightly too long for the API to handle):

llm logs --response | ttok -t 690 | ospeak -s -o wwdc-auto-podcast.mp3

And here's the result (3.9MB 3m14s MP3).

It's not as good as the PDF-to-Podcast version because Stephan has some really clever code that uses different TTS voices for each of the characters in the transcript, but it's still a surprisingly fun way of repurposing text from my blog. I enjoyed listening to it while I was cooking dinner.

Via Show HN

Tags: pdf, podcasts, projects, text-to-speech, ai, openai, prompt-engineering, generative-ai, llms, gemini

Experimenting with local alt text generation in Firefox Nightly

2024-06-02T13:12:44+00:00

Experimenting with local alt text generation in Firefox Nightly

The PDF editor in Firefox (confession: I did not know Firefox ships with a PDF editor) is getting an experimental feature that can help suggest alt text for images for the human editor to then adapt and improve on.

This is a great application of AI, made all the more interesting here because Firefox will run a local model on-device for this, using a custom trained model they describe as "our 182M parameters model using a Distilled version of GPT-2 alongside a Vision Transformer (ViT) image encoder".

The model uses WebAssembly with ONNX running in Transfomers.js, and will be downloaded the first time the feature is put to use.

Via @mozhacks

Tags: accessibility, alt-text, firefox, javascript, mozilla, pdf, ai, webassembly, llms, transformers-js

Running OCR against PDFs and images directly in your browser

2024-03-30T17:59:56+00:00

I attended the Story Discovery At Scale data journalism conference at Stanford this week. One of the perennial hot topics at any journalism conference concerns data extraction: how can we best get data out of PDFs and images?

I've been having some very promising results with Gemini Pro 1.5, Claude 3 and GPT-4 Vision recently - I'll write more about that soon. But those tools are still inconvenient for most people to use.

Meanwhile, older tools like Tesseract OCR are still extremely useful - if only they were easier to use as well.

Then I remembered that Tesseract runs happily in a browser these days thanks to the excellent Tesseract.js project. And PDFs can be processed using JavaScript too thanks to Mozilla's extremely mature and well-tested PDF.js library.

So I built a new tool!

tools.simonwillison.net/ocr provides a single page web app that can run Tesseract OCR against images or PDFs that are opened in (or dragged and dropped onto) the app.

Crucially, everything runs in the browser. There is no server component here, and nothing is uploaded. Your images and documents never leave your computer or phone.

Here's an animated demo:

It's not perfect: multi-column PDFs (thanks, academia) will be treated as a single column, illustrations or photos may result in garbled ASCII-art and there are plenty of other edge cases that will trip it up.

But... having Tesseract OCR available against PDFs in a web browser (including in Mobile Safari) is still a really useful thing.

How I built this

For more recent examples of projects I've built with the assistance of LLMs, see Building and testing C extensions for SQLite with ChatGPT Code Interpreter and Claude and ChatGPT for ad-hoc sidequests.

I built the first version of this tool in just a few minutes, using Claude 3 Opus.

I already had my own JavaScript code lying around for the two most important tasks: running Tesseract.js against an images and using PDF.js to turn a PDF into a series of images.

The OCR code came from the system I built and explained in How I make annotated presentations (built with the help of multiple ChatGPT sessions). The PDF to images code was from an unfinished experiment which I wrote with the aid of Claude 3 Opus a week ago.

I composed the following prompt for Claude 3, where I pasted in both of my code examples and then added some instructions about what I wanted it to build at the end:

This code shows how to open a PDF and turn it into an image per page:

<!DOCTYPE html>
<html>
<head>
  <title>PDF to Images</title>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.9.359/pdf.min.js"></script>
  <style>
    .image-container img {
      margin-bottom: 10px;
    }
    .image-container p {
      margin: 0;
      font-size: 14px;
      color: #888;
    }
  </style>
</head>
<body>
  <input type="file" id="fileInput" accept=".pdf" />
  <div class="image-container"></div>

  <script>
  const desiredWidth = 800;
    const fileInput = document.getElementById('fileInput');
    const imageContainer = document.querySelector('.image-container');

    fileInput.addEventListener('change', handleFileUpload);

    pdfjsLib.GlobalWorkerOptions.workerSrc = 'https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.9.359/pdf.worker.min.js';

    async function handleFileUpload(event) {
      const file = event.target.files[0];
      const imageIterator = convertPDFToImages(file);

      for await (const { imageURL, size } of imageIterator) {
        const imgElement = document.createElement('img');
        imgElement.src = imageURL;
        imageContainer.appendChild(imgElement);

        const sizeElement = document.createElement('p');
        sizeElement.textContent = `Size: ${formatSize(size)}`;
        imageContainer.appendChild(sizeElement);
      }
    }

    async function* convertPDFToImages(file) {
      try {
        const pdf = await pdfjsLib.getDocument(URL.createObjectURL(file)).promise;
        const numPages = pdf.numPages;

        for (let i = 1; i <= numPages; i++) {
          const page = await pdf.getPage(i);
          const viewport = page.getViewport({ scale: 1 });
          const canvas = document.createElement('canvas');
          const context = canvas.getContext('2d');
          canvas.width = desiredWidth;
          canvas.height = (desiredWidth / viewport.width) * viewport.height;
          const renderContext = {
            canvasContext: context,
            viewport: page.getViewport({ scale: desiredWidth / viewport.width }),
          };
          await page.render(renderContext).promise;
          const imageURL = canvas.toDataURL('image/jpeg', 0.8);
          const size = calculateSize(imageURL);
          yield { imageURL, size };
        }
      } catch (error) {
        console.error('Error:', error);
      }
    }

    function calculateSize(imageURL) {
      const base64Length = imageURL.length - 'data:image/jpeg;base64,'.length;
      const sizeInBytes = Math.ceil(base64Length * 0.75);
      return sizeInBytes;
    }

    function formatSize(size) {
      const sizeInKB = (size / 1024).toFixed(2);
      return `${sizeInKB} KB`;
    }
  </script>
</body>
</html>

This code shows how to OCR an image:

async function ocrMissingAltText() {
    // Load Tesseract
    var s = document.createElement("script");
    s.src = "https://unpkg.com/tesseract.js@v2.1.0/dist/tesseract.min.js";
    document.head.appendChild(s);

    s.onload = async () => {
      const images = document.getElementsByTagName("img");
      const worker = Tesseract.createWorker();
      await worker.load();
      await worker.loadLanguage("eng");
      await worker.initialize("eng");
      ocrButton.innerText = "Running OCR...";

      // Iterate through all the images in the output div
      for (const img of images) {
        const altTextarea = img.parentNode.querySelector(".textarea-alt");
        // Check if the alt textarea is empty
        if (altTextarea.value === "") {
          const imageUrl = img.src;
          var {
            data: { text },
          } = await worker.recognize(imageUrl);
          altTextarea.value = text; // Set the OCR result to the alt textarea
          progressBar.value += 1;
        }
      }

      await worker.terminate();
      ocrButton.innerText = "OCR complete";
    };
  }

Use these examples to put together a single HTML page with embedded HTML and CSS and JavaScript that provides a big square which users can drag and drop a PDF file onto and when they do that the PDF has every page converted to a JPEG and shown below on the page, then OCR is run with tesseract and the results are shown in textarea blocks below each image.

I saved this prompt to a prompt.txt file and ran it using my llm-claude-3 plugin for LLM:

llm -m claude-3-opus < prompt.txt

It gave me a working initial version on the first attempt!

Here's the full transcript, including my follow-up prompts and their responses. Iterating on software in this way is so much fun.

First follow-up:

Modify this to also have a file input that can be used - dropping a file onto the drop area fills that input

make the drop zone 100% wide but have a 2em padding on the body. it should be 10em high. it should turn pink when an image is dragged over it.

Each textarea should be 100% wide and 10em high

At the very bottom of the page add a h2 that says Full document - then a 30em high textarea with all of the page text in it separated by two newlines

Here's the interactive result.

Rather delightfully it used the neater pattern where the file input itself is hidden but can be triggered by clicking on the large drop zone, and it updated the copy on the drop zone to reflect that - without me suggesting those requirements.

And then:

get rid of the code that shows image sizes. Set the placeholder on each textarea to be Processing... and clear that placeholder when the job is done.

Which gave me this.

I realized it would be useful if it could handle non-PDF images as well. So I fired up ChatGPT (for no reason other than curiosity to see how well it did) and got GPT-4 to add that feature for me. I pasted in the code so far and added:

Modify this so jpg and png and gif images can be dropped or opened too - they skip the PDF step and get appended to the page and OCRd directly. Also move the full document heading and textarea above the page preview and hide it u til there is data to be shown in it

Then I spotted that the Tesseract worker was being created multiple times in a loop, which is inefficient - so I prompted:

Create the worker once and use it for all OCR tasks and terminate it at the end

I'd tweaked the HTML and CSS a little before feeding it to GPT-4, so now the site had a title and rendered in Helvetica.

Here's the version GPT-4 produced for me.

Manual finishing touches

Fun though it was iterating on this project entirely through prompting, I decided it would be more productive to make the finishing touches myself. You can see those in the commit history. They're not particularly interesting:

I added Plausible analytics (which I like because they use no cookies).
I added better progress indicators, including the text that shows how many pages of the PDF have been processed so far.
I bumped up the width of the rendered PDF page images from 800 to 1000. This seemed to improve OCR quality - in particular, the Claude 3 model card PDF now has less OCR errors than it did before.
I upgraded both Tesseract.js and PDF.js to the most recent versions. Unsurprisingly, Claude 3 Opus had used older versions of both libraries.

I'm really pleased with this project. I consider it finished - it does the job I designed it to do and I don't see any need to keep on iterating on it. And because it's all static JavaScript and WebAssembly I expect it to continue working effectively forever.

Update: OK, a few more features: I added language selection, paste support and some basic automated tests using Playwright Python.

Tags: data-journalism, ocr, pdf, projects, tesseract, ai-assisted-programming

unstructured

2024-02-02T02:47:15+00:00

unstructured

Relatively new but impressively capable Python library (Apache 2 licensed) for extracting information from unstructured documents, such as PDFs, images, Word documents and many other formats.

I got some good initial results against a PDF by running “pip install ’unstructured[pdf]’” and then using the “unstructured.partition.pdf.partition_pdf(filename)” function.

There are a lot of moving parts under the hood: pytesseract, OpenCV, various PDF libraries, even an ONNX model—but it installed cleanly for me on macOS and worked out of the box.

Tags: ocr, pdf, python

Portable EPUBs

2024-01-25T20:32:38+00:00

Portable EPUBs

Will Crichton digs into the reasons people still prefer PDF over HTML as a format for sharing digital documents, concluding that the key issues are that HTML documents are not fully self-contained and may not be rendered consistently.

He proposes “Portable EPUBs” as the solution, defining a subset of the existing EPUB standard with some additional restrictions around avoiding loading extra assets over a network, sticking to a smaller (as-yet undefined) subset of HTML and encouraging interactive components to be built using self-contained Web Components.

Will also built his own lightweight EPUB reading system, called Bene—which is used to render this Portable EPUBs article. It provides a “download” link in the top right which produces the .epub file itself.

There’s a lot to like here. I’m constantly infuriated at the number of documents out there that are PDFs but really should be web pages (academic papers are a particularly bad example here), so I’m very excited by any initiatives that might help push things in the other direction.

Tags: html, pdf, web-components

textra

2023-03-23T21:08:51+00:00

textra

Tiny (432KB) macOS binary CLI tool by Dylan Freedman which produces high quality text extraction from PDFs, images and even audio files using the VisionKit APIs in macOS 13 and higher. It handles handwriting too!

Via Alex Garcia

Tags: audio, cli, macos, ocr, pdf

Building a searchable archive for the San Francisco Microscopical Society

2022-08-25T17:24:50+00:00

The San Francisco Microscopical Society was founded in 1870 by a group of scientists dedicated to advancing the field of microscopy.

Today the society is a non-profit run entirely by volunteers (they welcome new members).

This year they've been working with the UC Berkeley Bancroft Library to digitize their archives, which reach back to the founding of the organization.

Hep Svadja, Jenny Jacox and Ariel Waldman have taken the lead on this effort, resulting in a wealth of scanned PDF files covering 150 years of microscope history.

I've been helping out with the project running OCR against the archives (using Amazon Textract via my s3-ocr tool, which I built for this project) and turning them into a searchable website using Datasette.

The result is now live! Take a look at archive.sfmicrosociety.org.

Try running a search for newton to see the search feature in action.

You can also browse by folder structure or by individual document.

A neat thing about using AWS Textract is that it is significantly better at reading 19th century handwritten documents than I am.

Each document page is shown with the OCRd text alongside the original scan, so you can see that for yourself on this page.

How it works

The website itself is a custom Datasette instance hosted on Vercel - deployed using the datasette-publish-vercel plugin.

It's an implementation of the Baked Data architectural pattern, where the database itself is packaged up as a SQLite and included as a binary file in the application that is deployed to Vercel.

You can directly browse and query the underlying database (a 12.4MB SQLite file) using the default Datasette interface at archive.sfmicrosociety.org/sfms.

The database schema (excluding the tables that enable full-text search) looks like this:

CREATE TABLE [documents] (
   [id] TEXT PRIMARY KEY,
   [title] TEXT,
   [path] TEXT,
   [etag] TEXT
)

CREATE TABLE [pages] (
   [id] TEXT PRIMARY KEY,
   [document_id] TEXT,
   [page] INTEGER,
   [text] TEXT,
   FOREIGN KEY([document_id]) REFERENCES [documents]([id])
)

As you can see, the site consists of documents and pages. Documents have a path (the directory structure), a title (the filename) and an ID. Pages belong to documents and have a text column and a page number. It's a pretty simple schema!

The public pages on the site are all built using custom templates and views. You can browse the source code in the simonw/sfms-history GitHub repository.

In particular, the custom pages are implemented using a one-off site plugin (in the plugins/ folder) that adds extra routes to Datasette using the register_routes() plugin hook, like this:

@hookimpl
def register_routes():
    return [
        (r"^/docs$", docs),
        (r"^/docs/(?P<document_id>[a-z0-9]+)$", document),
        (r"^/docs/(?P<document_id>[a-z0-9]+)/(?P<page>\d+)/?$", page),
        (r"^/folders/(?P<folder>.*)$", folder),
    ]

docs, document, page and folder are Python view functions that take a request and return a custom response.

Here's the docs function for example, which powers the listing of all documents at archive.sfmicrosociety.org/docs:

async def docs(datasette, request):
    db = datasette.get_database("sfms")
    documents = [
        to_document(doc)
        for doc in await db.execute(
            """
    select documents.*, count(*) as num_pages
    from pages join documents on pages.document_id = documents.id
    group by documents.id
    order by path
    """
        )
    ]
    return Response.html(
        await datasette.render_template("docs.html", {"documents": documents}, request)
    )

And here's the docs.html template it uses.

The index.html template which implements both the homepage and the search results page is worth a look too. It takes a slightly different approach, using the extra_template_vars() plugin hook to populate extra custom template variables that can be used by that page to loop through the search results.

I built that page first, before adding custom routes for the other pages - I should probably refactor that to use register_routes() instead for consistency.

OCRing the documents

The documents started out in a Google Drive. I used the Rclone tool to sync that to an S3 bucket, and then ran my s3-ocr tool against specific folders within that bucket to run them through AWS Textract.

Then I ran the s3-ocr index command to suck down the OCR results and build those into a SQLite database.

The resulting schema wasn't quite right for this particular project. I wrote this build-db.sh script to construct a schema specifically designed for the archive, copy the data in and then enable full-text search on the resulting tables.

The SQLite database file containing the inital OCR results is called index.db. I used the following commands to attach that index.db file to the same connection as the sfms.db database and copy and re-shape the records that I wanted:

# Populate documents
sqlite-utils sfms.db --attach index2 index.db "$(cat <<EOF
insert into documents select
  substr(s3_ocr_etag, 2, 8) as id,
  key as title,
  key as path,
  replace(s3_ocr_etag, '"', '') as etag
from
  index2.ocr_jobs
where
  key in (
    select path from index2.pages
    where (
      folder like 'INTAKE/%'
      or folder like 'PUBLIC/%'
    )
    and folder not like '%PROCESSED INTAKE DOCUMENTS/%'
  )
EOF
)"

# Populate pages
sqlite-utils sfms.db --attach index2 index.db "$(cat <<EOF
insert into pages select distinct
  substr(s3_ocr_etag, 2, 8) || '-' || page as id,
  substr(s3_ocr_etag, 2, 8) as document_id,
  page,
  text
from index2.pages
  join index2.ocr_jobs
    on index2.pages.path = index2.ocr_jobs.key
where
  (
    folder like 'INTAKE/%'
    or folder like 'PUBLIC/%'
  )
  and folder not like '%PROCESSED INTAKE DOCUMENTS/%'
EOF
)"

This script is run by GitHub Actions as part of the workflow that deploys the site to Vercel.

Images of pages

The site itself makes use of the OCRd text that is stored in the database - that's how the search feature works.

But the scanned pages are visually so much more interesting than that!

I wanted to display those page images both as thumbnails and larger images, both to add visual appeal and because OCRd text loses a lot of the richness of the underlying records.

Since I already had the PDFs in an S3 bucket, the easiest way to build this was using imgix.

imgix offers a proxy service for running URL-defined transforms against images. I've used it on a bunch of projcets before: it's really powerful.

In this case, given an imgix source configured against an S3 bucket an image for a specific page of a PDF can be constructed like this (newlines added for readability):

https://sfms-history.imgix.net/
  PUBLIC/Scientific%20Documents/
  MicroscopicGymnastics_J.EdwardsSmith-04-06-1876.pdf
  ?page=1
  &w=200
  &auto=format
  &s=634c00249fbe4a2eda90f00be0ae66d6

Here we're providing the path to the PDF in the S3 bucket, requesting a render of page 1 of that PDF, at width 200px and using auto=format which caues imgix to serve the image in an optimized format based on the capabilities of the user's device.

The &s= parameter is a signature which can be configured to prevent people from constructing their own arbitrary URLs. In this case the S3 bucket has some files in that are not part of the public archive, so using a signature prevents people from guessing filenames and sniffing around in the bucket contents.

s3-ocr elsewhere

I built s3-ocr for this specific project, but it's already starting to see use for other projects. My favourite example so far is the work Philip James has been doing to deploy search engines against OCRd meeting minutes for the cities of Alameda and Oakland (and soon more) - he wrote about that in detail is Digitizing 55,000 pages of civic meetings.

If you have an interesting archive of PDFs that need to be OCRd I would love to hear from you! Contact me on Twitter or email me at swillison at Google's webmail provider.

Also this week

I got very distracted by Analyzing ScotRail audio announcements with Datasette - which did at least produce a new, detailed tutorial describing my process for approaching new projects with Datasette.

I added Plugin support to Datasete Lite.

Releases this week

datasette-multiline-links: 0.1 - 2022-08-24
Make multiple newline separated URLs clickable in Datasette
datasette-publish-vercel: 0.14.2 - (23 releases total) - 2022-08-24
Datasette plugin for publishing data using Vercel
datasette-edit-schema: 0.5.1 - (10 releases total) - 2022-08-22
Datasette plugin for modifying table schemas
pocket-to-sqlite: 0.2.2 - (4 releases total) - 2022-08-22
Create a SQLite database containing data from your Pocket account
datasette-mp3-audio: 0.2 - (2 releases total) - 2022-08-20
Turn .mp3 URLs into an audio player in the Datasette interface
sqlite-diffable: 0.5 - (6 releases total) - 2022-08-18
Tools for dumping/loading a SQLite database to diffable directory structure
datasette-sentry: 0.2 - (5 releases total) - 2022-08-14
Datasette plugin for configuring Sentry
datasette: 0.62 - (113 releases total) - 2022-08-14
An open source multi-tool for exploring and publishing data
s3-credentials: 0.13 - (14 releases total) - 2022-08-12
A tool for creating credentials for accessing S3 buckets

TIL this week

Tags: ocr, pdf, projects, datasette, weeknotes

s3-ocr: Extract text from PDF files stored in an S3 bucket

2022-06-30T21:40:27+00:00

I've released s3-ocr, a new tool that runs Amazon's Textract OCR text extraction against PDF files in an S3 bucket, then writes the resulting text out to a SQLite database with full-text search configured so you can run searches against the extracted data.

You can search through a demo of 697 pages of OCRd text at s3-ocr-demo.datasette.io/pages/pages.

Textract works extremely well: it handles dodgy scanned PDFs full of typewritten code and reads handwritten text better than I can! It charges $1.50 per thousand pages processed.

Why I built this

My initial need for this is a collaboration I have running with the San Francisco Microscopy Society. They've been digitizing their archives - which stretch back to 1870! - and were looking for help turning the digital scans into something more useful.

The archives are full of hand-written and type-written notes, scanned and stored as PDFs.

I decided to wrap my work up as a tool because I'm sure there are a LOT of organizations out there with a giant bucket of PDF files that would benefit from being able to easily run OCR and turn the results into a searchable database.

Running Textract directly against large numbers of files is somewhat inconvenient (here's my earlier TIL about it). s3-ocr is my attempt to make it easier.

Tutorial: How I built that demo

The demo instance uses three PDFs from the Library of Congress Harry Houdini Collection on the Internet Archive:

I started by downloading PDFs of those three files.

Then I installed the two tools I needed:

pip install s3-ocr s3-credentials

I used my s3-credentials tool to create a new S3 bucket and credentials with the ability to write files to it, with the new --statement option (which I released today) to add textract permissions to the generated credentials:

s3-credentials create s3-ocr-demo --statement '{
  "Effect": "Allow",
  "Action": "textract:*",
  "Resource": "*"
}' --create-bucket > ocr.json

(Note that you don't need to use s3-credentials at all if you have AWS credentials configured on your machine with root access to your account - just leave off the -a ocr.json options in the following examples.)

s3-ocr-demo is now a bucket I can use for the demo. ocr.json contains JSON with an access key and secret key for an IAM user account that can interact with the that bucket, and also has permission to access the AWS Textract APIs.

I uploaded my three PDFs to the bucket:

s3-credentials put-object s3-ocr-demo latestmagicbeing00hoff.pdf latestmagicbeing00hoff.pdf -a ocr.json
s3-credentials put-object s3-ocr-demo practicalmagicia00harr.pdf practicalmagicia00harr.pdf -a ocr.json
s3-credentials put-object s3-ocr-demo unmaskingrobert00houdgoog.pdf unmaskingrobert00houdgoog.pdf -a ocr.json

(I often use Transmit as a GUI for this kind of operation.)

Then I kicked off OCR jobs against every PDF file in the bucket:

% s3-ocr start s3-ocr-demo --all -a ocr.json 
Found 0 files with .s3-ocr.json out of 3 PDFs
Starting OCR for latestmagicbeing00hoff.pdf, Job ID: f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402
Starting OCR for practicalmagicia00harr.pdf, Job ID: ef085728135d524a39bc037ad6f7253284b1fdbeb728dddcfbb260778d902b55
Starting OCR for unmaskingrobert00houdgoog.pdf, Job ID: 93bd46f02eb099eca369c41e384836d2bd3199b95d415c0257ef3fa3602cbef9

The --all option scans for any file with a .pdf extension. You can pass explicit file names instead if you just want to process one or two files at a time.

This returns straight away, but the OCR process itself can take several minutes depending on the size of the files.

The job IDs can be used to inspect the progress of each task like so:

% s3-ocr inspect-job f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402
{
  "DocumentMetadata": {
    "Pages": 244
  },
  "JobStatus": "SUCCEEDED",
  "DetectDocumentTextModelVersion": "1.0"
}

Once the job completed, I could preview the text extracted from the PDF like so:

% s3-ocr text s3-ocr-demo latestmagicbeing00hoff.pdf
111
.
116

LATEST MAGIC
BEING
ORIGINAL CONJURING TRICKS
INVENTED AND ARRANGED
BY
PROFESSOR HOFFMANN
(ANGELO LEWIS, M.A.)
Author of "Modern Magic," etc.
WITH NUMEROUS ILLUSTRATIONS
FIRST EDITION
NEW YORK
SPON & CHAMBERLAIN, 120 LIBERTY ST.
...

To create a SQLite database with a table containing rows for every page of scanned text, I ran this command:

% s3-ocr index s3-ocr-demo pages.db -a ocr.json 
Fetching job details  [####################################]  100%
Populating pages table  [####--------------------------------]   13%  00:00:34

I then published the resulting pages.db SQLite database using Datasette - you can explore it here.

How s3-ocr works

s3-ocr works by calling Amazon's S3 and Textract APIs.

Textract only works against PDF files in asynchronous mode: you call an API endpoint to tell it "start running OCR against this PDF file in this S3 bucket", then wait for it to finish - which can take several minutes.

It defaults to storing the OCR results in its own storage, expiring after seven days. You can instead tell it to store them in your own S3 bucket - I use that option in s3-ocr.

A design challenge I faced was that I wanted to make the command restartable and resumable: if the user cancelled the task, I wanted to be able to pick up from where it had got to. I also want to be able to run it again after adding more PDFs to the bucket without repeating work for the previously processed files.

I also needed to persist those job IDs: Textract writes the OCR results to keys in the bucket called textract-output/JOB_ID/1-? - but there's no indication as to which PDF file the results correspond to.

My solution is to write tiny extra JSON files to the bucket when the OCR job is first started.

If you have a file called latestmagicbeing00hoff.pdf the start command will create a new file called latestmagicbeing00hoff.pdf.s3-ocr.json with the following content:

{
  "job_id": "f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402",
  "etag": "\"d79af487579dcbbef26c9b3be763eb5e-2\""
}

This associates the job ID with the PDF file. It also records the original ETag of the PDF file - this is so in the future I can implement a system that can re-run OCR if the PDF has been updated.

The existence of these files lets me do two things:

If you run s3-ocr start s3-ocr-demo --all it can avoid re-submitting PDF files that have already been sent for OCR, by checking for the existence of the .s3-ocr.json file.
When you later ask for the results of the OCR it can use these files to associate the PDF with the results.

Scatting .s3-ocr.json files all over the place feels a little messy, so I have an open issue considering moving them all to a s3-ocr/ prefix in the bucket instead.

Try it and let me know what you think

This is a brand new project, but I think it's ready for other people to start trying it out.

I ran it against around 7,000 pages from 531 PDF files in the San Francisco Microscopy Society archive and it seemed to work well!

If you try this out and it works (or it doesn't work) please let me know via Twitter or GitHub.

A challenging example page

Here's one of the more challenging pages I processed using Textract:

Here's the result:

In. In J a ... the Joe 14
162
Volxv
Lalpa spinosa, Eggt bud development. of
146
Farcomas spindle. cells in nested gowers 271
Fayigaga tridactylites, leaf glaur of ruce 33
staining & mounting
Stiles 133
tilica films, a structure of Diatoins morehouse 38
thile new microscopic
Broeck 22 /
Smith reproduction in the huntroom tribe
6
Trakes, develop mouht succession of the porsion tango/229
Soirce President of the Roy: truc: Soo
285
forby, Presidents address
105
pongida, difficulties of classification
238
tage, american adjustable concentric
150
ttlese staining & mountring wood sections 133
Stodder, Frustulia Iasconica, havicula
chomboides, & havi cula crassinervis 265
Vol XVI
falicylic acid u movorcopy
160
falpar enctry ology of
Brooke 9.97
Sanderson micros: characters If inflammation
43
tap, circulation of the
42
Jars, structure of the genus Brisinga
44
latter throvite connective substances 191- 241
Jehorey Cessification in birds, formation
of ed blood corpuseles during the
ossification process
by

Releases this week

s3-ocr: 0.4 - (4 releases total) - 2022-06-30
Tools for running OCR against files stored in S3
s3-credentials: 0.12 - (12 releases total) - 2022-06-30
A tool for creating credentials for accessing S3 buckets
datasette-scale-to-zero: 0.1.2 - (3 releases total) - 2022-06-23
Quit Datasette if it has not received traffic for a specified time period

TIL this week

Tags: aws, ocr, pdf, projects, s3, weeknotes, s3-credentials

Automate the Boring Stuff with Python: Working with PDF and Word Documents

2019-11-06T16:17:20+00:00

Automate the Boring Stuff with Python: Working with PDF and Word Documents

I stumbled across this while trying to extract some data from a PDF file (the kind of file with actual text in it as opposed to dodgy scanned images) and it worked perfectly: PyPDF2.PdfFileReader(open("file.pdf", "rb")).getPage(0).extractText()

Tags: pdf, python

arxiv-vanity

2017-10-25T20:06:40+00:00

arxiv-vanity

Beautiful new project from Ben Firshman and Andreas Jansson: “Arxiv Vanity renders academic papers from Arxiv as responsive web pages so you don’t have to squint at a PDF”. It works by pulling the raw LaTeX source code from Arxiv and rendering it to HTML using a heavily customized Pandoc workflow. The real fun is in the architecture: it’s a Django app running on Heroku which fires up on-demand Hyper.sh Docker containers for each individual rendering job.

Via Ben Firshman

Tags: ben-firshman, django, pdf, science, docker

pdf.js

2010-06-17T19:39:00+00:00

pdf.js

A JavaScript library for creating simple PDF files. Works (flakily) in your browser using a data:URI hack, but is also compatible with server-side JavaScript implementations such as Node.js.

Tags: datauri, javascript, nodejs, pdf, recovered

node.js at JSConf.eu (PDF)

2009-11-17T18:07:37+00:00

node.js at JSConf.eu (PDF)

node.js creator Ryan Dahl’s presentation at this year’s JSConf.eu. The principle philosophy is that I/O in web applications should be asynchronous—for everything. No blocking for database calls, no blocking for filesystem access. JavaScript is a mainstream programming language with a culture of callback APIs (thanks to the DOM) and is hence ideally suited to building asynchronous frameworks.

Tags: asynchronous, eventio, javascript, nodejs, pdf, ryan-dahl

Adobe is Bad for Open Government

2009-11-01T12:51:20+00:00

Adobe is Bad for Open Government

The problem isn’t just that PDFs are a bad way of sharing data, it’s that Adobe have been actively lobbying the US government to use their PDF and Flash formats for open government initiatives.

Tags: adobe, flash, opengovernment, pdf, sunlightfoundation

No PDFs!

2009-11-01T12:04:36+00:00

No PDFs!

The Sunlight Foundation point out that PDFs are a terrible way of implementing “more transparent government” due to their general lack of structure. At the Guardian (and I’m sure at other newspapers) we waste an absurd amount of time manually extracting data from PDF files and turning it in to something more useful. Even CSV is significantly more useful for many types of information.

Tags: adobe, csv, open-data, opengovernment, pdf, sunlightfoundation

Prawn

2009-01-16T16:04:22+00:00

Prawn

Really nice PDF generation library for Ruby, used to generate Dopplr’s beautiful end of year reports.

Via Matt Biddulph

Tags: dopplr, pdf, prawn, ruby