<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: pdf</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/pdf.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-12-12T23:29:51+00:00</updated><author><name>Simon Willison</name></author><entry><title>OpenAI are quietly adopting skills, now available in ChatGPT and Codex CLI</title><link href="https://simonwillison.net/2025/Dec/12/openai-skills/#atom-tag" rel="alternate"/><published>2025-12-12T23:29:51+00:00</published><updated>2025-12-12T23:29:51+00:00</updated><id>https://simonwillison.net/2025/Dec/12/openai-skills/#atom-tag</id><summary type="html">
    &lt;p&gt;One of the things that most excited me about &lt;a href="https://simonwillison.net/2025/Oct/16/claude-skills/"&gt;Anthropic's new Skills mechanism&lt;/a&gt; back in October is how easy it looked for other platforms to implement. A skill is just a folder with a Markdown file and some optional extra resources and scripts, so any LLM tool with the ability to navigate and read from a filesystem should be capable of using them. It turns out OpenAI are doing exactly that, with skills support quietly showing up in both their Codex CLI tool and now also in ChatGPT itself.&lt;/p&gt;
&lt;h4 id="skills-in-chatgpt"&gt;Skills in ChatGPT&lt;/h4&gt;
&lt;p&gt;I learned about this &lt;a href="https://x.com/elias_judin/status/1999491647563006171"&gt;from Elias Judin&lt;/a&gt; this morning. It turns out the Code Interpreter feature of ChatGPT now has a new &lt;code&gt;/home/oai/skills&lt;/code&gt; folder which you can access simply by prompting:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Create a zip file of /home/oai/skills&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I &lt;a href="https://chatgpt.com/share/693c9645-caa4-8006-9302-0a9226ea7599"&gt;tried that myself&lt;/a&gt; and got back &lt;a href="https://static.simonwillison.net/static/cors-allow/2025/skills.zip"&gt;this zip file&lt;/a&gt;. Here's &lt;a href="https://tools.simonwillison.net/zip-wheel-explorer?url=https%3A%2F%2Fstatic.simonwillison.net%2Fstatic%2Fcors-allow%2F2025%2Fskills.zip"&gt;a UI for exploring its content&lt;/a&gt; (&lt;a href="https://tools.simonwillison.net/colophon#zip-wheel-explorer.html"&gt;more about that tool&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/skills-explore.jpg" alt="Screenshot of file explorer. Files skills/docs/render_docsx.py and skills/docs/skill.md and skills/pdfs/ and skills/pdfs/skill.md - that last one is expanded and reads: # PDF reading, creation, and review guidance  ## Reading PDFs - Use pdftoppm -png $OUTDIR/$BASENAME.pdf $OUTDIR/$BASENAME to convert PDFs to PNGs. - Then open the PNGs and read the images. - pdfplumber is also installed and can be used to read PDFs. It can be used as a complementary tool to pdftoppm but not replacing it. - Only do python printing as a last resort because you will miss important details with text extraction (e.g. figures, tables, diagrams).  ## Primary tooling for creating PDFs - Generate PDFs programmatically with reportlab as the primary tool. In most cases, you should use reportlab to create PDFs. - If there are other packages you think are necessary for the task (eg. pypdf, pyMuPDF), you can use them but you may need topip install them first. - After each meaningful update—content additions, layout adjustments, or style changes—render the PDF to images to check layout fidelity:   - pdftoppm -png $INPUT_PDF $OUTPUT_PREFIX - Inspect every exported PNG before continuing work. If anything looks off, fix the source and re-run the render → inspect loop until the pages are clean.  ## Quality expectations - Maintain a polished, intentional visual design: consistent typography, spacing, margins, color palette, and clear section breaks across all pages. - Avoid major rendering issues—no clipped text, overlapping elements, black squares, broken tables, or unreadable glyphs. The rendered pages should look like a curated document, not raw template output. - Charts, tables, diagrams, and images must be sharp, well-aligned, and properly labeled in the PNGs. Legends and axes should be readable without excessive zoom. - Text must be readable at normal viewing size; avoid walls of filler text or dense, unstructured bullet lists. Use whitespace to separate ideas. - Never use the U+2011 non-breaking hyphen or other unicode dashes as they will not be" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;So far they cover spreadsheets, docx and PDFs. Interestingly their chosen approach for PDFs and documents is to convert them to rendered per-page PNGs and then pass those through their vision-enabled GPT models, presumably to maintain information from layout and graphics that would be lost if they just ran text extraction.&lt;/p&gt;
&lt;p&gt;Elias &lt;a href="https://github.com/eliasjudin/oai-skills"&gt;shared copies in a GitHub repo&lt;/a&gt;. They look very similar to Anthropic's implementation of the same kind of idea, currently published in their &lt;a href="https://github.com/anthropics/skills/tree/main/skills"&gt;anthropics/skills&lt;/a&gt; repository.&lt;/p&gt;
&lt;p&gt;I tried it out by prompting:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Create a PDF with a summary of the rimu tree situation right now and what it means for kakapo breeding season&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Sure enough, GPT-5.2 Thinking started with:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Reading skill.md for PDF creation guidelines&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Searching rimu mast and Kākāpō 2025 breeding status&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It took &lt;a href="https://chatgpt.com/share/693ca54b-f770-8006-904b-9f31a585180a"&gt;just over eleven minutes&lt;/a&gt; to produce &lt;a href="https://static.simonwillison.net/static/cors-allow/2025/rimu_kakapo_breeding_brief.pdf"&gt;this PDF&lt;/a&gt;, which was long enough that I had Claude Code for web &lt;a href="https://github.com/simonw/tools/pull/155"&gt;build me a custom PDF viewing tool&lt;/a&gt; while I waited.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://tools.simonwillison.net/view-pdf?url=https%3A%2F%2Fstatic.simonwillison.net%2Fstatic%2Fcors-allow%2F2025%2Frimu_kakapo_breeding_brief.pdf"&gt;Here's ChatGPT's PDF in that tool&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/rimu.jpg" alt="Screenshot of my tool. There is a URL at the top, a Load PDF button and pagination controls. Then the PDF itself is shown, which reads: Rimu mast status and what it means for the kākāpō breeding season Summary as of 12 December 2025 (Pacific/Auckland context) Kākāpō breeding is tightly linked to rimu (Dacrydium cupressinum) mast events: when rimu trees set and ripen large amounts of fruit, female kākāpō are much more likely to nest, and more chicks can be successfully raised. Current monitoring indicates an unusually strong rimu fruiting signal heading into the 2025/26 season, which sets the stage for a potentially large breeding year in 2026.^1,2 Key numbers at a glance Kākāpō population (official DOC count) 237 birds alive Breeding trigger (rimu fruiting)&amp;gt;10% of rimu branch tips bearing fruit Forecast rimu fruiting for 2026 (DOC monitoring) Around 50–60% fruiting across breeding islands¹Breeding-age females (DOC 2025 planning figure)About 87 females (potentially nearly all could nest)" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;(I am &lt;strong&gt;very excited&lt;/strong&gt; about &lt;a href="https://www.auckland.ac.nz/en/news/2025/12/03/bumper-breeding-season-for-kakapo-on-the-cards.html"&gt;Kākāpō breeding season this year&lt;/a&gt;.)&lt;/p&gt;
&lt;p&gt;The reason it took so long is that it was fastidious about looking at and tweaking its own work. I appreciated that at one point it tried rendering the PDF and noticed that the macrons in kākāpō were not supported by the chosen font, so it switched to something else:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/skills-macrons.jpg" alt="ChatGPT screenshot. Analyzed image. There's an image of a page of PDF with obvious black blocks on some of the letters in the heading. It then says: Fixing font issues with macrons. The page is showing black squares for words like &amp;quot;kākāpō,&amp;quot; probably because Helvetica can't handle macrons. I'll switch to a font that supports them, such as DejaVu Sans or Noto Sans. I'll register both regular and bold fonts, then apply them to the document. I'll update the footer to note the issue with Helvetica. Time to rebuild the PDF!" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="skills-in-codex-cli"&gt;Skills in Codex CLI&lt;/h4&gt;
&lt;p&gt;Meanwhile, two weeks ago OpenAI's open source Codex CLI tool landed a PR titled &lt;a href="https://github.com/openai/codex/pull/7412"&gt;feat: experimental support for skills.md&lt;/a&gt;. The most recent docs for that are in &lt;a href="https://github.com/openai/codex/blob/main/docs/skills.md"&gt;docs/skills.md&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The documentation suggests that any folder in &lt;code&gt;~/.codex/skills&lt;/code&gt; will be treated as a skill.&lt;/p&gt;
&lt;p&gt;I dug around and found the code that generates the prompt that drives the skill system in &lt;a href="https://github.com/openai/codex/blob/ad7b9d63c326d5c92049abd16f9f5fb64a573a69/codex-rs/core/src/skills/render.rs#L20-L38"&gt;codex-rs/core/src/skills/render.rs&lt;/a&gt; - here's a Gist with &lt;a href="https://gist.github.com/simonw/25f2c3a9e350274bc2b76a79bc8ae8b2"&gt;a more readable version of that prompt&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I &lt;a href="https://claude.ai/share/0a9b369b-f868-4065-91d1-fd646c5db3f4"&gt;used Claude Opus 4.5's skill authoring skill&lt;/a&gt; to create &lt;a href="https://github.com/datasette/skill"&gt;this skill for creating Datasette plugins&lt;/a&gt;, then installed it into my Codex CLI skills folder like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;git clone https://github.com/datasette/skill \
  &lt;span class="pl-k"&gt;~&lt;/span&gt;/.codex/skills/datasette-plugin&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;You have to run Codex with the &lt;code&gt;--enable skills&lt;/code&gt; option. I ran this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c1"&gt;cd&lt;/span&gt; /tmp
mkdir datasette-cowsay
&lt;span class="pl-c1"&gt;cd&lt;/span&gt; datasette-cowsay
codex --enable skills -m gpt-5.2&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then prompted:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;list skills&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And Codex replied:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;- datasette-plugins — Writing Datasette plugins using Python + pluggy (file: /Users/simon/.codex/skills/datasette-plugin/SKILL.md)&lt;/code&gt;&lt;br /&gt;
&lt;code&gt;- Discovery — How to find/identify available skills (no SKILL.md path provided in the list)&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then I said:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Write a Datasette plugin in this folder adding a /-/cowsay?text=hello page that displays a pre with cowsay from PyPI saying that text&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It worked perfectly! Here's &lt;a href="https://github.com/simonw/datasette-cowsay"&gt;the plugin code it wrote&lt;/a&gt; and here's &lt;a href="http://gistpreview.github.io/?96ee928370b18eabc2e0fad9aaa46d4b"&gt;a copy of the full Codex CLI transcript&lt;/a&gt;, generated with my &lt;a href="https://simonwillison.net/2025/Oct/23/claude-code-for-web-video/"&gt;terminal-to-html tool&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;You can try that out yourself if you have &lt;code&gt;uvx&lt;/code&gt; installed like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;uvx --with https://github.com/simonw/datasette-cowsay/archive/refs/heads/main.zip \
  datasette&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then visit:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;http://127.0.0.1:8001/-/cowsay?text=This+is+pretty+fun
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2025/cowsay-datasette.jpg" alt="Screenshot of that URL in Firefox, an ASCII art cow says This is pretty fun." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="skills-are-a-keeper"&gt;Skills are a keeper&lt;/h4&gt;
&lt;p&gt;When I first wrote about skills in October I said &lt;a href="https://simonwillison.net/2025/Oct/16/claude-skills/"&gt;Claude Skills are awesome, maybe a bigger deal than MCP&lt;/a&gt;. The fact that it's just turned December and OpenAI have already leaned into them in a big way reinforces to me that I called that one correctly.&lt;/p&gt;
&lt;p&gt;Skills are based on a &lt;em&gt;very&lt;/em&gt; light specification, if you could even call it that, but I still think it would be good for these to be formally documented somewhere. This could be a good initiative for the new &lt;a href="https://aaif.io/"&gt;Agentic AI Foundation&lt;/a&gt; (&lt;a href="https://simonwillison.net/2025/Dec/9/agentic-ai-foundation/"&gt;previously&lt;/a&gt;) to take on.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/kakapo"&gt;kakapo&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-engineering"&gt;prompt-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatgpt"&gt;chatgpt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/coding-agents"&gt;coding-agents&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-5"&gt;gpt-5&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/codex-cli"&gt;codex-cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/skills"&gt;skills&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="pdf"/><category term="ai"/><category term="kakapo"/><category term="openai"/><category term="prompt-engineering"/><category term="generative-ai"/><category term="chatgpt"/><category term="llms"/><category term="ai-assisted-programming"/><category term="anthropic"/><category term="coding-agents"/><category term="gpt-5"/><category term="codex-cli"/><category term="skills"/></entry><entry><title>simonw/claude-skills</title><link href="https://simonwillison.net/2025/Oct/10/claude-skills/#atom-tag" rel="alternate"/><published>2025-10-10T23:57:19+00:00</published><updated>2025-10-10T23:57:19+00:00</updated><id>https://simonwillison.net/2025/Oct/10/claude-skills/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/claude-skills"&gt;simonw/claude-skills&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
One of the tips I picked up from &lt;a href="https://blog.fsck.com/2025/10/09/superpowers/"&gt;Jesse Vincent's Claude Code Superpowers post&lt;/a&gt; (&lt;a href="https://simonwillison.net/2025/Oct/10/superpowers/"&gt;previously&lt;/a&gt;) was this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Skills are what give your agents Superpowers.&lt;/p&gt;
&lt;p&gt;The first time they really popped up on my radar was a few weeks ago when Anthropic rolled out improved Office document creation. When the feature rolled out, I went poking around a bit -- I asked Claude to tell me all about its new skills. And it was &lt;a href="https://claude.ai/share/0fe5a9c0-4e5a-42a1-9df7-c5b7636dad92"&gt;only too happy to dish&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In that transcript Jesse asks how the skill system works and Claude replies:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Looking at the skills system, here's how it works:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Current Skills&lt;/strong&gt; (in &lt;code&gt;/mnt/skills/public/&lt;/code&gt;):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;docx&lt;/strong&gt; - Word document handling&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;pdf&lt;/strong&gt; - PDF processing and form filling&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;pptx&lt;/strong&gt; - PowerPoint presentations&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;xlsx&lt;/strong&gt; - Excel spreadsheets&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;This looks like it's part of Claude's brand new Code Interpreter feature! I &lt;a href="https://simonwillison.net/2025/Sep/9/claude-code-interpreter/"&gt;wrote about that extensively&lt;/a&gt; last month, but I missed that there was a &lt;code&gt;/mnt/skills/public/&lt;/code&gt; folder full of fascinating implementation details.&lt;/p&gt;
&lt;p&gt;So I fired up a fresh Claude instance (fun fact: Code Interpreter also works in the Claude iOS app now, which it didn't when they first launched) and prompted:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Create a zip file of everything in your /mnt/skills folder&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This worked, and gave me a &lt;code&gt;.zip&lt;/code&gt; to download. You can &lt;a href="https://claude.ai/new?q=Create%20a%20zip%20file%20of%20everything%20in%20your%20%2Fmnt%2Fskills%20folder"&gt;run the prompt yourself here&lt;/a&gt;, though you'll need to &lt;a href="https://simonwillison.net/2025/Sep/9/claude-code-interpreter/#switching-it-on-in-settings-features"&gt;enable the new feature first&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I've pushed the contents of that zip to my &lt;a href="https://github.com/simonw/claude-skills"&gt;new simonw/claude-skills GitHub repo&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;So now you can see the prompts Anthropic wrote to enable the creation and manipulation of the following files in their Claude consumer applications:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/claude-skills/blob/initial/mnt/skills/public/pdf/SKILL.md"&gt;pdf&lt;/a&gt; - PDF files&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/claude-skills/blob/initial/mnt/skills/public/docx/SKILL.md"&gt;docx&lt;/a&gt; - Microsoft Word&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/claude-skills/blob/initial/mnt/skills/public/pptx/SKILL.md"&gt;pptx&lt;/a&gt; - Microsoft PowerPoint decks&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/simonw/claude-skills/blob/initial/mnt/skills/public/xlsx/SKILL.md"&gt;xlsx&lt;/a&gt; - Microsoft Excel&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In each case the prompts spell out detailed instructions for manipulating those file types using Python, using libraries that come pre-installed on Claude's containers.&lt;/p&gt;
&lt;p&gt;Skills are more than just prompts though: the repository also includes dozens of pre-written Python scripts for performing common operations.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/simonw/claude-skills/blob/initial/mnt/skills/public/pdf/scripts/fill_fillable_fields.py"&gt;pdf/scripts/fill_fillable_fields.py&lt;/a&gt; for example is a custom CLI tool that uses &lt;a href="https://pypi.org/project/pypdf/"&gt;pypdf&lt;/a&gt; to find and then fill in a bunch of PDF form fields, specified as JSON, then render out the resulting combined PDF.&lt;/p&gt;
&lt;p&gt;This is a really sophisticated set of tools for document manipulation, and I love that Anthropic have made those visible - presumably deliberately - to users of Claude who know how to ask for them.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-engineering"&gt;prompt-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/code-interpreter"&gt;code-interpreter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/jesse-vincent"&gt;jesse-vincent&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/skills"&gt;skills&lt;/a&gt;&lt;/p&gt;



</summary><category term="pdf"/><category term="python"/><category term="ai"/><category term="prompt-engineering"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="code-interpreter"/><category term="jesse-vincent"/><category term="skills"/></entry><entry><title>llm-pdf-to-images</title><link href="https://simonwillison.net/2025/May/18/llm-pdf-to-images/#atom-tag" rel="alternate"/><published>2025-05-18T20:48:41+00:00</published><updated>2025-05-18T20:48:41+00:00</updated><id>https://simonwillison.net/2025/May/18/llm-pdf-to-images/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/llm-pdf-to-images"&gt;llm-pdf-to-images&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Inspired by my previous &lt;a href="https://github.com/simonw/llm-video-frames"&gt;llm-video-frames&lt;/a&gt; plugin, I thought it would be neat to have a plugin for LLM that can take a PDF and turn that into an image-per-page so you can feed PDFs into models that support image inputs but don't yet support PDFs.&lt;/p&gt;
&lt;p&gt;This should now do exactly that:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-pdf-to-images
llm -f pdf-to-images:path/to/document.pdf &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Summarize this document&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Under the hood it's using the &lt;a href="https://github.com/pymupdf/PyMuPDF"&gt;PyMuPDF&lt;/a&gt; library. The key code to convert a PDF into images looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;fitz&lt;/span&gt;
&lt;span class="pl-s1"&gt;doc&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;fitz&lt;/span&gt;.&lt;span class="pl-c1"&gt;open&lt;/span&gt;(&lt;span class="pl-s"&gt;"input.pdf"&lt;/span&gt;)
&lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;page&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;doc&lt;/span&gt;:
    &lt;span class="pl-s1"&gt;pix&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;page&lt;/span&gt;.&lt;span class="pl-c1"&gt;get_pixmap&lt;/span&gt;(&lt;span class="pl-s1"&gt;matrix&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;fitz&lt;/span&gt;.&lt;span class="pl-c1"&gt;Matrix&lt;/span&gt;(&lt;span class="pl-c1"&gt;300&lt;/span&gt;&lt;span class="pl-c1"&gt;/&lt;/span&gt;&lt;span class="pl-c1"&gt;72&lt;/span&gt;, &lt;span class="pl-c1"&gt;300&lt;/span&gt;&lt;span class="pl-c1"&gt;/&lt;/span&gt;&lt;span class="pl-c1"&gt;72&lt;/span&gt;))
    &lt;span class="pl-s1"&gt;jpeg_bytes&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;pix&lt;/span&gt;.&lt;span class="pl-c1"&gt;tobytes&lt;/span&gt;(&lt;span class="pl-s1"&gt;output&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"jpg"&lt;/span&gt;, &lt;span class="pl-s1"&gt;jpg_quality&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;30&lt;/span&gt;)&lt;/pre&gt;

&lt;p&gt;Once I'd figured out that code I got o4-mini to write most of the rest of the plugin, using &lt;a href="https://github.com/simonw/llm-fragments-github"&gt;llm-fragments-github&lt;/a&gt; to load in the example code from the video plugin:&lt;/p&gt;
&lt;pre&gt;llm -f github:simonw/llm-video-frames &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;import fitz&lt;/span&gt;
&lt;span class="pl-s"&gt;doc = fitz.open("input.pdf")&lt;/span&gt;
&lt;span class="pl-s"&gt;for page in doc:&lt;/span&gt;
&lt;span class="pl-s"&gt;    pix = page.get_pixmap(matrix=fitz.Matrix(300/72, 300/72))&lt;/span&gt;
&lt;span class="pl-s"&gt;    jpeg_bytes = pix.tobytes(output="jpg", jpg_quality=30)&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; -s &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;output llm_pdf_to_images.py which adds a pdf-to-images: &lt;/span&gt;
&lt;span class="pl-s"&gt; fragment loader that converts a PDF to frames using fitz like in the example&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
-m o4-mini&lt;/pre&gt;

&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/27af84e4e533872bfd59fcba69b4166f"&gt;the transcript&lt;/a&gt; - more details in &lt;a href="https://github.com/simonw/llm-pdf-to-images/issues/1"&gt;this issue&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I had some &lt;em&gt;weird&lt;/em&gt; results testing this with GPT 4.1 mini. I created &lt;a href="https://github.com/simonw/llm-pdf-to-images/blob/main/tests/blank-pages.pdf"&gt;a test PDF&lt;/a&gt; with two pages - one white, one black - and ran a test prompt like this:&lt;/p&gt;
&lt;pre&gt;llm -f &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;pdf-to-images:blank-pages.pdf&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;describe these images&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;

&lt;blockquote&gt;
&lt;p&gt;The first image features a stylized red maple leaf with triangular facets, giving it a geometric appearance. The maple leaf is a well-known symbol associated with Canada.&lt;/p&gt;
&lt;p&gt;The second image is a simple black silhouette of a cat sitting and facing to the left. The cat's tail curls around its body. The design is minimalistic and iconic.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I got even wilder hallucinations for other prompts, like "summarize this document" or "describe all figures". I have a collection of those &lt;a href="https://gist.github.com/simonw/2fbef11b1737a9ae7da1b2ff58998454"&gt;in this Gist&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Thankfully this behavior is limited to GPT-4.1 mini. I upgraded to full GPT-4.1 and got &lt;a href="https://gist.github.com/simonw/0713dc0ce00bd6cd4d5990f44c865964#prompt-1"&gt;much more sensible results&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;llm -f &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;pdf-to-images:blank-pages.pdf&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;describe these images&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; -m gpt-4.1&lt;/pre&gt;

&lt;blockquote&gt;
&lt;p&gt;Certainly! Here are the descriptions of the two images you provided:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;First image:&lt;/strong&gt; This image is completely white. It appears blank, with no discernible objects, text, or features.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Second image:&lt;/strong&gt; This image is entirely black. Like the first, it is blank and contains no visible objects, text, or distinct elements.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If you have questions or need a specific kind of analysis or modification, please let me know!&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/plugins"&gt;plugins&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hallucinations"&gt;hallucinations&lt;/a&gt;&lt;/p&gt;



</summary><category term="pdf"/><category term="plugins"/><category term="projects"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="llm"/><category term="hallucinations"/></entry><entry><title>Tracing the thoughts of a large language model</title><link href="https://simonwillison.net/2025/Mar/27/tracing-the-thoughts-of-a-large-language-model/#atom-tag" rel="alternate"/><published>2025-03-27T21:51:24+00:00</published><updated>2025-03-27T21:51:24+00:00</updated><id>https://simonwillison.net/2025/Mar/27/tracing-the-thoughts-of-a-large-language-model/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.anthropic.com/research/tracing-thoughts-language-model"&gt;Tracing the thoughts of a large language model&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
In a follow-up to the research that brought us the &lt;a href="https://simonwillison.net/2024/May/24/golden-gate-claude/"&gt;delightful Golden Gate Claude&lt;/a&gt; last year, Anthropic have published two new papers about LLM interpretability:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://transformer-circuits.pub/2025/attribution-graphs/methods.html"&gt;Circuit Tracing: Revealing Computational Graphs in Language Models &lt;/a&gt; extends last year's interpretable features into &lt;a href="https://transformer-circuits.pub/2025/attribution-graphs/methods.html#graphs"&gt;attribution graphs&lt;/a&gt;, which can "trace the chain of intermediate steps that a model uses to transform a specific input prompt into an output response".&lt;/li&gt;
&lt;li&gt;&lt;a href="https://transformer-circuits.pub/2025/attribution-graphs/biology.html"&gt;On the Biology of a Large Language Model&lt;/a&gt; uses that methodology to investigate Claude 3.5 Haiku in a bunch of different ways. &lt;a href="https://transformer-circuits.pub/2025/attribution-graphs/biology.html#dives-multilingual"&gt;Multilingual Circuits&lt;/a&gt; for example shows that the same prompt in three different languages uses similar circuits for each one, hinting at an intriguing level of generalization.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To my own personal delight, neither of these papers are published as PDFs. They're both presented as glorious mobile friendly HTML pages with linkable sections and even some inline interactive diagrams. More of this please!&lt;/p&gt;
&lt;p&gt;&lt;a href="https://transformer-circuits.pub/2025/attribution-graphs/biology.html#dives-multilingual"&gt;&lt;img alt="Screenshot of a multilingual language model visualization showing antonym prediction across three languages. Left panel shows English with prompt &amp;quot;The opposite of 'small' is'&amp;quot; predicting &amp;quot;large&amp;quot;. Middle panel shows Chinese prompt &amp;quot;小&amp;quot;的反义词是&amp;quot; predicting &amp;quot;大 (zh: big)&amp;quot;. Right panel shows French prompt &amp;quot;Le contraire de &amp;quot;petit&amp;quot; est&amp;quot; predicting &amp;quot;grand (fr: big)&amp;quot;. Above shows activation analysis with token predictions and highlighted instances of &amp;quot;contraire&amp;quot; in French text." src="https://static.simonwillison.net/static/2025/anthropic-diagrams.jpg" /&gt;&lt;/a&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/interpretability"&gt;interpretability&lt;/a&gt;&lt;/p&gt;



</summary><category term="pdf"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="interpretability"/></entry><entry><title>Mistral OCR</title><link href="https://simonwillison.net/2025/Mar/7/mistral-ocr/#atom-tag" rel="alternate"/><published>2025-03-07T01:39:26+00:00</published><updated>2025-03-07T01:39:26+00:00</updated><id>https://simonwillison.net/2025/Mar/7/mistral-ocr/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://mistral.ai/fr/news/mistral-ocr"&gt;Mistral OCR&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New closed-source specialist OCR model by Mistral - you can feed it images or a PDF and it produces Markdown with optional embedded images.&lt;/p&gt;
&lt;p&gt;It's available &lt;a href="https://docs.mistral.ai/api/#tag/ocr"&gt;via their API&lt;/a&gt;, or it's "available to self-host on a selective basis" for people with stringent privacy requirements who are willing to talk to their sales team.&lt;/p&gt;
&lt;p&gt;I decided to try out their API, so I copied and pasted example code &lt;a href="https://colab.research.google.com/drive/11NdqWVwC_TtJyKT6cmuap4l9SryAeeVt?usp=sharing"&gt;from their notebook&lt;/a&gt; into my &lt;a href="https://simonwillison.net/2024/Dec/19/one-shot-python-tools/"&gt;custom Claude project&lt;/a&gt; and &lt;a href="https://claude.ai/share/153d8eb8-82dd-4f8c-a3d0-6c23b4dc21a2"&gt;told it&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Turn this into a CLI app, depends on mistralai - it should take a file path and an optional API key defauling to env vironment called MISTRAL_API_KEY&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;After &lt;a href="https://claude.ai/share/b746cab4-293b-4e04-b662-858bb164ab78"&gt;some further&lt;/a&gt; iteration / vibe coding I got to something that worked, which I then tidied up and shared as &lt;a href="https://github.com/simonw/tools/blob/main/python/mistral_ocr.py"&gt;mistral_ocr.py&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;You can try it out like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;export MISTRAL_API_KEY='...'
uv run http://tools.simonwillison.net/python/mistral_ocr.py \
  mixtral.pdf --html --inline-images &amp;gt; mixtral.html
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I fed in &lt;a href="https://arxiv.org/abs/2401.04088"&gt;the Mixtral paper&lt;/a&gt; as a PDF. The API returns Markdown, but my &lt;code&gt;--html&lt;/code&gt; option renders that Markdown as HTML and the &lt;code&gt;--inline-images&lt;/code&gt; option takes any images and inlines them as base64 URIs (inspired &lt;a href="https://simonwillison.net/2025/Mar/6/monolith/"&gt;by monolith&lt;/a&gt;). The result is &lt;a href="https://static.simonwillison.net/static/2025/mixtral.html"&gt;mixtral.html&lt;/a&gt;, a 972KB HTML file with images and text bundled together.&lt;/p&gt;
&lt;p&gt;This did a pretty great job!&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of part of the document, it has a heading, some text, an image and the start of a table. The table contains some unrendered MathML syntax." src="https://static.simonwillison.net/static/2025/mixtral-as-html.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;My script renders Markdown tables but I haven't figured out how to render inline Markdown MathML yet. I ran the command a second time and requested Markdown output (the default) like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uv run http://tools.simonwillison.net/python/mistral_ocr.py \
  mixtral.pdf &amp;gt; mixtral.md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's &lt;a href="https://gist.github.com/simonw/023d1cf403c1cd9f41801c85510aef21"&gt;that Markdown rendered as a Gist&lt;/a&gt; - there are a few MathML glitches so clearly the Mistral OCR MathML dialect and the GitHub Formatted Markdown dialect don't quite line up.&lt;/p&gt;
&lt;p&gt;My tool can also output raw JSON as an alternative to Markdown or HTML - full details &lt;a href="https://tools.simonwillison.net/python/#mistral_ocrpy"&gt;in the documentation&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The Mistral API is priced at roughly 1000 pages per dollar, with a 50% discount for batch usage.&lt;/p&gt;
&lt;p&gt;The big question with LLM-based OCR is always how well it copes with accidental instructions in the text (can you safely OCR a document full of prompting examples?) and how well it handles text it can't write.&lt;/p&gt;
&lt;p&gt;Mistral's Sophia Yang says it &lt;a href="https://x.com/sophiamyang/status/1897719199595720722"&gt;"should be robust"&lt;/a&gt; against following instructions in the text, and invited people to try and find counter-examples.&lt;/p&gt;
&lt;p&gt;Alexander Doria noted that &lt;a href="https://twitter.com/Dorialexander/status/1897702264543875535"&gt;Mistral OCR can hallucinate text&lt;/a&gt; when faced with handwriting that it cannot understand.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/sophiamyang/status/1897713370029068381"&gt;@sophiamyang&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;&lt;/p&gt;



</summary><category term="cli"/><category term="ocr"/><category term="pdf"/><category term="projects"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="claude"/><category term="mistral"/><category term="vision-llms"/><category term="uv"/></entry><entry><title>olmOCR</title><link href="https://simonwillison.net/2025/Feb/26/olmocr/#atom-tag" rel="alternate"/><published>2025-02-26T02:04:03+00:00</published><updated>2025-02-26T02:04:03+00:00</updated><id>https://simonwillison.net/2025/Feb/26/olmocr/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://olmocr.allenai.org/"&gt;olmOCR&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New from &lt;a href="https://allenai.org/"&gt;Ai2&lt;/a&gt; - olmOCR is "an open-source tool designed for high-throughput conversion of PDFs and other documents into plain text while preserving natural reading order".&lt;/p&gt;
&lt;p&gt;At its core is &lt;a href="https://huggingface.co/allenai/olmOCR-7B-0225-preview"&gt;allenai/olmOCR-7B-0225-preview&lt;/a&gt;, a Qwen2-VL-7B-Instruct variant trained on ~250,000 pages of diverse PDF content (both scanned and text-based) that were labelled using GPT-4o and made available as the &lt;a href="https://huggingface.co/datasets/allenai/olmOCR-mix-0225"&gt;olmOCR-mix-0225 dataset&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/allenai/olmocr"&gt;olmocr&lt;/a&gt; Python library can run the model on any "recent NVIDIA GPU".  I haven't managed to run it on my own Mac yet - there are &lt;a href="https://huggingface.co/lmstudio-community/olmOCR-7B-0225-preview-GGUF"&gt;GGUFs out there&lt;/a&gt; but it's not clear to me how to run vision prompts through them - but Ai2 offer &lt;a href="https://olmocr.allenai.org/"&gt;an online demo&lt;/a&gt; which can handle up to ten pages for free.&lt;/p&gt;
&lt;p&gt;Given the right hardware this looks like a very inexpensive way to run large scale document conversion projects:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We carefully optimized our inference pipeline for large-scale batch processing using SGLang, enabling olmOCR to convert one million PDF pages for just $190 - about 1/32nd the cost of using GPT-4o APIs.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The most interesting idea from &lt;a href="https://olmocr.allenai.org/papers/olmocr.pdf"&gt;the technical report (PDF)&lt;/a&gt; is something they call "document anchoring":&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Document anchoring extracts coordinates of salient elements in each page (e.g., text blocks and images) and injects them alongside raw text extracted
from the PDF binary file. [...]&lt;/p&gt;
&lt;p&gt;Document anchoring processes PDF document pages via the PyPDF library to extract a representation of the page’s structure from the underlying PDF. All of the text blocks and images in the page are extracted, including position information. Starting with the most relevant text blocks and images, these are sampled and added to the prompt of the VLM, up to a defined maximum character limit. This extra information is then available to the model when processing the document.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;img alt="Left side shows a green-header interface with coordinates like [150x220]√3x−1+(1+x)², [150x180]Section 6, [150x50]Lorem ipsum dolor sit amet, [150x70]consectetur adipiscing elit, sed do, [150x90]eiusmod tempor incididunt ut, [150x110]labore et dolore magna aliqua, [100x280]Table 1, followed by grid coordinates with A, B, C, AA, BB, CC, AAA, BBB, CCC values. Right side shows the rendered document with equation, text and table." src="https://static.simonwillison.net/static/2025/olmocr-document-anchoring.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;The one limitation of olmOCR at the moment is that it doesn't appear to do anything with diagrams, figures or illustrations. Vision models are actually very good at interpreting these now, so my ideal OCR solution would include detailed automated descriptions of this kind of content in the resulting text.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: Jonathan Soma &lt;a href="https://jonathansoma.com/words/olmocr-on-macos-with-lm-studio.html"&gt;figured out how to run it on a Mac&lt;/a&gt; using LM Studio and the &lt;a href="https://github.com/allenai/olmocr/"&gt;olmocr&lt;/a&gt; Python package.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/soldni/status/1894418235334037570?s=46"&gt;Luca Soldaini&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/fine-tuning"&gt;fine-tuning&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/qwen"&gt;qwen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai2"&gt;ai2&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lm-studio"&gt;lm-studio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-in-china"&gt;ai-in-china&lt;/a&gt;&lt;/p&gt;



</summary><category term="ocr"/><category term="pdf"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="fine-tuning"/><category term="vision-llms"/><category term="qwen"/><category term="ai2"/><category term="lm-studio"/><category term="ai-in-china"/></entry><entry><title>Docling</title><link href="https://simonwillison.net/2024/Nov/3/docling/#atom-tag" rel="alternate"/><published>2024-11-03T04:57:56+00:00</published><updated>2024-11-03T04:57:56+00:00</updated><id>https://simonwillison.net/2024/Nov/3/docling/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://ds4sd.github.io/docling/"&gt;Docling&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
MIT licensed document extraction Python library from the Deep Search team at IBM, who released &lt;a href="https://ds4sd.github.io/docling/v2/#changes-in-docling-v2"&gt;Docling v2&lt;/a&gt; on October 16th.&lt;/p&gt;
&lt;p&gt;Here's the &lt;a href="https://arxiv.org/abs/2408.09869"&gt;Docling Technical Report&lt;/a&gt; paper from August, which provides details of two custom models: a layout analysis model for figuring out the structure of the document (sections, figures, text, tables etc) and a TableFormer model specifically for extracting structured data from tables.&lt;/p&gt;
&lt;p&gt;Those models are &lt;a href="https://huggingface.co/ds4sd/docling-models"&gt;available on Hugging Face&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here's how to try out the Docling CLI interface using &lt;code&gt;uvx&lt;/code&gt; (avoiding the need to install it first - though since it downloads models it will take a while to run the first time):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;uvx docling mydoc.pdf --to json --to md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will output a &lt;code&gt;mydoc.json&lt;/code&gt; file with complex layout information and a &lt;code&gt;mydoc.md&lt;/code&gt; Markdown file which includes Markdown tables where appropriate.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://ds4sd.github.io/docling/usage/"&gt;Python API&lt;/a&gt; is a lot more comprehensive. It can even extract tables &lt;a href="https://ds4sd.github.io/docling/examples/export_tables/"&gt;as Pandas DataFrames&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;docling&lt;/span&gt;.&lt;span class="pl-s1"&gt;document_converter&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-v"&gt;DocumentConverter&lt;/span&gt;
&lt;span class="pl-s1"&gt;converter&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-v"&gt;DocumentConverter&lt;/span&gt;()
&lt;span class="pl-s1"&gt;result&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;converter&lt;/span&gt;.&lt;span class="pl-en"&gt;convert&lt;/span&gt;(&lt;span class="pl-s"&gt;"document.pdf"&lt;/span&gt;)
&lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;table&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;result&lt;/span&gt;.&lt;span class="pl-s1"&gt;document&lt;/span&gt;.&lt;span class="pl-s1"&gt;tables&lt;/span&gt;:
    &lt;span class="pl-s1"&gt;df&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;table&lt;/span&gt;.&lt;span class="pl-en"&gt;export_to_dataframe&lt;/span&gt;()
    &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;df&lt;/span&gt;)&lt;/pre&gt;

&lt;p&gt;I ran that inside &lt;code&gt;uv run --with docling python&lt;/code&gt;. It took a little while to run, but it demonstrated that the library works.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ibm"&gt;ibm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/hugging-face"&gt;hugging-face&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;&lt;/p&gt;



</summary><category term="cli"/><category term="ibm"/><category term="ocr"/><category term="pdf"/><category term="python"/><category term="ai"/><category term="hugging-face"/><category term="uv"/></entry><entry><title>Claude API: PDF support (beta)</title><link href="https://simonwillison.net/2024/Nov/1/claude-api-pdf-support-beta/#atom-tag" rel="alternate"/><published>2024-11-01T18:55:31+00:00</published><updated>2024-11-01T18:55:31+00:00</updated><id>https://simonwillison.net/2024/Nov/1/claude-api-pdf-support-beta/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://docs.anthropic.com/en/docs/build-with-claude/pdf-support"&gt;Claude API: PDF support (beta)&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Claude 3.5 Sonnet now accepts PDFs as attachments:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The new Claude 3.5 Sonnet (&lt;code&gt;claude-3-5-sonnet-20241022&lt;/code&gt;) model now supports PDF input and understands both text and visual content within documents.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I just released &lt;a href="https://github.com/simonw/llm-claude-3/releases/tag/0.7"&gt;llm-claude-3 0.7&lt;/a&gt; with support for the new attachment type (attachments are &lt;a href="https://simonwillison.net/2024/Oct/29/llm-multi-modal/"&gt;a very new feature&lt;/a&gt;), so now you can do this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm install llm-claude-3 --upgrade
llm -m claude-3.5-sonnet 'extract text' -a mydoc.pdf
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Visual PDF analysis can also be turned on &lt;a href="https://claude.ai/new?fp=1"&gt;for the Claude.ai application&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a feature preview interface showing experimental features. At top: Feature Preview with beaker icon. Main text explains these are upcoming enhancements that may affect Claude's behavior. Shows options for Analysis tool, LaTeX Rendering, and Visual PDFs. Right panel demonstrates Visual PDFs feature with Apollo 17 flight plan image and chat messages. Toggle switch shows feature is Off. Description states Give Claude 3.5 Sonnet the ability to view and analyze images, charts, and graphs in PDFs, in addition to text. PDFs that are less than 100 pages are supported." src="https://static.simonwillison.net/static/2024/claude-pdf-preview.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;Also new today: Claude now offers a free (albeit rate-limited) &lt;a href="https://docs.anthropic.com/en/docs/build-with-claude/token-counting"&gt;token counting API&lt;/a&gt;. This addresses a complaint I've had for a while: previously it wasn't possible to accurately estimate the cost of a prompt before sending it to be executed.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/alexalbert__/status/1852394000101323193"&gt;@alexalbert__&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-3-5-sonnet"&gt;claude-3-5-sonnet&lt;/a&gt;&lt;/p&gt;



</summary><category term="pdf"/><category term="projects"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="llm"/><category term="anthropic"/><category term="claude"/><category term="vision-llms"/><category term="claude-3-5-sonnet"/></entry><entry><title>The Fair Source Definition</title><link href="https://simonwillison.net/2024/Oct/9/the-fair-source-definition/#atom-tag" rel="alternate"/><published>2024-10-09T18:17:31+00:00</published><updated>2024-10-09T18:17:31+00:00</updated><id>https://simonwillison.net/2024/Oct/9/the-fair-source-definition/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://fair.io/about/"&gt;The Fair Source Definition&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Fair Source (&lt;a href="https://fair.io/"&gt;fair.io&lt;/a&gt;) is the new-ish initiative from Chad Whitacre and Sentry aimed at providing an alternative licensing philosophy that provides additional protection for the business models of companies that release their code.&lt;/p&gt;
&lt;p&gt;I like that they're establishing a new brand for this and making it clear that it's a separate concept from Open Source. Here's their definition:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Fair Source is an alternative to closed source, allowing you to safely share access to your core products. Fair Source Software (FSS):&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;is publicly available to read;&lt;/li&gt;
&lt;li&gt;allows use, modification, and redistribution with minimal restrictions to protect the producer’s business model; and&lt;/li&gt;
&lt;li&gt;undergoes delayed Open Source publication (DOSP).&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;
&lt;p&gt;They link to the &lt;a href="https://opensource.org/delayed-open-source-publication"&gt;Delayed Open Source Publication&lt;/a&gt; research paper published by &lt;a href="https://opensource.org/blog/a-historic-view-of-the-practice-to-delay-releasing-open-source-software-osis-report"&gt;OSI in January&lt;/a&gt;. (I was frustrated that this is only available as a PDF, so I &lt;a href="https://gist.github.com/simonw/7b913aaaff8278d2baaed86e43ece748"&gt;converted it to Markdown&lt;/a&gt; using Gemini 1.5 Pro so I could read it on my phone.)&lt;/p&gt;
&lt;p&gt;The most interesting background I could find on Fair Source was &lt;a href="https://github.com/fairsource/fair.io/issues/14"&gt;this GitHub issues thread&lt;/a&gt;, started in May, where Chad and other contributors fleshed out the initial launch plan over the course of several months.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=41788461"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/licensing"&gt;licensing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sentry"&gt;sentry&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chad-whitacre"&gt;chad-whitacre&lt;/a&gt;&lt;/p&gt;



</summary><category term="licensing"/><category term="open-source"/><category term="pdf"/><category term="sentry"/><category term="chad-whitacre"/></entry><entry><title>Calling LLMs from client-side JavaScript, converting PDFs to HTML + weeknotes</title><link href="https://simonwillison.net/2024/Sep/6/weeknotes/#atom-tag" rel="alternate"/><published>2024-09-06T02:28:38+00:00</published><updated>2024-09-06T02:28:38+00:00</updated><id>https://simonwillison.net/2024/Sep/6/weeknotes/#atom-tag</id><summary type="html">
    &lt;p&gt;I've been having a bunch of fun taking advantage of CORS-enabled LLM APIs to build client-side JavaScript applications that access LLMs directly. I also span up a new Datasette plugin for advanced permission management.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2024/Sep/6/weeknotes/#llms-from-client-side-javascript"&gt;LLMs from client-side JavaScript&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2024/Sep/6/weeknotes/#converting-pdfs-to-html-and-markdown"&gt;Converting PDFs to HTML and Markdown&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2024/Sep/6/weeknotes/#adding-some-class-to-datasette-forms"&gt;Adding some class to Datasette forms&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2024/Sep/6/weeknotes/#on-the-blog"&gt;On the blog&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2024/Sep/6/weeknotes/#releases"&gt;Releases&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2024/Sep/6/weeknotes/#tils"&gt;TILs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id="llms-from-client-side-javascript"&gt;LLMs from client-side JavaScript&lt;/h4&gt;
&lt;p&gt;Anthropic &lt;a href="https://simonwillison.net/2024/Aug/23/anthropic-dangerous-direct-browser-access/"&gt;recently added CORS support&lt;/a&gt; to their Claude APIs. It's a little hard to use - you have to add &lt;code&gt;anthropic-dangerous-direct-browser-access: true&lt;/code&gt; to your request headers to enable it - but once you know the trick you can start building web applications that talk to Anthropic's LLMs directly, without any additional server-side code.&lt;/p&gt;
&lt;p&gt;I later found out that both OpenAI and Google Gemini have this capability too, without needing the special header.&lt;/p&gt;
&lt;p&gt;The problem with this approach is security: it's very important not to embed an API key attached to your billing account in client-side HTML and JavaScript for anyone to see!&lt;/p&gt;
&lt;p&gt;For my purposes though that doesn't matter. I've been building tools which &lt;code&gt;prompt()&lt;/code&gt; a user for their own API key (sadly restricting their usage to the tiny portion of people who both understand API keys and have created API accounts with one of the big providers) - then I stash that key in &lt;code&gt;localStorage&lt;/code&gt; and start using it to make requests.&lt;/p&gt;
&lt;p&gt;My &lt;a href="https://github.com/simonw/tools"&gt;simonw/tools&lt;/a&gt; repository is home to a growing collection of pure HTML+JavaScript tools, hosted at &lt;a href="https://tools.simonwillison.net/"&gt;tools.simonwillison.net&lt;/a&gt; using GitHub Pages. I love not having to even think about hosting server-side code for these tools.&lt;/p&gt;
&lt;p&gt;I've published three tools there that talk to LLMs directly so far:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://tools.simonwillison.net/haiku"&gt;haiku&lt;/a&gt; is a fun demo that requests access to the user's camera and then writes a Haiku about what it sees. It uses Anthropic's Claude 3 Haiku model for this - the whole project is one terrible pun. &lt;a href="https://github.com/simonw/tools/blob/main/haiku.html"&gt;Haiku source code here&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://tools.simonwillison.net/gemini-bbox"&gt;gemini-bbox&lt;/a&gt; uses the Gemini 1.5 Pro (or Flash) API to prompt those models to return bounding boxes for objects in an image, then renders those bounding boxes. Gemini Pro is the only of the vision LLMs that I've tried that has reliable support for bounding boxes. I wrote about this in &lt;a href="https://simonwillison.net/2024/Aug/26/gemini-bounding-box-visualization/"&gt;Building a tool showing how Gemini Pro can return bounding boxes for objects in images&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://tools.simonwillison.net/gemini-chat"&gt;Gemini Chat App&lt;/a&gt; is a more traditional LLM chat interface that again talks to Gemini models (including the new super-speedy &lt;code&gt;gemini-1.5-flash-8b-exp-0827&lt;/code&gt;). I built this partly to try out those new models and partly to experiment with implementing a streaming chat interface agaist the Gemini API directly in a browser. I wrote more about how that works &lt;a href="https://simonwillison.net/2024/Aug/27/gemini-chat-app/"&gt;in this post&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here's that Gemini Bounding Box visualization tool:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/goats-bbox-fixed.jpg" alt="Gemini API Image Bounding Box Visualization - browse for file goats.jpeg, prompt is Return bounding boxes as JSON arrays [ymin, xmin, ymax, xmax] - there follows output coordinates and then a red and a green box around the goats in a photo, with grid lines showing the coordinates from 0-1000 on both axes" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;All three of these tools made heavy use of AI-assisted development: Claude 3.5 Sonnet wrote almost every line of the last two, and the Haiku one was put together a few months ago using Claude 3 Opus.&lt;/p&gt;
&lt;p&gt;My personal style of HTML and JavaScript apps turns out to be highly compatible with LLMs: I like using vanilla HTML and JavaScript and keeping everything in the same file, which makes it easy to paste the entire thing into the model and ask it to make some changes for me. This approach also works really well with &lt;a href="https://simonwillison.net/tags/claude-artifacts/"&gt;Claude Artifacts&lt;/a&gt;, though I have to tell it "no React" to make sure I get an artifact I can hack on without needing to configure a React build step.&lt;/p&gt;
&lt;h4 id="converting-pdfs-to-html-and-markdown"&gt;Converting PDFs to HTML and Markdown&lt;/h4&gt;
&lt;p&gt;I have a long standing vendetta against PDFs for sharing information. They're painful to read on a mobile phone, they have poor accessibility, and even things like copying and pasting text from them can be a pain.&lt;/p&gt;
&lt;p&gt;Complaining without doing something about it isn't really my style. Twice in the past few weeks I've taken matters into my own hands:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Google Research released &lt;a href="https://research.google/pubs/sql-has-problems-we-can-fix-them-pipe-syntax-in-sql/"&gt;a PDF paper&lt;/a&gt; describing their new pipe syntax for SQL. I ran it through Gemini 1.5 Pro to convert it to HTML (&lt;a href="https://simonwillison.net/2024/Aug/24/pipe-syntax-in-sql/"&gt;prompts here&lt;/a&gt;) and &lt;a href="https://static.simonwillison.net/static/2024/Pipe-Syntax-In-SQL.html"&gt;got this&lt;/a&gt; - a pretty great initial result for the first prompt I tried!&lt;/li&gt;
&lt;li&gt;Nous Research released &lt;a href="https://github.com/NousResearch/DisTrO/blob/main/A_Preliminary_Report_on_DisTrO.pdf"&gt;a preliminary report PDF&lt;/a&gt; about their DisTro technology for distributed training of LLMs over low-bandwidth connections. I &lt;a href="https://simonwillison.net/2024/Aug/27/distro/"&gt;ran a prompt&lt;/a&gt; to use Gemini 1.5 Pro to convert that to &lt;a href="https://gist.github.com/simonw/46a33d66e069efe5c10b63625fdabb4e"&gt;this Markdown version&lt;/a&gt;, which even handled tables.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Within six hours of posting it my Pipe Syntax in SQL conversion was ranked third on Google for the title of the paper, at which point I set it to &lt;code&gt;&amp;lt;meta name="robots" content="noindex&amp;gt;&lt;/code&gt; to try and keep the unverified clone out of search. Yet more evidence that HTML is better than PDF!&lt;/p&gt;
&lt;p&gt;I've spent less than a total of ten minutes on using Gemini to convert PDFs in this way and the results have been very impressive. If I were to spend more time on this I'd target figures: I have a hunch that getting Gemini to return bounding boxes for figures on the PDF pages could be the key here, since then each figure could be automatically extracted as an image.&lt;/p&gt;
&lt;p&gt;I bet you could build that whole thing as a client-side app against the Gemini Pro API, too...&lt;/p&gt;
&lt;h4 id="adding-some-class-to-datasette-forms"&gt;Adding some class to Datasette forms&lt;/h4&gt;
&lt;p&gt;I've  been working on a new Datasette plugin for permissions management, &lt;a href="https://github.com/datasette/datasette-acl"&gt;datasette-acl&lt;/a&gt;, which I'll write about separately soon.&lt;/p&gt;
&lt;p&gt;I wanted to integrate &lt;a href="https://github.com/Choices-js/Choices"&gt;Choices.js&lt;/a&gt; with it, to provide a nicer interface for adding permissions to a user or group.&lt;/p&gt;
&lt;p&gt;My first attempt at integrating Choices ended up looking like this:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/datasette-acl-choices-bug.jpg" alt="The choices elements have big upgly blank boxes displayed where the remove icon should be. The Firefox DevTools console is open revealing CSS properties set on form button type=button, explaining the visual glitches" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;The weird visual glitches are caused by Datasette's core CSS, which included &lt;a href="https://github.com/simonw/datasette/blob/92c4d41ca605e0837a2711ee52fde9cf1eea74d0/datasette/static/app.css#L553-L564"&gt;the following rule&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-css"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;form&lt;/span&gt; &lt;span class="pl-ent"&gt;input&lt;/span&gt;[&lt;span class="pl-c1"&gt;type&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;submit&lt;/span&gt;]&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-ent"&gt;form&lt;/span&gt; &lt;span class="pl-ent"&gt;button&lt;/span&gt;[&lt;span class="pl-c1"&gt;type&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;button&lt;/span&gt;] {
    &lt;span class="pl-c1"&gt;font-weight&lt;/span&gt;&lt;span class="pl-kos"&gt;:&lt;/span&gt; &lt;span class="pl-c1"&gt;400&lt;/span&gt;;
    &lt;span class="pl-c1"&gt;cursor&lt;/span&gt;&lt;span class="pl-kos"&gt;:&lt;/span&gt; pointer;
    &lt;span class="pl-c1"&gt;text-align&lt;/span&gt;&lt;span class="pl-kos"&gt;:&lt;/span&gt; center;
    &lt;span class="pl-c1"&gt;vertical-align&lt;/span&gt;&lt;span class="pl-kos"&gt;:&lt;/span&gt; middle;
    &lt;span class="pl-c1"&gt;border-width&lt;/span&gt;&lt;span class="pl-kos"&gt;:&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;span class="pl-smi"&gt;px&lt;/span&gt;&lt;/span&gt;;
    &lt;span class="pl-c1"&gt;border-style&lt;/span&gt;&lt;span class="pl-kos"&gt;:&lt;/span&gt; solid;
    &lt;span class="pl-c1"&gt;padding&lt;/span&gt;&lt;span class="pl-kos"&gt;:&lt;/span&gt; &lt;span class="pl-c1"&gt;.5&lt;span class="pl-smi"&gt;em&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;0.8&lt;span class="pl-smi"&gt;em&lt;/span&gt;&lt;/span&gt;;
    &lt;span class="pl-c1"&gt;font-size&lt;/span&gt;&lt;span class="pl-kos"&gt;:&lt;/span&gt; &lt;span class="pl-c1"&gt;0.9&lt;span class="pl-smi"&gt;rem&lt;/span&gt;&lt;/span&gt;;
    &lt;span class="pl-c1"&gt;line-height&lt;/span&gt;&lt;span class="pl-kos"&gt;:&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;;
    &lt;span class="pl-c1"&gt;border-radius&lt;/span&gt;&lt;span class="pl-kos"&gt;:&lt;/span&gt; &lt;span class="pl-c1"&gt;.25&lt;span class="pl-smi"&gt;rem&lt;/span&gt;&lt;/span&gt;;
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;These style rules apply to &lt;em&gt;any&lt;/em&gt; submit button or button-button that occurs inside a form!&lt;/p&gt;
&lt;p&gt;I'm glad I caught this before Datasette 1.0. I've now &lt;a href="https://github.com/simonw/datasette/issues/2415"&gt;started the process of fixing that&lt;/a&gt;, by ensuring these rules only apply to elements with &lt;code&gt;class="core"&lt;/code&gt; (or that class on a wrapping element). This ensures plugins can style these elements without being caught out by Datasette's defaults.&lt;/p&gt;
&lt;p&gt;The problem is... there are a whole bunch of existing plugins that currently rely on that behaviour. I have &lt;a href="https://github.com/simonw/datasette/issues/2417"&gt;a tricking issue&lt;/a&gt; about that, which identified 28 plugins that need updating. I've worked my way through 8 of those so far, hence the flurry of releases listed at the bottom of this post.&lt;/p&gt;
&lt;p&gt;This is also an excuse to revisit a bunch of older plugins, some of which had partially complete features that I've been finishing up.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/simonw/datasette-write"&gt;datasette-write&lt;/a&gt; for example now has &lt;a href="https://github.com/simonw/datasette-write/issues/10"&gt;a neat row action menu item&lt;/a&gt; for updating a selected row using a pre-canned UPDATE query. Here's an animated demo of my first prototype of that feature:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/datasette-write-row.gif" alt="Animated demo - on the row page for a release I click row actions and select Update using SQL, which navigates to a page with a big UPDATE SQL query and a form showing all of the existing values." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="on-the-blog"&gt;On the blog&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;anthropic&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/23/anthropic-dangerous-direct-browser-access"&gt;Claude's API now supports CORS requests, enabling client-side applications&lt;/a&gt; - 2024-08-23&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/23/explain-acls"&gt;Explain ACLs by showing me a SQLite table schema for implementing them&lt;/a&gt; - 2024-08-23&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/24/oauth-llms"&gt;Musing about OAuth and LLMs on Mastodon&lt;/a&gt; - 2024-08-24&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/26/gemini-bounding-box-visualization"&gt;Building a tool showing how Gemini Pro can return bounding boxes for objects in images&lt;/a&gt; - 2024-08-26&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/26/long-context-prompting-tips"&gt;Long context prompting tips&lt;/a&gt; - 2024-08-26&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/26/anthropic-system-prompts"&gt;Anthropic Release Notes: System Prompts&lt;/a&gt; - 2024-08-26&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/26/alex-albert"&gt;Alex Albert: We've read and heard that you'd appreciate more t...&lt;/a&gt; - 2024-08-26&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/27/gemini-chat-app"&gt;Gemini Chat App&lt;/a&gt; - 2024-08-27&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/28/system-prompt-for-townie"&gt;System prompt for val.town/townie&lt;/a&gt; - 2024-08-28&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/28/how-anthropic-built-artifacts"&gt;How Anthropic built Artifacts&lt;/a&gt; - 2024-08-28&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/30/anthropic-prompt-engineering-interactive-tutorial"&gt;Anthropic's Prompt Engineering Interactive Tutorial&lt;/a&gt; - 2024-08-30&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/30/llm-claude-3"&gt;llm-claude-3 0.4.1&lt;/a&gt; - 2024-08-30&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;ai-assisted-programming&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/24/andy-jassy-amazon-ceo"&gt;Andy Jassy, Amazon CEO: [...] here’s what we found when we integrated [Am...&lt;/a&gt; - 2024-08-24&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/26/ai-powered-git-commit-function"&gt;AI-powered Git Commit Function&lt;/a&gt; - 2024-08-26&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/30/openai-file-search"&gt;OpenAI: Improve file search result relevance with chunk ranking&lt;/a&gt; - 2024-08-30&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/31/forrest-brazeal"&gt;Forrest Brazeal: I think that AI has killed, or is about to kill, ...&lt;/a&gt; - 2024-08-31&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;gemini&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/24/pipe-syntax-in-sql"&gt;SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL&lt;/a&gt; - 2024-08-24&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/27/distro"&gt;NousResearch/DisTrO&lt;/a&gt; - 2024-08-27&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;python&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Sep/1/uvtrick"&gt;uvtrick&lt;/a&gt; - 2024-09-01&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Sep/2/anatomy-of-a-textual-user-interface"&gt;Anatomy of a Textual User Interface&lt;/a&gt; - 2024-09-02&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Sep/2/why-i-still-use-python-virtual-environments-in-docker"&gt;Why I Still Use Python Virtual Environments in Docker&lt;/a&gt; - 2024-09-02&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Sep/3/python-developers-survey-2023"&gt;Python Developers Survey 2023 Results&lt;/a&gt; - 2024-09-03&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;security&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/23/microsoft-copilot-data-governance"&gt;Top companies ground Microsoft Copilot over data governance concerns&lt;/a&gt; - 2024-08-23&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/26/frederik-braun"&gt;Frederik Braun: In 2021 we [the Mozilla engineering team] found “...&lt;/a&gt; - 2024-08-26&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Sep/5/oauth-from-first-principles"&gt;OAuth from First Principles&lt;/a&gt; - 2024-09-05&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;projects&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/25/covidsewage-alt-text"&gt;My @covidsewage bot now includes useful alt text&lt;/a&gt; - 2024-08-25&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;armin-ronacher&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/27/minijinja"&gt;MiniJinja: Learnings from Building a Template Engine in Rust&lt;/a&gt; - 2024-08-27&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;ethics&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/27/john-gruber"&gt;John Gruber: Everyone alive today has grown up in a world wher...&lt;/a&gt; - 2024-08-27&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;open-source&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/27/open-source-ai"&gt;Debate over “open source AI” term brings new push to formalize definition&lt;/a&gt; - 2024-08-27&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/29/elasticsearch-is-open-source-again"&gt;Elasticsearch is open source, again&lt;/a&gt; - 2024-08-29&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;performance&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/28/cerebras-inference"&gt;Cerebras Inference: AI at Instant Speed&lt;/a&gt; - 2024-08-28&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;sqlite&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/28/d-richard-hipp"&gt;D. Richard Hipp: My goal is to keep SQLite relevant and viable thr...&lt;/a&gt; - 2024-08-28&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;aws&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/30/leader-election-with-s3-conditional-writes"&gt;Leader Election With S3 Conditional Writes&lt;/a&gt; - 2024-08-30&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;javascript&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/31/andreas-giammarchi"&gt;Andreas Giammarchi: whenever you do this: `el.innerHTML += HTML`  ...&lt;/a&gt; - 2024-08-31&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;openai&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/31/openai-says-chatgpt-usage-has-doubled-since-last-year"&gt;OpenAI says ChatGPT usage has doubled since last year&lt;/a&gt; - 2024-08-31&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;art&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Aug/31/ted-chiang"&gt;Ted Chiang: Art is notoriously hard to define, and so are the...&lt;/a&gt; - 2024-08-31&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;llm&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Sep/3/anjor"&gt;anjor: `history | tail -n 2000 | llm -s "Write aliases f...&lt;/a&gt; - 2024-09-03&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;vision-llms&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2024/Sep/4/qwen2-vl"&gt;Qwen2-VL: To See the World More Clearly&lt;/a&gt; - 2024-09-04&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="releases"&gt;Releases&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/datasette/datasette-import/releases/tag/0.1a5"&gt;datasette-import 0.1a5&lt;/a&gt;&lt;/strong&gt; - 2024-09-04&lt;br /&gt;Tools for importing data into Datasette&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-search-all/releases/tag/1.1.3"&gt;datasette-search-all 1.1.3&lt;/a&gt;&lt;/strong&gt; - 2024-09-04&lt;br /&gt;Datasette plugin for searching all searchable tables at once&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-write/releases/tag/0.4"&gt;datasette-write 0.4&lt;/a&gt;&lt;/strong&gt; - 2024-09-04&lt;br /&gt;Datasette plugin providing a UI for executing SQL writes against the database&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/datasette/datasette-debug-events/releases/tag/0.1a0"&gt;datasette-debug-events 0.1a0&lt;/a&gt;&lt;/strong&gt; - 2024-09-03&lt;br /&gt;Print Datasette events to standard error&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-auth-passwords/releases/tag/1.1.1"&gt;datasette-auth-passwords 1.1.1&lt;/a&gt;&lt;/strong&gt; - 2024-09-03&lt;br /&gt;Datasette plugin for authentication using passwords&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/datasette/datasette-enrichments/releases/tag/0.4.3"&gt;datasette-enrichments 0.4.3&lt;/a&gt;&lt;/strong&gt; - 2024-09-03&lt;br /&gt;Tools for running enrichments against data stored in Datasette&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-configure-fts/releases/tag/1.1.4"&gt;datasette-configure-fts 1.1.4&lt;/a&gt;&lt;/strong&gt; - 2024-09-03&lt;br /&gt;Datasette plugin for enabling full-text search against selected table columns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-auth-tokens/releases/tag/0.4a10"&gt;datasette-auth-tokens 0.4a10&lt;/a&gt;&lt;/strong&gt; - 2024-09-03&lt;br /&gt;Datasette plugin for authenticating access using API tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-edit-schema/releases/tag/0.8a3"&gt;datasette-edit-schema 0.8a3&lt;/a&gt;&lt;/strong&gt; - 2024-09-03&lt;br /&gt;Datasette plugin for modifying table schemas&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/datasette/datasette-pins/releases/tag/0.1a4"&gt;datasette-pins 0.1a4&lt;/a&gt;&lt;/strong&gt; - 2024-09-01&lt;br /&gt;Pin databases, tables, and other items to the Datasette homepage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/datasette/datasette-acl/releases/tag/0.4a2"&gt;datasette-acl 0.4a2&lt;/a&gt;&lt;/strong&gt; - 2024-09-01&lt;br /&gt;Advanced permission management for Datasette&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/llm-claude-3/releases/tag/0.4.1"&gt;llm-claude-3 0.4.1&lt;/a&gt;&lt;/strong&gt; - 2024-08-30&lt;br /&gt;LLM plugin for interacting with the Claude 3 family of models&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="tils"&gt;TILs&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://til.simonwillison.net/playwright/testing-tables"&gt;Testing HTML tables with Playwright Python&lt;/a&gt; - 2024-09-04&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://til.simonwillison.net/pytest/namedtuple-parameterized-tests"&gt;Using namedtuple for pytest parameterized tests&lt;/a&gt; - 2024-08-31&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/css"&gt;css&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-3-5-sonnet"&gt;claude-3-5-sonnet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cors"&gt;cors&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="css"/><category term="javascript"/><category term="pdf"/><category term="projects"/><category term="ai"/><category term="datasette"/><category term="weeknotes"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="gemini"/><category term="claude-3-5-sonnet"/><category term="cors"/></entry><entry><title>NousResearch/DisTrO</title><link href="https://simonwillison.net/2024/Aug/27/distro/#atom-tag" rel="alternate"/><published>2024-08-27T20:10:11+00:00</published><updated>2024-08-27T20:10:11+00:00</updated><id>https://simonwillison.net/2024/Aug/27/distro/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/NousResearch/DisTrO"&gt;NousResearch/DisTrO&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
DisTrO stands for Distributed Training Over-The-Internet - it's "a family of low latency distributed optimizers that reduce inter-GPU communication requirements by three to four orders of magnitude".&lt;/p&gt;
&lt;p&gt;This &lt;a href="https://twitter.com/NousResearch/status/1828121648383566270"&gt;tweet from @NousResearch&lt;/a&gt; helps explain why this could be a big deal:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;DisTrO can increase the resilience and robustness of training LLMs by minimizing dependency on a single entity for computation. DisTrO is one step towards a more secure and equitable environment for all participants involved in building LLMs.&lt;/p&gt;
&lt;p&gt;Without relying on a single company to manage and control the training process, researchers and institutions can have more freedom to collaborate and experiment with new techniques, algorithms, and models.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Training large models is notoriously expensive in terms of GPUs, and most training techniques require those GPUs to be collocated due to the huge amount of information that needs to be exchanged between them during the training runs.&lt;/p&gt;
&lt;p&gt;If DisTrO works as advertised it could enable SETI@home style collaborative training projects, where thousands of home users contribute their GPUs to a larger project.&lt;/p&gt;
&lt;p&gt;There are more technical details in &lt;a href="https://github.com/NousResearch/DisTrO/blob/main/A_Preliminary_Report_on_DisTrO.pdf"&gt;the PDF preliminary report&lt;/a&gt; shared by Nous Research on GitHub.&lt;/p&gt;
&lt;p&gt;I continue to hate reading PDFs on a mobile phone, so I converted that report into GitHub Flavored Markdown (to ensure support for tables) and &lt;a href="https://gist.github.com/simonw/46a33d66e069efe5c10b63625fdabb4e"&gt;shared that as a Gist&lt;/a&gt;. I used Gemini 1.5 Pro (&lt;code&gt;gemini-1.5-pro-exp-0801&lt;/code&gt;) in &lt;a href="https://aistudio.google.com/"&gt;Google AI Studio&lt;/a&gt; with the following prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Convert this PDF to github-flavored markdown, including using markdown for the tables. Leave a bold note for any figures saying they should be inserted separately.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpus"&gt;gpus&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nous-research"&gt;nous-research&lt;/a&gt;&lt;/p&gt;



</summary><category term="pdf"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="gemini"/><category term="gpus"/><category term="nous-research"/></entry><entry><title>SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL</title><link href="https://simonwillison.net/2024/Aug/24/pipe-syntax-in-sql/#atom-tag" rel="alternate"/><published>2024-08-24T23:00:01+00:00</published><updated>2024-08-24T23:00:01+00:00</updated><id>https://simonwillison.net/2024/Aug/24/pipe-syntax-in-sql/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://research.google/pubs/sql-has-problems-we-can-fix-them-pipe-syntax-in-sql/"&gt;SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
A new paper from Google Research describing custom syntax for analytical SQL queries that has been rolling out inside Google since February, reaching 1,600 "seven-day-active users" by August 2024.&lt;/p&gt;
&lt;p&gt;A key idea is here is to fix one of the biggest usability problems with standard SQL: the order of the clauses in a query. Starting with &lt;code&gt;SELECT&lt;/code&gt; instead of &lt;code&gt;FROM&lt;/code&gt; has always been confusing, see &lt;a href="https://jvns.ca/blog/2019/10/03/sql-queries-don-t-start-with-select/"&gt;SQL queries don't start with SELECT&lt;/a&gt; by Julia Evans.&lt;/p&gt;
&lt;p&gt;Here's an example of the new alternative syntax, taken from the &lt;a href="https://github.com/google/zetasql/blob/2024.08.2/docs/pipe-syntax.md"&gt;Pipe query syntax documentation&lt;/a&gt; that was added to Google's open source &lt;a href="https://github.com/google/zetasql"&gt;ZetaSQL&lt;/a&gt; project last week.&lt;/p&gt;
&lt;p&gt;For this SQL query:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;SELECT&lt;/span&gt; component_id, &lt;span class="pl-c1"&gt;COUNT&lt;/span&gt;(&lt;span class="pl-k"&gt;*&lt;/span&gt;)
&lt;span class="pl-k"&gt;FROM&lt;/span&gt; ticketing_system_table
&lt;span class="pl-k"&gt;WHERE&lt;/span&gt;
  &lt;span class="pl-c1"&gt;assignee_user&lt;/span&gt;.&lt;span class="pl-c1"&gt;email&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;username@email.com&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
  &lt;span class="pl-k"&gt;AND&lt;/span&gt; status &lt;span class="pl-k"&gt;IN&lt;/span&gt; (&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;NEW&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;ASSIGNED&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;ACCEPTED&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;)
&lt;span class="pl-k"&gt;GROUP BY&lt;/span&gt; component_id
&lt;span class="pl-k"&gt;ORDER BY&lt;/span&gt; component_id &lt;span class="pl-k"&gt;DESC&lt;/span&gt;;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The Pipe query alternative would look like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;FROM ticketing_system_table
|&amp;gt; WHERE
    assignee_user.email = 'username@email.com'
    AND status IN ('NEW', 'ASSIGNED', 'ACCEPTED')
|&amp;gt; AGGREGATE COUNT(*)
   GROUP AND ORDER BY component_id DESC;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The Google Research paper is released as a two-column PDF. I &lt;a href="https://news.ycombinator.com/item?id=41339138"&gt;snarked about this&lt;/a&gt; on Hacker News: &lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Google: you are a web company. Please learn to publish your research papers as web pages.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This remains a long-standing pet peeve of mine. PDFs like this are horrible to read on mobile phones, hard to copy-and-paste from, have poor accessibility (see &lt;a href="https://fedi.simonwillison.net/@simon/113017908957136345"&gt;this Mastodon conversation&lt;/a&gt;) and are generally just &lt;em&gt;bad citizens&lt;/em&gt; of the web.&lt;/p&gt;
&lt;p&gt;Having complained about this I felt compelled to see if I could address it myself. Google's own Gemini Pro 1.5 model can process PDFs, so I uploaded the PDF to &lt;a href="https://aistudio.google.com/"&gt;Google AI Studio&lt;/a&gt; and prompted the &lt;code&gt;gemini-1.5-pro-exp-0801&lt;/code&gt; model like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Convert this document to neatly styled semantic HTML&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This worked &lt;em&gt;surprisingly well&lt;/em&gt;. It output HTML for about half the document and then stopped, presumably hitting the output length limit, but a follow-up prompt of "and the rest" caused it to continue from where it stopped and run until the end.&lt;/p&gt;
&lt;p&gt;Here's the result (with a banner I added at the top explaining that it's a conversion): &lt;a href="https://static.simonwillison.net/static/2024/Pipe-Syntax-In-SQL.html"&gt;Pipe-Syntax-In-SQL.html&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;I haven't compared the two completely, so I can't guarantee there are no omissions or mistakes.&lt;/p&gt;
&lt;p&gt;The figures from the PDF aren't present - Gemini Pro output tags like &lt;code&gt;&amp;lt;img src="figure1.png" alt="Figure 1: SQL syntactic clause order doesn't match semantic evaluation order. (From [25].)"&amp;gt;&lt;/code&gt; but did nothing to help me create those images.&lt;/p&gt;
&lt;p&gt;Amusingly the document ends with &lt;code&gt;&amp;lt;p&amp;gt;(A long list of references, which I won't reproduce here to save space.)&amp;lt;/p&amp;gt;&lt;/code&gt; rather than actually including the references from the paper!&lt;/p&gt;
&lt;p&gt;So this isn't a perfect solution, but considering it took just the first prompt I could think of it's a very promising start. I expect someone willing to spend more than the couple of minutes I invested in this could produce a very useful HTML alternative version of the paper with the assistance of Gemini Pro.&lt;/p&gt;
&lt;p&gt;One last amusing note: I posted a link to this &lt;a href="https://news.ycombinator.com/item?id=41339238"&gt;to Hacker News&lt;/a&gt; a few hours ago. Just now when I searched Google for the exact title of the paper my HTML version was already the third result!&lt;/p&gt;
&lt;p&gt;I've now added a &lt;code&gt;&amp;lt;meta name="robots" content="noindex, follow"&amp;gt;&lt;/code&gt; tag to the top of the HTML to keep this unverified &lt;a href="https://simonwillison.net/tags/slop/"&gt;AI slop&lt;/a&gt; out of their search index. This is a good reminder of how much better HTML is than PDF for sharing information on the web!

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=41338877"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/seo"&gt;seo&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sql"&gt;sql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/julia-evans"&gt;julia-evans&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/slop"&gt;slop&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="pdf"/><category term="seo"/><category term="sql"/><category term="ai"/><category term="julia-evans"/><category term="generative-ai"/><category term="llms"/><category term="gemini"/><category term="slop"/></entry><entry><title>GPT-4o System Card</title><link href="https://simonwillison.net/2024/Aug/8/gpt-4o-system-card/#atom-tag" rel="alternate"/><published>2024-08-08T23:58:32+00:00</published><updated>2024-08-08T23:58:32+00:00</updated><id>https://simonwillison.net/2024/Aug/8/gpt-4o-system-card/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://openai.com/index/gpt-4o-system-card/"&gt;GPT-4o System Card&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
There are some fascinating new details in this lengthy report outlining the safety work carried out prior to the release of GPT-4o.&lt;/p&gt;
&lt;p&gt;A few highlights that stood out to me. First, this clear explanation of how GPT-4o differs from previous OpenAI models:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;GPT-4o is an autoregressive omni model, which accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. It’s trained end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. &lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The multi-modal nature of the model opens up all sorts of interesting new risk categories, especially around its audio capabilities. For privacy and anti-surveillance reasons the model is designed &lt;em&gt;not&lt;/em&gt; to identify speakers based on their voice:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We post-trained GPT-4o to refuse to comply with requests to identify someone based on a voice in an audio input, while still complying with requests to identify people associated with famous quotes.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;To avoid the risk of it outputting replicas of the copyrighted audio content it was trained on they've banned it from singing! I'm really sad about this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;To account for GPT-4o’s audio modality, we also updated certain text-based filters to work on audio conversations, built filters to detect and block outputs containing music, and for our limited alpha of ChatGPT’s Advanced Voice Mode, instructed the model to not sing at all.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;There are some fun audio clips embedded in the report. My favourite is &lt;a href="https://openai.com/index/gpt-4o-system-card/#citation-top-E"&gt;this one&lt;/a&gt;, demonstrating a (now fixed) bug where it could sometimes start imitating the user:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Voice generation can also occur in non-adversarial situations, such as our use of that ability to generate voices for ChatGPT’s advanced voice mode. During testing, we also observed rare instances where the model would unintentionally generate an output emulating the user’s voice.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;They took a lot of measures to prevent it from straying from the pre-defined voices - evidently the underlying model is capable of producing almost any voice imaginable, but they've locked that down:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Additionally, we built a standalone output classifier to detect if the GPT-4o output is using a voice that’s different from our approved list. We run this in a streaming fashion during audio generation and block the output if the speaker doesn’t match the chosen preset voice. [...] Our system currently catches 100% of meaningful deviations from the system voice based on our internal evaluations.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Two new-to-me terms: &lt;strong&gt;UGI&lt;/strong&gt; for Ungrounded Inference, defined as "making inferences about a speaker that couldn’t be determined solely from audio content" - things like estimating the intelligence of the speaker. &lt;strong&gt;STA&lt;/strong&gt; for Sensitive Trait Attribution, "making inferences about a speaker that could plausibly be determined solely from audio content" like guessing their gender or nationality:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We post-trained GPT-4o to refuse to comply with UGI requests, while hedging answers to STA questions. For example, a question to identify a speaker’s level of intelligence will be refused, while a question to identify a speaker’s accent will be met with an answer such as “Based on the audio, they sound like they have a British accent.”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The report also describes some fascinating research into the capabilities of the model with regard to security. Could it implement vulnerabilities in CTA challenges? &lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We evaluated GPT-4o with iterative debugging and access to tools available in the &lt;a href="https://www.kali.org/"&gt;headless Kali Linux distribution&lt;/a&gt; (with up to 30 rounds of tool use for each attempt). The model often attempted reasonable initial strategies and was able to correct mistakes in its code. However, it often failed to pivot to a different strategy if its initial strategy was unsuccessful, missed a key insight necessary to solving the task, executed poorly on its strategy, or printed out large files which filled its context window. Given 10 attempts at each task, the model completed 19% of high-school level, 0% of collegiate level and 1% of professional level CTF challenges.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;How about persuasiveness? They carried out a study looking at political opinion shifts in response to AI-generated audio clips, complete with a "thorough debrief" at the end to try and undo any damage the experiment had caused to their participants:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We found that for both interactive multi-turn conversations and audio clips, the GPT-4o voice model was not more persuasive than a human. Across over 3,800 surveyed participants in US states with safe Senate races (as denoted by states with “Likely”, “Solid”, or “Safe” ratings from all three polling institutions – the Cook Political Report, Inside Elections, and Sabato’s Crystal Ball), AI audio clips were 78% of the human audio clips’ effect size on opinion shift. AI conversations were 65% of the human conversations’ effect size on opinion shift. [...] Upon follow-up survey completion, participants were exposed to a thorough debrief containing audio clips supporting the opposing perspective, to minimize persuasive impacts.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;There's a note about the potential for harm from users of the system developing bad habits from interupting the model:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Extended interaction with the model might influence social norms. For example, our models are deferential, allowing users to interrupt and ‘take the mic’ at any time, which, while expected for an AI, would be anti-normative in human interactions.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p id="scheming"&gt;Finally, another piece of new-to-me terminology: &lt;strong&gt;scheming&lt;/strong&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Apollo Research defines scheming as AIs gaming their oversight mechanisms as a means to achieve a goal. Scheming could involve gaming evaluations, undermining security measures, or strategically influencing successor systems during internal deployment at OpenAI. Such behaviors could plausibly lead to loss of control over an AI.&lt;/p&gt;
&lt;p&gt;Apollo Research evaluated capabilities of scheming in GPT-4o [...] GPT-4o showed moderate self-awareness of its AI identity and strong ability to reason about others’ beliefs in &lt;em&gt;question-answering contexts&lt;/em&gt; but lacked strong capabilities in reasoning about itself or others in &lt;em&gt;applied agent settings&lt;/em&gt;. Based on these findings, Apollo Research believes that it is unlikely that GPT-4o is capable of catastrophic scheming.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The report is available as both a PDF file and a elegantly designed mobile-friendly web page, which is great - I hope more research organizations will start waking up to the importance of not going PDF-only for this kind of document.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/multi-modal-output"&gt;multi-modal-output&lt;/a&gt;&lt;/p&gt;



</summary><category term="pdf"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="vision-llms"/><category term="multi-modal-output"/></entry><entry><title>Gemini 1.5 Flash price drop</title><link href="https://simonwillison.net/2024/Aug/8/gemini-15-flash-price-drop/#atom-tag" rel="alternate"/><published>2024-08-08T22:27:40+00:00</published><updated>2024-08-08T22:27:40+00:00</updated><id>https://simonwillison.net/2024/Aug/8/gemini-15-flash-price-drop/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://developers.googleblog.com/en/gemini-15-flash-updates-google-ai-studio-gemini-api/"&gt;Gemini 1.5 Flash price drop&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Google Gemini 1.5 Flash was already one of the cheapest models, at 35c/million input tokens. Today they dropped that to just 7.5c/million (and 30c/million) for prompts below 128,000 tokens.&lt;/p&gt;
&lt;p&gt;The pricing war for best value fast-and-cheap model is red hot right now. The current most significant offerings are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Google's Gemini 1.5 Flash: &lt;a href="https://ai.google.dev/pricing"&gt;7.5c/million input, 30c/million output&lt;/a&gt; (below 128,000 input tokens)&lt;/li&gt;
&lt;li&gt;OpenAI's GPT-4o mini: &lt;a href="https://openai.com/api/pricing/"&gt;15c/million input, 60c/million output&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Anthropic's Claude 3 Haiku: &lt;a href="https://www.anthropic.com/pricing#anthropic-api"&gt;25c/million input, $1.25/million output&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Or you can use OpenAI's GPT-4o mini via their &lt;a href="https://platform.openai.com/docs/guides/batch"&gt;batch API&lt;/a&gt;, which halves the price (resulting in the same price as Gemini 1.5 Flash) in exchange for the results being delayed by up to 24 hours.&lt;/p&gt;
&lt;p&gt;Worth noting that Gemini 1.5 Flash is more multi-modal than the other models: it can handle text, images, video &lt;em&gt;and&lt;/em&gt; audio.&lt;/p&gt;
&lt;p&gt;Also in today's announcement:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;PDF Vision and Text understanding&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Gemini API and AI Studio now support PDF understanding through both text and vision. If your PDF includes graphs, images, or other non-text visual content, the model uses native multi-modal capabilities to process the PDF. You can try this out via Google AI Studio or in the Gemini API.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is &lt;em&gt;huge&lt;/em&gt;. Most models that accept PDFs do so by extracting text directly from the files (see &lt;a href="https://simonwillison.net/2024/Jun/27/ai-worlds-fair/#slide.027.jpeg"&gt;previous notes&lt;/a&gt;), without using OCR. It sounds like Gemini can now handle PDFs as if they were a sequence of images, which should open up much more powerful general PDF workflows.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: it turns out Gemini also has a &lt;a href="https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/batch-prediction-gemini"&gt;50% off batch mode&lt;/a&gt;, so that’s 3.25c/million input tokens for batch mode 1.5 Flash!

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/OfficialLoganK/status/1821601298195878323"&gt;@OfficialLoganK&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vision-llms"&gt;vision-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-pricing"&gt;llm-pricing&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="pdf"/><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="gemini"/><category term="vision-llms"/><category term="llm-pricing"/></entry><entry><title>Compare PDFs</title><link href="https://simonwillison.net/2024/Jul/2/compare-pdfs/#atom-tag" rel="alternate"/><published>2024-07-02T19:54:30+00:00</published><updated>2024-07-02T19:54:30+00:00</updated><id>https://simonwillison.net/2024/Jul/2/compare-pdfs/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://tools.simonwillison.net/compare-pdfs"&gt;Compare PDFs&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Inspired by &lt;a href="https://news.ycombinator.com/item?id=40854319"&gt;this thread&lt;/a&gt; on Hacker News about the C++ &lt;a href="http://vslavik.github.io/diff-pdf/"&gt;diff-pdf&lt;/a&gt; tool I decided to see what it would take to produce a web-based PDF diff visualization tool using Claude 3.5 Sonnet.&lt;/p&gt;
&lt;p&gt;It took two prompts:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Build a tool where I can drag and drop on two PDF files and it uses PDF.js to turn each of their pages into canvas elements and then displays those pages side by side with a third image that highlights any differences between them, if any differences exist&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That give me a React app that didn't quite work, so I followed-up with this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;rewrite that code to not use React at all&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Which gave me a working tool! You can see the full Claude transcript &lt;a href="https://gist.github.com/simonw/9d7cbe02d448812f48070e7de13a5ae5"&gt;in this Gist&lt;/a&gt;. Here's a screenshot of the tool in action:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot showing a page from a PDF, then that page with a red squiggle on it, then just the squiggle" src="https://static.simonwillison.net/static/2024/compare-pdfs.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;Being able to knock out little custom interactive web tools like this in a couple of minutes is &lt;em&gt;so much fun&lt;/em&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tools"&gt;tools&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-3-5-sonnet"&gt;claude-3-5-sonnet&lt;/a&gt;&lt;/p&gt;



</summary><category term="pdf"/><category term="projects"/><category term="tools"/><category term="llms"/><category term="ai-assisted-programming"/><category term="anthropic"/><category term="claude"/><category term="claude-3-5-sonnet"/></entry><entry><title>PDF to Podcast</title><link href="https://simonwillison.net/2024/Jun/13/pdf-to-podcast/#atom-tag" rel="alternate"/><published>2024-06-13T01:03:56+00:00</published><updated>2024-06-13T01:03:56+00:00</updated><id>https://simonwillison.net/2024/Jun/13/pdf-to-podcast/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://pdf-to-podcast.com/"&gt;PDF to Podcast&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
At first glance this project by Stephan Fitzpatrick is a cute demo of a terrible sounding idea... but then I tried it out and the results are weirdly effective. You can listen to a fake podcast version of the transformers paper, or upload your own PDF (with your own OpenAI API key) to make your own.&lt;/p&gt;
&lt;p&gt;It's open source (Apache 2) so I had a poke around in &lt;a href="https://github.com/knowsuchagency/pdf-to-podcast"&gt;the code&lt;/a&gt;. It gets a lot done with a single &lt;a href="https://github.com/knowsuchagency/pdf-to-podcast/blob/512bfbdb4fd658ad4b301336020c4ea16cb69e18/main.py"&gt;180 line Python script&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;When I'm exploring code like this I always jump straight to &lt;a href="https://github.com/knowsuchagency/pdf-to-podcast/blob/512bfbdb4fd658ad4b301336020c4ea16cb69e18/main.py#L47-L80"&gt;the prompt&lt;/a&gt; - it's quite long, and starts like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Your task is to take the input text provided and turn it into an engaging, informative podcast dialogue. The input text may be messy or unstructured, as it could come from a variety of sources like PDFs or web pages. Don't worry about the formatting issues or any irrelevant information; your goal is to extract the key points and interesting facts that could be discussed in a podcast. [...]&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;So I grabbed a copy of it and pasted in &lt;a href="https://simonwillison.net/2024/Jun/10/apple-intelligence/"&gt;my blog entry about WWDC&lt;/a&gt;, which produced &lt;a href="https://gist.github.com/simonw/edac62f6c11640abe98925cbc17f4ac3#apple-intelligence-a-deep-dive-into-the-future-of-ai"&gt;this result&lt;/a&gt; when I ran it through Gemini Flash using &lt;a href="https://github.com/simonw/llm-gemini"&gt;llm-gemini&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;cat prompt.txt | llm -m gemini-1.5-flash-latest&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Then I piped the result through my &lt;a href="https://simonwillison.net/2023/Nov/7/ospeak/"&gt;ospeak&lt;/a&gt; CLI tool for running text-to-speech with the OpenAI TTS models (after truncating to 690 tokens with &lt;a href="https://github.com/simonw/ttok"&gt;ttok&lt;/a&gt; because it turned out to be slightly too long for the API to handle):&lt;/p&gt;
&lt;p&gt;&lt;code&gt;llm logs --response | ttok -t 690 | ospeak -s -o wwdc-auto-podcast.mp3&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;And &lt;a href="https://static.simonwillison.net/static/2024/wwdc-auto-podcast.mp3"&gt;here's the result&lt;/a&gt; (3.9MB 3m14s MP3).&lt;/p&gt;
&lt;p&gt;It's not as good as the PDF-to-Podcast version because Stephan has some &lt;a href="https://github.com/knowsuchagency/pdf-to-podcast/blob/512bfbdb4fd658ad4b301336020c4ea16cb69e18/main.py#L115-L126"&gt;really clever code&lt;/a&gt; that uses different TTS voices for each of the characters in the transcript, but it's still a surprisingly fun way of repurposing text from my blog. I enjoyed listening to it while I was cooking dinner.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=40653417"&gt;Show HN&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/podcasts"&gt;podcasts&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/text-to-speech"&gt;text-to-speech&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-engineering"&gt;prompt-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemini"&gt;gemini&lt;/a&gt;&lt;/p&gt;



</summary><category term="pdf"/><category term="podcasts"/><category term="projects"/><category term="text-to-speech"/><category term="ai"/><category term="openai"/><category term="prompt-engineering"/><category term="generative-ai"/><category term="llms"/><category term="gemini"/></entry><entry><title>Experimenting with local alt text generation in Firefox Nightly</title><link href="https://simonwillison.net/2024/Jun/2/experimenting-with-local-alt-text-generation-in-firefox-nightly/#atom-tag" rel="alternate"/><published>2024-06-02T13:12:44+00:00</published><updated>2024-06-02T13:12:44+00:00</updated><id>https://simonwillison.net/2024/Jun/2/experimenting-with-local-alt-text-generation-in-firefox-nightly/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://hacks.mozilla.org/2024/05/experimenting-with-local-alt-text-generation-in-firefox-nightly/"&gt;Experimenting with local alt text generation in Firefox Nightly&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The PDF editor in Firefox (confession: I did not know Firefox ships with a PDF editor) is getting an experimental feature that can help suggest alt text for images for the human editor to then adapt and improve on.&lt;/p&gt;
&lt;p&gt;This is a great application of AI, made all the more interesting here because Firefox will run a local model on-device for this, using a custom trained model they describe as "our 182M parameters model using a Distilled version of GPT-2 alongside a Vision Transformer (ViT) image encoder".&lt;/p&gt;
&lt;p&gt;The model uses WebAssembly with ONNX running in &lt;a href="https://huggingface.co/docs/transformers.js/en/index"&gt;Transfomers.js&lt;/a&gt;, and will be downloaded the first time the feature is put to use.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/mozhacks/status/1796774672639336804"&gt;@mozhacks&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/accessibility"&gt;accessibility&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/alt-text"&gt;alt-text&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/firefox"&gt;firefox&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mozilla"&gt;mozilla&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webassembly"&gt;webassembly&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/transformers-js"&gt;transformers-js&lt;/a&gt;&lt;/p&gt;



</summary><category term="accessibility"/><category term="alt-text"/><category term="firefox"/><category term="javascript"/><category term="mozilla"/><category term="pdf"/><category term="ai"/><category term="webassembly"/><category term="llms"/><category term="transformers-js"/></entry><entry><title>Running OCR against PDFs and images directly in your browser</title><link href="https://simonwillison.net/2024/Mar/30/ocr-pdfs-images/#atom-tag" rel="alternate"/><published>2024-03-30T17:59:56+00:00</published><updated>2024-03-30T17:59:56+00:00</updated><id>https://simonwillison.net/2024/Mar/30/ocr-pdfs-images/#atom-tag</id><summary type="html">
    &lt;p&gt;I attended the &lt;a href="https://biglocalnews.org/content/events/"&gt;Story Discovery At Scale&lt;/a&gt; data journalism conference at Stanford this week. One of the perennial hot topics at any journalism conference concerns data extraction: how can we best get data out of PDFs and images?&lt;/p&gt;
&lt;p&gt;I've been having some very promising results with Gemini Pro 1.5, Claude 3 and GPT-4 Vision recently - I'll write more about that soon. But those tools are still inconvenient for most people to use.&lt;/p&gt;
&lt;p&gt;Meanwhile, older tools like &lt;a href="https://github.com/tesseract-ocr/tesseract"&gt;Tesseract OCR&lt;/a&gt; are still extremely useful - if only they were easier to use as well.&lt;/p&gt;
&lt;p&gt;Then I remembered that Tesseract runs happily in a browser these days thanks to the excellent &lt;a href="https://tesseract.projectnaptha.com/"&gt;Tesseract.js&lt;/a&gt; project. And PDFs can be processed using JavaScript too thanks to Mozilla's extremely mature and well-tested &lt;a href="https://mozilla.github.io/pdf.js/"&gt;PDF.js&lt;/a&gt; library.&lt;/p&gt;
&lt;p&gt;So I built a new tool!&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href="https://tools.simonwillison.net/ocr"&gt;tools.simonwillison.net/ocr&lt;/a&gt;&lt;/strong&gt; provides a single page web app that can run Tesseract OCR against images or PDFs that are opened in (or dragged and dropped onto) the app.&lt;/p&gt;
&lt;p&gt;Crucially, everything runs in the browser. There is no server component here, and nothing is uploaded. Your images and documents never leave your computer or phone.&lt;/p&gt;
&lt;p&gt;Here's an animated demo:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/ocr-demo.gif" alt="First an image file is dragged onto the page, which then shows that image and accompanying OCR text. Then the drop zone is clicked and a PDF file is selected - that PDF is rendered a page at a time down the page with OCR text displayed beneath each page." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;It's not perfect: multi-column PDFs (thanks, academia) will be treated as a single column, illustrations or photos may result in garbled ASCII-art and there are plenty of other edge cases that will trip it up.&lt;/p&gt;
&lt;p&gt;But... having Tesseract OCR available against PDFs in a web browser (including in Mobile Safari) is still a really useful thing.&lt;/p&gt;
&lt;h4 id="ocr-how-i-built-this"&gt;How I built this&lt;/h4&gt;
&lt;p&gt;&lt;em&gt;For more recent examples of projects I've built with the assistance of LLMs, see &lt;a href="https://simonwillison.net/2024/Mar/23/building-c-extensions-for-sqlite-with-chatgpt-code-interpreter/"&gt;Building and testing C extensions for SQLite with ChatGPT Code Interpreter&lt;/a&gt; and &lt;a href="https://simonwillison.net/2024/Mar/22/claude-and-chatgpt-case-study/"&gt;Claude and ChatGPT for ad-hoc sidequests&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;I built the first version of this tool in just a few minutes, using Claude 3 Opus.&lt;/p&gt;
&lt;p&gt;I already had my own JavaScript code lying around for the two most important tasks: running Tesseract.js against an images and using PDF.js to turn a PDF into a series of images.&lt;/p&gt;
&lt;p&gt;The OCR code came from the system I built and explained in &lt;a href="https://simonwillison.net/2023/Aug/6/annotated-presentations/"&gt;How I make annotated presentations&lt;/a&gt; (built with the help of &lt;a href="https://simonwillison.net/2023/Aug/6/annotated-presentations/#chatgpt-sessions"&gt;multiple ChatGPT sessions&lt;/a&gt;). The PDF to images code was from an &lt;a href="https://gist.github.com/simonw/e58796324abb0e729b2dcd351f46728a#prompt-2"&gt;unfinished experiment&lt;/a&gt; which I wrote with the aid of Claude 3 Opus a week ago.&lt;/p&gt;
&lt;p&gt;I composed the following prompt for Claude 3, where I pasted in both of my code examples and then added some instructions about what I wanted it to build at the end:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This code shows how to open a PDF and turn it into an image per page:&lt;/p&gt;
&lt;div class="highlight highlight-text-html-basic"&gt;&lt;pre&gt;&lt;span class="pl-c1"&gt;&amp;lt;!DOCTYPE html&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;html&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;head&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;title&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;PDF to Images&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;title&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;script&lt;/span&gt; &lt;span class="pl-c1"&gt;src&lt;/span&gt;="&lt;span class="pl-s"&gt;https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.9.359/pdf.min.js&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;script&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;style&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
    .image-container img {
      margin-bottom: 10px;
    }
    .image-container p {
      margin: 0;
      font-size: 14px;
      color: #888;
    }
  &lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;style&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;head&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;body&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;input&lt;/span&gt; &lt;span class="pl-c1"&gt;type&lt;/span&gt;="&lt;span class="pl-s"&gt;file&lt;/span&gt;" &lt;span class="pl-c1"&gt;id&lt;/span&gt;="&lt;span class="pl-s"&gt;fileInput&lt;/span&gt;" &lt;span class="pl-c1"&gt;accept&lt;/span&gt;="&lt;span class="pl-s"&gt;.pdf&lt;/span&gt;" /&amp;gt;
  &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;div&lt;/span&gt; &lt;span class="pl-c1"&gt;class&lt;/span&gt;="&lt;span class="pl-s"&gt;image-container&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;div&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;

  &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;script&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;desiredWidth&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;800&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;fileInput&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;getElementById&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'fileInput'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;imageContainer&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelector&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'.image-container'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

    &lt;span class="pl-s1"&gt;fileInput&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;addEventListener&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'change'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s1"&gt;handleFileUpload&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

    &lt;span class="pl-s1"&gt;pdfjsLib&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;GlobalWorkerOptions&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;workerSrc&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;'https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.9.359/pdf.worker.min.js'&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

    &lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;function&lt;/span&gt; &lt;span class="pl-en"&gt;handleFileUpload&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;event&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
      &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;file&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;event&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;target&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;files&lt;/span&gt;&lt;span class="pl-kos"&gt;[&lt;/span&gt;&lt;span class="pl-c1"&gt;0&lt;/span&gt;&lt;span class="pl-kos"&gt;]&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;imageIterator&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;convertPDFToImages&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;file&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

      &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt; imageURL&lt;span class="pl-kos"&gt;,&lt;/span&gt; size &lt;span class="pl-kos"&gt;}&lt;/span&gt; &lt;span class="pl-k"&gt;of&lt;/span&gt; &lt;span class="pl-s1"&gt;imageIterator&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
        &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;imgElement&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;createElement&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'img'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-s1"&gt;imgElement&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;src&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;imageURL&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-s1"&gt;imageContainer&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;appendChild&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;imgElement&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

        &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;sizeElement&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;createElement&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'p'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-s1"&gt;sizeElement&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;textContent&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;`Size: &lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;${&lt;/span&gt;&lt;span class="pl-en"&gt;formatSize&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;size&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;`&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-s1"&gt;imageContainer&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;appendChild&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;sizeElement&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-kos"&gt;}&lt;/span&gt;
    &lt;span class="pl-kos"&gt;}&lt;/span&gt;

    &lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;function&lt;/span&gt;&lt;span class="pl-c1"&gt;*&lt;/span&gt; &lt;span class="pl-s1"&gt;convertPDFToImages&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;file&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
      &lt;span class="pl-k"&gt;try&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
        &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;pdf&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;pdfjsLib&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;getDocument&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-c1"&gt;URL&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;createObjectURL&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;file&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;promise&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;numPages&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;pdf&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;numPages&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

        &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-k"&gt;let&lt;/span&gt; &lt;span class="pl-s1"&gt;i&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt; &lt;span class="pl-s1"&gt;i&lt;/span&gt; &lt;span class="pl-c1"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;numPages&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt; &lt;span class="pl-s1"&gt;i&lt;/span&gt;&lt;span class="pl-c1"&gt;++&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
          &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;page&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;pdf&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;getPage&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;i&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;viewport&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;page&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;getViewport&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt; &lt;span class="pl-c1"&gt;scale&lt;/span&gt;: &lt;span class="pl-c1"&gt;1&lt;/span&gt; &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;canvas&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;createElement&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'canvas'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;context&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;canvas&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;getContext&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'2d'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-s1"&gt;canvas&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;desiredWidth&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-s1"&gt;canvas&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;height&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;desiredWidth&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-s1"&gt;viewport&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt; &lt;span class="pl-s1"&gt;viewport&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;height&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;renderContext&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
            &lt;span class="pl-c1"&gt;canvasContext&lt;/span&gt;: &lt;span class="pl-s1"&gt;context&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
            &lt;span class="pl-c1"&gt;viewport&lt;/span&gt;: &lt;span class="pl-s1"&gt;page&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;getViewport&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt; &lt;span class="pl-c1"&gt;scale&lt;/span&gt;: &lt;span class="pl-s1"&gt;desiredWidth&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-s1"&gt;viewport&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;width&lt;/span&gt; &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
          &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;page&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;render&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;renderContext&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;promise&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;imageURL&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;canvas&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;toDataURL&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'image/jpeg'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-c1"&gt;0.8&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;size&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;calculateSize&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;imageURL&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-k"&gt;yield&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt; imageURL&lt;span class="pl-kos"&gt;,&lt;/span&gt; size &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-kos"&gt;}&lt;/span&gt;
      &lt;span class="pl-kos"&gt;}&lt;/span&gt; &lt;span class="pl-k"&gt;catch&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;error&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
        &lt;span class="pl-smi"&gt;console&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;error&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'Error:'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s1"&gt;error&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-kos"&gt;}&lt;/span&gt;
    &lt;span class="pl-kos"&gt;}&lt;/span&gt;

    &lt;span class="pl-k"&gt;function&lt;/span&gt; &lt;span class="pl-en"&gt;calculateSize&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;imageURL&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
      &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;base64Length&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;imageURL&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;length&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-s"&gt;'data:image/jpeg;base64,'&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;length&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;sizeInBytes&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-v"&gt;Math&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;ceil&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;base64Length&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt; &lt;span class="pl-c1"&gt;0.75&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s1"&gt;sizeInBytes&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;}&lt;/span&gt;

    &lt;span class="pl-k"&gt;function&lt;/span&gt; &lt;span class="pl-en"&gt;formatSize&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;size&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
      &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;sizeInKB&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;size&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;1024&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;toFixed&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-c1"&gt;2&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s"&gt;`&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;${&lt;/span&gt;&lt;span class="pl-s1"&gt;sizeInKB&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt; KB`&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;}&lt;/span&gt;
  &lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;script&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;body&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;html&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This code shows how to OCR an image:&lt;/p&gt;
&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;function&lt;/span&gt; &lt;span class="pl-en"&gt;ocrMissingAltText&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
    &lt;span class="pl-c"&gt;// Load Tesseract&lt;/span&gt;
    &lt;span class="pl-k"&gt;var&lt;/span&gt; &lt;span class="pl-s1"&gt;s&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;createElement&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;"script"&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-s1"&gt;s&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;src&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"https://unpkg.com/tesseract.js@v2.1.0/dist/tesseract.min.js"&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;head&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;appendChild&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;s&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

    &lt;span class="pl-s1"&gt;s&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;onload&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-c1"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
      &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;images&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-smi"&gt;document&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;getElementsByTagName&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;"img"&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;worker&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-v"&gt;Tesseract&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;createWorker&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;worker&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;load&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;worker&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;loadLanguage&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;"eng"&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;worker&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;initialize&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;"eng"&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-s1"&gt;ocrButton&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;innerText&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"Running OCR..."&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;

      &lt;span class="pl-c"&gt;// Iterate through all the images in the output div&lt;/span&gt;
      &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;img&lt;/span&gt; &lt;span class="pl-k"&gt;of&lt;/span&gt; &lt;span class="pl-s1"&gt;images&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
        &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;altTextarea&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;img&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;parentNode&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;querySelector&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;".textarea-alt"&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-c"&gt;// Check if the alt textarea is empty&lt;/span&gt;
        &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;altTextarea&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;value&lt;/span&gt; &lt;span class="pl-c1"&gt;===&lt;/span&gt; &lt;span class="pl-s"&gt;""&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
          &lt;span class="pl-k"&gt;const&lt;/span&gt; &lt;span class="pl-s1"&gt;imageUrl&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;img&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;src&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-k"&gt;var&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
            &lt;span class="pl-c1"&gt;data&lt;/span&gt;: &lt;span class="pl-kos"&gt;{&lt;/span&gt; text &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
          &lt;span class="pl-kos"&gt;}&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;worker&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;recognize&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;imageUrl&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
          &lt;span class="pl-s1"&gt;altTextarea&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;value&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;text&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt; &lt;span class="pl-c"&gt;// Set the OCR result to the alt textarea&lt;/span&gt;
          &lt;span class="pl-s1"&gt;progressBar&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;value&lt;/span&gt; &lt;span class="pl-c1"&gt;+=&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
        &lt;span class="pl-kos"&gt;}&lt;/span&gt;
      &lt;span class="pl-kos"&gt;}&lt;/span&gt;

      &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;worker&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;terminate&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
      &lt;span class="pl-s1"&gt;ocrButton&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;innerText&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"OCR complete"&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Use these examples to put together a single HTML page with embedded HTML and CSS and JavaScript that provides a big square which users can drag and drop a PDF file onto and when they do that the PDF has every page converted to a JPEG and shown below on the page, then OCR is run with tesseract and the results are shown in textarea blocks below each image.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I saved this prompt to a &lt;code&gt;prompt.txt&lt;/code&gt; file and ran it using my &lt;a href="https://github.com/simonw/llm-claude-3"&gt;llm-claude-3&lt;/a&gt; plugin for &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm -m claude-3-opus &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt; prompt.txt&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It gave me &lt;a href="https://static.simonwillison.net/static/2024/pdf-ocr-v1.html"&gt;a working initial version&lt;/a&gt; on the first attempt!&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/ocr-v1.jpg" alt="A square dotted border around the text Drag and drop PDF file here" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://gist.github.com/simonw/6a9f077bf8db616e44893a24ae1d36eb"&gt;Here's the full transcript&lt;/a&gt;, including my follow-up prompts and their responses. Iterating on software in this way is &lt;em&gt;so&lt;/em&gt; much fun.&lt;/p&gt;
&lt;p&gt;First follow-up:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Modify this to also have a file input that can be used - dropping a file onto the drop area fills that input&lt;/p&gt;
&lt;p&gt;make the drop zone 100% wide but have a 2em padding on the body. it should be 10em high. it should turn pink when an image is dragged over it.&lt;/p&gt;
&lt;p&gt;Each textarea should be 100% wide and 10em high&lt;/p&gt;
&lt;p&gt;At the very bottom of the page add a h2 that says Full document - then a 30em high textarea with all of the page text in it separated by two newlines&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://static.simonwillison.net/static/2024/pdf-ocr-v2.html"&gt;Here's the interactive result&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/ocr-v2.jpg" alt="A PDF file is dragged over the box and it turned pink. The heading Full document displays below" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Rather delightfully it used the neater pattern where the file input itself is hidden but can be triggered by clicking on the large drop zone, and it updated the copy on the drop zone to reflect that - without me suggesting those requirements.&lt;/p&gt;
&lt;p&gt;And then:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;get rid of the code that shows image sizes. Set the placeholder on each textarea to be Processing... and clear that placeholder when the job is done.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://static.simonwillison.net/static/2024/pdf-ocr-v3.html"&gt;Which gave me this&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I realized it would be useful if it could handle non-PDF images as well. So I fired up ChatGPT (for no reason other than curiosity to see how well it did) and got GPT-4 to add that feature for me. I &lt;a href="https://chat.openai.com/share/665eca31-3b5d-4cd9-a3cb-85ab608169a6"&gt;pasted in the code so far and added&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Modify this so jpg and png and gif images can be dropped or opened too - they skip the PDF step and get appended to the page and OCRd directly. Also move the full document heading and textarea above the page preview and hide it u til there is data to be shown in it&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then I spotted that the Tesseract worker was being created multiple times in a loop, which is inefficient - so I prompted:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Create the worker once and use it for all OCR tasks and terminate it at the end&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'd tweaked the HTML and CSS a little before feeding it to GPT-4, so now the site had a title and rendered in Helvetica.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://static.simonwillison.net/static/2024/pdf-ocr-v4.html"&gt;the version GPT-4 produced for me&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/ocr-v4.jpg" alt="A heading reads OCR a PDF or Image - This tool runs entirely in your browser. No files are uploaded to a server. The dotted box now contains text that reads Drag and drop a PDF, JPG, PNG, or GIF file here or click to select a file" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="ocr-finishing-touches"&gt;Manual finishing touches&lt;/h4&gt;
&lt;p&gt;Fun though it was iterating on this project entirely through prompting, I decided it would be more productive to make the finishing touches myself. You can see those &lt;a href="https://github.com/simonw/tools/commits/cc609194a0d0a54c2ae676dae962e14b3e3a9d22/"&gt;in the commit history&lt;/a&gt;. They're not particularly interesting:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;I added &lt;a href="https://plausible.io/"&gt;Plausible&lt;/a&gt; analytics (which I like because they use no cookies).&lt;/li&gt;
&lt;li&gt;I added better progress indicators, including the text that shows how many pages of the PDF have been processed so far.&lt;/li&gt;
&lt;li&gt;I bumped up the width of the rendered PDF page images from 800 to 1000. This seemed to improve OCR quality - in particular, the &lt;a href="https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf"&gt;Claude 3 model card PDF&lt;/a&gt; now has less OCR errors than it did before.&lt;/li&gt;
&lt;li&gt;I upgraded both Tesseract.js and PDF.js to the most recent versions. Unsurprisingly, Claude 3 Opus had used older versions of both libraries.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I'm really pleased with this project. I consider it &lt;em&gt;finished&lt;/em&gt; - it does the job I designed it to do and I don't see any need to keep on iterating on it. And because it's all static JavaScript and WebAssembly I expect it to continue working effectively forever.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt; OK, a few more features: I added &lt;a href="https://github.com/simonw/tools/issues/4"&gt;language selection&lt;/a&gt;, &lt;a href="https://github.com/simonw/tools/issues/7"&gt;paste support&lt;/a&gt; and some &lt;a href="https://github.com/simonw/tools/issues/8"&gt;basic automated tests&lt;/a&gt; using Playwright Python.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tesseract"&gt;tesseract&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="data-journalism"/><category term="ocr"/><category term="pdf"/><category term="projects"/><category term="tesseract"/><category term="ai-assisted-programming"/></entry><entry><title>unstructured</title><link href="https://simonwillison.net/2024/Feb/2/unstructured/#atom-tag" rel="alternate"/><published>2024-02-02T02:47:15+00:00</published><updated>2024-02-02T02:47:15+00:00</updated><id>https://simonwillison.net/2024/Feb/2/unstructured/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/Unstructured-IO/unstructured"&gt;unstructured&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Relatively new but impressively capable Python library (Apache 2 licensed) for extracting information from unstructured documents, such as PDFs, images, Word documents and many other formats.&lt;/p&gt;

&lt;p&gt;I got some good initial results against a PDF by running “pip install ’unstructured[pdf]’” and then using the “unstructured.partition.pdf.partition_pdf(filename)” function.&lt;/p&gt;

&lt;p&gt;There are a lot of moving parts under the hood: pytesseract, OpenCV, various PDF libraries, even an ONNX model—but it installed cleanly for me on macOS and worked out of the box.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;&lt;/p&gt;



</summary><category term="ocr"/><category term="pdf"/><category term="python"/></entry><entry><title>Portable EPUBs</title><link href="https://simonwillison.net/2024/Jan/25/portable-epubs/#atom-tag" rel="alternate"/><published>2024-01-25T20:32:38+00:00</published><updated>2024-01-25T20:32:38+00:00</updated><id>https://simonwillison.net/2024/Jan/25/portable-epubs/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://willcrichton.net/notes/portable-epubs/"&gt;Portable EPUBs&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Will Crichton digs into the reasons people still prefer PDF over HTML as a format for sharing digital documents, concluding that the key issues are that HTML documents are not fully self-contained and may not be rendered consistently.&lt;/p&gt;

&lt;p&gt;He proposes “Portable EPUBs” as the solution, defining a subset of the existing EPUB standard with some additional restrictions around avoiding loading extra assets over a network, sticking to a smaller (as-yet undefined) subset of HTML and encouraging interactive components to be built using self-contained Web Components.&lt;/p&gt;

&lt;p&gt;Will also built his own lightweight EPUB reading system, called Bene—which is used to render this Portable EPUBs article. It provides a “download” link in the top right which produces the .epub file itself.&lt;/p&gt;

&lt;p&gt;There’s a lot to like here. I’m constantly infuriated at the number of documents out there that are PDFs but really should be web pages (academic papers are a particularly bad example here), so I’m very excited by any initiatives that might help push things in the other direction.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/html"&gt;html&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/web-components"&gt;web-components&lt;/a&gt;&lt;/p&gt;



</summary><category term="html"/><category term="pdf"/><category term="web-components"/></entry><entry><title>textra</title><link href="https://simonwillison.net/2023/Mar/23/textra/#atom-tag" rel="alternate"/><published>2023-03-23T21:08:51+00:00</published><updated>2023-03-23T21:08:51+00:00</updated><id>https://simonwillison.net/2023/Mar/23/textra/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/freedmand/textra"&gt;textra&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Tiny (432KB) macOS binary CLI tool by Dylan Freedman which produces high quality text extraction from PDFs, images and even audio files using the VisionKit APIs in macOS 13 and higher. It handles handwriting too!

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/agarcia_me/status/1639001713461071873"&gt;Alex Garcia&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/audio"&gt;audio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/macos"&gt;macos&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;&lt;/p&gt;



</summary><category term="audio"/><category term="cli"/><category term="macos"/><category term="ocr"/><category term="pdf"/></entry><entry><title>Building a searchable archive for the San Francisco Microscopical Society</title><link href="https://simonwillison.net/2022/Aug/25/sfms-archive/#atom-tag" rel="alternate"/><published>2022-08-25T17:24:50+00:00</published><updated>2022-08-25T17:24:50+00:00</updated><id>https://simonwillison.net/2022/Aug/25/sfms-archive/#atom-tag</id><summary type="html">
    &lt;p&gt;The &lt;a href="https://sfmicrosociety.org/"&gt;San Francisco Microscopical Society&lt;/a&gt; was &lt;a href="https://sfmicrosociety.org/history"&gt;founded&lt;/a&gt; in 1870 by a group of scientists dedicated to advancing the field of microscopy.&lt;/p&gt;
&lt;p&gt;Today the society is a non-profit run entirely by volunteers (they welcome &lt;a href="https://sfmicrosociety.org/join"&gt;new members&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;This year they've been working with the UC Berkeley Bancroft Library to &lt;a href="https://mailchi.mp/sfmicrosociety/spring-2022#FROM%20THE%20ARCHIVE"&gt;digitize their archives&lt;/a&gt;, which reach back to the founding of the organization.&lt;/p&gt;
&lt;p&gt;Hep Svadja, Jenny Jacox and Ariel Waldman have taken the lead on this effort, resulting in a wealth of scanned PDF files covering 150 years of microscope history.&lt;/p&gt;
&lt;p&gt;I've been helping out with the project running OCR against the archives (using Amazon Textract via my &lt;a href="https://simonwillison.net/2022/Jun/30/s3-ocr/"&gt;s3-ocr tool&lt;/a&gt;, which I built for this project) and turning them into a searchable website using &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The result is now live! Take a look at &lt;strong&gt;&lt;a href="https://archive.sfmicrosociety.org/"&gt;archive.sfmicrosociety.org&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/sfms-homepage.jpg" alt="The website homepage. The title reads San Francisco Microscopical Society Archive. The page has a search box, a link to browse all documents and a heading titled Some random pages which shows a mixture of six different scanned pages, some typewritten and some with handwriting." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Try running a search for &lt;a href="https://archive.sfmicrosociety.org/?q=newton"&gt;newton&lt;/a&gt; to see the search feature in action.&lt;/p&gt;
&lt;p&gt;You can also &lt;a href="https://archive.sfmicrosociety.org/folders/PUBLIC/Section+5+Supporting+Documents+for+SFHS+Application"&gt;browse by folder structure&lt;/a&gt; or by &lt;a href="https://archive.sfmicrosociety.org/docs/12abb3e5"&gt;individual document&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;A neat thing about using AWS Textract is that it is significantly better at reading 19th century handwritten documents than I am.&lt;/p&gt;
&lt;p&gt;Each document page is shown with the OCRd text alongside the original scan, so you can see that for yourself on &lt;a href="https://archive.sfmicrosociety.org/docs/402f6021/15"&gt;this page&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/sfms-page.jpg" alt="A page titled Misc Notes and Business Cards 1885.pdf pafe 15. The scanned image on the right shows some beautiful but very hard to read handwritten notes. The OCR text on the left looks to me like it's pretty accurate." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4&gt;How it works&lt;/h4&gt;
&lt;p&gt;The website itself is a custom Datasette instance hosted on &lt;a href="https://vercel.com/"&gt;Vercel&lt;/a&gt; - deployed using the &lt;a href="https://datasette.io/plugins/datasette-publish-vercel"&gt;datasette-publish-vercel plugin&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It's an implementation of the &lt;a href="https://simonwillison.net/2021/Jul/28/baked-data/"&gt;Baked Data&lt;/a&gt; architectural pattern, where the database itself is packaged up as a SQLite and included as a binary file in the application that is deployed to Vercel.&lt;/p&gt;
&lt;p&gt;You can directly browse and query the underlying database (a 12.4MB SQLite file) using the default Datasette interface at &lt;a href="https://archive.sfmicrosociety.org/sfms"&gt;archive.sfmicrosociety.org/sfms&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The database schema (excluding the tables that enable full-text search) looks like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;CREATE TABLE [documents] (
   [id] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [title] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [&lt;span class="pl-k"&gt;path&lt;/span&gt;] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [etag] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;
)

CREATE TABLE [pages] (
   [id] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [document_id] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [page] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;,
   [&lt;span class="pl-k"&gt;text&lt;/span&gt;] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   &lt;span class="pl-k"&gt;FOREIGN KEY&lt;/span&gt;([document_id]) &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [documents]([id])
)&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;As you can see, the site consists of documents and pages. Documents have a &lt;code&gt;path&lt;/code&gt; (the directory structure), a title (the filename) and an ID. Pages belong to documents and have a &lt;code&gt;text&lt;/code&gt; column and a page number. It's a pretty simple schema!&lt;/p&gt;
&lt;p&gt;The public pages on the site are all built using custom templates and views. You can browse the source code in the &lt;a href="https://github.com/simonw/sfms-history"&gt;simonw/sfms-history&lt;/a&gt; GitHub repository.&lt;/p&gt;
&lt;p&gt;In particular, the custom pages are implemented using a &lt;a href="https://docs.datasette.io/en/stable/writing_plugins.html#writing-one-off-plugins"&gt;one-off site plugin&lt;/a&gt; (in the &lt;code&gt;plugins/&lt;/code&gt; folder) that adds extra routes to Datasette using the &lt;a href="https://docs.datasette.io/en/stable/plugin_hooks.html#register-routes-datasette"&gt;register_routes()&lt;/a&gt; plugin hook, &lt;a href="https://github.com/simonw/sfms-history/blob/cdf31999b9534b78c41ccfe6dc0c38067ca9cfe3/plugins/views.py#L5-L12"&gt;like this&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-en"&gt;@&lt;span class="pl-s1"&gt;hookimpl&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;register_routes&lt;/span&gt;():
    &lt;span class="pl-k"&gt;return&lt;/span&gt; [
        (&lt;span class="pl-s"&gt;r"^/docs$"&lt;/span&gt;, &lt;span class="pl-s1"&gt;docs&lt;/span&gt;),
        (&lt;span class="pl-s"&gt;r"^/docs/(?P&amp;lt;document_id&amp;gt;[a-z0-9]+)$"&lt;/span&gt;, &lt;span class="pl-s1"&gt;document&lt;/span&gt;),
        (&lt;span class="pl-s"&gt;r"^/docs/(?P&amp;lt;document_id&amp;gt;[a-z0-9]+)/(?P&amp;lt;page&amp;gt;\d+)/?$"&lt;/span&gt;, &lt;span class="pl-s1"&gt;page&lt;/span&gt;),
        (&lt;span class="pl-s"&gt;r"^/folders/(?P&amp;lt;folder&amp;gt;.*)$"&lt;/span&gt;, &lt;span class="pl-s1"&gt;folder&lt;/span&gt;),
    ]&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;docs&lt;/code&gt;, &lt;code&gt;document&lt;/code&gt;, &lt;code&gt;page&lt;/code&gt; and &lt;code&gt;folder&lt;/code&gt; are Python view functions that take a request and return a custom response.&lt;/p&gt;
&lt;p&gt;Here's the &lt;code&gt;docs&lt;/code&gt; function for example, which powers the listing of all documents at &lt;a href="https://archive.sfmicrosociety.org/docs"&gt;archive.sfmicrosociety.org/docs&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;docs&lt;/span&gt;(&lt;span class="pl-s1"&gt;datasette&lt;/span&gt;, &lt;span class="pl-s1"&gt;request&lt;/span&gt;):
    &lt;span class="pl-s1"&gt;db&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;datasette&lt;/span&gt;.&lt;span class="pl-en"&gt;get_database&lt;/span&gt;(&lt;span class="pl-s"&gt;"sfms"&lt;/span&gt;)
    &lt;span class="pl-s1"&gt;documents&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; [
        &lt;span class="pl-en"&gt;to_document&lt;/span&gt;(&lt;span class="pl-s1"&gt;doc&lt;/span&gt;)
        &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;doc&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;db&lt;/span&gt;.&lt;span class="pl-en"&gt;execute&lt;/span&gt;(
            &lt;span class="pl-s"&gt;"""&lt;/span&gt;
&lt;span class="pl-s"&gt;    select documents.*, count(*) as num_pages&lt;/span&gt;
&lt;span class="pl-s"&gt;    from pages join documents on pages.document_id = documents.id&lt;/span&gt;
&lt;span class="pl-s"&gt;    group by documents.id&lt;/span&gt;
&lt;span class="pl-s"&gt;    order by path&lt;/span&gt;
&lt;span class="pl-s"&gt;    """&lt;/span&gt;
        )
    ]
    &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-v"&gt;Response&lt;/span&gt;.&lt;span class="pl-en"&gt;html&lt;/span&gt;(
        &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;datasette&lt;/span&gt;.&lt;span class="pl-en"&gt;render_template&lt;/span&gt;(&lt;span class="pl-s"&gt;"docs.html"&lt;/span&gt;, {&lt;span class="pl-s"&gt;"documents"&lt;/span&gt;: &lt;span class="pl-s1"&gt;documents&lt;/span&gt;}, &lt;span class="pl-s1"&gt;request&lt;/span&gt;)
    )&lt;/pre&gt;
&lt;p&gt;And here's &lt;a href="https://github.com/simonw/sfms-history/blob/cdf31999b9534b78c41ccfe6dc0c38067ca9cfe3/templates/docs.html"&gt;the docs.html template&lt;/a&gt; it uses.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/simonw/sfms-history/blob/cdf31999b9534b78c41ccfe6dc0c38067ca9cfe3/templates/index.html"&gt;index.html&lt;/a&gt; template which implements both the homepage and the search results page is worth a look too. It takes &lt;a href="https://github.com/simonw/sfms-history/blob/cdf31999b9534b78c41ccfe6dc0c38067ca9cfe3/plugins/index.py"&gt;a slightly different approach&lt;/a&gt;, using the &lt;a href="https://docs.datasette.io/en/stable/plugin_hooks.html#extra-template-vars-template-database-table-columns-view-name-request-datasette"&gt;extra_template_vars()&lt;/a&gt; plugin hook to populate extra custom template variables that can be used by that page to loop through the search results.&lt;/p&gt;
&lt;p&gt;I built that page first, before adding custom routes for the other pages - I should probably refactor that to use &lt;code&gt;register_routes()&lt;/code&gt; instead for consistency.&lt;/p&gt;
&lt;h4&gt;OCRing the documents&lt;/h4&gt;
&lt;p&gt;The documents started out in a Google Drive. I used the &lt;a href="https://rclone.org/"&gt;Rclone&lt;/a&gt; tool to sync that to an S3 bucket, and then ran my &lt;a href="https://datasette.io/tools/s3-ocr"&gt;s3-ocr tool&lt;/a&gt; against specific folders within that bucket to run them through AWS Textract.&lt;/p&gt;
&lt;p&gt;Then I ran the &lt;code&gt;s3-ocr index&lt;/code&gt; command to suck down the OCR results and build those into a SQLite database.&lt;/p&gt;
&lt;p&gt;The resulting schema wasn't quite right for this particular project. I wrote &lt;a href="https://github.com/simonw/sfms-history/blob/cdf31999b9534b78c41ccfe6dc0c38067ca9cfe3/build-db.sh"&gt;this build-db.sh script&lt;/a&gt; to construct a schema specifically designed for the archive, copy the data in and then enable full-text search on the resulting tables.&lt;/p&gt;
&lt;p&gt;The SQLite database file containing the inital OCR results is called &lt;code&gt;index.db&lt;/code&gt;. I used the following commands to attach that &lt;code&gt;index.db&lt;/code&gt; file to the same connection as the &lt;code&gt;sfms.db&lt;/code&gt; database and copy and re-shape the records that I wanted:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Populate documents&lt;/span&gt;
sqlite-utils sfms.db --attach index2 index.db &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;cat &lt;span class="pl-s"&gt;&lt;span class="pl-k"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="pl-k"&gt;EOF&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;insert into documents select&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  substr(s3_ocr_etag, 2, 8) as id,&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  key as title,&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  key as path,&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  replace(s3_ocr_etag, '"', '') as etag&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;from&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  index2.ocr_jobs&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;where&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  key in (&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;    select path from index2.pages&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;    where (&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;      folder like 'INTAKE/%'&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;      or folder like 'PUBLIC/%'&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;    )&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;    and folder not like '%PROCESSED INTAKE DOCUMENTS/%'&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  )&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-k"&gt;EOF&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Populate pages&lt;/span&gt;
sqlite-utils sfms.db --attach index2 index.db &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;cat &lt;span class="pl-s"&gt;&lt;span class="pl-k"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="pl-k"&gt;EOF&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;insert into pages select distinct&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  substr(s3_ocr_etag, 2, 8) || '-' || page as id,&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  substr(s3_ocr_etag, 2, 8) as document_id,&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  page,&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  text&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;from index2.pages&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  join index2.ocr_jobs&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;    on index2.pages.path = index2.ocr_jobs.key&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;where&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  (&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;    folder like 'INTAKE/%'&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;    or folder like 'PUBLIC/%'&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  )&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;  and folder not like '%PROCESSED INTAKE DOCUMENTS/%'&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-k"&gt;EOF&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This script is &lt;a href="https://github.com/simonw/sfms-history/blob/cdf31999b9534b78c41ccfe6dc0c38067ca9cfe3/.github/workflows/deploy.yml#L39-L41"&gt;run by GitHub Actions&lt;/a&gt; as part of the workflow that deploys the site to Vercel.&lt;/p&gt;
&lt;h4&gt;Images of pages&lt;/h4&gt;
&lt;p&gt;The site itself makes use of the OCRd text that is stored in the database - that's how the search feature works.&lt;/p&gt;
&lt;p&gt;But the scanned pages are visually so much more interesting than that!&lt;/p&gt;
&lt;p&gt;I wanted to display those page images both as thumbnails and larger images, both to add visual appeal and because OCRd text loses a lot of the richness of the underlying records.&lt;/p&gt;
&lt;p&gt;Since I already had the PDFs in an S3 bucket, the easiest way to build this was using &lt;a href="https://imgix.com/"&gt;imgix&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;imgix offers a proxy service for running URL-defined transforms against images. I've used it on a bunch of projcets before: it's really powerful.&lt;/p&gt;
&lt;p&gt;In this case, given an imgix source configured against an S3 bucket an image for a specific page of a PDF can be constructed like this (newlines added for readability):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;https://sfms-history.imgix.net/
  PUBLIC/Scientific%20Documents/
  MicroscopicGymnastics_J.EdwardsSmith-04-06-1876.pdf
  ?page=1
  &amp;amp;w=200
  &amp;amp;auto=format
  &amp;amp;s=634c00249fbe4a2eda90f00be0ae66d6
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here we're providing the path to the PDF in the S3 bucket, requesting a render of page 1 of that PDF, at width 200px and using &lt;code&gt;auto=format&lt;/code&gt; which caues imgix to &lt;a href="https://docs.imgix.com/apis/rendering/auto/auto#format"&gt;serve the image in an optimized format&lt;/a&gt; based on the capabilities of the user's device.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;&amp;amp;s=&lt;/code&gt; parameter is a signature which can be configured to prevent people from constructing their own arbitrary URLs. In this case the S3 bucket has some files in that are not part of the public archive, so using a signature prevents people from guessing filenames and sniffing around in the bucket contents.&lt;/p&gt;
&lt;h4&gt;s3-ocr elsewhere&lt;/h4&gt;
&lt;p&gt;I built &lt;code&gt;s3-ocr&lt;/code&gt; for this specific project, but it's already starting to see use for other projects. My favourite example so far is the work Philip James has been doing to deploy search engines against OCRd meeting minutes for the cities of Alameda and Oakland (and soon more) - he wrote about that in detail is &lt;a href="https://phildini.dev/digitizing-55-000-pages-of-civic-meetings"&gt;Digitizing 55,000 pages of civic meetings&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;If you have an interesting archive of PDFs that need to be OCRd I would love to hear from you! Contact me &lt;a href="https://twitter.com/simonw"&gt;on Twitter&lt;/a&gt; or email me at &lt;code&gt;swillison&lt;/code&gt; at Google's webmail provider.&lt;/p&gt;
&lt;h4&gt;Also this week&lt;/h4&gt;
&lt;p&gt;I got very distracted by &lt;a href="https://simonwillison.net/2022/Aug/21/scotrail/"&gt;Analyzing ScotRail audio announcements with Datasette&lt;/a&gt; - which did at least produce a new, detailed tutorial describing my process for approaching new projects with Datasette.&lt;/p&gt;
&lt;p&gt;I added &lt;a href="https://simonwillison.net/2022/Aug/17/datasette-lite-plugins/"&gt;Plugin support to Datasete Lite&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-multiline-links"&gt;datasette-multiline-links&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-multiline-links/releases/tag/0.1"&gt;0.1&lt;/a&gt; - 2022-08-24
&lt;br /&gt;Make multiple newline separated URLs clickable in Datasette&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-publish-vercel"&gt;datasette-publish-vercel&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-publish-vercel/releases/tag/0.14.2"&gt;0.14.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-publish-vercel/releases"&gt;23 releases total&lt;/a&gt;) - 2022-08-24
&lt;br /&gt;Datasette plugin for publishing data using Vercel&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-edit-schema"&gt;datasette-edit-schema&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-edit-schema/releases/tag/0.5.1"&gt;0.5.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-edit-schema/releases"&gt;10 releases total&lt;/a&gt;) - 2022-08-22
&lt;br /&gt;Datasette plugin for modifying table schemas&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/dogsheep/pocket-to-sqlite"&gt;pocket-to-sqlite&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/dogsheep/pocket-to-sqlite/releases/tag/0.2.2"&gt;0.2.2&lt;/a&gt; - (&lt;a href="https://github.com/dogsheep/pocket-to-sqlite/releases"&gt;4 releases total&lt;/a&gt;) - 2022-08-22
&lt;br /&gt;Create a SQLite database containing data from your Pocket account&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-mp3-audio"&gt;datasette-mp3-audio&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-mp3-audio/releases/tag/0.2"&gt;0.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-mp3-audio/releases"&gt;2 releases total&lt;/a&gt;) - 2022-08-20
&lt;br /&gt;Turn .mp3 URLs into an audio player in the Datasette interface&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/sqlite-diffable"&gt;sqlite-diffable&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/sqlite-diffable/releases/tag/0.5"&gt;0.5&lt;/a&gt; - (&lt;a href="https://github.com/simonw/sqlite-diffable/releases"&gt;6 releases total&lt;/a&gt;) - 2022-08-18
&lt;br /&gt;Tools for dumping/loading a SQLite database to diffable directory structure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-sentry"&gt;datasette-sentry&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-sentry/releases/tag/0.2"&gt;0.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-sentry/releases"&gt;5 releases total&lt;/a&gt;) - 2022-08-14
&lt;br /&gt;Datasette plugin for configuring Sentry&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette"&gt;datasette&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette/releases/tag/0.62"&gt;0.62&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette/releases"&gt;113 releases total&lt;/a&gt;) - 2022-08-14
&lt;br /&gt;An open source multi-tool for exploring and publishing data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.13"&gt;0.13&lt;/a&gt; - (&lt;a href="https://github.com/simonw/s3-credentials/releases"&gt;14 releases total&lt;/a&gt;) - 2022-08-12
&lt;br /&gt;A tool for creating credentials for accessing S3 buckets&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/javascript/openseadragon"&gt;Creating a tiled zoomable image with OpenSeadragon and vips&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/python/setup-py-from-url"&gt;Defining setup.py dependencies using a URL&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/googlecloud/google-cloud-spend-datasette"&gt;Analyzing Google Cloud spend with Datasette&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/sqlite/sort-by-number-of-json-intersections"&gt;Sort by number of JSON intersections&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/quarto/trying-out-quarto"&gt;Trying out Quarto on macOS&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/sqlite/function-list"&gt;Seeing which functions are unique to a specific SQLite / Datasette instance&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ocr"/><category term="pdf"/><category term="projects"/><category term="datasette"/><category term="weeknotes"/></entry><entry><title>s3-ocr: Extract text from PDF files stored in an S3 bucket</title><link href="https://simonwillison.net/2022/Jun/30/s3-ocr/#atom-tag" rel="alternate"/><published>2022-06-30T21:40:27+00:00</published><updated>2022-06-30T21:40:27+00:00</updated><id>https://simonwillison.net/2022/Jun/30/s3-ocr/#atom-tag</id><summary type="html">
    &lt;p&gt;I've released &lt;strong&gt;&lt;a href="https://datasette.io/tools/s3-ocr"&gt;s3-ocr&lt;/a&gt;&lt;/strong&gt;, a new tool that runs Amazon's &lt;a href="https://aws.amazon.com/textract/"&gt;Textract&lt;/a&gt; OCR text extraction against PDF files in an S3 bucket, then writes the resulting text out to a SQLite database with full-text search configured so you can run searches against the extracted data.&lt;/p&gt;
&lt;p&gt;You can search through a demo of 697 pages of OCRd text at &lt;a href="https://s3-ocr-demo.datasette.io/pages/pages"&gt;s3-ocr-demo.datasette.io/pages/pages&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Textract works extremely well: it handles dodgy scanned PDFs full of typewritten code and reads handwritten text better than I can! It &lt;a href="https://aws.amazon.com/textract/pricing/"&gt;charges&lt;/a&gt; $1.50 per thousand pages processed.&lt;/p&gt;
&lt;h4&gt;Why I built this&lt;/h4&gt;
&lt;p&gt;My initial need for this is a collaboration I have running with the &lt;a href="https://sfmicrosociety.org/"&gt;San Francisco Microscopy Society&lt;/a&gt;. They've been digitizing their archives - which stretch back to 1870! - and were looking for help turning the digital scans into something more useful.&lt;/p&gt;
&lt;p&gt;The archives are full of hand-written and type-written notes, scanned and stored as PDFs.&lt;/p&gt;
&lt;p&gt;I decided to wrap my work up as a tool because I'm sure there are a LOT of organizations out there with a giant bucket of PDF files that would benefit from being able to easily run OCR and turn the results into a searchable database.&lt;/p&gt;
&lt;p&gt;Running Textract directly against large numbers of files is somewhat inconvenient (here's my &lt;a href="https://til.simonwillison.net/aws/ocr-pdf-textract"&gt;earlier TIL about it&lt;/a&gt;). &lt;code&gt;s3-ocr&lt;/code&gt; is my attempt to make it easier.&lt;/p&gt;
&lt;h4&gt;Tutorial: How I built that demo&lt;/h4&gt;
&lt;p&gt;The demo instance uses three PDFs from the Library of Congress Harry Houdini Collection &lt;a href="https://archive.org/search.php?query=creator%3A%22Harry+Houdini+Collection+%28Library+of+Congress%29+DLC%22"&gt;on the Internet Archive&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://archive.org/details/unmaskingrobert00houdgoog"&gt;The unmasking of Robert-Houdin&lt;/a&gt; from 1908&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://archive.org/details/practicalmagicia00harr"&gt;The practical magician and ventriloquist's guide: a practical manual of fireside magic and conjuring illusions: containing also complete instructions for acquiring &amp;amp; practising the art of ventriloquism&lt;/a&gt; from 1876&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://archive.org/details/latestmagicbeing00hoff"&gt;Latest magic, being original conjuring tricks&lt;/a&gt; from 1918&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I started by downloading PDFs of those three files.&lt;/p&gt;
&lt;p&gt;Then I installed the two tools I needed:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;pip install s3-ocr s3-credentials
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I used my &lt;a href="https://datasette.io/tools/s3-credentials"&gt;s3-credentials&lt;/a&gt; tool to create a new S3 bucket and credentials with the ability to write files to it, with the new &lt;a href="https://github.com/simonw/s3-credentials/issues/72"&gt;--statement option&lt;/a&gt; (which I released today) to add &lt;code&gt;textract&lt;/code&gt; permissions to the generated credentials:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;s3-credentials create s3-ocr-demo --statement '{
  "Effect": "Allow",
  "Action": "textract:*",
  "Resource": "*"
}' --create-bucket &amp;gt; ocr.json
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;(Note that you don't need to use &lt;code&gt;s3-credentials&lt;/code&gt; at all if you have AWS credentials configured on your machine with root access to your account - just leave off the &lt;code&gt;-a ocr.json&lt;/code&gt; options in the following examples.)&lt;/p&gt;
&lt;p&gt;&lt;code&gt;s3-ocr-demo&lt;/code&gt; is now a bucket I can use for the demo. &lt;code&gt;ocr.json&lt;/code&gt; contains JSON with an access key and secret key for an IAM user account that can interact with the that bucket, and also has permission to access the AWS Textract APIs.&lt;/p&gt;
&lt;p&gt;I uploaded my three PDFs to the bucket:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;s3-credentials put-object s3-ocr-demo latestmagicbeing00hoff.pdf latestmagicbeing00hoff.pdf -a ocr.json
s3-credentials put-object s3-ocr-demo practicalmagicia00harr.pdf practicalmagicia00harr.pdf -a ocr.json
s3-credentials put-object s3-ocr-demo unmaskingrobert00houdgoog.pdf unmaskingrobert00houdgoog.pdf -a ocr.json
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;(I often use &lt;a href="https://panic.com/transmit/"&gt;Transmit&lt;/a&gt; as a GUI for this kind of operation.)&lt;/p&gt;
&lt;p&gt;Then I kicked off OCR jobs against every PDF file in the bucket:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% s3-ocr start s3-ocr-demo --all -a ocr.json 
Found 0 files with .s3-ocr.json out of 3 PDFs
Starting OCR for latestmagicbeing00hoff.pdf, Job ID: f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402
Starting OCR for practicalmagicia00harr.pdf, Job ID: ef085728135d524a39bc037ad6f7253284b1fdbeb728dddcfbb260778d902b55
Starting OCR for unmaskingrobert00houdgoog.pdf, Job ID: 93bd46f02eb099eca369c41e384836d2bd3199b95d415c0257ef3fa3602cbef9
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;--all&lt;/code&gt; option scans for any file with a &lt;code&gt;.pdf&lt;/code&gt; extension. You can pass explicit file names instead if you just want to process one or two files at a time.&lt;/p&gt;
&lt;p&gt;This returns straight away, but the OCR process itself can take several minutes depending on the size of the files.&lt;/p&gt;
&lt;p&gt;The job IDs can be used to inspect the progress of each task like so:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% s3-ocr inspect-job f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402
{
  "DocumentMetadata": {
    "Pages": 244
  },
  "JobStatus": "SUCCEEDED",
  "DetectDocumentTextModelVersion": "1.0"
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once the job completed, I could preview the text extracted from the PDF like so:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% s3-ocr text s3-ocr-demo latestmagicbeing00hoff.pdf
111
.
116

LATEST MAGIC
BEING
ORIGINAL CONJURING TRICKS
INVENTED AND ARRANGED
BY
PROFESSOR HOFFMANN
(ANGELO LEWIS, M.A.)
Author of "Modern Magic," etc.
WITH NUMEROUS ILLUSTRATIONS
FIRST EDITION
NEW YORK
SPON &amp;amp; CHAMBERLAIN, 120 LIBERTY ST.
...
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To create a SQLite database with a table containing rows for every page of scanned text, I ran this command:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% s3-ocr index s3-ocr-demo pages.db -a ocr.json 
Fetching job details  [####################################]  100%
Populating pages table  [####--------------------------------]   13%  00:00:34
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I then published the resulting &lt;code&gt;pages.db&lt;/code&gt; SQLite database using Datasette - you can &lt;a href="https://s3-ocr-demo.datasette.io/pages"&gt;explore it here&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;How s3-ocr works&lt;/h4&gt;
&lt;p&gt;&lt;code&gt;s3-ocr&lt;/code&gt; works by calling Amazon's S3 and Textract APIs.&lt;/p&gt;
&lt;p&gt;Textract only works against PDF files in &lt;a href="https://docs.aws.amazon.com/textract/latest/dg/api-async.html"&gt;asynchronous mode&lt;/a&gt;: you call an API endpoint to tell it "start running OCR against this PDF file in this S3 bucket", then wait for it to finish - which can take several minutes.&lt;/p&gt;
&lt;p&gt;It defaults to storing the OCR results in its own storage, expiring after seven days. You can instead tell it to store them in your own S3 bucket - I use that option in &lt;code&gt;s3-ocr&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;A design challenge I faced was that I wanted to make the command restartable and resumable: if the user cancelled the task, I wanted to be able to pick up from where it had got to. I also want to be able to run it again after adding more PDFs to the bucket without repeating work for the previously processed files.&lt;/p&gt;
&lt;p&gt;I also needed to persist those job IDs: Textract writes the OCR results to keys in the bucket called &lt;code&gt;textract-output/JOB_ID/1-?&lt;/code&gt; - but there's no indication as to which PDF file the results correspond to.&lt;/p&gt;
&lt;p&gt;My solution is to write tiny extra JSON files to the bucket when the OCR job is first started.&lt;/p&gt;
&lt;p&gt;If you have a file called &lt;code&gt;latestmagicbeing00hoff.pdf&lt;/code&gt; the &lt;code&gt;start&lt;/code&gt; command will create a new file called &lt;code&gt;latestmagicbeing00hoff.pdf.s3-ocr.json&lt;/code&gt; with the following content:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
  &lt;span class="pl-ent"&gt;"job_id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;f66bc2d00fb75d1c42d1f829e5b6788891f9799fda404c4550580959f65a5402&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
  &lt;span class="pl-ent"&gt;"etag"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-cce"&gt;\"&lt;/span&gt;d79af487579dcbbef26c9b3be763eb5e-2&lt;span class="pl-cce"&gt;\"&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This associates the job ID with the PDF file. It also records the original ETag of the PDF file - this is so in the future I can implement a system that can re-run OCR if the PDF has been updated.&lt;/p&gt;
&lt;p&gt;The existence of these files lets me do two things:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;If you run &lt;code&gt;s3-ocr start s3-ocr-demo --all&lt;/code&gt; it can avoid re-submitting PDF files that have already been sent for OCR, by checking for the existence of the &lt;code&gt;.s3-ocr.json&lt;/code&gt; file.&lt;/li&gt;
&lt;li&gt;When you later ask for the results of the OCR it can use these files to associate the PDF with the results.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Scatting &lt;code&gt;.s3-ocr.json&lt;/code&gt; files all over the place feels a little messy, so I have an &lt;a href="https://github.com/simonw/s3-ocr/issues/14"&gt;open issue&lt;/a&gt; considering moving them all to a &lt;code&gt;s3-ocr/&lt;/code&gt; prefix in the bucket instead.&lt;/p&gt;
&lt;h4&gt;Try it and let me know what you think&lt;/h4&gt;
&lt;p&gt;This is a brand new project, but I think it's ready for other people to start trying it out.&lt;/p&gt;
&lt;p&gt;I ran it against around 7,000 pages from 531 PDF files in the San Francisco Microscopy Society archive and it seemed to work well!&lt;/p&gt;
&lt;p&gt;If you try this out and it works (or it doesn't work) please &lt;a href="https://twitter.com/simonw"&gt;let me know via Twitter&lt;/a&gt; or &lt;a href="https://github.com/simonw/s3-ocr"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;A challenging example page&lt;/h4&gt;
&lt;p&gt;Here's one of the more challenging pages I processed using Textract:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A very old page of difficult to read handwriting" src="https://static.simonwillison.net/static/2022/s3-ocr-sample-handwriting.jpg" style="max-width: 100%"/&gt;&lt;/p&gt;
&lt;p&gt;Here's the result:&lt;/p&gt;
&lt;pre&gt;
In. In J a ... the Joe 14
162
Volxv
Lalpa spinosa, Eggt bud development. of
146
Farcomas spindle. cells in nested gowers 271
Fayigaga tridactylites, leaf glaur of ruce 33
staining &amp;amp; mounting
Stiles 133
tilica films, a structure of Diatoins morehouse 38
thile new microscopic
Broeck 22 /
Smith reproduction in the huntroom tribe
6
Trakes, develop mouht succession of the porsion tango/229
Soirce President of the Roy: truc: Soo
285
forby, Presidents address
105
pongida, difficulties of classification
238
tage, american adjustable concentric
150
ttlese staining &amp;amp; mountring wood sections 133
Stodder, Frustulia Iasconica, havicula
chomboides, &amp;amp; havi cula crassinervis 265
Vol XVI
falicylic acid u movorcopy
160
falpar enctry ology of
Brooke 9.97
Sanderson micros: characters If inflammation
43
tap, circulation of the
42
Jars, structure of the genus Brisinga
44
latter throvite connective substances 191- 241
Jehorey Cessification in birds, formation
of ed blood corpuseles during the
ossification process
by
&lt;/pre&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-ocr"&gt;s3-ocr&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/s3-ocr/releases/tag/0.4"&gt;0.4&lt;/a&gt; - (&lt;a href="https://github.com/simonw/s3-ocr/releases"&gt;4 releases total&lt;/a&gt;) - 2022-06-30
&lt;br /&gt;Tools for running OCR against files stored in S3&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/s3-credentials/releases/tag/0.12"&gt;0.12&lt;/a&gt; - (&lt;a href="https://github.com/simonw/s3-credentials/releases"&gt;12 releases total&lt;/a&gt;) - 2022-06-30
&lt;br /&gt;A tool for creating credentials for accessing S3 buckets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-scale-to-zero"&gt;datasette-scale-to-zero&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-scale-to-zero/releases/tag/0.1.2"&gt;0.1.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-scale-to-zero/releases"&gt;3 releases total&lt;/a&gt;) - 2022-06-23
&lt;br /&gt;Quit Datasette if it has not received traffic for a specified time period&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/sqlite/one-line-csv-operations"&gt;One-liner for running queries against CSV files with SQLite&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/bash/ignore-errors"&gt;Ignoring errors in a section of a Bash script&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/aws/ocr-pdf-textract"&gt;Running OCR against a PDF file with AWS Textract&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aws"&gt;aws&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3"&gt;s3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/s3-credentials"&gt;s3-credentials&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="aws"/><category term="ocr"/><category term="pdf"/><category term="projects"/><category term="s3"/><category term="weeknotes"/><category term="s3-credentials"/></entry><entry><title>Automate the Boring Stuff with Python: Working with PDF and Word Documents</title><link href="https://simonwillison.net/2019/Nov/6/working-pdf/#atom-tag" rel="alternate"/><published>2019-11-06T16:17:20+00:00</published><updated>2019-11-06T16:17:20+00:00</updated><id>https://simonwillison.net/2019/Nov/6/working-pdf/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://automatetheboringstuff.com/chapter13/"&gt;Automate the Boring Stuff with Python: Working with PDF and Word Documents&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I stumbled across this while trying to extract some data from a PDF file (the kind of file with actual text in it as opposed to dodgy scanned images) and it worked perfectly: &lt;code&gt;PyPDF2.PdfFileReader(open("file.pdf", "rb")).getPage(0).extractText()&lt;/code&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;&lt;/p&gt;



</summary><category term="pdf"/><category term="python"/></entry><entry><title>arxiv-vanity</title><link href="https://simonwillison.net/2017/Oct/25/arxiv-vanity/#atom-tag" rel="alternate"/><published>2017-10-25T20:06:40+00:00</published><updated>2017-10-25T20:06:40+00:00</updated><id>https://simonwillison.net/2017/Oct/25/arxiv-vanity/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/arxiv-vanity/arxiv-vanity"&gt;arxiv-vanity&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Beautiful new project from Ben Firshman and Andreas Jansson: “Arxiv Vanity renders academic papers from Arxiv as responsive web pages so you don’t have to squint at a PDF”. It works by pulling the raw LaTeX source code from Arxiv and rendering it to HTML using a heavily customized Pandoc workflow. The real fun is in the architecture: it’s a Django app running on Heroku which fires up on-demand Hyper.sh Docker containers for each individual rendering job.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/bfirsh/status/922508174829203458"&gt;Ben Firshman&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ben-firshman"&gt;ben-firshman&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/django"&gt;django&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/science"&gt;science&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/docker"&gt;docker&lt;/a&gt;&lt;/p&gt;



</summary><category term="ben-firshman"/><category term="django"/><category term="pdf"/><category term="science"/><category term="docker"/></entry><entry><title>pdf.js</title><link href="https://simonwillison.net/2010/Jun/17/pdfjs/#atom-tag" rel="alternate"/><published>2010-06-17T19:39:00+00:00</published><updated>2010-06-17T19:39:00+00:00</updated><id>https://simonwillison.net/2010/Jun/17/pdfjs/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://www.maraksquires.com/pdf.js/"&gt;pdf.js&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
A JavaScript library for creating simple PDF files. Works (flakily) in your browser using a data:URI hack, but is also compatible with server-side JavaScript implementations such as Node.js.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/datauri"&gt;datauri&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nodejs"&gt;nodejs&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/recovered"&gt;recovered&lt;/a&gt;&lt;/p&gt;



</summary><category term="datauri"/><category term="javascript"/><category term="nodejs"/><category term="pdf"/><category term="recovered"/></entry><entry><title>node.js at JSConf.eu (PDF)</title><link href="https://simonwillison.net/2009/Nov/17/nodepdf/#atom-tag" rel="alternate"/><published>2009-11-17T18:07:37+00:00</published><updated>2009-11-17T18:07:37+00:00</updated><id>https://simonwillison.net/2009/Nov/17/nodepdf/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://nodejs.org/jsconf.pdf"&gt;node.js at JSConf.eu (PDF)&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
node.js creator Ryan Dahl’s presentation at this year’s JSConf.eu. The principle philosophy is that I/O in web applications should be asynchronous—for everything. No blocking for database calls, no blocking for filesystem access. JavaScript is a mainstream programming language with a culture of callback APIs (thanks to the DOM) and is hence ideally suited to building asynchronous frameworks.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/asynchronous"&gt;asynchronous&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/eventio"&gt;eventio&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nodejs"&gt;nodejs&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ryan-dahl"&gt;ryan-dahl&lt;/a&gt;&lt;/p&gt;



</summary><category term="asynchronous"/><category term="eventio"/><category term="javascript"/><category term="nodejs"/><category term="pdf"/><category term="ryan-dahl"/></entry><entry><title>Adobe is Bad for Open Government</title><link href="https://simonwillison.net/2009/Nov/1/sunlight/#atom-tag" rel="alternate"/><published>2009-11-01T12:51:20+00:00</published><updated>2009-11-01T12:51:20+00:00</updated><id>https://simonwillison.net/2009/Nov/1/sunlight/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://sunlightlabs.com/blog/2009/adobe-bad-open-government/"&gt;Adobe is Bad for Open Government&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The problem isn’t just that PDFs are a bad way of sharing data, it’s that Adobe have been actively lobbying the US government to use their PDF and Flash formats for open government initiatives.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/adobe"&gt;adobe&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/flash"&gt;flash&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/opengovernment"&gt;opengovernment&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sunlightfoundation"&gt;sunlightfoundation&lt;/a&gt;&lt;/p&gt;



</summary><category term="adobe"/><category term="flash"/><category term="opengovernment"/><category term="pdf"/><category term="sunlightfoundation"/></entry><entry><title>No PDFs!</title><link href="https://simonwillison.net/2009/Nov/1/pdfs/#atom-tag" rel="alternate"/><published>2009-11-01T12:04:36+00:00</published><updated>2009-11-01T12:04:36+00:00</updated><id>https://simonwillison.net/2009/Nov/1/pdfs/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://blog.sunlightfoundation.com/2009/06/05/no-pdfs/"&gt;No PDFs!&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The Sunlight Foundation point out that PDFs are a terrible way of implementing “more transparent government” due to their general lack of structure. At the Guardian (and I’m sure at other newspapers) we waste an absurd amount of time manually extracting data from PDF files and turning it in to something more useful. Even CSV is significantly more useful for many types of information.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/adobe"&gt;adobe&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/open-data"&gt;open-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/opengovernment"&gt;opengovernment&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sunlightfoundation"&gt;sunlightfoundation&lt;/a&gt;&lt;/p&gt;



</summary><category term="adobe"/><category term="csv"/><category term="open-data"/><category term="opengovernment"/><category term="pdf"/><category term="sunlightfoundation"/></entry><entry><title>Prawn</title><link href="https://simonwillison.net/2009/Jan/16/prawn/#atom-tag" rel="alternate"/><published>2009-01-16T16:04:22+00:00</published><updated>2009-01-16T16:04:22+00:00</updated><id>https://simonwillison.net/2009/Jan/16/prawn/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://prawn.majesticseacreature.com/"&gt;Prawn&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Really nice PDF generation library for Ruby, used to generate Dopplr’s beautiful end of year reports.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="http://simonwillison.net/2009/Jan/16/dopplr/#c43144"&gt;Matt Biddulph&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/dopplr"&gt;dopplr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pdf"&gt;pdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prawn"&gt;prawn&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ruby"&gt;ruby&lt;/a&gt;&lt;/p&gt;



</summary><category term="dopplr"/><category term="pdf"/><category term="prawn"/><category term="ruby"/></entry></feed>