<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: xml</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/xml.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-11-05T22:24:57+00:00</updated><author><name>Simon Willison</name></author><entry><title>Removing XSLT for a more secure browser</title><link href="https://simonwillison.net/2025/Nov/5/removing-xslt/#atom-tag" rel="alternate"/><published>2025-11-05T22:24:57+00:00</published><updated>2025-11-05T22:24:57+00:00</updated><id>https://simonwillison.net/2025/Nov/5/removing-xslt/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://developer.chrome.com/docs/web-platform/deprecating-xslt"&gt;Removing XSLT for a more secure browser&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Previously discussed &lt;a href="https://simonwillison.net/2025/Aug/19/xslt/"&gt;back in August&lt;/a&gt;, it looks like it's now official:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Chrome intends to deprecate and remove XSLT from the browser. [...] We intend to remove support from version 155 (November 17, 2026). The &lt;a href="https://github.com/mozilla/standards-positions/issues/1287#issuecomment-3227145793"&gt;Firefox&lt;/a&gt; and &lt;a href="https://github.com/whatwg/html/issues/11523#issuecomment-3149280766"&gt;WebKit&lt;/a&gt; projects have also indicated plans to remove XSLT from their browser engines. [...]&lt;/p&gt;
&lt;p&gt;The continued inclusion of XSLT 1.0 in web browsers presents a significant and unnecessary security risk. The underlying libraries that process these transformations, such as &lt;a href="https://github.com/GNOME/libxslt"&gt;libxslt&lt;/a&gt; (used by Chromium browsers), are complex, aging C/C++ codebases. This type of code is notoriously susceptible to memory safety vulnerabilities like buffer overflows, which can lead to arbitrary code execution.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I mostly encounter XSLT on people's Atom/RSS feeds, converting those to a more readable format in case someone should navigate directly to that link. Jake Archibald &lt;a href="https://jakearchibald.com/2025/making-xml-human-readable-without-xslt/"&gt;shared an alternative solution to that&lt;/a&gt; back in September.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=45823059"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/browsers"&gt;browsers&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chrome"&gt;chrome&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/web-standards"&gt;web-standards&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xslt"&gt;xslt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/jake-archibald"&gt;jake-archibald&lt;/a&gt;&lt;/p&gt;



</summary><category term="browsers"/><category term="chrome"/><category term="security"/><category term="web-standards"/><category term="xml"/><category term="xslt"/><category term="jake-archibald"/></entry><entry><title>Making XML human-readable without XSLT</title><link href="https://simonwillison.net/2025/Sep/2/making-xml-human-readable-without-xslt/#atom-tag" rel="alternate"/><published>2025-09-02T19:32:57+00:00</published><updated>2025-09-02T19:32:57+00:00</updated><id>https://simonwillison.net/2025/Sep/2/making-xml-human-readable-without-xslt/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://jakearchibald.com/2025/making-xml-human-readable-without-xslt/"&gt;Making XML human-readable without XSLT&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
In response to the &lt;a href="https://simonwillison.net/2025/Aug/19/xslt/"&gt;recent discourse&lt;/a&gt; about XSLT support in browsers, Jake Archibald shares a new-to-me alternative trick for making an XML document readable in a browser: adding the following element near the top of the XML:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;script
  xmlns="http://www.w3.org/1999/xhtml"
  src="script.js" defer="" /&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That &lt;code&gt;script.js&lt;/code&gt; will then be executed by the browser, and can swap out the XML with HTML by creating new elements using the correct namespace:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;const htmlEl = document.createElementNS(
  'http://www.w3.org/1999/xhtml',
  'html',
);
document.documentElement.replaceWith(htmlEl);
// Now populate the new DOM
&lt;/code&gt;&lt;/pre&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/browsers"&gt;browsers&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rss"&gt;rss&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xslt"&gt;xslt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/jake-archibald"&gt;jake-archibald&lt;/a&gt;&lt;/p&gt;



</summary><category term="browsers"/><category term="javascript"/><category term="rss"/><category term="xml"/><category term="xslt"/><category term="jake-archibald"/></entry><entry><title>My First Open Source AI Generated Library</title><link href="https://simonwillison.net/2025/Jun/21/my-first-open-source-ai-generated-library/#atom-tag" rel="alternate"/><published>2025-06-21T23:22:45+00:00</published><updated>2025-06-21T23:22:45+00:00</updated><id>https://simonwillison.net/2025/Jun/21/my-first-open-source-ai-generated-library/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://lucumr.pocoo.org/2025/6/21/my-first-ai-library/"&gt;My First Open Source AI Generated Library&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Armin Ronacher had Claude and Claude Code do almost &lt;em&gt;all of the work&lt;/em&gt; in building, testing, packaging and publishing a new Python library based on his design:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;It wrote ~1100 lines of code for the parser&lt;/li&gt;
&lt;li&gt;It wrote ~1000 lines of tests&lt;/li&gt;
&lt;li&gt;It configured the entire Python package, CI, PyPI publishing&lt;/li&gt;
&lt;li&gt;Generated a README, drafted a changelog, designed a logo, made it theme-aware&lt;/li&gt;
&lt;li&gt;Did multiple refactorings to make me happier&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;The project? &lt;a href="https://github.com/mitsuhiko/sloppy-xml-py"&gt;sloppy-xml-py&lt;/a&gt;, a lax XML parser (and violation of everything the XML Working Group hold sacred) which ironically is necessary because LLMs themselves frequently output "XML" that includes validation errors.&lt;/p&gt;
&lt;p&gt;Claude's SVG logo design is actually pretty decent, turns out it can draw &lt;a href="https://simonwillison.net/2025/May/22/code-with-claude-live-blog/#live-update-357"&gt;more than just bad pelicans&lt;/a&gt;!&lt;/p&gt;
&lt;center&gt;&lt;img alt="Hand drawn style, orange rough rectangly containing &amp;lt; { s } &amp;gt; - then the text Sloppy XML below in black" src="https://static.simonwillison.net/static/2025/sloppy-xml.jpg"&gt;&lt;/center&gt;

&lt;p&gt;I think experiments like this are a really valuable way to explore the capabilities of these models. Armin's conclusion:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This was an experiment to see how far I could get with minimal manual effort, and to unstick myself from an annoying blocker. The result is good enough for my immediate use case and I also felt good enough to publish it to PyPI in case someone else has the same problem.&lt;/p&gt;
&lt;p&gt;Treat it as a curious side project which says more about what's possible today than what's necessarily advisable.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'd like to present a slightly different conclusion here. The most interesting thing about this project is that &lt;strong&gt;the code is good&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;My criteria for good code these days is the following:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Solves a defined problem, well enough that I'm not tempted to solve it in a different way&lt;/li&gt;
&lt;li&gt;Uses minimal dependencies&lt;/li&gt;
&lt;li&gt;Clear and easy to understand&lt;/li&gt;
&lt;li&gt;Well tested, with tests prove that the code does what it's meant to do&lt;/li&gt;
&lt;li&gt;Comprehensive documentation&lt;/li&gt;
&lt;li&gt;Packaged and published in a way that makes it convenient for me to use&lt;/li&gt;
&lt;li&gt;Designed to be easy to maintain and make changes in the future&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;code&gt;sloppy-xml-py&lt;/code&gt; fits all of those criteria. It's useful, well defined, &lt;a href="https://github.com/mitsuhiko/sloppy-xml-py/blob/main/sloppy_xml.py"&gt;the code is readable&lt;/a&gt; with just about the right level of comments, everything is tested, the documentation explains everything I need to know, and it's been shipped to PyPI.&lt;/p&gt;
&lt;p&gt;I'd be proud to have written this myself.&lt;/p&gt;
&lt;p&gt;This example is &lt;em&gt;not&lt;/em&gt; an argument for replacing programmers with LLMs. The code is good because Armin is an expert programmer who stayed in full control throughout the process. As I wrote the other day, &lt;a href="https://simonwillison.net/2025/Jun/18/coding-agents/"&gt;a skilled individual with both deep domain understanding and deep understanding of the capabilities of the agent&lt;/a&gt;.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://bsky.app/profile/mitsuhiko.at/post/3ls4ov5fk7c2l"&gt;@mitsuhiko.at&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/armin-ronacher"&gt;armin-ronacher&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pypi"&gt;pypi&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;&lt;/p&gt;



</summary><category term="armin-ronacher"/><category term="open-source"/><category term="pypi"/><category term="python"/><category term="xml"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="claude"/><category term="claude-code"/></entry><entry><title>Cracking The Dave &amp; Buster’s Anomaly</title><link href="https://simonwillison.net/2025/Jun/5/cracking-the-dave-busters-anomaly/#atom-tag" rel="alternate"/><published>2025-06-05T10:23:31+00:00</published><updated>2025-06-05T10:23:31+00:00</updated><id>https://simonwillison.net/2025/Jun/5/cracking-the-dave-busters-anomaly/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://rambo.codes/posts/2025-05-12-cracking-the-dave-and-busters-anomaly"&gt;Cracking The Dave &amp;amp; Buster’s Anomaly&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Guilherme Rambo reports on a weird iOS messages bug:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The bug is that, if you try to send an audio message using the Messages app to someone who’s also using the Messages app, and that message happens to include the name “Dave and Buster’s”, the message will never be received.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Guilherme captured the logs from an affected device and spotted an XHTMLParseFailure error.&lt;/p&gt;
&lt;p&gt;It turned out the iOS automatic transcription mechanism was recognizing the brand name and converting it to the official restaurant chain's preferred spelling "Dave &amp;amp; Buster’s"... which was then incorrectly escaped and triggered a parse error!


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/xhtml"&gt;xhtml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ios"&gt;ios&lt;/a&gt;&lt;/p&gt;



</summary><category term="xhtml"/><category term="xml"/><category term="ios"/></entry><entry><title>Anthropic's Prompt Engineering Interactive Tutorial</title><link href="https://simonwillison.net/2024/Aug/30/anthropic-prompt-engineering-interactive-tutorial/#atom-tag" rel="alternate"/><published>2024-08-30T02:52:04+00:00</published><updated>2024-08-30T02:52:04+00:00</updated><id>https://simonwillison.net/2024/Aug/30/anthropic-prompt-engineering-interactive-tutorial/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/anthropics/courses/tree/master/prompt_engineering_interactive_tutorial"&gt;Anthropic&amp;#x27;s Prompt Engineering Interactive Tutorial&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Anthropic continue their trend of offering the best documentation of any of the leading LLM vendors. This tutorial is delivered as a set of Jupyter notebooks - I used it as an excuse to try &lt;a href="https://docs.astral.sh/uv/guides/tools/"&gt;uvx&lt;/a&gt; like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;git clone https://github.com/anthropics/courses
uvx --from jupyter-core jupyter notebook courses&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This installed a working Jupyter system, started the server and launched my browser within a few seconds.&lt;/p&gt;
&lt;p&gt;The first few chapters are pretty basic, demonstrating simple prompts run through the Anthropic API. I used &lt;code&gt;%pip install anthropic&lt;/code&gt; instead of &lt;code&gt;!pip install anthropic&lt;/code&gt; to make sure the package was installed in the correct virtual environment, &lt;a href="https://github.com/anthropics/courses/issues/30"&gt;then filed an issue and a PR&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;One new-to-me trick: in the first chapter the tutorial suggests running this:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-v"&gt;API_KEY&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"your_api_key_here"&lt;/span&gt;
&lt;span class="pl-c1"&gt;%&lt;/span&gt;&lt;span class="pl-s1"&gt;store&lt;/span&gt; &lt;span class="pl-v"&gt;API_KEY&lt;/span&gt;&lt;/pre&gt;
&lt;p&gt;This stashes your Anthropic API key in the &lt;a href="https://ipython.readthedocs.io/en/stable/config/extensions/storemagic.html"&gt;IPython store&lt;/a&gt;. In subsequent notebooks you can restore the &lt;code&gt;API_KEY&lt;/code&gt; variable like this:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-c1"&gt;%&lt;/span&gt;&lt;span class="pl-s1"&gt;store&lt;/span&gt; &lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-s1"&gt;r&lt;/span&gt; &lt;span class="pl-v"&gt;API_KEY&lt;/span&gt;&lt;/pre&gt;
&lt;p&gt;I poked around and on macOS those variables are stored in files of the same name in &lt;code&gt;~/.ipython/profile_default/db/autorestore&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/anthropics/courses/blob/master/prompt_engineering_interactive_tutorial/Anthropic%201P/04_Separating_Data_and_Instructions.ipynb"&gt;Chapter 4: Separating Data and Instructions&lt;/a&gt; included some interesting notes on Claude's support for content wrapped in XML-tag-style delimiters:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; While Claude can recognize and work with a wide range of separators and delimeters, we recommend that you &lt;strong&gt;use specifically XML tags as separators&lt;/strong&gt; for Claude, as Claude was trained specifically to recognize XML tags as a prompt organizing mechanism. Outside of function calling, &lt;strong&gt;there are no special sauce XML tags that Claude has been trained on that you should use to maximally boost your performance&lt;/strong&gt;. We have purposefully made Claude very malleable and customizable this way.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Plus this note on the importance of avoiding typos, with a nod back to the &lt;a href="https://simonwillison.net/2023/Apr/5/sycophancy-sandbagging/"&gt;problem of sandbagging&lt;/a&gt; where models match their intelligence and tone to that of their prompts:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This is an important lesson about prompting: &lt;strong&gt;small details matter&lt;/strong&gt;! It's always worth it to &lt;strong&gt;scrub your prompts for typos and grammatical errors&lt;/strong&gt;. Claude is sensitive to patterns (in its early years, before finetuning, it was a raw text-prediction tool), and it's more likely to make mistakes when you make mistakes, smarter when you sound smart, sillier when you sound silly, and so on.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://github.com/anthropics/courses/blob/master/prompt_engineering_interactive_tutorial/Anthropic%201P/05_Formatting_Output_and_Speaking_for_Claude.ipynb"&gt;Chapter 5: Formatting Output and Speaking for Claude&lt;/a&gt; includes notes on one of Claude's most interesting features: &lt;em&gt;prefill&lt;/em&gt;, where you can tell it how to start its response:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-s1"&gt;client&lt;/span&gt;.&lt;span class="pl-s1"&gt;messages&lt;/span&gt;.&lt;span class="pl-en"&gt;create&lt;/span&gt;(
    &lt;span class="pl-s1"&gt;model&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"claude-3-haiku-20240307"&lt;/span&gt;,
    &lt;span class="pl-s1"&gt;max_tokens&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;100&lt;/span&gt;,
    &lt;span class="pl-s1"&gt;messages&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;[
        {&lt;span class="pl-s"&gt;"role"&lt;/span&gt;: &lt;span class="pl-s"&gt;"user"&lt;/span&gt;, &lt;span class="pl-s"&gt;"content"&lt;/span&gt;: &lt;span class="pl-s"&gt;"JSON facts about cats"&lt;/span&gt;},
        {&lt;span class="pl-s"&gt;"role"&lt;/span&gt;: &lt;span class="pl-s"&gt;"assistant"&lt;/span&gt;, &lt;span class="pl-s"&gt;"content"&lt;/span&gt;: &lt;span class="pl-s"&gt;"{"&lt;/span&gt;}
    ]
)&lt;/pre&gt;

&lt;p&gt;Things start to get really interesting in &lt;a href="https://github.com/anthropics/courses/blob/master/prompt_engineering_interactive_tutorial/Anthropic%201P/06_Precognition_Thinking_Step_by_Step.ipynb"&gt;Chapter 6: Precognition (Thinking Step by Step)&lt;/a&gt;, which suggests using XML tags to help the model consider different arguments prior to generating a final answer:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Is this review sentiment positive or negative? First, write the best arguments for each side in &amp;lt;positive-argument&amp;gt; and &amp;lt;negative-argument&amp;gt; XML tags, then answer.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The tags make it easy to strip out the "thinking out loud" portions of the response.&lt;/p&gt;
&lt;p&gt;It also warns about Claude's sensitivity to ordering. If you give Claude two options (e.g. for sentiment analysis):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In most situations (but not all, confusingly enough), &lt;strong&gt;Claude is more likely to choose the second of two options&lt;/strong&gt;, possibly because in its training data from the web, second options were more likely to be correct.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This effect can be reduced using the thinking out loud / brainstorming prompting techniques.&lt;/p&gt;
&lt;p&gt;A related tip is proposed in &lt;a href="https://github.com/anthropics/courses/blob/master/prompt_engineering_interactive_tutorial/Anthropic%201P/08_Avoiding_Hallucinations.ipynb"&gt;Chapter 8: Avoiding Hallucinations&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;How do we fix this? Well, a great way to reduce hallucinations on long documents is to &lt;strong&gt;make Claude gather evidence first.&lt;/strong&gt; &lt;/p&gt;
&lt;p&gt;In this case, we &lt;strong&gt;tell Claude to first extract relevant quotes, then base its answer on those quotes&lt;/strong&gt;. Telling Claude to do so here makes it correctly notice that the quote does not answer the question.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I really like the example prompt they provide here, for answering complex questions against a long document:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;&amp;lt;question&amp;gt;What was Matterport's subscriber base on the precise date of May 31, 2020?&amp;lt;/question&amp;gt;&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;Please read the below document. Then, in &amp;lt;scratchpad&amp;gt; tags, pull the most relevant quote from the document and consider whether it answers the user's question or whether it lacks sufficient detail. Then write a brief numerical answer in &amp;lt;answer&amp;gt; tags.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=41395921"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/jupyter"&gt;jupyter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-engineering"&gt;prompt-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uv"&gt;uv&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="xml"/><category term="ai"/><category term="jupyter"/><category term="prompt-engineering"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/><category term="uv"/></entry><entry><title>SIARD: Software Independent Archiving of Relational Databases</title><link href="https://simonwillison.net/2022/May/4/siard/#atom-tag" rel="alternate"/><published>2022-05-04T22:40:27+00:00</published><updated>2022-05-04T22:40:27+00:00</updated><id>https://simonwillison.net/2022/May/4/siard/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.bar.admin.ch/bar/en/home/archiving/tools/siard-suite.html"&gt;SIARD: Software Independent Archiving of Relational Databases&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I hadn’t heard of this before but it looks really interesting: the Federal Archives of Switzerland developed a standard for archiving any relational database as a zip file full of XML which is “is used in over 50 countries around the globe”.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/MAndrewWaugh/status/1521981238890082304"&gt;@MAndrewWaugh&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/archives"&gt;archives&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/databases"&gt;databases&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;&lt;/p&gt;



</summary><category term="archives"/><category term="databases"/><category term="xml"/></entry><entry><title>Building an Evernote to SQLite exporter</title><link href="https://simonwillison.net/2020/Oct/16/building-evernote-sqlite-exporter/#atom-tag" rel="alternate"/><published>2020-10-16T20:12:07+00:00</published><updated>2020-10-16T20:12:07+00:00</updated><id>https://simonwillison.net/2020/Oct/16/building-evernote-sqlite-exporter/#atom-tag</id><summary type="html">
    &lt;p&gt;I've been using Evernote for over a decade, and I've long wanted to export my data from it so I can do interesting things with it.&lt;/p&gt;
&lt;p&gt;Ideally I'd use their API for that, so I could continually export new notes via a cron. Unfortunately the API is implemented &lt;a href="https://evernote.com/blog/so-api-together-evernote-and-thrift/"&gt;using Thrift&lt;/a&gt; (the only public API I've seen to use that) and my attempts to speak to it from Python have so far &lt;a href="https://twitter.com/simonw/status/1257743727264595968"&gt;failed to get anywhere&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Last weekend I decided to try using their "Export notes" feature instead, and its &lt;a href="https://evernote.com/blog/how-evernotes-xml-export-format-works/"&gt;ENEX XML format&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;ENEX - the Evernote XML export format&lt;/h4&gt;
&lt;p&gt;Select a note - or multiple notes - in the Evernote macOS desktop app, use the &lt;code&gt;File -&amp;gt; Export Notes&lt;/code&gt; menu item and Evernote will create a &lt;code&gt;Notes.enex&lt;/code&gt; exported XML file.&lt;/p&gt;
&lt;p&gt;These files can get BIG, because any images attached to your notes will be embedded as base64 encoded binary data within the XML. My export is 2.8GB!&lt;/p&gt;
&lt;p&gt;When faced with unknown XML like this it's useful to get a high level overview of what tags are present. in 2009 I wrote &lt;a href="https://github.com/simonw/xml-analyser"&gt;a Python script&lt;/a&gt; for this - as part of this project I updated it for Python 3 and pushed &lt;a href="https://pypi.org/project/xml-analyser/"&gt;a release to PyPI&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here's a subset of the output when run against my &lt;code&gt;Notes.enex&lt;/code&gt; file (&lt;a href="https://gist.github.com/simonw/eb86bbdd953c2c1dba2007f43e689f2f"&gt;full output is here&lt;/a&gt;).&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;note&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: {
        &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;attr_counts&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: {},
        &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;child_counts&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: {
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;content&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;2126&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;created&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;2126&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;note-attributes&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;2126&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;resource&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;2605&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;tag&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;45&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;title&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;2126&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;updated&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;2126&lt;/span&gt;
        },
        &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;count&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;2126&lt;/span&gt;,
        &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;parent_counts&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: {
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;en-export&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;2126&lt;/span&gt;
        }
    },
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;note-attributes&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: {
        &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;attr_counts&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: {},
        &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;child_counts&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: {
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;altitude&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;1466&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;application-data&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;449&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;author&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;998&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;classifications&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;51&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;content-class&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;387&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;latitude&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;1480&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;longitude&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;1480&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;reminder-done-time&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;2&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;reminder-order&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;2126&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;reminder-time&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;1&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;source&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;1664&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;source-application&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;423&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;source-url&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;85&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;subject-date&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;10&lt;/span&gt;
        },
        &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;count&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;2126&lt;/span&gt;,
        &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;parent_counts&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: {
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;note&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;2126&lt;/span&gt;
        }
    }
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This shows me that every note is represented as a &lt;code&gt;&amp;lt;note&amp;gt;&lt;/code&gt; element, and crucial metadata lives in children of a &lt;code&gt;&amp;lt;note-attributes&amp;gt;&lt;/code&gt; child element.&lt;/p&gt;
&lt;p&gt;(I thought I'd need to tweak the script for performance since it works by loading the entire file into memory, but my laptop has 32GB of RAM so it didn't even blink.)&lt;/p&gt;
&lt;h4&gt;ENEX limitations&lt;/h4&gt;
&lt;p&gt;I ran into some significant limitations while working with my ENEX export.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Notebooks are not represented in the file at all - you just get the notes. So I can't tell which recipes were in my "cooking" notebook. It's possible to work around this by manually exporting the notes from each notebook one at a time and storing them in separate export files. I didn't bother.&lt;/li&gt;
&lt;li&gt;Some of the data in the file - the note content itself for example - consists of further blocks of XML embedded in CDATA. This means you have to run a nested XML parser for every note you process.&lt;/li&gt;
&lt;li&gt;The notes XML &lt;code&gt;&amp;lt;en-note&amp;gt;&lt;/code&gt; format is mostly XHTML, but includes custom &lt;code&gt;&amp;lt;en-media hash="..."&amp;gt;&lt;/code&gt; tags where inline images should be displayed&lt;/li&gt;
&lt;li&gt;Those inline images are CDATA encoded base64 strings. They don't include a hash, but I figured out that decoding the base64 string and then running it through MD5 generates the hash that is used in the corresponding &lt;code&gt;&amp;lt;en-media&amp;gt;&lt;/code&gt; tag.&lt;/li&gt;
&lt;li&gt;Notes in the export don't have any form of unique ID!&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That last limitation - the lack of unique IDs - is a huge pain. It means that any export is by necessity a one-time operation - if you edit a note and attempt a re-export you will get a brand new record that can't be automatically used to update the previous one.&lt;/p&gt;
&lt;p&gt;This is particularly frustrating because I know that Evernote assigns a GUID to each note - they just don't include them in the export file.&lt;/p&gt;
&lt;h3&gt;Writing an exporter&lt;/h3&gt;
&lt;p&gt;My &lt;a href="https://dogsheep.github.io/"&gt;Dogsheep&lt;/a&gt; family of tools aims to liberate personal data from all kinds of different sources and convert it into SQLite, which means I can explore, query and visualize it using &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/dogsheep/evernote-to-sqlite"&gt;dogsheep/evernote-to-sqlite&lt;/a&gt; is my latest entry in that series.&lt;/p&gt;
&lt;p&gt;I based it on &lt;a href="https://github.com/dogsheep/healthkit-to-sqlite"&gt;healthkit-to-sqlite&lt;/a&gt; because that, too, has to deal with a multiple GB XML file. Both tools use the Python standard library's &lt;a href="https://docs.python.org/3/library/xml.etree.elementtree.html#pull-api-for-non-blocking-parsing"&gt;XMLPullParser&lt;/a&gt;, which makes XML elements available as a stream without needing to load the entire file into memory at once.&lt;/p&gt;
&lt;p&gt;The conversion code &lt;a href="https://github.com/dogsheep/evernote-to-sqlite/blob/0.2/evernote_to_sqlite/utils.py"&gt;is here&lt;/a&gt;. It ended up being relatively straight-forward, using &lt;code&gt;ElementTree&lt;/code&gt; to extract data from the XML and &lt;a href="https://sqlite-utils.readthedocs.io/"&gt;sqlite-utils&lt;/a&gt; to write it to a database.&lt;/p&gt;
&lt;p&gt;It implements a progress bar by tracking the number of bytes that have been read from the underlying file XML.&lt;/p&gt;
&lt;p&gt;One optimization: originally I created a single &lt;code&gt;resources&lt;/code&gt; table with the resource metadata and a BLOB column containing the binary image contents.&lt;/p&gt;
&lt;p&gt;This table was huge - over a GB - and had very poor performance for operations such as querying and faceting across unindexed columns - because any table scan had to work through MBs of binary data.&lt;/p&gt;
&lt;p&gt;My personal Evernote notebook has around 2,000 resources. 2,000 rows of metadata should usually be very fast to query.&lt;/p&gt;
&lt;p&gt;So... I split the binary data out into a two column &lt;code&gt;resources_data&lt;/code&gt; table - &lt;code&gt;md5&lt;/code&gt; primary key and a BLOB for &lt;code&gt;data&lt;/code&gt;. This massively sped up queries against that &lt;code&gt;resources&lt;/code&gt; table.&lt;/p&gt;
&lt;p&gt;And it works! Running the following command produces a SQLite database file containing all of my notes, note metadata and embedded images:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;evernote-to-sqlite enex evernote.db MyNotes.enex
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Serving up formatted notes&lt;/h4&gt;
&lt;p&gt;Figuring out the best way to serve up the note content in Datasette is still a work in progress - but I've hacked together a delightfully terrible way of doing this using a one-off plugin.&lt;/p&gt;
&lt;p&gt;Datasette plugins are usually packaged and installed via PyPI, but there's &lt;a href="https://docs.datasette.io/en/stable/writing_plugins.html#writing-one-off-plugins"&gt;an alternative option&lt;/a&gt; for plugins that don't warrant distribution: create a &lt;code&gt;plugins/&lt;/code&gt; directory, drop in one or more Python files and start Datasette pointing &lt;code&gt;--plugins-dir&lt;/code&gt; at that directory.&lt;/p&gt;
&lt;p&gt;This is great for hacking on terrible ideas. Here's the &lt;code&gt;evernote.py&lt;/code&gt; plugin I wrote in full:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;datasette&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;hookimpl&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;jinja2&lt;/span&gt;

&lt;span class="pl-v"&gt;START&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"&amp;lt;en-note"&lt;/span&gt;
&lt;span class="pl-v"&gt;END&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"&amp;lt;/en-note&amp;gt;"&lt;/span&gt;
&lt;span class="pl-v"&gt;TEMPLATE&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"""&lt;/span&gt;
&lt;span class="pl-s"&gt;&amp;lt;div style="max-width: 500px; white-space: normal; overflow-wrap: break-word;"&amp;gt;{}&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;"""&lt;/span&gt;.&lt;span class="pl-en"&gt;strip&lt;/span&gt;()

&lt;span class="pl-v"&gt;EN_MEDIA_SCRIPT&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"""&lt;/span&gt;
&lt;span class="pl-s"&gt;Array.from(document.querySelectorAll('en-media')).forEach(el =&amp;gt; {&lt;/span&gt;
&lt;span class="pl-s"&gt;    let hash = el.getAttribute('hash');&lt;/span&gt;
&lt;span class="pl-s"&gt;    let type = el.getAttribute('type');&lt;/span&gt;
&lt;span class="pl-s"&gt;    let path = `/evernote/resources_data/${hash}.json?_shape=array`;&lt;/span&gt;
&lt;span class="pl-s"&gt;    fetch(path).then(r =&amp;gt; r.json()).then(rows =&amp;gt; {&lt;/span&gt;
&lt;span class="pl-s"&gt;        let b64 = rows[0].data.encoded;&lt;/span&gt;
&lt;span class="pl-s"&gt;        let data = `data:${type};base64,${b64}`;&lt;/span&gt;
&lt;span class="pl-s"&gt;        el.innerHTML = `&amp;lt;img style="max-width: 300px" src="${data}"&amp;gt;`;&lt;/span&gt;
&lt;span class="pl-s"&gt;    });&lt;/span&gt;
&lt;span class="pl-s"&gt;});&lt;/span&gt;
&lt;span class="pl-s"&gt;"""&lt;/span&gt;

&lt;span class="pl-en"&gt;@&lt;span class="pl-s1"&gt;hookimpl&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;render_cell&lt;/span&gt;(&lt;span class="pl-s1"&gt;value&lt;/span&gt;, &lt;span class="pl-s1"&gt;table&lt;/span&gt;):
    &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-c1"&gt;not&lt;/span&gt; &lt;span class="pl-s1"&gt;table&lt;/span&gt;:
        &lt;span class="pl-c"&gt;# Don't render content from arbitrary SQL queries, could be XSS hole&lt;/span&gt;
        &lt;span class="pl-k"&gt;return&lt;/span&gt;
    &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-c1"&gt;not&lt;/span&gt; &lt;span class="pl-s1"&gt;value&lt;/span&gt; &lt;span class="pl-c1"&gt;or&lt;/span&gt; &lt;span class="pl-c1"&gt;not&lt;/span&gt; &lt;span class="pl-en"&gt;isinstance&lt;/span&gt;(&lt;span class="pl-s1"&gt;value&lt;/span&gt;, &lt;span class="pl-s1"&gt;str&lt;/span&gt;):
        &lt;span class="pl-k"&gt;return&lt;/span&gt;
    &lt;span class="pl-s1"&gt;value&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;value&lt;/span&gt;.&lt;span class="pl-en"&gt;strip&lt;/span&gt;()
    &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-s1"&gt;value&lt;/span&gt;.&lt;span class="pl-en"&gt;startswith&lt;/span&gt;(&lt;span class="pl-v"&gt;START&lt;/span&gt;) &lt;span class="pl-c1"&gt;and&lt;/span&gt; &lt;span class="pl-s1"&gt;value&lt;/span&gt;.&lt;span class="pl-en"&gt;endswith&lt;/span&gt;(&lt;span class="pl-v"&gt;END&lt;/span&gt;):
        &lt;span class="pl-s1"&gt;trimmed&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;value&lt;/span&gt;[&lt;span class="pl-en"&gt;len&lt;/span&gt;(&lt;span class="pl-v"&gt;START&lt;/span&gt;) : &lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-en"&gt;len&lt;/span&gt;(&lt;span class="pl-v"&gt;END&lt;/span&gt;)]
        &lt;span class="pl-s1"&gt;trimmed&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;trimmed&lt;/span&gt;.&lt;span class="pl-en"&gt;split&lt;/span&gt;(&lt;span class="pl-s"&gt;"&amp;gt;"&lt;/span&gt;, &lt;span class="pl-c1"&gt;1&lt;/span&gt;)[&lt;span class="pl-c1"&gt;1&lt;/span&gt;]
        &lt;span class="pl-c"&gt;# Replace those horrible double newlines&lt;/span&gt;
        &lt;span class="pl-s1"&gt;trimmed&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;trimmed&lt;/span&gt;.&lt;span class="pl-en"&gt;replace&lt;/span&gt;(&lt;span class="pl-s"&gt;"&amp;lt;div&amp;gt;&amp;lt;br /&amp;gt;&amp;lt;/div&amp;gt;"&lt;/span&gt;, &lt;span class="pl-s"&gt;"&amp;lt;br&amp;gt;"&lt;/span&gt;)
        &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s1"&gt;jinja2&lt;/span&gt;.&lt;span class="pl-v"&gt;Markup&lt;/span&gt;(&lt;span class="pl-v"&gt;TEMPLATE&lt;/span&gt;.&lt;span class="pl-en"&gt;format&lt;/span&gt;(&lt;span class="pl-s1"&gt;trimmed&lt;/span&gt;))

&lt;span class="pl-en"&gt;@&lt;span class="pl-s1"&gt;hookimpl&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;extra_body_script&lt;/span&gt;():
    &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-v"&gt;EN_MEDIA_SCRIPT&lt;/span&gt;&lt;/pre&gt;
&lt;p&gt;This uses two &lt;a href="https://docs.datasette.io/en/stable/plugin_hooks.html"&gt;Datasette plugin hooks&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;render_cell()&lt;/code&gt; is called every time Datasette's interface renders the value of a column. Here I'm looking for the &lt;code&gt;&amp;lt;en-note&amp;gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;/en-note&amp;gt;&lt;/code&gt; tags and, if they are present, stripping them off and marking their inner content as safe to display using &lt;code&gt;jinja2.Markup()&lt;/code&gt; (without this they would be automatically HTML esacped).&lt;/p&gt;
&lt;p&gt;&lt;code&gt;extra_body_script()&lt;/code&gt; can be used to inject additional JavaScript at the bottom of the page. I'm injecting some particularly devious JavaScript which scans the page for Evernote's &lt;code&gt;&amp;lt;en-media&amp;gt;&lt;/code&gt; tags and, if it finds any, runs a &lt;code&gt;fetch()&lt;/code&gt; to get the base64-encoded data from Datasette and then injects that into the page as a 300px wide image element using a &lt;code&gt;data:&lt;/code&gt; URI.&lt;/p&gt;
&lt;p&gt;Like I said, this is a pretty terrifying hack! But it works just fine, and my notes are now visible inside my personal Datasette instance:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2020/evernote-datasette.png" alt="One of my notes displayed in Datasette, with an embedded image" style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;h4&gt;Bonus feature: search by OCR&lt;/h4&gt;
&lt;p&gt;An interesting feature of Evernote is that it runs cloud-based OCR against images in your notes, enabling you to search across the text contained within them.&lt;/p&gt;
&lt;p&gt;It turns out the output of this OCR process is included in the export XML. It looks like this:&lt;/p&gt;
&lt;div class="highlight highlight-text-xml"&gt;&lt;pre&gt;&amp;lt;&lt;span class="pl-ent"&gt;recoIndex&lt;/span&gt;
  &lt;span class="pl-e"&gt;docType&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;unknown&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
  &lt;span class="pl-e"&gt;objType&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;image&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;objID&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;76dd28b07797cc9f3f129c4871c5293c&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; 
  &lt;span class="pl-e"&gt;engineVersion&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;7.0.24.1&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; 
  &lt;span class="pl-e"&gt;recoType&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;service&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; 
  &lt;span class="pl-e"&gt;lang&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;en&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; 
  &lt;span class="pl-e"&gt;objWidth&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;670&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; 
  &lt;span class="pl-e"&gt;objHeight&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;128&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;item&lt;/span&gt; &lt;span class="pl-e"&gt;x&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;26&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;52&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;81&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;h&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;29&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;t&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;76&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;This&amp;lt;/&lt;span class="pl-ent"&gt;t&lt;/span&gt;&amp;gt;
  &amp;lt;/&lt;span class="pl-ent"&gt;item&lt;/span&gt;&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;item&lt;/span&gt; &lt;span class="pl-e"&gt;x&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;134&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;52&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;35&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;h&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;29&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;t&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;79&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;is&amp;lt;/&lt;span class="pl-ent"&gt;t&lt;/span&gt;&amp;gt;
  &amp;lt;/&lt;span class="pl-ent"&gt;item&lt;/span&gt;&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;item&lt;/span&gt; &lt;span class="pl-e"&gt;x&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;196&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;37&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;h&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;21&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;t&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;73&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;so&amp;lt;/&lt;span class="pl-ent"&gt;t&lt;/span&gt;&amp;gt;
  &amp;lt;/&lt;span class="pl-ent"&gt;item&lt;/span&gt;&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;item&lt;/span&gt; &lt;span class="pl-e"&gt;x&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;300&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;60&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;57&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;h&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;21&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;t&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;71&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;can&amp;lt;/&lt;span class="pl-ent"&gt;t&lt;/span&gt;&amp;gt;
  &amp;lt;/&lt;span class="pl-ent"&gt;item&lt;/span&gt;&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;item&lt;/span&gt; &lt;span class="pl-e"&gt;x&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;382&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;54&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;79&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;h&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;27&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;t&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;77&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;test&amp;lt;/&lt;span class="pl-ent"&gt;t&lt;/span&gt;&amp;gt;
  &amp;lt;/&lt;span class="pl-ent"&gt;item&lt;/span&gt;&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;item&lt;/span&gt; &lt;span class="pl-e"&gt;x&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;486&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;52&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;59&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;h&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;29&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;t&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;82&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;the&amp;lt;/&lt;span class="pl-ent"&gt;t&lt;/span&gt;&amp;gt;
  &amp;lt;/&lt;span class="pl-ent"&gt;item&lt;/span&gt;&amp;gt;
  &amp;lt;&lt;span class="pl-ent"&gt;item&lt;/span&gt; &lt;span class="pl-e"&gt;x&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;570&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;y&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;54&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;59&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;h&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;25&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;t&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;74&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;OCR&amp;lt;/&lt;span class="pl-ent"&gt;t&lt;/span&gt;&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;t&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;33&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;DeR&amp;lt;/&lt;span class="pl-ent"&gt;t&lt;/span&gt;&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;t&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;15&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;OCR!&amp;lt;/&lt;span class="pl-ent"&gt;t&lt;/span&gt;&amp;gt;
    &amp;lt;&lt;span class="pl-ent"&gt;t&lt;/span&gt; &lt;span class="pl-e"&gt;w&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;14&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;OCR]&amp;lt;/&lt;span class="pl-ent"&gt;t&lt;/span&gt;&amp;gt;
  &amp;lt;/&lt;span class="pl-ent"&gt;item&lt;/span&gt;&amp;gt;
&amp;lt;/&lt;span class="pl-ent"&gt;recoIndex&lt;/span&gt;&amp;gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;As far as I can tell an &lt;code&gt;&amp;lt;item&amp;gt;&lt;/code&gt; gets multiple &lt;code&gt;&amp;lt;t&amp;gt;&lt;/code&gt; elements only if the OCR wasn't 100% sure.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;evernote-to-sqlite&lt;/code&gt; creates a searchable &lt;code&gt;ocr&lt;/code&gt; text field using values from that XML, so I can search my images in Datasette.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dogsheep"&gt;dogsheep&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="projects"/><category term="sqlite"/><category term="xml"/><category term="datasette"/><category term="dogsheep"/><category term="sqlite-utils"/></entry><entry><title>xml-analyser</title><link href="https://simonwillison.net/2020/Oct/12/xml-analyser/#atom-tag" rel="alternate"/><published>2020-10-12T00:41:22+00:00</published><updated>2020-10-12T00:41:22+00:00</updated><id>https://simonwillison.net/2020/Oct/12/xml-analyser/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/simonw/xml-analyser"&gt;xml-analyser&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
In building evernote-to-sqlite I dusted off an ancient (2009) project I built that scans through an XML file and provides a summary of what elements are present in the document and how they relate to each other. I’ve now packaged it up as a CLI app and published it on PyPI.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;&lt;/p&gt;



</summary><category term="cli"/><category term="projects"/><category term="xml"/></entry><entry><title>Using memory-profiler to debug excessive memory usage in healthkit-to-sqlite</title><link href="https://simonwillison.net/2019/Jul/24/memory-profiler/#atom-tag" rel="alternate"/><published>2019-07-24T08:25:07+00:00</published><updated>2019-07-24T08:25:07+00:00</updated><id>https://simonwillison.net/2019/Jul/24/memory-profiler/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/dogsheep/healthkit-to-sqlite/issues/7"&gt;Using memory-profiler to debug excessive memory usage in healthkit-to-sqlite&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This morning I figured out how to use the memory-profiler module (and mprof command line tool) to debug memory usage of Python processes. I added the details, including screenshots, to this GitHub issue. It helped me knock down RAM usage for my healthkit-to-sqlite from 2.5GB to just 80MB by making smarter usage of the ElementTree pull parser.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/elementtree"&gt;elementtree&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/memory"&gt;memory&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/profiling"&gt;profiling&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;&lt;/p&gt;



</summary><category term="elementtree"/><category term="memory"/><category term="profiling"/><category term="python"/><category term="xml"/></entry><entry><title>Convert Locations.kml (pulled from an iPhone backup) to SQLite</title><link href="https://simonwillison.net/2019/Jun/14/locations-kml/#atom-tag" rel="alternate"/><published>2019-06-14T00:45:01+00:00</published><updated>2019-06-14T00:45:01+00:00</updated><id>https://simonwillison.net/2019/Jun/14/locations-kml/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://gist.github.com/simonw/2f09eb4b9fa41a3721f3e87f18813221"&gt;Convert Locations.kml (pulled from an iPhone backup) to SQLite&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I’ve been playing around with data from my iPhone using the iPhone Backup Extractor app and one of the things it exports for you is a Locations.kml file full of location history data. I wrote a tiny script using Python’s ElementTree XMLPullParser to efficiently iterate through the Placemarks and yield them as dictionaries, which I then batch-inserted into sqlite-utils to create a SQLite database.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/kml"&gt;kml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite-utils"&gt;sqlite-utils&lt;/a&gt;&lt;/p&gt;



</summary><category term="kml"/><category term="projects"/><category term="sqlite"/><category term="xml"/><category term="sqlite-utils"/></entry><entry><title>Exploring the UK Register of Members Interests with SQL and Datasette</title><link href="https://simonwillison.net/2018/Apr/25/register-members-interests/#atom-tag" rel="alternate"/><published>2018-04-25T15:49:39+00:00</published><updated>2018-04-25T15:49:39+00:00</updated><id>https://simonwillison.net/2018/Apr/25/register-members-interests/#atom-tag</id><summary type="html">
    &lt;p&gt;Ever wondered which UK Members of Parliament &lt;a href="https://register-of-members-interests.datasettes.com/regmem?sql=select+people_name%2C+person_id%2C+count%28*%29+as+n%2C+group_concat%28item%2C+%22+%3A%3A+%22%29+from+%28select+distinct+item%2C+people.name+as+people_name%2C+person_id%0D%0Afrom+items+join+people+on+items.person_id+%3D+people.id%0D%0Awhere+items.rowid+in+%28select+rowid+from+%5Bitems_fts%5D+where+%5Bitems_fts%5D+match+%3Asearch%29+order+by+hash%29%0D%0Agroup+by+people_name%2C+person_id%0D%0Aorder+by+n+desc%3B&amp;amp;search=helicopter"&gt;get gifted the most helicopter rides&lt;/a&gt;? How about &lt;a href="https://register-of-members-interests.datasettes.com/regmem?sql=select+distinct+item%2C+people.name%2C+person_id%0D%0Afrom+items+join+people+on+items.person_id+%3D+people.id%0D%0Awhere+items.rowid+in+%28select+rowid+from+%5Bitems_fts%5D+where+%5Bitems_fts%5D+match+%3Asearch%29+order+by+hash&amp;amp;search=gift+sultan+brunei"&gt;which MPs have been given Christmas hampers&lt;/a&gt; by the Sultan of Brunei? (David Cameron, William Hague and Michael Howard apparently). Here’s how to dig through the Register of Members Interests using SQL and Datasette.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://register-of-members-interests.datasettes.com/regmem?sql=select+distinct+item%2C+people.name%2C+person_id%0D%0Afrom+items+join+people+on+items.person_id+%3D+people.id%0D%0Awhere+items.rowid+in+%28select+rowid+from+%5Bitems_fts%5D+where+%5Bitems_fts%5D+match+%3Asearch%29+order+by+hash&amp;amp;search=gift+sultan+brunei" class="a-img"&gt;&lt;img style="max-width: 100%" src="https://static.simonwillison.net/static/2018/mp-gifts-sultan.png" alt="Gifts from the Sultan" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.mysociety.org/"&gt;mySociety&lt;/a&gt; have been building incredible civic participation applications like &lt;a href="https://www.theyworkforyou.com/"&gt;TheyWorkForYou&lt;/a&gt; and &lt;a href="https://www.fixmystreet.com/"&gt;FixMyStreet&lt;/a&gt; for nearly 15 years now, and have accumulated all kinds of interesting data along the way.&lt;/p&gt;
&lt;p&gt;They recently launched their own data portal at &lt;a href="https://data.mysociety.org/"&gt;data.mysociety.org&lt;/a&gt; listing all of the information they have available. While exploring it I stumbled across their copy of the &lt;a href="https://data.mysociety.org/datasets/members-interest/"&gt;UK Register of Members Interests&lt;/a&gt;. Every UK Member of Parliament has to &lt;a href="https://www.parliament.uk/mps-lords-and-offices/standards-and-financial-interests/parliamentary-commissioner-for-standards/registers-of-interests/register-of-members-financial-interests/"&gt;register their conflicts of interest and income sources&lt;/a&gt;, and mySociety have an ongoing project to parse that data into a more useful format.&lt;/p&gt;
&lt;p&gt;It won’t surprise you to hear that I couldn’t resist turning their XML files into a SQLite database.&lt;/p&gt;
&lt;p&gt;The result is &lt;a href="https://register-of-members-interests.datasettes.com"&gt;register-of-members-interests.datasettes.com&lt;/a&gt; - a &lt;a href="https://github.com/simonw/datasette"&gt;Datasette&lt;/a&gt; instance running against a SQLite database containing over 1.3 million line-items registered by 1,419 MPs over the course of 18 years.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Some_fun_queries_14"&gt;&lt;/a&gt;Some fun queries&lt;/h3&gt;
&lt;p&gt;A few of my favourites so far:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Which MPs have taken &lt;a href="https://register-of-members-interests.datasettes.com/regmem?sql=select+people_name%2C+person_id%2C+count%28*%29+as+n%2C+group_concat%28item%2C+%22+%3A%3A+%22%29+from+%28select+distinct+item%2C+people.name+as+people_name%2C+person_id%0D%0Afrom+items+join+people+on+items.person_id+%3D+people.id%0D%0Awhere+items.rowid+in+%28select+rowid+from+%5Bitems_fts%5D+where+%5Bitems_fts%5D+match+%3Asearch%29+order+by+hash%29%0D%0Agroup+by+people_name%2C+person_id%0D%0Aorder+by+n+desc%3B&amp;amp;search=helicopter"&gt;the most donated helicopter rides&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Which MPs have accepted &lt;a href="https://register-of-members-interests.datasettes.com/regmem?sql=select+distinct+item%2C+people.name%2C+person_id%0D%0Afrom+items+join+people+on+items.person_id+%3D+people.id%0D%0Awhere+items.rowid+in+%28select+rowid+from+%5Bitems_fts%5D+where+%5Bitems_fts%5D+match+%3Asearch%29+order+by+hash&amp;amp;search=gift+sultan+brunei"&gt;gifts from the Surtan of Brunei&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;A better gifts query with more robust de-duping: here's gifts of &lt;a href="https://register-of-members-interests.datasettes.com/regmem?sql=select+item%2C+people.name%2C+max%28date%29+as+max_date%0D%0Afrom+items+join+people+on+items.person_id+%3D+people.id%0D%0Awhere+items.rowid+in+%28select+rowid+from+%5Bitems_fts%5D+where+%5Bitems_fts%5D+match+%3Asearch%29%0D%0Agroup+by+item%2C+people.name%2C+person_id%0D%0Aorder+by+max_date+desc&amp;amp;search=hamper"&gt;hampers&lt;/a&gt;, &lt;a href="https://register-of-members-interests.datasettes.com/regmem?sql=select+item%2C+people.name%2C+max%28date%29+as+max_date%0D%0Afrom+items+join+people+on+items.person_id+%3D+people.id%0D%0Awhere+items.rowid+in+%28select+rowid+from+%5Bitems_fts%5D+where+%5Bitems_fts%5D+match+%3Asearch%29%0D%0Agroup+by+item%2C+people.name%2C+person_id%0D%0Aorder+by+max_date+desc&amp;amp;search=gift+watch"&gt;watches&lt;/a&gt; and &lt;a href="https://register-of-members-interests.datasettes.com/regmem?sql=select+item%2C+people.name%2C+max%28date%29+as+max_date%0D%0Afrom+items+join+people+on+items.person_id+%3D+people.id%0D%0Awhere+items.rowid+in+%28select+rowid+from+%5Bitems_fts%5D+where+%5Bitems_fts%5D+match+%3Asearch%29%0D%0Agroup+by+item%2C+people.name%2C+person_id%0D%0Aorder+by+max_date+desc&amp;amp;search=glastonbury"&gt;Glastonbury festival tickets&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Which MPs own significant shares in &lt;a href="https://register-of-members-interests.datasettes.com/regmem/mps_ranked_by_matching_shareholdings?search=shell"&gt;Shell&lt;/a&gt;, &lt;a href="(https://register-of-members-interests.datasettes.com/regmem/mps_ranked_by_matching_shareholdings?search=apple)"&gt;Apple&lt;/a&gt;, or &lt;a href="https://register-of-members-interests.datasettes.com/regmem/mps_ranked_by_matching_shareholdings?search=bp"&gt;BP&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Who has reported &lt;a href="https://register-of-members-interests.datasettes.com/regmem/mps_ranked_by_matching_overseas_line_items?search=saudi+arabia"&gt;the most overseas trips to Saudi Arabia&lt;/a&gt;? The top hit there was Daniel Kawczynski, and it turns out his &lt;a href="https://en.wikipedia.org/wiki/Daniel_Kawczynski"&gt;Wikipedia page&lt;/a&gt; has a section dedicated to his relationship with the kindgom.&lt;/li&gt;
&lt;li&gt;How much do MPs get paid for &lt;a href="https://register-of-members-interests.datasettes.com/regmem/items?_search=Have+I+Got+News+for+You&amp;amp;_sort_desc=date"&gt;appearances on Have I Got News For You&lt;/a&gt;. The going rate seems to be £1,500 but you can find &lt;a href="https://register-of-members-interests.datasettes.com/regmem?sql=select+distinct+item%2C+person_id%2C+people.name+from+items+join+people+on+people.id+%3D+items.person_id%0D%0Awhere+%22item%22+not+like+%22%251%2C500%25%22+and+%22item%22+not+like+%22%251500%22+and+%22item%22+like+%22%25%C2%A3%25%22+and+items.rowid+in+%28select+rowid+from+%5Bitems_fts%5D+where+%5Bitems_fts%5D+match+%3Asearch%29+order+by+date+desc+limit+101&amp;amp;search=Have+I+Got+News+For+You"&gt;some interesting discrepancies&lt;/a&gt; if you exclude that value from the results.&lt;/li&gt;
&lt;li&gt;Which MPs are responsible for &lt;a href="https://register-of-members-interests.datasettes.com/regmem?sql=select+%22person_id%22%2C+people.name%2C+count(*)+as+%22count%22%0D%0Afrom+items+join+people+on+items.person_id+%3D+people.id%0D%0Agroup+by+%22person_id%22+order+by+%22count%22+desc+limit+100"&gt;the most total line items reported&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;a id="Understanding_the_data_model_23"&gt;&lt;/a&gt;Understanding the data model&lt;/h3&gt;
&lt;p&gt;Most of the action takes place in the &lt;a href="https://register-of-members-interests.datasettes.com/regmem/items"&gt;items&lt;/a&gt; table, where each item is a line-item from an MP’s filing. You can search that table by keyword (see helicopter example above) or apply filters to it using the standard Datasette interface. You can also execute your own SQL directly against the database.&lt;/p&gt;
&lt;p&gt;Each item is filed against a &lt;a href="https://register-of-members-interests.datasettes.com/regmem/categories?_sort=type"&gt;category&lt;/a&gt;. There appears to have been quite a bit of churn in the way that the categories are defined over the years, plus the data is pretty untidy - there are no less than 10 ways of spelling “Remunerated employment, office, profession etc.” for example!&lt;/p&gt;
&lt;p&gt;&lt;a href="https://register-of-members-interests.datasettes.com/regmem/categories" class="a-img"&gt;&lt;img style="max-width: 100%" src="https://static.simonwillison.net/static/2018/mp-categories.png" alt="Categories" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;There are also a LOT of duplicate items in the set - it appears that MPs frequently list the same item (a rental property for example) every time they fill out the register. SQL DISTINCT clauses can help filter through these, as seen in some of the above examples.&lt;/p&gt;
&lt;p&gt;The data also has the concepts of both &lt;a href="https://register-of-members-interests.datasettes.com/regmem/members"&gt;members&lt;/a&gt; and &lt;a href="https://register-of-members-interests.datasettes.com/regmem/people"&gt;people&lt;/a&gt;. As far as I can tell people are distinct, but members may contain duplicates - presumably to represent MPs who have served more than one term in office. It looks like the member field &lt;a href="https://register-of-members-interests.datasettes.com/regmem/items?_sort_desc=date&amp;amp;member_id__notblank=1"&gt;stopped being populated in March 2015&lt;/a&gt; so analysis is best performed against the people table.&lt;/p&gt;
&lt;p&gt;Once concept I have introduced myself is the &lt;code&gt;record_id&lt;/code&gt;. In the XML documents the items are often grouped together into a related collection, like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;regmem personid=&amp;quot;uk.org.publicwhip/person/10001&amp;quot;
    memberid=&amp;quot;uk.org.publicwhip/member/40289&amp;quot; membername=&amp;quot;Diane Abbott&amp;quot; date=&amp;quot;2014-07-14&amp;quot;&amp;gt;
    &amp;lt;category type=&amp;quot;2&amp;quot; name=&amp;quot;Remunerated employment, office, profession etc&amp;quot;&amp;gt;
        &amp;lt;item&amp;gt;Payments from MRL Public Sector Consultants, Pepple House, 8 Broad Street, Great Cambourne, Cambridge CB23 6HJ:&amp;lt;/item&amp;gt;
        &amp;lt;item&amp;gt;26 November 2013, I received a fee of £1,000 for speaking at the 1st African Legislative Summit, National Assembly, Abuja, Nigeria.  Hours: 8 hrs. The cost of my flights, transfers and hotel accommodation in Abuja were also met; estimated value £5,000. &amp;lt;em&amp;gt;(Registered 3 December 2013)&amp;lt;/em&amp;gt;&amp;lt;/item&amp;gt;
        &amp;lt;item&amp;gt;23 July 2013, I received a fee of £5,000 for appearing as a contestant on ITV&amp;amp;#8217;s &amp;amp;#8216;The Chase Celebrity &amp;amp;#8211; Series 3&amp;amp;#8217; television programme.  Address of payer:  ITV Studios Ltd, London Television Centre, Upper Ground, London SE1 9Lt.  Hours: 12 hrs.   &amp;lt;em&amp;gt;(Registered 23 July 2013)&amp;lt;/em&amp;gt;&amp;lt;/item&amp;gt;
    &amp;lt;/category&amp;gt;
&amp;lt;/regmem&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;While these items are presented as separate line items, their grouping carries meaning: the first line item here acts as a kind of heading to help provide context to the other items.&lt;/p&gt;
&lt;p&gt;To model this in the simplest way possible, I’ve attempted to preserve the order of these groups using a pair of additional columns: the &lt;code&gt;record_id&lt;/code&gt; and the &lt;code&gt;sort_order&lt;/code&gt;. I construct the &lt;code&gt;record_id&lt;/code&gt; using a collection of other fields - the idea is for it to be sortable, and for each line-item in the same grouping to have the same &lt;code&gt;record_id&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;record_id = &amp;quot;{date}-{category_id}-{person_id}-{record}&amp;quot;.format(
    date=date,
    category_id=category_id,
    person_id=person_id.split(&amp;quot;/&amp;quot;)[
        -1
    ],
    record=record,
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The resulting &lt;code&gt;record_id&lt;/code&gt; might look like this: &lt;code&gt;2018-04-16-70b64e89-24878-0&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;To recreate that particular sequence of line-items, you can search for all items matching that &lt;code&gt;record_id&lt;/code&gt; and then sort them by their &lt;code&gt;sort_order&lt;/code&gt;. Here’s &lt;a href="https://register-of-members-interests.datasettes.com/regmem/items?_sort=sort_order&amp;amp;record_id__exact=2014-07-14-1f359ece-10001-0"&gt;that record from Diane Abbott&lt;/a&gt; shown with its surrounding context.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://register-of-members-interests.datasettes.com/regmem/items?_sort=sort_order&amp;amp;record_id__exact=2014-07-14-1f359ece-10001-0" class="a-img"&gt;&lt;img style="max-width: 100%" src="https://static.simonwillison.net/static/2018/mp-record.png" alt="A single record" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;&lt;a id="How_I_built_it_65"&gt;&lt;/a&gt;How I built it&lt;/h3&gt;
&lt;p&gt;The short version: I downloaded all of the XML files and wrote a Python script which parsed them using &lt;a href="https://docs.python.org/3/library/xml.etree.elementtree.html"&gt;ElementTree&lt;/a&gt; and inserted them into a SQLite database. I’ve &lt;a href="https://github.com/simonw/register-of-members-interests"&gt;put the code on GitHub&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;A couple of fun tricks: firstly, I borrowed some code from &lt;a href="https://github.com/simonw/csvs-to-sqlite"&gt;csvs-to-sqlite&lt;/a&gt; to create the full-text search index and enable searching:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;def create_and_populate_fts(conn):
    create_sql = &amp;quot;&amp;quot;&amp;quot;
        CREATE VIRTUAL TABLE &amp;quot;items_fts&amp;quot;
        USING {fts_version} (item, person_name, content=&amp;quot;items&amp;quot;)
    &amp;quot;&amp;quot;&amp;quot;.format(
        fts_version=best_fts_version()
    )
    conn.executescript(create_sql)
    conn.executescript(
        &amp;quot;&amp;quot;&amp;quot;
        INSERT INTO &amp;quot;items_fts&amp;quot; (rowid, item, person_name)
        SELECT items.rowid, items.item, people.name
        FROM items LEFT JOIN people ON items.person_id = people.id
    &amp;quot;&amp;quot;&amp;quot;
    )
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;best_fts_version()&lt;/code&gt; function &lt;a href="https://github.com/simonw/register-of-members-interests/blob/6372cfbc0905c4c045714e9092a981668f2d6345/convert_xml_to_sqlite.py#L56-L71"&gt;implements basic feature detection&lt;/a&gt; against SQLite by trying operations in an in-memory database.&lt;/p&gt;
&lt;p&gt;Secondly, I ended up writing my own tiny utility function for inserting records into SQLite. SQLite has useful &lt;code&gt;INSERT OR REPLACE INTO&lt;/code&gt; syntax which allows you to insert a record and will automatically update an existing record if there is a match on the primary key. This meant I could write this utility function and use it for all of my data inserts:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;def insert_or_replace(conn, table, record):
    pairs = record.items()
    columns = [p[0] for p in pairs]
    params = [p[1] for p in pairs]
    sql = &amp;quot;INSERT OR REPLACE INTO {table} ({column_list}) VALUES ({value_list});&amp;quot;.format(
        table=table,
        column_list=&amp;quot;, &amp;quot;.join(columns),
        value_list=&amp;quot;, &amp;quot;.join([&amp;quot;?&amp;quot; for p in params]),
    )
    conn.execute(sql, params)

# ...

insert_or_replace(
    db,
    &amp;quot;people&amp;quot;,
    {
        &amp;quot;id&amp;quot;: person_id,
        &amp;quot;name&amp;quot;: regmem_el.attrib[&amp;quot;membername&amp;quot;],
    },
)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;&lt;a id="What_can_you_find_113"&gt;&lt;/a&gt;What can you find?&lt;/h3&gt;
&lt;p&gt;I’ve really only scratched the surface of what’s in here with my initial queries. What can you find? Send me Datasette query links &lt;a href="https://twitter.com/simonw"&gt;on Twitter&lt;/a&gt; with your discoveries!&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/mysociety"&gt;mysociety&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/political-hacking"&gt;political-hacking&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/politics"&gt;politics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="mysociety"/><category term="political-hacking"/><category term="politics"/><category term="projects"/><category term="sqlite"/><category term="xml"/><category term="datasette"/></entry><entry><title>Has JSON pretty much replaced XML for string processing for the web, or are there use cases where XML is still necessary?</title><link href="https://simonwillison.net/2012/Feb/25/has-json-pretty-much/#atom-tag" rel="alternate"/><published>2012-02-25T17:17:00+00:00</published><updated>2012-02-25T17:17:00+00:00</updated><id>https://simonwillison.net/2012/Feb/25/has-json-pretty-much/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;em&gt;My answer to &lt;a href="https://www.quora.com/Has-JSON-pretty-much-replaced-XML-for-string-processing-for-the-web-or-are-there-use-cases-where-XML-is-still-necessary/answer/Simon-Willison"&gt;Has JSON pretty much replaced XML for string processing for the web, or are there use cases where XML is still necessary?&lt;/a&gt; on Quora&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It's replaced XML as the default format for most APIs. XML is still necessary for Atom/RSS feeds and other existing standards built on top of XML. It's also a better choice than JSON for markup-style data - stuff like XHTML where tags are applied to sequences of characters within larger chunks of text.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/json"&gt;json&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/web-development"&gt;web-development&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/quora"&gt;quora&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="json"/><category term="web-development"/><category term="xml"/><category term="quora"/></entry><entry><title>What are XML feed best practices?</title><link href="https://simonwillison.net/2012/Jan/31/what-are-xml-feed/#atom-tag" rel="alternate"/><published>2012-01-31T14:29:00+00:00</published><updated>2012-01-31T14:29:00+00:00</updated><id>https://simonwillison.net/2012/Jan/31/what-are-xml-feed/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;em&gt;My answer to &lt;a href="https://www.quora.com/What-are-XML-feed-best-practices/answer/Simon-Willison"&gt;What are XML feed best practices?&lt;/a&gt; on Quora&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It sounds like you're pretty much screwed already, if you're dealing with companies that still think FTPing XML around is a sensible thing to do.&lt;/p&gt;

&lt;p&gt;I would suggest focusing on what you can control. Assume that you will be passed bad data - weird formats, not-well-formed XML, duplicate entries etc. Your job is to handle all of this without going mad, and without your codebase turning in to an unmanageable ball of mud.&lt;/p&gt;

&lt;p&gt;So, start by figuring out your own core data model / abstraction. It will need to be VERY loose - as few required fields as possible, since you can be sure some if the feeds you are consuming will come in with stuff missing at some point or another.&lt;/p&gt;

&lt;p&gt;Separate your feed consumers from the rest of your code. Having your own good internal Web API (which could consume JSON rather than XML since you control it) might be smart, since that will provide a solid separation and you can then write all of your feed consumers as separate pieces of code that just POST new items to the API.&lt;/p&gt;

&lt;p&gt;Learn to love, respect and cherish unique identifiers... but be very wary of supposedly unique identifiers from external sources unless you can be absolutely sure they won't change on you. Create your own unique IDs at the first available opportunity, treat them properly within your own system and map external identifiers to them whenever you can.&lt;/p&gt;

&lt;p&gt;Write your consumers in a dynamic language with a solid interactive prompt, like Python or Ruby. This will make them much easier to write and debug. Use whatever you like for your core data storage / API.&lt;/p&gt;

&lt;p&gt;Since your incoming data will come in all shapes and sizes, consider a document store such as MongoDB or Riak over a SQL database. Avoiding SQL migrations will help you out a lot.&lt;/p&gt;

&lt;p&gt;Log and store absolutely everything. Ideally you should be able to re-execute every import that the system has ever executed, in order, to make debugging and fixing errors non terrifying. That will almost certainly prove impossible, but it's a nice thought.&lt;/p&gt;

&lt;p&gt;Good luck!&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/databases"&gt;databases&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mysql"&gt;mysql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/php"&gt;php&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/quora"&gt;quora&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="databases"/><category term="mysql"/><category term="php"/><category term="xml"/><category term="quora"/></entry><entry><title>What is the difference between XHTML 1.0 strict and transitional?</title><link href="https://simonwillison.net/2012/Jan/14/what-is-the-difference/#atom-tag" rel="alternate"/><published>2012-01-14T13:09:00+00:00</published><updated>2012-01-14T13:09:00+00:00</updated><id>https://simonwillison.net/2012/Jan/14/what-is-the-difference/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;em&gt;My answer to &lt;a href="https://www.quora.com/What-is-the-difference-between-XHTML-1-0-strict-and-transitional/answer/Simon-Willison"&gt;What is the difference between XHTML 1.0 strict and transitional?&lt;/a&gt; on Quora&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Not a lot. XHTML transitional lets you use a few presentational attributes and elements that aren't available in XHTML strict. Here's a more detailed overview from back in 2005: &lt;span&gt;&lt;a href="http://24ways.org/2005/transitional-vs-strict-markup"&gt;http://24ways.org/2005/transitio...&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;These days you're much better off using HTML5 and ignoring XHTML entirely.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/html"&gt;html&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/web-development"&gt;web-development&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xhtml"&gt;xhtml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/quora"&gt;quora&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="html"/><category term="web-development"/><category term="xhtml"/><category term="xml"/><category term="quora"/></entry><entry><title>Indexing JSON in Solr 3.1</title><link href="https://simonwillison.net/2010/Dec/10/solr/#atom-tag" rel="alternate"/><published>2010-12-10T09:46:00+00:00</published><updated>2010-12-10T09:46:00+00:00</updated><id>https://simonwillison.net/2010/Dec/10/solr/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://www.lucidimagination.com/blog/2010/12/08/indexing-json-in-solr-3-1/"&gt;Indexing JSON in Solr 3.1&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The next release of Solr will support indexing documents provided as JSON—Solr currently requires incoming documents to be formatted as XML.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/json"&gt;json&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/solr"&gt;solr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/recovered"&gt;recovered&lt;/a&gt;&lt;/p&gt;



</summary><category term="json"/><category term="search"/><category term="solr"/><category term="xml"/><category term="recovered"/></entry><entry><title>Quoting James Clark</title><link href="https://simonwillison.net/2010/Dec/2/james/#atom-tag" rel="alternate"/><published>2010-12-02T18:48:00+00:00</published><updated>2010-12-02T18:48:00+00:00</updated><id>https://simonwillison.net/2010/Dec/2/james/#atom-tag</id><summary type="html">
    &lt;blockquote cite="http://blog.jclark.com/2010/11/xml-vs-web_24.html"&gt;&lt;p&gt;I think the Web community has spoken, and it’s clear that what it wants is HTML5, JavaScript and JSON. XML isn’t going away but I see it being less and less a Web technology; it won’t be something that you send over the wire on the public Web, but just one of many technologies that are used on the server to manage and generate what you do send over the wire.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="http://blog.jclark.com/2010/11/xml-vs-web_24.html"&gt;James Clark&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/html5"&gt;html5&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/json"&gt;json&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/recovered"&gt;recovered&lt;/a&gt;&lt;/p&gt;



</summary><category term="html5"/><category term="json"/><category term="xml"/><category term="recovered"/></entry><entry><title>Introducing BERT and BERT-RPC</title><link href="https://simonwillison.net/2009/Oct/21/bert/#atom-tag" rel="alternate"/><published>2009-10-21T22:11:37+00:00</published><updated>2009-10-21T22:11:37+00:00</updated><id>https://simonwillison.net/2009/Oct/21/bert/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://github.com/blog/531-introducing-bert-and-bert-rpc"&gt;Introducing BERT and BERT-RPC&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Justification for inventing a brand new serialisation protocol: Thrift and Protocol Buffers both use IDLs and code generation, XML “is not convertible to a simple unambiguous data structure in any language I’ve ever used” and JSON lacks support for unencoded binary data. The result is BERT—Binary ERlang Term—which extracts a format from Erlang in much the same way that JSON extracted one from JavaScript.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/erlang"&gt;erlang&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/json"&gt;json&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/protocolbuffers"&gt;protocolbuffers&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/serialisation"&gt;serialisation&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/thrift"&gt;thrift&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;&lt;/p&gt;



</summary><category term="erlang"/><category term="github"/><category term="javascript"/><category term="json"/><category term="protocolbuffers"/><category term="serialisation"/><category term="thrift"/><category term="xml"/></entry><entry><title>minixsv</title><link href="https://simonwillison.net/2009/Aug/12/minixsv/#atom-tag" rel="alternate"/><published>2009-08-12T16:59:57+00:00</published><updated>2009-08-12T16:59:57+00:00</updated><id>https://simonwillison.net/2009/Aug/12/minixsv/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://www.familieleuthe.de/DownloadMiniXsv.html"&gt;minixsv&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
As far as I can tell, this is the only library that can validate XML using pure Python (no C extension required). I’d be extremely happy if someone would write a pure Python library (or one that only depends on ElementTree, which is included in the standard library) for validating XML against a Relax NG Compact syntax schema. Even DTD validation would be better than nothing!

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="http://stackoverflow.com/questions/1243449/validating-xml-in-python-without-non-python-dependencies"&gt;Stack Overflow&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/elementtree"&gt;elementtree&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/minixsv"&gt;minixsv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/relaxng"&gt;relaxng&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/validation"&gt;validation&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xmlschema"&gt;xmlschema&lt;/a&gt;&lt;/p&gt;



</summary><category term="elementtree"/><category term="minixsv"/><category term="python"/><category term="relaxng"/><category term="validation"/><category term="xml"/><category term="xmlschema"/></entry><entry><title>xmlwitch</title><link href="https://simonwillison.net/2009/Jul/24/xmlwitch/#atom-tag" rel="alternate"/><published>2009-07-24T00:33:57+00:00</published><updated>2009-07-24T00:33:57+00:00</updated><id>https://simonwillison.net/2009/Jul/24/xmlwitch/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://github.com/galvez/xmlwitch/"&gt;xmlwitch&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
An XML building library for Python that doesn’t suck (I love ElementTree for parsing XML, but I’ve never really liked it for generation). Makes smart use of the with statement.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/withstatement"&gt;withstatement&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xmlwitch"&gt;xmlwitch&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="withstatement"/><category term="xml"/><category term="xmlwitch"/></entry><entry><title>Quoting Henri Sivonen</title><link href="https://simonwillison.net/2009/Jul/6/xhtml/#atom-tag" rel="alternate"/><published>2009-07-06T12:46:23+00:00</published><updated>2009-07-06T12:46:23+00:00</updated><id>https://simonwillison.net/2009/Jul/6/xhtml/#atom-tag</id><summary type="html">
    &lt;blockquote cite="http://hsivonen.iki.fi/xhtml2-html5-q-and-a/"&gt;&lt;p&gt;There are two meanings to XHTML: technical and marketing. The technical kind (XHTML served using the application/xhtml xml MIME type) is a formulation of HTML as an XML vocabulary. The marketing kind (XHTML served using the text/html MIME type) is processed just like HTML by browsers but the authors attempt to observe slightly different syntax rules in order to make it seem that they are doing something newer and shinier compared to HTML.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="http://hsivonen.iki.fi/xhtml2-html5-q-and-a/"&gt;Henri Sivonen&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/buzzwords"&gt;buzzwords&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/henri-sivonen"&gt;henri-sivonen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xhtml"&gt;xhtml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;&lt;/p&gt;



</summary><category term="buzzwords"/><category term="henri-sivonen"/><category term="xhtml"/><category term="xml"/></entry><entry><title>With YQL Execute, the Internet becomes your database</title><link href="https://simonwillison.net/2009/Apr/29/yql/#atom-tag" rel="alternate"/><published>2009-04-29T22:50:54+00:00</published><updated>2009-04-29T22:50:54+00:00</updated><id>https://simonwillison.net/2009/Apr/29/yql/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://developer.yahoo.net/blog/archives/2009/04/yql_execute.html"&gt;With YQL Execute, the Internet becomes your database&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This is nuts (in a good way). Yahoo!’s intriguing universal SQL-style  XML/JSONP web service interface now supports JavaScript as a kind of stored procedure language, meaning you can use JavaScript and E4X to screen-scrape web pages, then query the results with YQL.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/apis"&gt;apis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/e4x"&gt;e4x&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/json"&gt;json&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/jsonp"&gt;jsonp&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sql"&gt;sql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/yahoo"&gt;yahoo&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/yql"&gt;yql&lt;/a&gt;&lt;/p&gt;



</summary><category term="apis"/><category term="e4x"/><category term="javascript"/><category term="json"/><category term="jsonp"/><category term="sql"/><category term="xml"/><category term="yahoo"/><category term="yql"/></entry><entry><title>A few notes on the Guardian Open Platform</title><link href="https://simonwillison.net/2009/Mar/10/openplatform/#atom-tag" rel="alternate"/><published>2009-03-10T14:28:39+00:00</published><updated>2009-03-10T14:28:39+00:00</updated><id>https://simonwillison.net/2009/Mar/10/openplatform/#atom-tag</id><summary type="html">
    &lt;p&gt;This morning we launched the &lt;a href="http://www.guardian.co.uk/open-platform"&gt;Guardian Open Platform&lt;/a&gt; at a well attended event in our new offices in &lt;a href="http://www.kingsplace.co.uk/"&gt;Kings Place&lt;/a&gt;. This is one of the main projects I've been helping out with since joining the Guardian last year, and it's fantastic to finally have it out in the open.&lt;/p&gt;

&lt;p&gt;There are two components to the launch today: the Content API and the Data Store. I'll describe the Data Store first as it deserves not to get buried in the discussion about its larger cousin.&lt;/p&gt;

&lt;h4&gt;The Data Store&lt;/h4&gt;

&lt;p&gt;&lt;a href="http://www.guardian.co.uk/profile/simonrogers"&gt;Simon Rogers&lt;/a&gt; is the Guardian news editor who is principally responsible for gathering data about the world. If you ever see an infographic in the paper, the chances are Simon had a hand in researching the data for it. His delicious feed is a &lt;a href="http://delicious.com/smfrogers"&gt;positive gold mine&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;As of today, a sizeable portion the data he collects for the newspaper will also be published online. As a starting point, we're publishing over &lt;a href="http://www.guardian.co.uk/data-store"&gt;80 data sets&lt;/a&gt;, all using Google Spreadsheets which means it's all accessible through the &lt;a href="http://code.google.com/apis/spreadsheets/overview.html"&gt;Spreadsheets Data API&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Here's Simon's take on it, from &lt;a href="http://www.guardian.co.uk/news/datablog/2009/mar/10/blogpost1"&gt;Welcome to the Datablog&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote cite="http://www.guardian.co.uk/news/datablog/2009/mar/10/blogpost1"&gt;&lt;p&gt;Everyday we work with datasets from around the world. We have had to check this data and make sure it's the best we can get, from the most credible sources. But then it lives for the moment of the paper's publication and afterward disappears into a hard drive, rarely to emerge again before updating a year later.&lt;/p&gt;

&lt;p&gt;So, together with its companion site, the Data Store – a directory of all the stats we post – we are opening up that data for everyone. Whenever we come across something interesting or relevant or useful, we'll post it up here and let you know what we're planning to do with it.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;It's worth spending quite a while digging around the data. Most sets come with a full description, including where the data was sourced from. New data sets will be announced &lt;a href="http://www.guardian.co.uk/news/datablog"&gt;on the Datablog&lt;/a&gt;, which is cleverly subtitled "Facts are sacred".&lt;/p&gt;

&lt;h4&gt;The Content API&lt;/h4&gt;

&lt;p&gt;&lt;a href="http://api.guardianapis.com/docs/"&gt;The Content API&lt;/a&gt; provides REST-ish access to over a million items of content, mostly from the last decade but with a few gems that are &lt;a href="http://www.guardian.co.uk/world/1944/aug/26/france.secondworldwar"&gt;a little bit older&lt;/a&gt;. Various types of content are available - article is the most common, but you can grab information (though not necessarily content) about audio, video, galleries and more. You can retrieve 50 items at a time, and pagination is unlimited (provided you stay below the API's rate limit).&lt;/p&gt;

&lt;p&gt;Articles are provided with their full body content, though this does not currently include any HTML tags (a known issue). It's a good idea to review &lt;a href="http://www.guardian.co.uk/open-platform/terms-and-conditions"&gt;our terms and conditions&lt;/a&gt;, but you should know that if you opt to republish our article bodies on your site we may ask you to include our ads alongside our content in the future.&lt;/p&gt;

&lt;p&gt;We serve 15 minute HTTP cache headers, but you are allowed to store our content for up to 24 hours. You really, really don't want to store content for longer than that, as in addition to violating our T&amp;amp;Cs you might find yourself inadvertently publishing an article that has been retracted for legal reasons. UK libel laws can be pretty scary.&lt;/p&gt;

&lt;p&gt;In addition to regular search, you can also filter our content using tags. Tags are a core aspect of the Guardian's &lt;a href="http://www.guardian.co.uk/help/insideguardian+series/an-abc-of-r2"&gt;R2 platform&lt;/a&gt;, being used for keywords, contributors, "series" (used to implement blogs), content types and more. Every item returned by the API includes tags, and the tags can be used to further filter the results.&lt;/p&gt;

&lt;p&gt;We also return a list of filters at the bottom of each page of search results showing the tags that could be used to filter that result set, ordered by the number of results (you may have seen this feature referred to as faceted search or guided navigation). Handy tip: you can use ?count=0 in your search API key to turn off results entirely and just get back the filters section. The race is on to be first to release a tag relationship browser based on this feature.&lt;/p&gt;

&lt;p&gt;API responses can be had in custom XML, JSON or Atom. The Atom format is the least mature at the moment, and we'd welcome suggestions for improving it from the community.&lt;/p&gt;

&lt;p&gt;I released &lt;a href="http://code.google.com/p/openplatform-python/"&gt;a Python client library&lt;/a&gt; for the API this morning, and we also have libraries for &lt;a href="http://code.google.com/p/openplatform-ruby/"&gt;Ruby&lt;/a&gt;, &lt;a href="http://code.google.com/p/openplatform-java/"&gt;Java&lt;/a&gt; and &lt;a href="http://code.google.com/p/openplatform-php/"&gt;PHP&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We also have an API Explorer (written in JavaScript and jQuery, hosted on the same domain as the API so that it can make Ajax requests) but you'll need an API key to try it out.&lt;/p&gt;

&lt;h4&gt;The bad news&lt;/h4&gt;

&lt;p&gt;The response to the API release has been terrific (check out what &lt;a href="http://www.tom-watson.co.uk/2009/03/guardian-open-platform/"&gt;Tom Watson&lt;/a&gt; had to say), but as a result it's likely that API key provisions will be significantly lower than the overall demand for them. Please bear with us while we work towards a more widely accessible release.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/apis"&gt;apis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/atom"&gt;atom&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/contentapi"&gt;contentapi&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data"&gt;data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datastore"&gt;datastore&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/guardian"&gt;guardian&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/journalism"&gt;journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/jquery"&gt;jquery&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/json"&gt;json&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openplatform"&gt;openplatform&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/simon-rogers"&gt;simon-rogers&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tom-watson"&gt;tom-watson&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="apis"/><category term="atom"/><category term="contentapi"/><category term="data"/><category term="data-journalism"/><category term="datastore"/><category term="guardian"/><category term="javascript"/><category term="journalism"/><category term="jquery"/><category term="json"/><category term="openplatform"/><category term="python"/><category term="simon-rogers"/><category term="tom-watson"/><category term="xml"/></entry><entry><title>JsonML (JSON Markup Language)</title><link href="https://simonwillison.net/2009/Feb/10/jsonml/#atom-tag" rel="alternate"/><published>2009-02-10T15:03:21+00:00</published><updated>2009-02-10T15:03:21+00:00</updated><id>https://simonwillison.net/2009/Feb/10/jsonml/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://jsonml.org/"&gt;JsonML (JSON Markup Language)&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
An almost non-lossy serialization format for sending XML as JSON (plain text in between elements is ignored). Uses the (element-name, attribute-dictionary, list-of-children) tuple format, which sadly means many common cases end up taking more bytes than the original XML. Still an improvement on serializations that behave differently when a list of children has only one item in it.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/json"&gt;json&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/jsonml"&gt;jsonml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/serialization"&gt;serialization&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;&lt;/p&gt;



</summary><category term="json"/><category term="jsonml"/><category term="serialization"/><category term="xml"/></entry><entry><title>Crowbar</title><link href="https://simonwillison.net/2009/Jan/24/crowbar/#atom-tag" rel="alternate"/><published>2009-01-24T23:52:55+00:00</published><updated>2009-01-24T23:52:55+00:00</updated><id>https://simonwillison.net/2009/Jan/24/crowbar/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://simile.mit.edu/wiki/Crowbar"&gt;Crowbar&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Headless Gecko/XULRunner which exposes a web service API for screen scraping using a real browser DOM—just pass it the URL of a page and the URL of a screen scraping JavaScript script (a bit like a Greasemonkey user script) and get back RDF/XML.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/crowbar"&gt;crowbar&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dom"&gt;dom&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gecko"&gt;gecko&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/greasemonkey"&gt;greasemonkey&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mozilla"&gt;mozilla&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rdf"&gt;rdf&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webservice"&gt;webservice&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xulrunner"&gt;xulrunner&lt;/a&gt;&lt;/p&gt;



</summary><category term="crowbar"/><category term="dom"/><category term="gecko"/><category term="greasemonkey"/><category term="mozilla"/><category term="rdf"/><category term="scraping"/><category term="webservice"/><category term="xml"/><category term="xulrunner"/></entry><entry><title>How to install lxml python module on mac os 10.5 (leopard)</title><link href="https://simonwillison.net/2008/Dec/15/lxml/#atom-tag" rel="alternate"/><published>2008-12-15T00:05:21+00:00</published><updated>2008-12-15T00:05:21+00:00</updated><id>https://simonwillison.net/2008/Dec/15/lxml/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://lsimons.wordpress.com/2008/08/31/how-to-install-lxml-python-module-on-mac-os-105-leopard/"&gt;How to install lxml python module on mac os 10.5 (leopard)&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Instructions that work! Finally, I can find out what all the fuss is about.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/leopard"&gt;leopard&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/libxml2"&gt;libxml2&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lxml"&gt;lxml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/macos"&gt;macos&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;&lt;/p&gt;



</summary><category term="leopard"/><category term="libxml2"/><category term="lxml"/><category term="macos"/><category term="python"/><category term="xml"/></entry><entry><title>pyquery</title><link href="https://simonwillison.net/2008/Dec/6/python/#atom-tag" rel="alternate"/><published>2008-12-06T09:53:45+00:00</published><updated>2008-12-06T09:53:45+00:00</updated><id>https://simonwillison.net/2008/Dec/6/python/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://pypi.python.org/pypi/pyquery"&gt;pyquery&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
“A jQuery-like library for Python”—implemented on top of lxml, providing jQuery style methods for manipulating an HTML or XML document.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/jquery"&gt;jquery&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lxml"&gt;lxml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pyquery"&gt;pyquery&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;&lt;/p&gt;



</summary><category term="jquery"/><category term="lxml"/><category term="pyquery"/><category term="python"/><category term="xml"/></entry><entry><title>Magnificent Seven - the value of Atom</title><link href="https://simonwillison.net/2008/Oct/19/bill/#atom-tag" rel="alternate"/><published>2008-10-19T22:24:21+00:00</published><updated>2008-10-19T22:24:21+00:00</updated><id>https://simonwillison.net/2008/Oct/19/bill/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://www.dehora.net/journal/2008/10/07/magnificent-seven-the-value-of-atom/"&gt;Magnificent Seven - the value of Atom&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The seven core things that Atom solves so that you don’t have to.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/atom"&gt;atom&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/bill-de-hora"&gt;bill-de-hora&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rest"&gt;rest&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;&lt;/p&gt;



</summary><category term="atom"/><category term="bill-de-hora"/><category term="rest"/><category term="xml"/></entry><entry><title>cascadenik: cascading sheets of style for mapnik</title><link href="https://simonwillison.net/2008/Aug/30/cascadenik/#atom-tag" rel="alternate"/><published>2008-08-30T10:04:35+00:00</published><updated>2008-08-30T10:04:35+00:00</updated><id>https://simonwillison.net/2008/Aug/30/cascadenik/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://mike.teczno.com/notes/cascadenik.html"&gt;cascadenik: cascading sheets of style for mapnik&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Great idea. Mapnik (the open source tile rendering system used by OpenStreetMap and others) has a complex style configuration based on XML. Michal Migurski has build a CSS-style equivalent which compiles down to XML, hopefully making it much quicker and easier to get started with Mapnik customisation.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cascadenik"&gt;cascadenik&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/css"&gt;css&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mapnik"&gt;mapnik&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mapping"&gt;mapping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/michal-migurski"&gt;michal-migurski&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openstreetmap"&gt;openstreetmap&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;&lt;/p&gt;



</summary><category term="cascadenik"/><category term="css"/><category term="mapnik"/><category term="mapping"/><category term="michal-migurski"/><category term="openstreetmap"/><category term="xml"/></entry><entry><title>Tip: Configure SAX parsers for secure processing</title><link href="https://simonwillison.net/2008/Aug/23/tip/#atom-tag" rel="alternate"/><published>2008-08-23T11:12:10+00:00</published><updated>2008-08-23T11:12:10+00:00</updated><id>https://simonwillison.net/2008/Aug/23/tip/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://www.ibm.com/developerworks/xml/library/x-tipcfsx.html"&gt;Tip: Configure SAX parsers for secure processing&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Explains the billion laughs attack, among others.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/billionlaughs"&gt;billionlaughs&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/elliotte-rusty-harold"&gt;elliotte-rusty-harold&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sax"&gt;sax&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;&lt;/p&gt;



</summary><category term="billionlaughs"/><category term="elliotte-rusty-harold"/><category term="sax"/><category term="security"/><category term="xml"/></entry><entry><title>DoS vulnerability in REXML</title><link href="https://simonwillison.net/2008/Aug/23/dos/#atom-tag" rel="alternate"/><published>2008-08-23T11:11:13+00:00</published><updated>2008-08-23T11:11:13+00:00</updated><id>https://simonwillison.net/2008/Aug/23/dos/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://www.ruby-lang.org/en/news/2008/08/23/dos-vulnerability-in-rexml/"&gt;DoS vulnerability in REXML&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Ruby’s REXML library is susceptible to the “billion laughs” denial of service attack where recursively nested entities expand a single entitity reference to a billion characters (kind of like the exploding zip file attack). Rails applications that process user-supplied XML should apply the monkey-patch ASAP; a proper gem update is forthcoming.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/billionlaughs"&gt;billionlaughs&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/denial-of-service"&gt;denial-of-service&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rails"&gt;rails&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rexml"&gt;rexml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ruby"&gt;ruby&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;&lt;/p&gt;



</summary><category term="billionlaughs"/><category term="denial-of-service"/><category term="rails"/><category term="rexml"/><category term="ruby"/><category term="security"/><category term="xml"/></entry></feed>