Simon Willison's Weblog: xml

Removing XSLT for a more secure browser

2025-11-05T22:24:57+00:00

Removing XSLT for a more secure browser

Previously discussed back in August, it looks like it's now official:

Chrome intends to deprecate and remove XSLT from the browser. [...] We intend to remove support from version 155 (November 17, 2026). The Firefox and WebKit projects have also indicated plans to remove XSLT from their browser engines. [...]

The continued inclusion of XSLT 1.0 in web browsers presents a significant and unnecessary security risk. The underlying libraries that process these transformations, such as libxslt (used by Chromium browsers), are complex, aging C/C++ codebases. This type of code is notoriously susceptible to memory safety vulnerabilities like buffer overflows, which can lead to arbitrary code execution.

I mostly encounter XSLT on people's Atom/RSS feeds, converting those to a more readable format in case someone should navigate directly to that link. Jake Archibald shared an alternative solution to that back in September.

Via Hacker News

Tags: browsers, chrome, security, web-standards, xml, xslt, jake-archibald

Making XML human-readable without XSLT

2025-09-02T19:32:57+00:00

Making XML human-readable without XSLT

In response to the recent discourse about XSLT support in browsers, Jake Archibald shares a new-to-me alternative trick for making an XML document readable in a browser: adding the following element near the top of the XML:

<script
  xmlns="http://www.w3.org/1999/xhtml"
  src="script.js" defer="" />

That script.js will then be executed by the browser, and can swap out the XML with HTML by creating new elements using the correct namespace:

const htmlEl = document.createElementNS(
  'http://www.w3.org/1999/xhtml',
  'html',
);
document.documentElement.replaceWith(htmlEl);
// Now populate the new DOM

Tags: browsers, javascript, rss, xml, xslt, jake-archibald

My First Open Source AI Generated Library

2025-06-21T23:22:45+00:00

My First Open Source AI Generated Library

Armin Ronacher had Claude and Claude Code do almost all of the work in building, testing, packaging and publishing a new Python library based on his design:

It wrote ~1100 lines of code for the parser

It wrote ~1000 lines of tests

It configured the entire Python package, CI, PyPI publishing

Generated a README, drafted a changelog, designed a logo, made it theme-aware

Did multiple refactorings to make me happier

The project? sloppy-xml-py, a lax XML parser (and violation of everything the XML Working Group hold sacred) which ironically is necessary because LLMs themselves frequently output "XML" that includes validation errors.

Claude's SVG logo design is actually pretty decent, turns out it can draw more than just bad pelicans!

I think experiments like this are a really valuable way to explore the capabilities of these models. Armin's conclusion:

This was an experiment to see how far I could get with minimal manual effort, and to unstick myself from an annoying blocker. The result is good enough for my immediate use case and I also felt good enough to publish it to PyPI in case someone else has the same problem.

Treat it as a curious side project which says more about what's possible today than what's necessarily advisable.

I'd like to present a slightly different conclusion here. The most interesting thing about this project is that the code is good.

My criteria for good code these days is the following:

Solves a defined problem, well enough that I'm not tempted to solve it in a different way
Uses minimal dependencies
Clear and easy to understand
Well tested, with tests prove that the code does what it's meant to do
Comprehensive documentation
Packaged and published in a way that makes it convenient for me to use
Designed to be easy to maintain and make changes in the future

sloppy-xml-py fits all of those criteria. It's useful, well defined, the code is readable with just about the right level of comments, everything is tested, the documentation explains everything I need to know, and it's been shipped to PyPI.

I'd be proud to have written this myself.

This example is not an argument for replacing programmers with LLMs. The code is good because Armin is an expert programmer who stayed in full control throughout the process. As I wrote the other day, a skilled individual with both deep domain understanding and deep understanding of the capabilities of the agent.

Via @mitsuhiko.at

Tags: armin-ronacher, open-source, pypi, python, xml, ai, generative-ai, llms, ai-assisted-programming, claude, claude-code

Cracking The Dave & Buster’s Anomaly

2025-06-05T10:23:31+00:00

Cracking The Dave & Buster’s Anomaly

Guilherme Rambo reports on a weird iOS messages bug:

The bug is that, if you try to send an audio message using the Messages app to someone who’s also using the Messages app, and that message happens to include the name “Dave and Buster’s”, the message will never be received.

Guilherme captured the logs from an affected device and spotted an XHTMLParseFailure error.

It turned out the iOS automatic transcription mechanism was recognizing the brand name and converting it to the official restaurant chain's preferred spelling "Dave & Buster’s"... which was then incorrectly escaped and triggered a parse error!

Tags: xhtml, xml, ios

Anthropic's Prompt Engineering Interactive Tutorial

2024-08-30T02:52:04+00:00

Anthropic's Prompt Engineering Interactive Tutorial

Anthropic continue their trend of offering the best documentation of any of the leading LLM vendors. This tutorial is delivered as a set of Jupyter notebooks - I used it as an excuse to try uvx like this:

git clone https://github.com/anthropics/courses
uvx --from jupyter-core jupyter notebook courses

This installed a working Jupyter system, started the server and launched my browser within a few seconds.

The first few chapters are pretty basic, demonstrating simple prompts run through the Anthropic API. I used %pip install anthropic instead of !pip install anthropic to make sure the package was installed in the correct virtual environment, then filed an issue and a PR.

One new-to-me trick: in the first chapter the tutorial suggests running this:

API_KEY = "your_api_key_here"
%store API_KEY

This stashes your Anthropic API key in the IPython store. In subsequent notebooks you can restore the API_KEY variable like this:

%store -r API_KEY

I poked around and on macOS those variables are stored in files of the same name in ~/.ipython/profile_default/db/autorestore.

Chapter 4: Separating Data and Instructions included some interesting notes on Claude's support for content wrapped in XML-tag-style delimiters:

Note: While Claude can recognize and work with a wide range of separators and delimeters, we recommend that you use specifically XML tags as separators for Claude, as Claude was trained specifically to recognize XML tags as a prompt organizing mechanism. Outside of function calling, there are no special sauce XML tags that Claude has been trained on that you should use to maximally boost your performance. We have purposefully made Claude very malleable and customizable this way.

Plus this note on the importance of avoiding typos, with a nod back to the problem of sandbagging where models match their intelligence and tone to that of their prompts:

This is an important lesson about prompting: small details matter! It's always worth it to scrub your prompts for typos and grammatical errors. Claude is sensitive to patterns (in its early years, before finetuning, it was a raw text-prediction tool), and it's more likely to make mistakes when you make mistakes, smarter when you sound smart, sillier when you sound silly, and so on.

Chapter 5: Formatting Output and Speaking for Claude includes notes on one of Claude's most interesting features: prefill, where you can tell it how to start its response:

client.messages.create(
    model="claude-3-haiku-20240307",
    max_tokens=100,
    messages=[
        {"role": "user", "content": "JSON facts about cats"},
        {"role": "assistant", "content": "{"}
    ]
)

Things start to get really interesting in Chapter 6: Precognition (Thinking Step by Step), which suggests using XML tags to help the model consider different arguments prior to generating a final answer:

Is this review sentiment positive or negative? First, write the best arguments for each side in <positive-argument> and <negative-argument> XML tags, then answer.

The tags make it easy to strip out the "thinking out loud" portions of the response.

It also warns about Claude's sensitivity to ordering. If you give Claude two options (e.g. for sentiment analysis):

In most situations (but not all, confusingly enough), Claude is more likely to choose the second of two options, possibly because in its training data from the web, second options were more likely to be correct.

This effect can be reduced using the thinking out loud / brainstorming prompting techniques.

A related tip is proposed in Chapter 8: Avoiding Hallucinations:

How do we fix this? Well, a great way to reduce hallucinations on long documents is to make Claude gather evidence first.

In this case, we tell Claude to first extract relevant quotes, then base its answer on those quotes. Telling Claude to do so here makes it correctly notice that the quote does not answer the question.

I really like the example prompt they provide here, for answering complex questions against a long document:

<question>What was Matterport's subscriber base on the precise date of May 31, 2020?</question>

Please read the below document. Then, in <scratchpad> tags, pull the most relevant quote from the document and consider whether it answers the user's question or whether it lacks sufficient detail. Then write a brief numerical answer in <answer> tags.

Via Hacker News

Tags: python, xml, ai, jupyter, prompt-engineering, generative-ai, llms, anthropic, claude, uv

SIARD: Software Independent Archiving of Relational Databases

2022-05-04T22:40:27+00:00

SIARD: Software Independent Archiving of Relational Databases

I hadn’t heard of this before but it looks really interesting: the Federal Archives of Switzerland developed a standard for archiving any relational database as a zip file full of XML which is “is used in over 50 countries around the globe”.

Via @MAndrewWaugh

Tags: archives, databases, xml

Building an Evernote to SQLite exporter

2020-10-16T20:12:07+00:00

I've been using Evernote for over a decade, and I've long wanted to export my data from it so I can do interesting things with it.

Ideally I'd use their API for that, so I could continually export new notes via a cron. Unfortunately the API is implemented using Thrift (the only public API I've seen to use that) and my attempts to speak to it from Python have so far failed to get anywhere.

Last weekend I decided to try using their "Export notes" feature instead, and its ENEX XML format.

ENEX - the Evernote XML export format

Select a note - or multiple notes - in the Evernote macOS desktop app, use the File -> Export Notes menu item and Evernote will create a Notes.enex exported XML file.

These files can get BIG, because any images attached to your notes will be embedded as base64 encoded binary data within the XML. My export is 2.8GB!

When faced with unknown XML like this it's useful to get a high level overview of what tags are present. in 2009 I wrote a Python script for this - as part of this project I updated it for Python 3 and pushed a release to PyPI.

Here's a subset of the output when run against my Notes.enex file (full output is here).

{
    "note": {
        "attr_counts": {},
        "child_counts": {
            "content": 2126,
            "created": 2126,
            "note-attributes": 2126,
            "resource": 2605,
            "tag": 45,
            "title": 2126,
            "updated": 2126
        },
        "count": 2126,
        "parent_counts": {
            "en-export": 2126
        }
    },
    "note-attributes": {
        "attr_counts": {},
        "child_counts": {
            "altitude": 1466,
            "application-data": 449,
            "author": 998,
            "classifications": 51,
            "content-class": 387,
            "latitude": 1480,
            "longitude": 1480,
            "reminder-done-time": 2,
            "reminder-order": 2126,
            "reminder-time": 1,
            "source": 1664,
            "source-application": 423,
            "source-url": 85,
            "subject-date": 10
        },
        "count": 2126,
        "parent_counts": {
            "note": 2126
        }
    }
}

This shows me that every note is represented as a <note> element, and crucial metadata lives in children of a <note-attributes> child element.

(I thought I'd need to tweak the script for performance since it works by loading the entire file into memory, but my laptop has 32GB of RAM so it didn't even blink.)

ENEX limitations

I ran into some significant limitations while working with my ENEX export.

Notebooks are not represented in the file at all - you just get the notes. So I can't tell which recipes were in my "cooking" notebook. It's possible to work around this by manually exporting the notes from each notebook one at a time and storing them in separate export files. I didn't bother.
Some of the data in the file - the note content itself for example - consists of further blocks of XML embedded in CDATA. This means you have to run a nested XML parser for every note you process.
The notes XML <en-note> format is mostly XHTML, but includes custom <en-media hash="..."> tags where inline images should be displayed
Those inline images are CDATA encoded base64 strings. They don't include a hash, but I figured out that decoding the base64 string and then running it through MD5 generates the hash that is used in the corresponding <en-media> tag.
Notes in the export don't have any form of unique ID!

That last limitation - the lack of unique IDs - is a huge pain. It means that any export is by necessity a one-time operation - if you edit a note and attempt a re-export you will get a brand new record that can't be automatically used to update the previous one.

This is particularly frustrating because I know that Evernote assigns a GUID to each note - they just don't include them in the export file.

Writing an exporter

My Dogsheep family of tools aims to liberate personal data from all kinds of different sources and convert it into SQLite, which means I can explore, query and visualize it using Datasette.

dogsheep/evernote-to-sqlite is my latest entry in that series.

I based it on healthkit-to-sqlite because that, too, has to deal with a multiple GB XML file. Both tools use the Python standard library's XMLPullParser, which makes XML elements available as a stream without needing to load the entire file into memory at once.

The conversion code is here. It ended up being relatively straight-forward, using ElementTree to extract data from the XML and sqlite-utils to write it to a database.

It implements a progress bar by tracking the number of bytes that have been read from the underlying file XML.

One optimization: originally I created a single resources table with the resource metadata and a BLOB column containing the binary image contents.

This table was huge - over a GB - and had very poor performance for operations such as querying and faceting across unindexed columns - because any table scan had to work through MBs of binary data.

My personal Evernote notebook has around 2,000 resources. 2,000 rows of metadata should usually be very fast to query.

So... I split the binary data out into a two column resources_data table - md5 primary key and a BLOB for data. This massively sped up queries against that resources table.

And it works! Running the following command produces a SQLite database file containing all of my notes, note metadata and embedded images:

evernote-to-sqlite enex evernote.db MyNotes.enex

Serving up formatted notes

Figuring out the best way to serve up the note content in Datasette is still a work in progress - but I've hacked together a delightfully terrible way of doing this using a one-off plugin.

Datasette plugins are usually packaged and installed via PyPI, but there's an alternative option for plugins that don't warrant distribution: create a plugins/ directory, drop in one or more Python files and start Datasette pointing --plugins-dir at that directory.

This is great for hacking on terrible ideas. Here's the evernote.py plugin I wrote in full:

from datasette import hookimpl
import jinja2

START = "<en-note"
END = "</en-note>"
TEMPLATE = """
<div style="max-width: 500px; white-space: normal; overflow-wrap: break-word;">{}</div>
""".strip()

EN_MEDIA_SCRIPT = """
Array.from(document.querySelectorAll('en-media')).forEach(el => {
    let hash = el.getAttribute('hash');
    let type = el.getAttribute('type');
    let path = `/evernote/resources_data/${hash}.json?_shape=array`;
    fetch(path).then(r => r.json()).then(rows => {
        let b64 = rows[0].data.encoded;
        let data = `data:${type};base64,${b64}`;
        el.innerHTML = `<img style="max-width: 300px" src="${data}">`;
    });
});
"""

@hookimpl
def render_cell(value, table):
    if not table:
        # Don't render content from arbitrary SQL queries, could be XSS hole
        return
    if not value or not isinstance(value, str):
        return
    value = value.strip()
    if value.startswith(START) and value.endswith(END):
        trimmed = value[len(START) : -len(END)]
        trimmed = trimmed.split(">", 1)[1]
        # Replace those horrible double newlines
        trimmed = trimmed.replace("<div><br /></div>", "<br>")
        return jinja2.Markup(TEMPLATE.format(trimmed))

@hookimpl
def extra_body_script():
    return EN_MEDIA_SCRIPT

This uses two Datasette plugin hooks.

render_cell() is called every time Datasette's interface renders the value of a column. Here I'm looking for the <en-note> and </en-note> tags and, if they are present, stripping them off and marking their inner content as safe to display using jinja2.Markup() (without this they would be automatically HTML esacped).

extra_body_script() can be used to inject additional JavaScript at the bottom of the page. I'm injecting some particularly devious JavaScript which scans the page for Evernote's <en-media> tags and, if it finds any, runs a fetch() to get the base64-encoded data from Datasette and then injects that into the page as a 300px wide image element using a data: URI.

Like I said, this is a pretty terrifying hack! But it works just fine, and my notes are now visible inside my personal Datasette instance:

Bonus feature: search by OCR

An interesting feature of Evernote is that it runs cloud-based OCR against images in your notes, enabling you to search across the text contained within them.

It turns out the output of this OCR process is included in the export XML. It looks like this:

<recoIndex
  docType="unknown"
  objType="image" objID="76dd28b07797cc9f3f129c4871c5293c" 
  engineVersion="7.0.24.1" 
  recoType="service" 
  lang="en" 
  objWidth="670" 
  objHeight="128">
  <item x="26" y="52" w="81" h="29">
    <t w="76">This</t>
  </item>
  <item x="134" y="52" w="35" h="29">
    <t w="79">is</t>
  </item>
  <item x="196" y="60" w="37" h="21">
    <t w="73">so</t>
  </item>
  <item x="300" y="60" w="57" h="21">
    <t w="71">can</t>
  </item>
  <item x="382" y="54" w="79" h="27">
    <t w="77">test</t>
  </item>
  <item x="486" y="52" w="59" h="29">
    <t w="82">the</t>
  </item>
  <item x="570" y="54" w="59" h="25">
    <t w="74">OCR</t>
    <t w="33">DeR</t>
    <t w="15">OCR!</t>
    <t w="14">OCR]</t>
  </item>
</recoIndex>

As far as I can tell an <item> gets multiple <t> elements only if the OCR wasn't 100% sure.

evernote-to-sqlite creates a searchable ocr text field using values from that XML, so I can search my images in Datasette.

Tags: projects, sqlite, xml, datasette, dogsheep, sqlite-utils

xml-analyser

2020-10-12T00:41:22+00:00

xml-analyser

In building evernote-to-sqlite I dusted off an ancient (2009) project I built that scans through an XML file and provides a summary of what elements are present in the document and how they relate to each other. I’ve now packaged it up as a CLI app and published it on PyPI.

Tags: cli, projects, xml

Using memory-profiler to debug excessive memory usage in healthkit-to-sqlite

2019-07-24T08:25:07+00:00

Using memory-profiler to debug excessive memory usage in healthkit-to-sqlite

This morning I figured out how to use the memory-profiler module (and mprof command line tool) to debug memory usage of Python processes. I added the details, including screenshots, to this GitHub issue. It helped me knock down RAM usage for my healthkit-to-sqlite from 2.5GB to just 80MB by making smarter usage of the ElementTree pull parser.

Tags: elementtree, memory, profiling, python, xml

Convert Locations.kml (pulled from an iPhone backup) to SQLite

2019-06-14T00:45:01+00:00

Convert Locations.kml (pulled from an iPhone backup) to SQLite

I’ve been playing around with data from my iPhone using the iPhone Backup Extractor app and one of the things it exports for you is a Locations.kml file full of location history data. I wrote a tiny script using Python’s ElementTree XMLPullParser to efficiently iterate through the Placemarks and yield them as dictionaries, which I then batch-inserted into sqlite-utils to create a SQLite database.

Tags: kml, projects, sqlite, xml, sqlite-utils

Exploring the UK Register of Members Interests with SQL and Datasette

2018-04-25T15:49:39+00:00

Ever wondered which UK Members of Parliament get gifted the most helicopter rides? How about which MPs have been given Christmas hampers by the Sultan of Brunei? (David Cameron, William Hague and Michael Howard apparently). Here’s how to dig through the Register of Members Interests using SQL and Datasette.

mySociety have been building incredible civic participation applications like TheyWorkForYou and FixMyStreet for nearly 15 years now, and have accumulated all kinds of interesting data along the way.

They recently launched their own data portal at data.mysociety.org listing all of the information they have available. While exploring it I stumbled across their copy of the UK Register of Members Interests. Every UK Member of Parliament has to register their conflicts of interest and income sources, and mySociety have an ongoing project to parse that data into a more useful format.

It won’t surprise you to hear that I couldn’t resist turning their XML files into a SQLite database.

The result is register-of-members-interests.datasettes.com - a Datasette instance running against a SQLite database containing over 1.3 million line-items registered by 1,419 MPs over the course of 18 years.

Some fun queries

A few of my favourites so far:

Which MPs have taken the most donated helicopter rides
Which MPs have accepted gifts from the Surtan of Brunei
A better gifts query with more robust de-duping: here's gifts of hampers, watches and Glastonbury festival tickets
Which MPs own significant shares in Shell, Apple, or BP.
Who has reported the most overseas trips to Saudi Arabia? The top hit there was Daniel Kawczynski, and it turns out his Wikipedia page has a section dedicated to his relationship with the kindgom.
How much do MPs get paid for appearances on Have I Got News For You. The going rate seems to be £1,500 but you can find some interesting discrepancies if you exclude that value from the results.
Which MPs are responsible for the most total line items reported

Understanding the data model

Most of the action takes place in the items table, where each item is a line-item from an MP’s filing. You can search that table by keyword (see helicopter example above) or apply filters to it using the standard Datasette interface. You can also execute your own SQL directly against the database.

Each item is filed against a category. There appears to have been quite a bit of churn in the way that the categories are defined over the years, plus the data is pretty untidy - there are no less than 10 ways of spelling “Remunerated employment, office, profession etc.” for example!

There are also a LOT of duplicate items in the set - it appears that MPs frequently list the same item (a rental property for example) every time they fill out the register. SQL DISTINCT clauses can help filter through these, as seen in some of the above examples.

The data also has the concepts of both members and people. As far as I can tell people are distinct, but members may contain duplicates - presumably to represent MPs who have served more than one term in office. It looks like the member field stopped being populated in March 2015 so analysis is best performed against the people table.

Once concept I have introduced myself is the record_id. In the XML documents the items are often grouped together into a related collection, like this:

<regmem personid="uk.org.publicwhip/person/10001"
    memberid="uk.org.publicwhip/member/40289" membername="Diane Abbott" date="2014-07-14">
    <category type="2" name="Remunerated employment, office, profession etc">
        <item>Payments from MRL Public Sector Consultants, Pepple House, 8 Broad Street, Great Cambourne, Cambridge CB23 6HJ:</item>
        <item>26 November 2013, I received a fee of £1,000 for speaking at the 1st African Legislative Summit, National Assembly, Abuja, Nigeria.  Hours: 8 hrs. The cost of my flights, transfers and hotel accommodation in Abuja were also met; estimated value £5,000. <em>(Registered 3 December 2013)</em></item>
        <item>23 July 2013, I received a fee of £5,000 for appearing as a contestant on ITV&#8217;s &#8216;The Chase Celebrity &#8211; Series 3&#8217; television programme.  Address of payer:  ITV Studios Ltd, London Television Centre, Upper Ground, London SE1 9Lt.  Hours: 12 hrs.   <em>(Registered 23 July 2013)</em></item>
    </category>
</regmem>

While these items are presented as separate line items, their grouping carries meaning: the first line item here acts as a kind of heading to help provide context to the other items.

To model this in the simplest way possible, I’ve attempted to preserve the order of these groups using a pair of additional columns: the record_id and the sort_order. I construct the record_id using a collection of other fields - the idea is for it to be sortable, and for each line-item in the same grouping to have the same record_id:

record_id = "{date}-{category_id}-{person_id}-{record}".format(
    date=date,
    category_id=category_id,
    person_id=person_id.split("/")[
        -1
    ],
    record=record,
)

The resulting record_id might look like this: 2018-04-16-70b64e89-24878-0

To recreate that particular sequence of line-items, you can search for all items matching that record_id and then sort them by their sort_order. Here’s that record from Diane Abbott shown with its surrounding context.

How I built it

The short version: I downloaded all of the XML files and wrote a Python script which parsed them using ElementTree and inserted them into a SQLite database. I’ve put the code on GitHub.

A couple of fun tricks: firstly, I borrowed some code from csvs-to-sqlite to create the full-text search index and enable searching:

def create_and_populate_fts(conn):
    create_sql = """
        CREATE VIRTUAL TABLE "items_fts"
        USING {fts_version} (item, person_name, content="items")
    """.format(
        fts_version=best_fts_version()
    )
    conn.executescript(create_sql)
    conn.executescript(
        """
        INSERT INTO "items_fts" (rowid, item, person_name)
        SELECT items.rowid, items.item, people.name
        FROM items LEFT JOIN people ON items.person_id = people.id
    """
    )

The best_fts_version() function implements basic feature detection against SQLite by trying operations in an in-memory database.

Secondly, I ended up writing my own tiny utility function for inserting records into SQLite. SQLite has useful INSERT OR REPLACE INTO syntax which allows you to insert a record and will automatically update an existing record if there is a match on the primary key. This meant I could write this utility function and use it for all of my data inserts:

def insert_or_replace(conn, table, record):
    pairs = record.items()
    columns = [p[0] for p in pairs]
    params = [p[1] for p in pairs]
    sql = "INSERT OR REPLACE INTO {table} ({column_list}) VALUES ({value_list});".format(
        table=table,
        column_list=", ".join(columns),
        value_list=", ".join(["?" for p in params]),
    )
    conn.execute(sql, params)

# ...

insert_or_replace(
    db,
    "people",
    {
        "id": person_id,
        "name": regmem_el.attrib["membername"],
    },
)

What can you find?

I’ve really only scratched the surface of what’s in here with my initial queries. What can you find? Send me Datasette query links on Twitter with your discoveries!

Tags: mysociety, political-hacking, politics, projects, sqlite, xml, datasette

Has JSON pretty much replaced XML for string processing for the web, or are there use cases where XML is still necessary?

2012-02-25T17:17:00+00:00

My answer to Has JSON pretty much replaced XML for string processing for the web, or are there use cases where XML is still necessary? on Quora

It's replaced XML as the default format for most APIs. XML is still necessary for Atom/RSS feeds and other existing standards built on top of XML. It's also a better choice than JSON for markup-style data - stuff like XHTML where tags are applied to sequences of characters within larger chunks of text.

Tags: json, web-development, xml, quora

What are XML feed best practices?

2012-01-31T14:29:00+00:00

My answer to What are XML feed best practices? on Quora

It sounds like you're pretty much screwed already, if you're dealing with companies that still think FTPing XML around is a sensible thing to do.

I would suggest focusing on what you can control. Assume that you will be passed bad data - weird formats, not-well-formed XML, duplicate entries etc. Your job is to handle all of this without going mad, and without your codebase turning in to an unmanageable ball of mud.

So, start by figuring out your own core data model / abstraction. It will need to be VERY loose - as few required fields as possible, since you can be sure some if the feeds you are consuming will come in with stuff missing at some point or another.

Separate your feed consumers from the rest of your code. Having your own good internal Web API (which could consume JSON rather than XML since you control it) might be smart, since that will provide a solid separation and you can then write all of your feed consumers as separate pieces of code that just POST new items to the API.

Learn to love, respect and cherish unique identifiers... but be very wary of supposedly unique identifiers from external sources unless you can be absolutely sure they won't change on you. Create your own unique IDs at the first available opportunity, treat them properly within your own system and map external identifiers to them whenever you can.

Write your consumers in a dynamic language with a solid interactive prompt, like Python or Ruby. This will make them much easier to write and debug. Use whatever you like for your core data storage / API.

Since your incoming data will come in all shapes and sizes, consider a document store such as MongoDB or Riak over a SQL database. Avoiding SQL migrations will help you out a lot.

Log and store absolutely everything. Ideally you should be able to re-execute every import that the system has ever executed, in order, to make debugging and fixing errors non terrifying. That will almost certainly prove impossible, but it's a nice thought.

Good luck!

Tags: databases, mysql, php, xml, quora

What is the difference between XHTML 1.0 strict and transitional?

2012-01-14T13:09:00+00:00

My answer to What is the difference between XHTML 1.0 strict and transitional? on Quora

Not a lot. XHTML transitional lets you use a few presentational attributes and elements that aren't available in XHTML strict. Here's a more detailed overview from back in 2005: http://24ways.org/2005/transitio...

These days you're much better off using HTML5 and ignoring XHTML entirely.

Tags: html, web-development, xhtml, xml, quora

Indexing JSON in Solr 3.1

2010-12-10T09:46:00+00:00

Indexing JSON in Solr 3.1

The next release of Solr will support indexing documents provided as JSON—Solr currently requires incoming documents to be formatted as XML.

Tags: json, search, solr, xml, recovered

Quoting James Clark

2010-12-02T18:48:00+00:00

I think the Web community has spoken, and it’s clear that what it wants is HTML5, JavaScript and JSON. XML isn’t going away but I see it being less and less a Web technology; it won’t be something that you send over the wire on the public Web, but just one of many technologies that are used on the server to manage and generate what you do send over the wire.

— James Clark

Tags: html5, json, xml, recovered

Introducing BERT and BERT-RPC

2009-10-21T22:11:37+00:00

Introducing BERT and BERT-RPC

Justification for inventing a brand new serialisation protocol: Thrift and Protocol Buffers both use IDLs and code generation, XML “is not convertible to a simple unambiguous data structure in any language I’ve ever used” and JSON lacks support for unencoded binary data. The result is BERT—Binary ERlang Term—which extracts a format from Erlang in much the same way that JSON extracted one from JavaScript.

Tags: erlang, github, javascript, json, protocolbuffers, serialisation, thrift, xml

minixsv

2009-08-12T16:59:57+00:00

minixsv

As far as I can tell, this is the only library that can validate XML using pure Python (no C extension required). I’d be extremely happy if someone would write a pure Python library (or one that only depends on ElementTree, which is included in the standard library) for validating XML against a Relax NG Compact syntax schema. Even DTD validation would be better than nothing!

Via Stack Overflow

Tags: elementtree, minixsv, python, relaxng, validation, xml, xmlschema

xmlwitch

2009-07-24T00:33:57+00:00

xmlwitch

An XML building library for Python that doesn’t suck (I love ElementTree for parsing XML, but I’ve never really liked it for generation). Makes smart use of the with statement.

Tags: python, withstatement, xml, xmlwitch

Quoting Henri Sivonen

2009-07-06T12:46:23+00:00

There are two meanings to XHTML: technical and marketing. The technical kind (XHTML served using the application/xhtml xml MIME type) is a formulation of HTML as an XML vocabulary. The marketing kind (XHTML served using the text/html MIME type) is processed just like HTML by browsers but the authors attempt to observe slightly different syntax rules in order to make it seem that they are doing something newer and shinier compared to HTML.

— Henri Sivonen

Tags: buzzwords, henri-sivonen, xhtml, xml

With YQL Execute, the Internet becomes your database

2009-04-29T22:50:54+00:00

With YQL Execute, the Internet becomes your database

This is nuts (in a good way). Yahoo!’s intriguing universal SQL-style XML/JSONP web service interface now supports JavaScript as a kind of stored procedure language, meaning you can use JavaScript and E4X to screen-scrape web pages, then query the results with YQL.

Tags: apis, e4x, javascript, json, jsonp, sql, xml, yahoo, yql

A few notes on the Guardian Open Platform

2009-03-10T14:28:39+00:00

This morning we launched the Guardian Open Platform at a well attended event in our new offices in Kings Place. This is one of the main projects I've been helping out with since joining the Guardian last year, and it's fantastic to finally have it out in the open.

There are two components to the launch today: the Content API and the Data Store. I'll describe the Data Store first as it deserves not to get buried in the discussion about its larger cousin.

The Data Store

Simon Rogers is the Guardian news editor who is principally responsible for gathering data about the world. If you ever see an infographic in the paper, the chances are Simon had a hand in researching the data for it. His delicious feed is a positive gold mine.

As of today, a sizeable portion the data he collects for the newspaper will also be published online. As a starting point, we're publishing over 80 data sets, all using Google Spreadsheets which means it's all accessible through the Spreadsheets Data API.

Here's Simon's take on it, from Welcome to the Datablog:

Everyday we work with datasets from around the world. We have had to check this data and make sure it's the best we can get, from the most credible sources. But then it lives for the moment of the paper's publication and afterward disappears into a hard drive, rarely to emerge again before updating a year later.

So, together with its companion site, the Data Store – a directory of all the stats we post – we are opening up that data for everyone. Whenever we come across something interesting or relevant or useful, we'll post it up here and let you know what we're planning to do with it.

It's worth spending quite a while digging around the data. Most sets come with a full description, including where the data was sourced from. New data sets will be announced on the Datablog, which is cleverly subtitled "Facts are sacred".

The Content API

The Content API provides REST-ish access to over a million items of content, mostly from the last decade but with a few gems that are a little bit older. Various types of content are available - article is the most common, but you can grab information (though not necessarily content) about audio, video, galleries and more. You can retrieve 50 items at a time, and pagination is unlimited (provided you stay below the API's rate limit).

Articles are provided with their full body content, though this does not currently include any HTML tags (a known issue). It's a good idea to review our terms and conditions, but you should know that if you opt to republish our article bodies on your site we may ask you to include our ads alongside our content in the future.

We serve 15 minute HTTP cache headers, but you are allowed to store our content for up to 24 hours. You really, really don't want to store content for longer than that, as in addition to violating our T&Cs you might find yourself inadvertently publishing an article that has been retracted for legal reasons. UK libel laws can be pretty scary.

In addition to regular search, you can also filter our content using tags. Tags are a core aspect of the Guardian's R2 platform, being used for keywords, contributors, "series" (used to implement blogs), content types and more. Every item returned by the API includes tags, and the tags can be used to further filter the results.

We also return a list of filters at the bottom of each page of search results showing the tags that could be used to filter that result set, ordered by the number of results (you may have seen this feature referred to as faceted search or guided navigation). Handy tip: you can use ?count=0 in your search API key to turn off results entirely and just get back the filters section. The race is on to be first to release a tag relationship browser based on this feature.

API responses can be had in custom XML, JSON or Atom. The Atom format is the least mature at the moment, and we'd welcome suggestions for improving it from the community.

I released a Python client library for the API this morning, and we also have libraries for Ruby, Java and PHP.

We also have an API Explorer (written in JavaScript and jQuery, hosted on the same domain as the API so that it can make Ajax requests) but you'll need an API key to try it out.

The bad news

The response to the API release has been terrific (check out what Tom Watson had to say), but as a result it's likely that API key provisions will be significantly lower than the overall demand for them. Please bear with us while we work towards a more widely accessible release.

Tags: apis, atom, contentapi, data, data-journalism, datastore, guardian, javascript, journalism, jquery, json, openplatform, python, simon-rogers, tom-watson, xml

JsonML (JSON Markup Language)

2009-02-10T15:03:21+00:00

JsonML (JSON Markup Language)

An almost non-lossy serialization format for sending XML as JSON (plain text in between elements is ignored). Uses the (element-name, attribute-dictionary, list-of-children) tuple format, which sadly means many common cases end up taking more bytes than the original XML. Still an improvement on serializations that behave differently when a list of children has only one item in it.

Tags: json, jsonml, serialization, xml

Crowbar

2009-01-24T23:52:55+00:00

Crowbar

Headless Gecko/XULRunner which exposes a web service API for screen scraping using a real browser DOM—just pass it the URL of a page and the URL of a screen scraping JavaScript script (a bit like a Greasemonkey user script) and get back RDF/XML.

Tags: crowbar, dom, gecko, greasemonkey, mozilla, rdf, scraping, webservice, xml, xulrunner

How to install lxml python module on mac os 10.5 (leopard)

2008-12-15T00:05:21+00:00

How to install lxml python module on mac os 10.5 (leopard)

Instructions that work! Finally, I can find out what all the fuss is about.

Tags: leopard, libxml2, lxml, macos, python, xml

pyquery

2008-12-06T09:53:45+00:00

pyquery

“A jQuery-like library for Python”—implemented on top of lxml, providing jQuery style methods for manipulating an HTML or XML document.

Tags: jquery, lxml, pyquery, python, xml

Magnificent Seven - the value of Atom

2008-10-19T22:24:21+00:00

Magnificent Seven - the value of Atom

The seven core things that Atom solves so that you don’t have to.

Tags: atom, bill-de-hora, rest, xml

cascadenik: cascading sheets of style for mapnik

2008-08-30T10:04:35+00:00

cascadenik: cascading sheets of style for mapnik

Great idea. Mapnik (the open source tile rendering system used by OpenStreetMap and others) has a complex style configuration based on XML. Michal Migurski has build a CSS-style equivalent which compiles down to XML, hopefully making it much quicker and easier to get started with Mapnik customisation.

Tags: cascadenik, css, mapnik, mapping, michal-migurski, openstreetmap, xml

Tip: Configure SAX parsers for secure processing

2008-08-23T11:12:10+00:00

Tip: Configure SAX parsers for secure processing

Explains the billion laughs attack, among others.

Tags: billionlaughs, elliotte-rusty-harold, sax, security, xml

DoS vulnerability in REXML

2008-08-23T11:11:13+00:00

DoS vulnerability in REXML

Ruby’s REXML library is susceptible to the “billion laughs” denial of service attack where recursively nested entities expand a single entitity reference to a billion characters (kind of like the exploding zip file attack). Rails applications that process user-supplied XML should apply the monkey-patch ASAP; a proper gem update is forthcoming.

Tags: billionlaughs, denial-of-service, rails, rexml, ruby, security, xml