<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: henri-sivonen</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/henri-sivonen.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-12-17T01:48:54+00:00</updated><author><name>Simon Willison</name></author><entry><title>firefox parser/html/java/README.txt</title><link href="https://simonwillison.net/2025/Dec/17/firefox-parser/#atom-tag" rel="alternate"/><published>2025-12-17T01:48:54+00:00</published><updated>2025-12-17T01:48:54+00:00</updated><id>https://simonwillison.net/2025/Dec/17/firefox-parser/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/mozilla-firefox/firefox/tree/main/parser/html/java"&gt;firefox parser/html/java/README.txt&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
TIL (or TIR - &lt;a href="https://simonwillison.net/2009/Jul/11/john/"&gt;Today I was Reminded&lt;/a&gt;) that the HTML5 Parser used by Firefox is maintained as Java code (&lt;a href="https://github.com/mozilla-firefox/firefox/commits/main/parser/html/javasrc"&gt;commit history here&lt;/a&gt;) and converted to C++ using a custom translation script.&lt;/p&gt;
&lt;p&gt;You can see that in action by checking out the ~8GB Firefox repository and running:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;cd parser/html/java
make sync
make translate
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here's &lt;a href="http://gistpreview.github.io/?e53ff836cb44816670adddc3a518b3cc"&gt;a terminal session where I did that&lt;/a&gt;, including the output of &lt;code&gt;git diff&lt;/code&gt; showing the updated C++ files.&lt;/p&gt;
&lt;p&gt;I did some digging and found that the code that does the translation work lives, weirdly, in the &lt;a href="https://github.com/validator/validator"&gt;Nu Html Checker&lt;/a&gt; repository on GitHub which powers the W3C's &lt;a href="https://validator.w3.org/nu/"&gt;validator.w3.org/nu/&lt;/a&gt; validation service!&lt;/p&gt;
&lt;p&gt;Here's a snippet from &lt;a href="https://github.com/validator/validator/blob/dfd1948624259c63027bc5953e89bdeee81fb7b0/htmlparser/translator-src/nu/validator/htmlparser/cpptranslate/CppVisitor.java#L421-L442"&gt;htmlparser/cpptranslate/CppVisitor.java&lt;/a&gt; showing how a class declaration is converted into C++:&lt;/p&gt;
&lt;pre&gt;    &lt;span class="pl-k"&gt;protected&lt;/span&gt; &lt;span class="pl-smi"&gt;void&lt;/span&gt; &lt;span class="pl-en"&gt;startClassDeclaration&lt;/span&gt;() {
        &lt;span class="pl-s1"&gt;printer&lt;/span&gt;.&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s"&gt;"#define "&lt;/span&gt;);
        &lt;span class="pl-s1"&gt;printer&lt;/span&gt;.&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;className&lt;/span&gt;);
        &lt;span class="pl-s1"&gt;printer&lt;/span&gt;.&lt;span class="pl-en"&gt;printLn&lt;/span&gt;(&lt;span class="pl-s"&gt;"_cpp__"&lt;/span&gt;);
        &lt;span class="pl-s1"&gt;printer&lt;/span&gt;.&lt;span class="pl-en"&gt;printLn&lt;/span&gt;();

        &lt;span class="pl-k"&gt;for&lt;/span&gt; (&lt;span class="pl-smi"&gt;int&lt;/span&gt; &lt;span class="pl-s1"&gt;i&lt;/span&gt; = &lt;span class="pl-c1"&gt;0&lt;/span&gt;; &lt;span class="pl-s1"&gt;i&lt;/span&gt; &amp;lt; &lt;span class="pl-smi"&gt;Main&lt;/span&gt;.&lt;span class="pl-c1"&gt;H_LIST&lt;/span&gt;.&lt;span class="pl-s1"&gt;length&lt;/span&gt;; &lt;span class="pl-s1"&gt;i&lt;/span&gt;++) {
            &lt;span class="pl-smi"&gt;String&lt;/span&gt; &lt;span class="pl-s1"&gt;klazz&lt;/span&gt; = &lt;span class="pl-smi"&gt;Main&lt;/span&gt;.&lt;span class="pl-c1"&gt;H_LIST&lt;/span&gt;[&lt;span class="pl-s1"&gt;i&lt;/span&gt;];
            &lt;span class="pl-k"&gt;if&lt;/span&gt; (!&lt;span class="pl-s1"&gt;klazz&lt;/span&gt;.&lt;span class="pl-en"&gt;equals&lt;/span&gt;(&lt;span class="pl-s1"&gt;javaClassName&lt;/span&gt;)) {
                &lt;span class="pl-s1"&gt;printer&lt;/span&gt;.&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s"&gt;"#include &lt;span class="pl-cce"&gt;\"&lt;/span&gt;"&lt;/span&gt;);
                &lt;span class="pl-s1"&gt;printer&lt;/span&gt;.&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;cppTypes&lt;/span&gt;.&lt;span class="pl-en"&gt;classPrefix&lt;/span&gt;());
                &lt;span class="pl-s1"&gt;printer&lt;/span&gt;.&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;klazz&lt;/span&gt;);
                &lt;span class="pl-s1"&gt;printer&lt;/span&gt;.&lt;span class="pl-en"&gt;printLn&lt;/span&gt;(&lt;span class="pl-s"&gt;".h&lt;span class="pl-cce"&gt;\"&lt;/span&gt;"&lt;/span&gt;);
            }
        }

        &lt;span class="pl-s1"&gt;printer&lt;/span&gt;.&lt;span class="pl-en"&gt;printLn&lt;/span&gt;();
        &lt;span class="pl-s1"&gt;printer&lt;/span&gt;.&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s"&gt;"#include &lt;span class="pl-cce"&gt;\"&lt;/span&gt;"&lt;/span&gt;);
        &lt;span class="pl-s1"&gt;printer&lt;/span&gt;.&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;className&lt;/span&gt;);
        &lt;span class="pl-s1"&gt;printer&lt;/span&gt;.&lt;span class="pl-en"&gt;printLn&lt;/span&gt;(&lt;span class="pl-s"&gt;".h&lt;span class="pl-cce"&gt;\"&lt;/span&gt;"&lt;/span&gt;);
        &lt;span class="pl-s1"&gt;printer&lt;/span&gt;.&lt;span class="pl-en"&gt;printLn&lt;/span&gt;();
    }&lt;/pre&gt;

&lt;p&gt;Here's a &lt;a href="https://johnresig.com/blog/html-5-parsing/"&gt;fascinating blog post&lt;/a&gt; from John Resig explaining how validator author Henri Sivonen introduced the new parser into Firefox in 2009.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=46295771#46296888"&gt;Hacker News conversation&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/c-plus-plus"&gt;c-plus-plus&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/firefox2"&gt;firefox2&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/henri-sivonen"&gt;henri-sivonen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/java"&gt;java&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/john-resig"&gt;john-resig&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mozilla"&gt;mozilla&lt;/a&gt;&lt;/p&gt;



</summary><category term="c-plus-plus"/><category term="firefox2"/><category term="henri-sivonen"/><category term="java"/><category term="john-resig"/><category term="mozilla"/></entry><entry><title>Firefox 4: the HTML5 parser - inline SVG, speed and more</title><link href="https://simonwillison.net/2010/May/12/firefox/#atom-tag" rel="alternate"/><published>2010-05-12T08:56:00+00:00</published><updated>2010-05-12T08:56:00+00:00</updated><id>https://simonwillison.net/2010/May/12/firefox/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://hacks.mozilla.org/2010/05/firefox-4-the-html5-parser-inline-svg-speed-and-more/"&gt;Firefox 4: the HTML5 parser - inline SVG, speed and more&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
A complete replacement for the oldest part of Gecko (the HTML parser dates back to 1998) headed up by HTML5 validator author Henri Sivonen, using the parsing algorithm defined in the HTML5 specification. Improvements include parsing taking place off the main UI thread and the ability to embed SVG and MathML directly inline in HTML pages.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/firefox"&gt;firefox&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gecko"&gt;gecko&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/henri-sivonen"&gt;henri-sivonen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/html5"&gt;html5&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mathml"&gt;mathml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/svg"&gt;svg&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/recovered"&gt;recovered&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/firefox4"&gt;firefox4&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parser"&gt;parser&lt;/a&gt;&lt;/p&gt;



</summary><category term="firefox"/><category term="gecko"/><category term="henri-sivonen"/><category term="html5"/><category term="mathml"/><category term="svg"/><category term="recovered"/><category term="firefox4"/><category term="parser"/></entry><entry><title>HTML 5 Parsing</title><link href="https://simonwillison.net/2009/Jul/11/john/#atom-tag" rel="alternate"/><published>2009-07-11T23:36:57+00:00</published><updated>2009-07-11T23:36:57+00:00</updated><id>https://simonwillison.net/2009/Jul/11/john/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://ejohn.org/blog/html-5-parsing/"&gt;HTML 5 Parsing&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Firefox nightlies include a new parser that implements the HTML5 parsing algorithm (disabled by default), which uses C++ code automatically generated from Henri Sivonen’s Java parser first used in the HTML5 validator.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/browsers"&gt;browsers&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/firefox"&gt;firefox&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/henri-sivonen"&gt;henri-sivonen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/html5"&gt;html5&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/john-resig"&gt;john-resig&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mozilla"&gt;mozilla&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parsing"&gt;parsing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/validator"&gt;validator&lt;/a&gt;&lt;/p&gt;



</summary><category term="browsers"/><category term="firefox"/><category term="henri-sivonen"/><category term="html5"/><category term="john-resig"/><category term="mozilla"/><category term="parsing"/><category term="validator"/></entry><entry><title>Quoting Henri Sivonen</title><link href="https://simonwillison.net/2009/Jul/6/xhtml/#atom-tag" rel="alternate"/><published>2009-07-06T12:46:23+00:00</published><updated>2009-07-06T12:46:23+00:00</updated><id>https://simonwillison.net/2009/Jul/6/xhtml/#atom-tag</id><summary type="html">
    &lt;blockquote cite="http://hsivonen.iki.fi/xhtml2-html5-q-and-a/"&gt;&lt;p&gt;There are two meanings to XHTML: technical and marketing. The technical kind (XHTML served using the application/xhtml xml MIME type) is a formulation of HTML as an XML vocabulary. The marketing kind (XHTML served using the text/html MIME type) is processed just like HTML by browsers but the authors attempt to observe slightly different syntax rules in order to make it seem that they are doing something newer and shinier compared to HTML.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="http://hsivonen.iki.fi/xhtml2-html5-q-and-a/"&gt;Henri Sivonen&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/buzzwords"&gt;buzzwords&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/henri-sivonen"&gt;henri-sivonen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xhtml"&gt;xhtml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;&lt;/p&gt;



</summary><category term="buzzwords"/><category term="henri-sivonen"/><category term="xhtml"/><category term="xml"/></entry><entry><title>An Unnofficial Q&amp;A about the Discontinuation of the XHTML2 WG</title><link href="https://simonwillison.net/2009/Jul/6/unnofficial/#atom-tag" rel="alternate"/><published>2009-07-06T12:27:18+00:00</published><updated>2009-07-06T12:27:18+00:00</updated><id>https://simonwillison.net/2009/Jul/6/unnofficial/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://hsivonen.iki.fi/xhtml2-html5-q-and-a/"&gt;An Unnofficial Q&amp;amp;A about the Discontinuation of the XHTML2 WG&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
By Henri Sivonen.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/faq"&gt;faq&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/henri-sivonen"&gt;henri-sivonen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/html"&gt;html&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/html2"&gt;html2&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/html5"&gt;html5&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/w3c"&gt;w3c&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/web-standards"&gt;web-standards&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xhtml"&gt;xhtml&lt;/a&gt;&lt;/p&gt;



</summary><category term="faq"/><category term="henri-sivonen"/><category term="html"/><category term="html2"/><category term="html5"/><category term="w3c"/><category term="web-standards"/><category term="xhtml"/></entry><entry><title>Quoting Henri Sivonen</title><link href="https://simonwillison.net/2008/Mar/20/reality/#atom-tag" rel="alternate"/><published>2008-03-20T14:43:43+00:00</published><updated>2008-03-20T14:43:43+00:00</updated><id>https://simonwillison.net/2008/Mar/20/reality/#atom-tag</id><summary type="html">
    &lt;blockquote cite="http://hsivonen.iki.fi/rdf/"&gt;&lt;p&gt;Draconian failure on error is not the answer problems of Postel's law. Draconian error handling  creates an unstable equilibrium in Game Theory terms  - it only lasts until one player breaks the rule. One non-Draconian XML5 implementation in key client product and the Draconian XML ranks would break. Well-specified error recovery is the right way to implement the liberal part of Postel's law.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="http://hsivonen.iki.fi/rdf/"&gt;Henri Sivonen&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/draconian"&gt;draconian&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/henri-sivonen"&gt;henri-sivonen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/html5"&gt;html5&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/law"&gt;law&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/postelslaw"&gt;postelslaw&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;&lt;/p&gt;



</summary><category term="draconian"/><category term="henri-sivonen"/><category term="html5"/><category term="law"/><category term="postelslaw"/><category term="xml"/></entry></feed>