<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: parsing</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/parsing.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-05-26T07:23:17+00:00</updated><author><name>Simon Willison</name></author><entry><title>GitHub Issues search now supports nested queries and boolean operators: Here’s how we (re)built it</title><link href="https://simonwillison.net/2025/May/26/github-issues-search/#atom-tag" rel="alternate"/><published>2025-05-26T07:23:17+00:00</published><updated>2025-05-26T07:23:17+00:00</updated><id>https://simonwillison.net/2025/May/26/github-issues-search/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.blog/developer-skills/application-development/github-issues-search-now-supports-nested-queries-and-boolean-operators-heres-how-we-rebuilt-it/"&gt;GitHub Issues search now supports nested queries and boolean operators: Here’s how we (re)built it&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
GitHub Issues got a significant search upgrade &lt;a href="https://simonwillison.net/2025/Jan/16/evolving-github-issues/"&gt;back in January&lt;/a&gt;. Deborah Digges provides some behind the scene details about how it works and how they rolled it out.&lt;/p&gt;
&lt;p&gt;The signature new feature is complex boolean logic: you can now search for things like &lt;code&gt;is:issue state:open author:rileybroughten (type:Bug OR type:Epic)&lt;/code&gt;, up to five levels of nesting deep.&lt;/p&gt;
&lt;p&gt;Queries are parsed into an AST using the Ruby &lt;a href="https://github.com/kschiess/parslet"&gt;parslet&lt;/a&gt; PEG grammar library. The AST is then compiled into a nested Elasticsearch &lt;code&gt;bool&lt;/code&gt; JSON query.&lt;/p&gt;
&lt;p&gt;GitHub Issues search deals with around 2,000 queries a second so robust testing is extremely important! The team rolled it out invisibly to 1% of live traffic, running the new implementation via a queue and competing the number of results returned to try and spot any degradations compared to the old production code.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/elasticsearch"&gt;elasticsearch&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ops"&gt;ops&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parsing"&gt;parsing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ruby"&gt;ruby&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scaling"&gt;scaling&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-issues"&gt;github-issues&lt;/a&gt;&lt;/p&gt;



</summary><category term="elasticsearch"/><category term="github"/><category term="ops"/><category term="parsing"/><category term="ruby"/><category term="scaling"/><category term="search"/><category term="github-issues"/></entry><entry><title>Lark parsing library JSON tutorial</title><link href="https://simonwillison.net/2023/Aug/13/lark-parsing-library-json-tutorial/#atom-tag" rel="alternate"/><published>2023-08-13T21:50:16+00:00</published><updated>2023-08-13T21:50:16+00:00</updated><id>https://simonwillison.net/2023/Aug/13/lark-parsing-library-json-tutorial/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://lark-parser.readthedocs.io/en/stable/json_tutorial.html"&gt;Lark parsing library JSON tutorial&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
A very convincing tutorial for a new-to-me parsing library for Python called Lark.&lt;/p&gt;

&lt;p&gt;The tutorial covers building a full JSON parser from scratch, which ends up being just 19 lines of grammar definition code and 15 lines for the transformer to turn that tree into the final JSON.&lt;/p&gt;

&lt;p&gt;It then gets into the details of optimization—the default Earley algorithm is quite slow, but swapping that out for a LALR parser (a one-line change) provides a 5x speedup for this particular example.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://github.com/spandanb/learndb-py"&gt;spandanb/learndb-py&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/compilers"&gt;compilers&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/json"&gt;json&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parsing"&gt;parsing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;&lt;/p&gt;



</summary><category term="compilers"/><category term="json"/><category term="parsing"/><category term="python"/></entry><entry><title>Building Search DSLs with Django</title><link href="https://simonwillison.net/2023/Jun/19/building-search-dsls-with-django/#atom-tag" rel="alternate"/><published>2023-06-19T08:30:32+00:00</published><updated>2023-06-19T08:30:32+00:00</updated><id>https://simonwillison.net/2023/Jun/19/building-search-dsls-with-django/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://danlamanna.com/posts/building-search-dsls-with-django/"&gt;Building Search DSLs with Django&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Neat tutorial by Dan Lamanna: how to build a GitHub-style search feature—supporting modifiers like “is:open author:danlamanna”—using PyParsing and the Django ORM.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://lobste.rs/s/itjx6c/building_search_dsls_with_django"&gt;Lobsters&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/django"&gt;django&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/dsl"&gt;dsl&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parsing"&gt;parsing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;&lt;/p&gt;



</summary><category term="django"/><category term="dsl"/><category term="parsing"/><category term="python"/><category term="search"/></entry><entry><title>Comby</title><link href="https://simonwillison.net/2022/May/18/comby/#atom-tag" rel="alternate"/><published>2022-05-18T05:47:24+00:00</published><updated>2022-05-18T05:47:24+00:00</updated><id>https://simonwillison.net/2022/May/18/comby/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://comby.dev/"&gt;Comby&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Describes itself as “Structural search and replace for any language”. Lets you execute search and replace patterns that look a little bit like simplified regular expressions, but with some deep OCaml-powered magic that makes them aware of comment, string and nested parenthesis rules for different languages. This means you can use it to construct scripts that automate common refactoring or code upgrade tasks.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=31413533"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/parsing"&gt;parsing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/refactoring"&gt;refactoring&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocaml"&gt;ocaml&lt;/a&gt;&lt;/p&gt;



</summary><category term="parsing"/><category term="refactoring"/><category term="ocaml"/></entry><entry><title>pup</title><link href="https://simonwillison.net/2020/Feb/14/pup/#atom-tag" rel="alternate"/><published>2020-02-14T16:25:27+00:00</published><updated>2020-02-14T16:25:27+00:00</updated><id>https://simonwillison.net/2020/Feb/14/pup/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/ericchiang/pup"&gt;pup&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This is a great idea: a command-line tool for parsing HTML on stdin using CSS selectors. It’s like jq but for HTML. Supports a sensible collection of selectors and has a number of output options for the selected nodes, including plain text and JSON. It also works as a simple pretty-printer for HTML.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/html"&gt;html&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parsing"&gt;parsing&lt;/a&gt;&lt;/p&gt;



</summary><category term="cli"/><category term="html"/><category term="parsing"/></entry><entry><title>Parsing CSV using ANTLR and Python 3</title><link href="https://simonwillison.net/2018/Apr/6/python3-antlr-csv/#atom-tag" rel="alternate"/><published>2018-04-06T14:33:58+00:00</published><updated>2018-04-06T14:33:58+00:00</updated><id>https://simonwillison.net/2018/Apr/6/python3-antlr-csv/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/jszheng/py3antlr4book/tree/master/08-CSV"&gt;Parsing CSV using ANTLR and Python 3&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I’ve been trying to figure out how to use ANTLR grammars from Python—this is the first example I’ve found that really clicked for me.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/antlr"&gt;antlr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/csv"&gt;csv&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parsing"&gt;parsing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;&lt;/p&gt;



</summary><category term="antlr"/><category term="csv"/><category term="parsing"/><category term="python"/></entry><entry><title>Parse shell one-liners with pyparsing</title><link href="https://simonwillison.net/2017/Oct/22/pyparsing/#atom-tag" rel="alternate"/><published>2017-10-22T13:35:41+00:00</published><updated>2017-10-22T13:35:41+00:00</updated><id>https://simonwillison.net/2017/Oct/22/pyparsing/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://nvbn.github.io/2016/07/05/shell-ast/"&gt;Parse shell one-liners with pyparsing&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Neat introduction to the pyparsing library, both for parsing tokens into labeled sections and constructing an AST from them.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/parsing"&gt;parsing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;&lt;/p&gt;



</summary><category term="parsing"/><category term="python"/></entry><entry><title>Quoting Andrew Clover</title><link href="https://simonwillison.net/2009/Nov/16/regex/#atom-tag" rel="alternate"/><published>2009-11-16T10:32:15+00:00</published><updated>2009-11-16T10:32:15+00:00</updated><id>https://simonwillison.net/2009/Nov/16/regex/#atom-tag</id><summary type="html">
    &lt;blockquote cite="http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454"&gt;&lt;p&gt;Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454"&gt;Andrew Clover&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/andrew-clover"&gt;andrew-clover&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/funny"&gt;funny&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/html"&gt;html&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parsing"&gt;parsing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/stackoverflow"&gt;stackoverflow&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xhtml"&gt;xhtml&lt;/a&gt;&lt;/p&gt;



</summary><category term="andrew-clover"/><category term="funny"/><category term="html"/><category term="parsing"/><category term="regular-expressions"/><category term="stackoverflow"/><category term="xhtml"/></entry><entry><title>HTML 5 Parsing</title><link href="https://simonwillison.net/2009/Jul/11/john/#atom-tag" rel="alternate"/><published>2009-07-11T23:36:57+00:00</published><updated>2009-07-11T23:36:57+00:00</updated><id>https://simonwillison.net/2009/Jul/11/john/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://ejohn.org/blog/html-5-parsing/"&gt;HTML 5 Parsing&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Firefox nightlies include a new parser that implements the HTML5 parsing algorithm (disabled by default), which uses C++ code automatically generated from Henri Sivonen’s Java parser first used in the HTML5 validator.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/browsers"&gt;browsers&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/firefox"&gt;firefox&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/henri-sivonen"&gt;henri-sivonen&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/html5"&gt;html5&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/john-resig"&gt;john-resig&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mozilla"&gt;mozilla&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parsing"&gt;parsing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/validator"&gt;validator&lt;/a&gt;&lt;/p&gt;



</summary><category term="browsers"/><category term="firefox"/><category term="henri-sivonen"/><category term="html5"/><category term="john-resig"/><category term="mozilla"/><category term="parsing"/><category term="validator"/></entry><entry><title>South's Design</title><link href="https://simonwillison.net/2009/May/13/south/#atom-tag" rel="alternate"/><published>2009-05-13T12:30:45+00:00</published><updated>2009-05-13T12:30:45+00:00</updated><id>https://simonwillison.net/2009/May/13/south/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://www.aeracode.org/2009/5/9/souths-design/"&gt;South&amp;#x27;s Design&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Andrew Godwin explains why South resorts to parsing your models.py file in order to construct information about for creating automatic migrations.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/andrew-godwin"&gt;andrew-godwin&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/django"&gt;django&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/models"&gt;models&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/orm"&gt;orm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parsing"&gt;parsing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/south"&gt;south&lt;/a&gt;&lt;/p&gt;



</summary><category term="andrew-godwin"/><category term="django"/><category term="models"/><category term="orm"/><category term="parsing"/><category term="python"/><category term="south"/></entry><entry><title>Simple Top-Down Parsing in Python</title><link href="https://simonwillison.net/2008/Jul/19/simple/#atom-tag" rel="alternate"/><published>2008-07-19T23:37:59+00:00</published><updated>2008-07-19T23:37:59+00:00</updated><id>https://simonwillison.net/2008/Jul/19/simple/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://effbot.org/zone/simple-top-down-parsing.htm"&gt;Simple Top-Down Parsing in Python&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Eye-opening tutorial on building a recursive descent parser for Python, in Python that uses top-down operator precedence.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/compilers"&gt;compilers&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/effbot"&gt;effbot&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/fredrik-lundh"&gt;fredrik-lundh&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parsing"&gt;parsing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/recursivedescent"&gt;recursivedescent&lt;/a&gt;&lt;/p&gt;



</summary><category term="compilers"/><category term="effbot"/><category term="fredrik-lundh"/><category term="parsing"/><category term="python"/><category term="recursivedescent"/></entry><entry><title>python4ply tutorial</title><link href="https://simonwillison.net/2008/Mar/11/pythonply/#atom-tag" rel="alternate"/><published>2008-03-11T05:49:22+00:00</published><updated>2008-03-11T05:49:22+00:00</updated><id>https://simonwillison.net/2008/Mar/11/pythonply/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://www.dalkescientific.com/writings/diary/archive/2008/03/09/python4ply_tutorial_1.html"&gt;python4ply tutorial&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
python4ply is a parser for Python written in Python using the PLY toolkit, which compiles to Python bytecode using the built-in compiler module. The tutorial shows how to use it to add support for Perl-style 1_000_000 readable numbers.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/compilers"&gt;compilers&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lexing"&gt;lexing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parsing"&gt;parsing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python4ply"&gt;python4ply&lt;/a&gt;&lt;/p&gt;



</summary><category term="compilers"/><category term="lexing"/><category term="parsing"/><category term="python"/><category term="python4ply"/></entry></feed>