<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: regular-expressions</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/regular-expressions.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2026-05-04T17:52:00+00:00</updated><author><name>Simon Willison</name></author><entry><title>TRE Python binding — ReDoS robustness demo</title><link href="https://simonwillison.net/2026/May/4/tre-python-binding/#atom-tag" rel="alternate"/><published>2026-05-04T17:52:00+00:00</published><updated>2026-05-04T17:52:00+00:00</updated><id>https://simonwillison.net/2026/May/4/tre-python-binding/#atom-tag</id><summary type="html">
    
        &lt;p&gt;&lt;strong&gt;Research:&lt;/strong&gt; &lt;a href="https://github.com/simonw/research/tree/main/tre-python-binding#readme"&gt;TRE Python binding — ReDoS robustness demo&lt;/a&gt;&lt;/p&gt;
        &lt;p&gt;If it's &lt;a href="https://simonwillison.net/2026/May/4/redis-array/"&gt;good enough for antirez&lt;/a&gt; to add to Redis I figured Ville Laurikari's &lt;a href="https://github.com/laurikari/tre/"&gt;TRE&lt;/a&gt; regular expression engine was worth exploring in a little more detail.&lt;/p&gt;
&lt;p&gt;I had Claude Code build an experimental Python binding (it used &lt;code&gt;ctypes&lt;/code&gt;) and try some malicious regular expression attacks against the library. TRE handles those much better than Python's standard library implementation, thanks mainly to the lack of support for backtracking.&lt;/p&gt;
    
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/c"&gt;c&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ctypes"&gt;ctypes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="c"/><category term="ctypes"/><category term="python"/><category term="regular-expressions"/><category term="security"/></entry><entry><title>Redis Array Playground</title><link href="https://simonwillison.net/2026/May/4/redis-array/#atom-tag" rel="alternate"/><published>2026-05-04T15:53:57+00:00</published><updated>2026-05-04T15:53:57+00:00</updated><id>https://simonwillison.net/2026/May/4/redis-array/#atom-tag</id><summary type="html">
    
        &lt;p&gt;&lt;strong&gt;Tool:&lt;/strong&gt; &lt;a href="https://tools.simonwillison.net/redis-array"&gt;Redis Array Playground&lt;/a&gt;&lt;/p&gt;
        &lt;p&gt;Salvatore Sanfilippo submitted &lt;a href="https://github.com/redis/redis/pull/15162"&gt;a PR&lt;/a&gt; adding a new data type - arrays - to Redis. &lt;/p&gt;
&lt;p&gt;The new commands are &lt;code&gt;ARCOUNT&lt;/code&gt;, &lt;code&gt;ARDEL&lt;/code&gt;, &lt;code&gt;ARDELRANGE&lt;/code&gt;, &lt;code&gt;ARGET&lt;/code&gt;, &lt;code&gt;ARGETRANGE&lt;/code&gt;, &lt;code&gt;ARGREP&lt;/code&gt;, &lt;code&gt;ARINFO&lt;/code&gt;, &lt;code&gt;ARINSERT&lt;/code&gt;, &lt;code&gt;ARLASTITEMS&lt;/code&gt;, &lt;code&gt;ARLEN&lt;/code&gt;, &lt;code&gt;ARMGET&lt;/code&gt;, &lt;code&gt;ARMSET&lt;/code&gt;, &lt;code&gt;ARNEXT&lt;/code&gt;, &lt;code&gt;AROP&lt;/code&gt;, &lt;code&gt;ARRING&lt;/code&gt;, &lt;code&gt;ARSCAN&lt;/code&gt;, &lt;code&gt;ARSEEK&lt;/code&gt;, &lt;code&gt;ARSET&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The implementation is currently available in a branch, so I &lt;a href="https://github.com/simonw/tools/pull/277"&gt;had Claude Code for web&lt;/a&gt; 
build this interactive playground for trying out the new commands in a WASM-compiled build of a subset of Redis running in the browser.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a Redis command builder UI. Left sidebar shows commands ARSCAN, ARSEEK, ARSET. Main panel has a &amp;quot;predicate oneof&amp;quot; section with a MATCH dropdown and value CHERRY, plus a &amp;quot;+ add another&amp;quot; button. Below is &amp;quot;options (optional) oneof&amp;quot; with checkboxes: AND (checked), OR (unchecked), LIMIT (checked, value 10), WITHVALUES (checked), NOCASE (checked). COMMAND section shows: ARGREP myarr - + MATCH CHERRY AND LIMIT 10 WITHVALUES NOCASE. A red &amp;quot;Run command&amp;quot; button is below. REPLY section shows &amp;quot;(no reply yet)&amp;quot;." src="https://static.simonwillison.net/static/2026/redis-array-explorer-card.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;The most interesting new command is &lt;code&gt;ARGREP&lt;/code&gt; which can run a server-side grep against a range of values in the array using the newly vendored &lt;a href="https://github.com/laurikari/tre/"&gt;TRE regex library&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Salvatore wrote more about the AI-assisted development process for the array type in &lt;a href="https://antirez.com/news/164"&gt;Redis array type: short story of a long development&lt;/a&gt;.&lt;/p&gt;
    
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/c"&gt;c&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/redis"&gt;redis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/salvatore-sanfilippo"&gt;salvatore-sanfilippo&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webassembly"&gt;webassembly&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/agentic-engineering"&gt;agentic-engineering&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="c"/><category term="redis"/><category term="regular-expressions"/><category term="salvatore-sanfilippo"/><category term="ai"/><category term="webassembly"/><category term="generative-ai"/><category term="llms"/><category term="agentic-engineering"/></entry><entry><title>MicroQuickJS</title><link href="https://simonwillison.net/2025/Dec/23/microquickjs/#atom-tag" rel="alternate"/><published>2025-12-23T20:53:40+00:00</published><updated>2025-12-23T20:53:40+00:00</updated><id>https://simonwillison.net/2025/Dec/23/microquickjs/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/bellard/mquickjs"&gt;MicroQuickJS&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New project from programming legend Fabrice Bellard, of ffmpeg and QEMU and QuickJS and &lt;a href="https://bellard.org"&gt;so much more&lt;/a&gt; fame:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;MicroQuickJS (aka. MQuickJS) is a Javascript engine targetted at embedded systems. It compiles and runs Javascript programs with as low as 10 kB of RAM. The whole engine requires about 100 kB of ROM (ARM Thumb-2 code) including the C library. The speed is comparable to QuickJS.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It supports &lt;a href="https://github.com/bellard/mquickjs/blob/17ce6fe54c1ea4f500f26636bd22058fce2ce61a/README.md#javascript-subset-reference"&gt;a subset of full JavaScript&lt;/a&gt;, though it looks like a rich and full-featured subset to me.&lt;/p&gt;
&lt;p&gt;One of my ongoing interests is sandboxing: mechanisms for executing untrusted code - from end users or generated by LLMs - in an environment that restricts memory usage and applies a strict time limit and restricts file or network access. Could MicroQuickJS be useful in that context?&lt;/p&gt;
&lt;p&gt;I fired up Claude Code for web (on my iPhone) and kicked off &lt;a href="https://simonwillison.net/2025/Nov/6/async-code-research/"&gt;an asynchronous research project&lt;/a&gt; to see explore that question:&lt;/p&gt;
&lt;p&gt;My full prompt &lt;a href="https://github.com/simonw/research/pull/50#issue-3757781692"&gt;is here&lt;/a&gt;. It started like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Clone https://github.com/bellard/mquickjs to /tmp&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;Investigate this code as the basis for a safe sandboxing environment for running untrusted code such that it cannot exhaust memory or CPU or access files or the network&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;First try building python bindings for this using FFI - write a script that builds these by checking out the code to /tmp and building against that, to avoid copying the C code in this repo permanently. Write and execute tests with pytest to exercise it as a sandbox&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;Then build a "real" Python extension not using FFI and experiment with that&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;Then try compiling the C to WebAssembly and exercising it via both node.js and Deno, with a similar suite of tests [...]&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I later added to the interactive session:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Does it have a regex engine that might allow a resource exhaustion attack from an expensive regex?&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;(The answer was no - the regex engine calls the interrupt handler even during pathological expression backtracking, meaning that any configured time limit should still hold.)&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://gistpreview.github.io/?6e07c54db7bb8ed8aa0eccfe4a384679"&gt;the full transcript&lt;/a&gt; and the &lt;a href="https://github.com/simonw/research/blob/main/mquickjs-sandbox/README.md"&gt;final report&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Some key observations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;MicroQuickJS is &lt;em&gt;very&lt;/em&gt; well suited to the sandbox problem. It has robust near and time limits baked in, it doesn't expose any dangerous primitive like filesystem of network access and even has a regular expression engine that protects against exhaustion attacks (provided you configure a time limit).&lt;/li&gt;
&lt;li&gt;Claude span up and tested a Python library that calls a MicroQuickJS shared library (involving a little bit of extra C), a compiled a Python binding and a library that uses the original MicroQuickJS CLI tool. All of those approaches work well.&lt;/li&gt;
&lt;li&gt;Compiling to WebAssembly was a little harder. It got a version working in Node.js and Deno and Pyodide, but the Python libraries wasmer and wasmtime proved harder, apparently because "mquickjs uses setjmp/longjmp for error handling". It managed to get to a working wasmtime version with &lt;a href="https://github.com/simonw/research/blob/main/mquickjs-sandbox/README.md#working-solution"&gt;a gross hack&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I'm really excited about this. MicroQuickJS is tiny, full featured, looks robust and comes from excellent pedigree. I think this makes for a very solid new entrant in the quest for a robust sandbox.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: I had Claude Code build &lt;a href="https://tools.simonwillison.net/microquickjs"&gt;tools.simonwillison.net/microquickjs&lt;/a&gt;, an interactive web playground for trying out the WebAssembly build of MicroQuickJS, adapted from my previous &lt;a href="https://tools.simonwillison.net/quickjs"&gt;QuickJS plaground&lt;/a&gt;. My QuickJS page loads 2.28 MB (675 KB transferred). The MicroQuickJS one loads 303 KB (120 KB transferred).&lt;/p&gt;
&lt;p&gt;Here are &lt;a href="https://github.com/simonw/tools/pull/180#issue-3758595291"&gt;the prompts I used&lt;/a&gt; for that.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/c"&gt;c&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nodejs"&gt;nodejs&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sandboxing"&gt;sandboxing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webassembly"&gt;webassembly&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/deno"&gt;deno&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pyodide"&gt;pyodide&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-code"&gt;claude-code&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/fabrice-bellard"&gt;fabrice-bellard&lt;/a&gt;&lt;/p&gt;



</summary><category term="c"/><category term="javascript"/><category term="nodejs"/><category term="python"/><category term="regular-expressions"/><category term="sandboxing"/><category term="ai"/><category term="webassembly"/><category term="deno"/><category term="pyodide"/><category term="generative-ai"/><category term="llms"/><category term="claude-code"/><category term="fabrice-bellard"/></entry><entry><title>New dashboard: alt text for all my images</title><link href="https://simonwillison.net/2025/Apr/28/dashboard-alt-text/#atom-tag" rel="alternate"/><published>2025-04-28T01:22:27+00:00</published><updated>2025-04-28T01:22:27+00:00</updated><id>https://simonwillison.net/2025/Apr/28/dashboard-alt-text/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://simonwillison.net/dashboard/alt-text/"&gt;New dashboard: alt text for all my images&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I got curious today about how I'd been using alt text for images on my blog, and realized that since I have &lt;a href="https://django-sql-dashboard.datasette.io/"&gt;Django SQL Dashboard&lt;/a&gt; running on this site and PostgreSQL is capable of &lt;a href="https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454"&gt;parsing HTML with regular expressions&lt;/a&gt; I could probably find out using a SQL query.&lt;/p&gt;
&lt;p&gt;I pasted &lt;a href="https://simonwillison.net/dashboard/schema/"&gt;my PostgreSQL schema&lt;/a&gt; into Claude and gave it a pretty long prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Give this PostgreSQL schema I want a query that returns all of my images and their alt text. Images are sometimes stored as HTML image tags and other times stored in markdown.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;blog_quotation.quotation&lt;/code&gt;, &lt;code&gt;blog_note.body&lt;/code&gt; both contain markdown. &lt;code&gt;blog_blogmark.commentary&lt;/code&gt; has markdown if &lt;code&gt;use_markdown&lt;/code&gt; is true or HTML otherwise. &lt;code&gt;blog_entry.body&lt;/code&gt; is always HTML&lt;/p&gt;
&lt;p&gt;Write me a SQL query to extract all of my images and their alt tags using regular expressions. In HTML documents it should look for either &lt;code&gt;&amp;lt;img .* src="..." .* alt="..."&lt;/code&gt; or &lt;code&gt;&amp;lt;img alt="..." .* src="..."&lt;/code&gt; (images may be self-closing XHTML style in some places). In Markdown they will always be &lt;code&gt;![alt text](url)&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;I want the resulting table to have three columns: URL, alt_text, src - the URL column needs to be constructed as e.g. &lt;code&gt;/2025/Feb/2/slug&lt;/code&gt; for a record where created is on 2nd feb 2025 and the &lt;code&gt;slug&lt;/code&gt; column contains &lt;code&gt;slug&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Use CTEs and unions where appropriate&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It almost got it right on the first go, and with &lt;a href="https://claude.ai/share/e3b996d3-b480-436d-aa40-9caa7609474f"&gt;a couple of follow-up prompts&lt;/a&gt; I had the query I wanted. I also added the option to &lt;a href="https://simonwillison.net/dashboard/alt-text/?search=pelican"&gt;search&lt;/a&gt; my alt text / image URLs, which has already helped me hunt down and fix a few old images on expired domain names. Here's a copy of &lt;a href="https://gist.github.com/simonw/5b44a662354e124e33cc1d4704cdb91a"&gt;the finished 100 line SQL query&lt;/a&gt;.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/accessibility"&gt;accessibility&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/alt-text"&gt;alt-text&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/postgresql"&gt;postgresql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sql"&gt;sql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/django-sql-dashboard"&gt;django-sql-dashboard&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;&lt;/p&gt;



</summary><category term="accessibility"/><category term="alt-text"/><category term="postgresql"/><category term="regular-expressions"/><category term="sql"/><category term="ai"/><category term="django-sql-dashboard"/><category term="generative-ai"/><category term="llms"/><category term="ai-assisted-programming"/><category term="claude"/></entry><entry><title>tc39/proposal-regex-escaping</title><link href="https://simonwillison.net/2025/Feb/18/tc39proposal-regex-escaping/#atom-tag" rel="alternate"/><published>2025-02-18T21:53:56+00:00</published><updated>2025-02-18T21:53:56+00:00</updated><id>https://simonwillison.net/2025/Feb/18/tc39proposal-regex-escaping/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/tc39/proposal-regex-escaping"&gt;tc39/proposal-regex-escaping&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I just heard &lt;a href="https://social.coop/@kriskowal/114026510846190089"&gt;from Kris Kowal&lt;/a&gt; that this proposal for ECMAScript has been approved for ECMA TC-39:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Almost 20 years later, @simon’s RegExp.escape idea comes to fruition. This reached “Stage 4” at ECMA TC-39 just now, which formalizes that multiple browsers have shipped the feature and it’s in the next revision of the JavaScript specification.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'll be honest, I had completely forgotten about my 2006 blog entry &lt;a href="https://simonwillison.net/2006/Jan/20/escape/"&gt;Escaping regular expression characters in JavaScript&lt;/a&gt; where I proposed that JavaScript should have an equivalent of the Python &lt;a href="https://docs.python.org/3/library/re.html#re.escape"&gt;re.escape()&lt;/a&gt; function.&lt;/p&gt;
&lt;p&gt;It turns out my post was referenced in &lt;a href="https://esdiscuss.org/topic/regexp-escape"&gt;this 15 year old thread&lt;/a&gt; on the esdiscuss mailing list, which evolved over time into a proposal which turned into &lt;a href="https://caniuse.com/mdn-javascript_builtins_regexp_escape"&gt;implementations&lt;/a&gt; in Safari, Firefox and soon Chrome - here's &lt;a href="https://github.com/v8/v8/commit/b5c08badc7b3d4b85b2645b1a4d9973ee6efaa91"&gt;the commit landing it in v8&lt;/a&gt; on February 12th 2025.&lt;/p&gt;
&lt;p&gt;One of the best things about having a long-running blog is that sometimes posts you forgot about over a decade ago turn out to have a life of their own.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/blogging"&gt;blogging&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ecmascript"&gt;ecmascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/standards"&gt;standards&lt;/a&gt;&lt;/p&gt;



</summary><category term="blogging"/><category term="ecmascript"/><category term="javascript"/><category term="regular-expressions"/><category term="standards"/></entry><entry><title>datasette-enrichments-re2 0.1a0</title><link href="https://simonwillison.net/2023/Nov/28/datasette-enrichments-re2/#atom-tag" rel="alternate"/><published>2023-11-28T02:07:14+00:00</published><updated>2023-11-28T02:07:14+00:00</updated><id>https://simonwillison.net/2023/Nov/28/datasette-enrichments-re2/#atom-tag</id><summary type="html">
    
        &lt;p&gt;&lt;strong&gt;Release:&lt;/strong&gt; &lt;a href="https://github.com/datasette/datasette-enrichments-re2/releases/tag/0.1a0"&gt;datasette-enrichments-re2 0.1a0&lt;/a&gt;&lt;/p&gt;
        
    
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="regular-expressions"/><category term="datasette"/></entry><entry><title>Why I invented "dash encoding", a new encoding scheme for URL paths</title><link href="https://simonwillison.net/2022/Mar/5/dash-encoding/#atom-tag" rel="alternate"/><published>2022-03-05T21:50:38+00:00</published><updated>2022-03-05T21:50:38+00:00</updated><id>https://simonwillison.net/2022/Mar/5/dash-encoding/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; now includes its own custom string encoding scheme, which I've called &lt;strong&gt;dash encoding&lt;/strong&gt;. I really didn't want to have to invent something new here, but unfortunately I think this is the best solution to my very particular problem. Some notes on how dash encoding works and why I created it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update 18th March 2022&lt;/strong&gt;: This turned out not to be the right idea for my project after all! I ended up settling on a &lt;a href="https://simonwillison.net/2022/Mar/19/weeknotes/#tilde-encoding"&gt;Tilde encoding&lt;/a&gt; scheme instead.&lt;/p&gt;

&lt;h4&gt;Table names and rows in URLs&lt;/h4&gt;
&lt;p&gt;I've put a lot of thought into the design of Datasette's URLs.&lt;/p&gt;
&lt;p&gt;Datasette exposes relational databases tables, as both web pages and a JSON API.&lt;/p&gt;
&lt;p&gt;Consider a database in a SQLite file called &lt;code&gt;legislators.db&lt;/code&gt;, containing a table called &lt;code&gt;legislator_terms&lt;/code&gt; (example from &lt;a href="https://datasette.io/tutorials/explore"&gt;this tutorial&lt;/a&gt;). The URL path to the web interface for that table will be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://congress-legislators.datasettes.com/legislators/legislator_terms"&gt;/legislators/legislator_terms&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And the JSON API will be here:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://congress-legislators.datasettes.com/legislators/legislator_terms.json"&gt;/legislators/legislator_terms.json&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;(Worth noting that Datasette supports other formats here too - &lt;a href="https://docs.datasette.io/en/stable/csv_export.html"&gt;CSV&lt;/a&gt; by default, and plugins can add more formats such as &lt;a href="https://datasette.io/plugins/datasette-geojson"&gt;GeoJSON&lt;/a&gt; or &lt;a href="https://datasette.io/plugins/datasette-atom"&gt;Atom&lt;/a&gt; or &lt;a href="https://datasette.io/plugins/datasette-ics"&gt;iCal&lt;/a&gt;.)&lt;/p&gt;
&lt;p&gt;Datasette also provides pages (and APIs) for individual rows, identified by their primary key:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://congress-legislators.datasettes.com/legislators/legislator_terms/1"&gt;/legislators/legislator_terms/1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://congress-legislators.datasettes.com/legislators/legislator_terms/1.json"&gt;/legislators/legislator_terms/1.json&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For tables with compound primary keys, these pages can include the primary key values separated by commas:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://latest.datasette.io/fixtures/compound_three_primary_keys/a,a,a"&gt;/fixtures/compound_three_primary_keys/a,a,a&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is all pretty straightforward so far. But now we get to the challenge: what if a table's name or a row's primary key contains a forward slash or a period character?&lt;/p&gt;
&lt;p&gt;This could break the URL scheme!&lt;/p&gt;
&lt;p&gt;SQLite table names are allowed to contain almost any character, and Datasette is designed to work with any existing SQLite database - so I can't guarantee that a table with one of those characters won't need to be handled.&lt;/p&gt;
&lt;p&gt;Consider a database with two tables - one called &lt;code&gt;legislator_terms&lt;/code&gt; and another called &lt;code&gt;legislator_terms/1&lt;/code&gt; - given the URL &lt;code&gt;/legislators/legislator_terms/1&lt;/code&gt; it's no longer clear if it refers to the table with that name or the row with primary key 1 in the other table!&lt;/p&gt;
&lt;p&gt;A similar problem exists for table names with as &lt;code&gt;legislators.csv&lt;/code&gt; - which end in a format. Or primary key string values that end in &lt;code&gt;.json&lt;/code&gt;.&lt;/p&gt;
&lt;h4&gt;Why URL encoding doesn't work here&lt;/h4&gt;
&lt;p&gt;Up until now, Datasette has solved this problem using &lt;a href="https://en.wikipedia.org/wiki/Percent-encoding"&gt;URL percent encoding&lt;/a&gt;. This provides a standard mechanism for encoding "special" characters in URLs.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;legislator_terms/1&lt;/code&gt; encodes to &lt;code&gt;legislator_terms%2F1&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;This should be enough to solve the problem. The URL to that weirdly named table can now be:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;/legislators/legislator_terms%2F1&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;When routing the URL, the application can take this into account and identify that this it a table named &lt;code&gt;legislator_terms/1&lt;/code&gt;, as opposed to a request for the row with ID &lt;code&gt;1&lt;/code&gt; in the &lt;code&gt;legislator_terms&lt;/code&gt; table.&lt;/p&gt;
&lt;p&gt;There are two remaining problems.&lt;/p&gt;
&lt;p&gt;Firstly, the "." character is ignored by URL encoding, so we still can't tell the difference between &lt;code&gt;/db/table.json&lt;/code&gt; and a table called &lt;code&gt;table.json&lt;/code&gt;. I worked around this issue in Datasette by supporting an optional alternative &lt;code&gt;?_format=json&lt;/code&gt; parameter, but it's &lt;a href="https://github.com/simonw/datasette/issues/1439"&gt;messy and confusing&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Much more seriously, it turns out there are numerous common pieces of web infrastructure that "helpfully" decode escaped characters in URLs before passing them on to the underlying web application!&lt;/p&gt;
&lt;p&gt;I first encountered this in the ASGI standard itself, which decoded characters in the &lt;code&gt;path&lt;/code&gt; field before they were passed to the rest of the application.I submitted &lt;a href="https://github.com/django/asgiref/issues/87#issuecomment-500168070"&gt;a PR&lt;/a&gt; adding &lt;code&gt;raw_path&lt;/code&gt; to ASGI precisely to work around this problem for Datasette.&lt;/p&gt;
&lt;p&gt;Over time though, the problem kept cropping up. Datasette aims to run on as many hosting platforms as possible. I've seen URL escaping applied at a higher level enough times now to be very suspicious of any load balancer or proxy or other web server mechanism that might end up executing between Datasette and the rest of the web.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt; Flask core maintainer David Lord &lt;a href="https://twitter.com/davidism/status/1500251083070787585"&gt;confirms on Twitter&lt;/a&gt; that this is a long-standing known problem:&lt;/p&gt;

&lt;blockquote cite="https://twitter.com/davidism/status/1500251083070787585"&gt;&lt;p&gt;This behavior in Apache/nginx/etc is why WSGI/ASGI can't specify "literal URL the user typed in", because anything in front of the app might modify slashes or anything else. So all the spec can provide is "decoded URL".&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;So, I need a way of encoding a table name that might include &lt;code&gt;/&lt;/code&gt; and &lt;code&gt;.&lt;/code&gt; characters in a way that will survive some other layer of the stack decoding URL encoded strings in the URL path before Datasette gets to see them!&lt;/p&gt;
&lt;h4&gt;Introducing dash encoding&lt;/h4&gt;
&lt;p&gt;That's where dash encoding comes in. I tried to design the fastest, simplest encoding mechanism I could that would solve this very specific problem.&lt;/p&gt;
&lt;p&gt;Loose requirements:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Reversible - it's crucial to at any possible value survives a round-trip through the encoding&lt;/li&gt;
&lt;li&gt;Avoid changing the string at all if possible. Otherwise I could use something like base64, but I wanted to keep the name in the URL as close to readable as possible&lt;/li&gt;
&lt;li&gt;Survive interference by proxies and load balancer that might try to be helpful&lt;/li&gt;
&lt;li&gt;Fast to apply the transformation&lt;/li&gt;
&lt;li&gt;As simple as possible&lt;/li&gt;
&lt;li&gt;Easy to implement, including in languages other than Python&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dash encoding consists of three simple steps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Replace all single hyphen characters &lt;code&gt;-&lt;/code&gt; with two hyphens &lt;code&gt;--&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Replace any forward slash &lt;code&gt;/&lt;/code&gt; character with hyphen forward slash &lt;code&gt;-/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Replace any period character &lt;code&gt;.&lt;/code&gt; with hyphen period &lt;code&gt;-.&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To reverse the encoding, run those steps backwards.&lt;/p&gt;
&lt;p&gt;Here the Python implementation of this encoding scheme:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;dash_encode&lt;/span&gt;(&lt;span class="pl-s1"&gt;s&lt;/span&gt;: &lt;span class="pl-s1"&gt;str&lt;/span&gt;) &lt;span class="pl-c1"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="pl-s1"&gt;str&lt;/span&gt;:
     &lt;span class="pl-s"&gt;"Returns dash-encoded string - for example ``/foo/bar`` -&amp;gt; ``-/foo-/bar``"&lt;/span&gt;
     &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s1"&gt;s&lt;/span&gt;.&lt;span class="pl-en"&gt;replace&lt;/span&gt;(&lt;span class="pl-s"&gt;"-"&lt;/span&gt;, &lt;span class="pl-s"&gt;"--"&lt;/span&gt;).&lt;span class="pl-en"&gt;replace&lt;/span&gt;(&lt;span class="pl-s"&gt;"."&lt;/span&gt;, &lt;span class="pl-s"&gt;"-."&lt;/span&gt;).&lt;span class="pl-en"&gt;replace&lt;/span&gt;(&lt;span class="pl-s"&gt;"/"&lt;/span&gt;, &lt;span class="pl-s"&gt;"-/"&lt;/span&gt;)

&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;dash_decode&lt;/span&gt;(&lt;span class="pl-s1"&gt;s&lt;/span&gt;: &lt;span class="pl-s1"&gt;str&lt;/span&gt;) &lt;span class="pl-c1"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="pl-s1"&gt;str&lt;/span&gt;:
     &lt;span class="pl-s"&gt;"Decodes a dash-encoded string, so ``-/foo-/bar`` -&amp;gt; ``/foo/bar``"&lt;/span&gt;
     &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s1"&gt;s&lt;/span&gt;.&lt;span class="pl-en"&gt;replace&lt;/span&gt;(&lt;span class="pl-s"&gt;"-/"&lt;/span&gt;, &lt;span class="pl-s"&gt;"/"&lt;/span&gt;).&lt;span class="pl-en"&gt;replace&lt;/span&gt;(&lt;span class="pl-s"&gt;"-."&lt;/span&gt;, &lt;span class="pl-s"&gt;"."&lt;/span&gt;).&lt;span class="pl-en"&gt;replace&lt;/span&gt;(&lt;span class="pl-s"&gt;"--"&lt;/span&gt;, &lt;span class="pl-s"&gt;"-"&lt;/span&gt;)&lt;/pre&gt;
&lt;p&gt;And the pytest tests for it:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-en"&gt;@&lt;span class="pl-s1"&gt;pytest&lt;/span&gt;.&lt;span class="pl-s1"&gt;mark&lt;/span&gt;.&lt;span class="pl-en"&gt;parametrize&lt;/span&gt;(&lt;/span&gt;
&lt;span class="pl-en"&gt;     &lt;span class="pl-s"&gt;"original,expected"&lt;/span&gt;,&lt;/span&gt;
&lt;span class="pl-en"&gt;     (&lt;/span&gt;
&lt;span class="pl-en"&gt;         (&lt;span class="pl-s"&gt;"abc"&lt;/span&gt;, &lt;span class="pl-s"&gt;"abc"&lt;/span&gt;),&lt;/span&gt;
&lt;span class="pl-en"&gt;         (&lt;span class="pl-s"&gt;"/foo/bar"&lt;/span&gt;, &lt;span class="pl-s"&gt;"-/foo-/bar"&lt;/span&gt;),&lt;/span&gt;
&lt;span class="pl-en"&gt;         (&lt;span class="pl-s"&gt;"/-/bar"&lt;/span&gt;, &lt;span class="pl-s"&gt;"-/---/bar"&lt;/span&gt;),&lt;/span&gt;
&lt;span class="pl-en"&gt;         (&lt;span class="pl-s"&gt;"-/db-/table---.csv-.csv"&lt;/span&gt;, &lt;span class="pl-s"&gt;"---/db---/table-------.csv---.csv"&lt;/span&gt;),&lt;/span&gt;
&lt;span class="pl-en"&gt;     ),&lt;/span&gt;
&lt;span class="pl-en"&gt; )&lt;/span&gt;
 &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;test_dash_encoding&lt;/span&gt;(&lt;span class="pl-s1"&gt;original&lt;/span&gt;, &lt;span class="pl-s1"&gt;expected&lt;/span&gt;):
     &lt;span class="pl-s1"&gt;actual&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;utils&lt;/span&gt;.&lt;span class="pl-en"&gt;dash_encode&lt;/span&gt;(&lt;span class="pl-s1"&gt;original&lt;/span&gt;)
     &lt;span class="pl-k"&gt;assert&lt;/span&gt; &lt;span class="pl-s1"&gt;actual&lt;/span&gt; &lt;span class="pl-c1"&gt;==&lt;/span&gt; &lt;span class="pl-s1"&gt;expected&lt;/span&gt;
     &lt;span class="pl-c"&gt;# And test round-trip&lt;/span&gt;
     &lt;span class="pl-k"&gt;assert&lt;/span&gt; &lt;span class="pl-s1"&gt;original&lt;/span&gt; &lt;span class="pl-c1"&gt;==&lt;/span&gt; &lt;span class="pl-s1"&gt;utils&lt;/span&gt;.&lt;span class="pl-en"&gt;dash_decode&lt;/span&gt;(&lt;span class="pl-s1"&gt;actual&lt;/span&gt;)&lt;/pre&gt;
&lt;p&gt;Here's &lt;a href="https://github.com/simonw/datasette/commit/d1cb73180b4b5a07538380db76298618a5fc46b6"&gt;the full commit&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This meets my requirements.&lt;/p&gt;
&lt;h4&gt;Capturing these with a regular expression&lt;/h4&gt;
&lt;p&gt;There was one remaining challenge. Datasette uses regular expressions - inspired by Django - to route requests to the correct page.&lt;/p&gt;
&lt;p&gt;I wanted to use a regular expression to extract out dash encoded values, that could also distinguish them from &lt;code&gt;/&lt;/code&gt; and &lt;code&gt;-&lt;/code&gt; and &lt;code&gt;.&lt;/code&gt; characters that were not encoded in that way.&lt;/p&gt;
&lt;p&gt;Here's the pattern I came up with for strings matching this pattern:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;([^\/\-\.]*|(\-/)|(\-\.)|(\-\-))*&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Broken down:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;[^\/\-\.]*&lt;/code&gt; means 0 or more characters that are NOT one of &lt;code&gt;.&lt;/code&gt; or &lt;code&gt;/&lt;/code&gt; or &lt;code&gt;-&lt;/code&gt; - since we don't care about those characters at all&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;(\-/)&lt;/code&gt; means the explicit sequence &lt;code&gt;-/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;(\-\.)&lt;/code&gt; means the explicit sequence &lt;code&gt;-.&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;(\-\-)&lt;/code&gt; means the explicit sequence &lt;code&gt;--&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Those four are wrapped in a group combined with the &lt;code&gt;|&lt;/code&gt; or operator&lt;/li&gt;
&lt;li&gt;The group is then wrapped in a &lt;code&gt;(..)*&lt;/code&gt; - specifying that it can repeat as many times as you like&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A better way to break down this regular expression is visually, &lt;a href="https://www.debuggex.com/r/KYfCocdmuBHxHETv"&gt;using Debuggex&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/dash-encoding-regex.png" alt="A visualization of the regular expression, showing how it loops around the inner concept of none of those three characters or one of the three explicit character groupings." style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Combining this into the full regular expression that matches a &lt;code&gt;/database/table.format&lt;/code&gt; path is even messier, due to the need to add non-capturing group syntax &lt;code&gt;(?:..)&lt;/code&gt; and named groups &lt;code&gt;(?P&amp;lt;name&amp;gt;...)&lt;/code&gt; - it ends up looking like this:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;^/(?P&amp;lt;database&amp;gt;[^/]+)/(?P&amp;lt;table&amp;gt;(?:[^\/\-\.]*|(?:\-/)*|(?:\-\.)*|(?:\-\-)*)*?)\.(?P&amp;lt;format&amp;gt;\w+)?$&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Visualized &lt;a href="https://www.debuggex.com/r/aTF6lx5JpaMN6UYz"&gt;with Debuggex&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/dash-encoding-regex-full.png" alt="The more complex regex visualized." style="max-width:100%;" /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt; Thanks to suggestions &lt;a href="https://twitter.com/dracos/status/1500236433809973248"&gt;from Matthew Somerville&lt;/a&gt; I simplified this further to:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;^/(?P&amp;lt;database&amp;gt;[^/]+)/(?P&amp;lt;table&amp;gt;[^\/\-\.]*|\-/|\-\.|\-\-)*(?P&amp;lt;format&amp;gt;\.\w+)?$&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/dash-encoding-regex-simpler.png" alt="This looks less complex in Debuggex" style="max-width:100%;" /&gt;&lt;/p&gt;

&lt;h4&gt;Next steps: implementation&lt;/h4&gt;
&lt;p&gt;I'm currently working on integrating it into Datasette in &lt;a href="https://github.com/simonw/datasette/pull/1648"&gt;this PR&lt;/a&gt;. The full history of my thinking around this problem can be found &lt;a href="https://github.com/simonw/datasette/issues/1439"&gt;in issue 1439&lt;/a&gt;, with comments stretching back to August last year!&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/urls"&gt;urls&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="regular-expressions"/><category term="urls"/><category term="datasette"/></entry><entry><title>Search and replace with regular expressions in VS Code</title><link href="https://simonwillison.net/2021/Aug/2/vs-code-regular-expressions/#atom-tag" rel="alternate"/><published>2021-08-02T21:30:11+00:00</published><updated>2021-08-02T21:30:11+00:00</updated><id>https://simonwillison.net/2021/Aug/2/vs-code-regular-expressions/#atom-tag</id><summary type="html">
    
        &lt;p&gt;&lt;strong&gt;TIL:&lt;/strong&gt; &lt;a href="https://til.simonwillison.net/vscode/vs-code-regular-expressions"&gt;Search and replace with regular expressions in VS Code&lt;/a&gt;&lt;/p&gt;
        
    
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="regular-expressions"/></entry><entry><title>datasette-ripgrep: deploy a regular expression search engine for your source code</title><link href="https://simonwillison.net/2020/Nov/28/datasette-ripgrep/#atom-tag" rel="alternate"/><published>2020-11-28T06:51:06+00:00</published><updated>2020-11-28T06:51:06+00:00</updated><id>https://simonwillison.net/2020/Nov/28/datasette-ripgrep/#atom-tag</id><summary type="html">
    &lt;p&gt;This week I built &lt;a href="https://github.com/simonw/datasette-ripgrep"&gt;datasette-ripgrep&lt;/a&gt; - a web application  for running regular expression searches against source code, built on top of the amazing &lt;a href="https://github.com/BurntSushi/ripgrep"&gt;ripgrep&lt;/a&gt; command-line tool.&lt;/p&gt;
&lt;h4&gt;datasette-ripgrep demo&lt;/h4&gt;
&lt;p&gt;I've deployed a demo version of the application here:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://ripgrep.datasette.io/-/ripgrep?pattern=pytest"&gt;ripgrep.datasette.io/-/ripgrep?pattern=pytest&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The demo runs searches against the source code of every one of my GitHub repositories that start with &lt;code&gt;datasette&lt;/code&gt; - &lt;a href="https://github-to-sqlite.dogsheep.net/github/repos?name__startswith=datasette&amp;amp;owner__exact=9599"&gt;61 repos&lt;/a&gt; right now - so it should include all of my Datasette plugins plus the core Datasette repository itself.&lt;/p&gt;
&lt;p&gt;Since it's running on top of &lt;code&gt;ripgrep&lt;/code&gt;, it supports regular expressions. This is absurdly useful. Some examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Every usage of the &lt;code&gt;.plugin_config(&lt;/code&gt; method: &lt;a href="https://ripgrep.datasette.io/-/ripgrep?pattern=%5C.plugin_config%5C%28"&gt;plugin_config\(&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Everywhere I use &lt;code&gt;async with httpx.AsyncClient&lt;/code&gt; (usually in tests): &lt;a href="https://ripgrep.datasette.io/-/ripgrep?pattern=async+with.*AsyncClient"&gt;async with.*AsyncClient&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;All places where I use a Jinja &lt;code&gt;|&lt;/code&gt; filter inside a variable: &lt;a href="https://ripgrep.datasette.io/-/ripgrep?pattern=%5C%7B%5C%7B.*%5C%7C.*%5C%7D%5C%7D"&gt;\{\{.*\|.*\}\}&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I usually run ripgrep as &lt;code&gt;rg&lt;/code&gt; on the command-line, or use it within Visual Studio Code (&lt;a href="https://twitter.com/simonw/status/1331381448171929600"&gt;fun fact&lt;/a&gt;: the reason VS Code's "Find in Files" is so good is it's running ripgrep under the hood).&lt;/p&gt;
&lt;p&gt;So why have it as a web application? Because this means I can link to it, bookmark it and use it on my phone.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2020/datasette-ripgrep.png" alt="A screenshot of datasette-ripgrep in action" style="max-width: 100%" /&gt;&lt;/p&gt;
&lt;h4&gt;Why build this?&lt;/h4&gt;
&lt;p&gt;There are plenty of great existing code search tools out there already: I've heard great things about &lt;a href="https://github.com/livegrep/livegrep"&gt;livegrep&lt;/a&gt;, and a quick Google search shows a bunch of other options.&lt;/p&gt;
&lt;p&gt;Aside from being a fun project, &lt;code&gt;datasette-ripgrep&lt;/code&gt; has one key advantage: it gets to benefit from Datasette's publishing mechanism, which means it's really easy to deploy.&lt;/p&gt;
&lt;p&gt;That &lt;a href="https://ripgrep.datasette.io/"&gt;ripgrep.datasette.io&lt;/a&gt; demo is deployed by checking out the source code to be searched into a &lt;code&gt;all&lt;/code&gt; directory and then using the following command:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;datasette publish cloudrun \
    --metadata metadata.json \
    --static all:all \
    --install=datasette-ripgrep \
    --service datasette-ripgrep \
    --apt-get-install ripgrep
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;all&lt;/code&gt; is a folder containing the source code to be searched. &lt;code&gt;metadata.json&lt;/code&gt; contains this:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;plugins&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: {
        &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;datasette-ripgrep&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: {
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;path&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;/app/all&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;time_limit&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-c1"&gt;3.0&lt;/span&gt;
        }
    }
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;That's all there is to it! The result is a deployed code search engine, running on Google Cloud Run.&lt;/p&gt;
&lt;p&gt;(If you want to try this yourself you'll need to be using the just-released Datasette 0.52.)&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/simonw/datasette-ripgrep/blob/main/.github/workflows/deploy_demo.yml"&gt;GitHub Action workflow&lt;/a&gt; that deploys the demo also uses my &lt;a href="https://github.com/dogsheep/github-to-sqlite"&gt;github-to-sqlite&lt;/a&gt; tool to fetch my repos and then shallow-clones the ones that begin with &lt;code&gt;datasette&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;If you have &lt;a href="https://docs.datasette.io/en/stable/publish.html#publishing-to-google-cloud-run"&gt;your own Google Cloud Run credentials&lt;/a&gt;, you can run your own copy of that workflow against your own repositories.&lt;/p&gt;
&lt;h4&gt;A different kind of Datasette plugin&lt;/h4&gt;
&lt;p&gt;Datasette is a tool for publishing SQLite databases, so most Datasette plugins integrate with SQLite in some way.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;datasette-ripgrep&lt;/code&gt; is different: it makes no use of SQLite at all, but instead takes advantage of Datasette's URL routing, &lt;code&gt;datasette publish&lt;/code&gt; deployments and permissions system.&lt;/p&gt;
&lt;p&gt;The plugin implementation is currently &lt;a href="https://github.com/simonw/datasette-ripgrep/blob/07b9ced2935b0b6080c1c42fcaf6ab9e8003d186/datasette_ripgrep/__init__.py"&gt;134 lines of code&lt;/a&gt;, excluding tests and templates.&lt;/p&gt;
&lt;p&gt;While the plugin doesn't use SQLite, it does share a common philosophy with Datasette: the plugin bundles the source code that it is going to search as part of the deployed application, in a similar way to how Datasette usually bundles one or more SQLite database files.&lt;/p&gt;
&lt;p&gt;As such, it's extremely inexpensive to run and can be deployed to serverless hosting. If you need to scale it, you can run more copies.&lt;/p&gt;
&lt;p&gt;This does mean that the application needs to be re-deployed to pick up changes to the searchable code. I'll probably set my demo to do this on a daily basis.&lt;/p&gt;
&lt;h4&gt;Controlling processes from asyncio&lt;/h4&gt;
&lt;p&gt;The trickiest part of the implementation was figuring out how to use Python's &lt;code&gt;asyncio.create_subprocess_exec()&lt;/code&gt; method to safely run the &lt;code&gt;rg&lt;/code&gt; process in response to incoming requests.&lt;/p&gt;
&lt;p&gt;I don't want expensive searches to tie up the server, so I implemented two limits here. The first is a time limit: by default, searches have a second to run after which the &lt;code&gt;rg&lt;/code&gt; process will be terminated and only results recieved so far will be returned. This is achieved using the &lt;a href="https://docs.python.org/3/library/asyncio-task.html#asyncio.wait_for"&gt;asyncio.wait_for()&lt;/a&gt; function.&lt;/p&gt;
&lt;p&gt;I also implemented a limit on the number of matching lines that can be returned, defaulting to 2,000. Any more than that and the process is terminated early.&lt;/p&gt;
&lt;p&gt;Both of these limits can be customized using plugin settings (documented in &lt;a href="https://github.com/simonw/datasette-ripgrep/blob/main/README.md"&gt;the README&lt;/a&gt;). You can see how they are implemented in the &lt;a href="https://github.com/simonw/datasette-ripgrep/blob/0.2/datasette_ripgrep/__init__.py#L9-L55"&gt;async def run_ripgrep(pattern, path, time_limit=1.0, max_lines=2000)&lt;/a&gt; function.&lt;/p&gt;
&lt;h4&gt;Highlighted linkable line numbers&lt;/h4&gt;
&lt;p&gt;The other fun implementation detail is the way the source code listings are displayed. I'm using CSS to display the line numbers in a way that makes them visible without them breaking copy-and-paste (inspired by &lt;a href="https://www.sylvaindurand.org/using-css-to-add-line-numbering/"&gt;this article by Sylvain Durand&lt;/a&gt;).&lt;/p&gt;
&lt;div class="highlight highlight-source-css"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;code&lt;/span&gt;:&lt;span class="pl-c1"&gt;before&lt;/span&gt; {
    &lt;span class="pl-c1"&gt;content&lt;/span&gt;: &lt;span class="pl-en"&gt;attr&lt;/span&gt;(data-line);
    &lt;span class="pl-c1"&gt;display&lt;/span&gt;: inline-block;
    &lt;span class="pl-c1"&gt;width&lt;/span&gt;: &lt;span class="pl-c1"&gt;3.5&lt;span class="pl-smi"&gt;ch&lt;/span&gt;&lt;/span&gt;;
    &lt;span class="pl-c1"&gt;-webkit-user-select&lt;/span&gt;: none;
    &lt;span class="pl-c1"&gt;color&lt;/span&gt;: &lt;span class="pl-pds"&gt;&lt;span class="pl-kos"&gt;#&lt;/span&gt;666&lt;/span&gt;;
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The HTML looks like this:&lt;/p&gt;
&lt;div class="highlight highlight-text-html-basic"&gt;&lt;pre&gt;&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;pre&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;code&lt;/span&gt; &lt;span class="pl-c1"&gt;id&lt;/span&gt;="&lt;span class="pl-s"&gt;L1&lt;/span&gt;" &lt;span class="pl-c1"&gt;data-line&lt;/span&gt;="&lt;span class="pl-s"&gt;1&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;from setuptools import setup&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;code&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;code&lt;/span&gt; &lt;span class="pl-c1"&gt;id&lt;/span&gt;="&lt;span class="pl-s"&gt;L2&lt;/span&gt;" &lt;span class="pl-c1"&gt;data-line&lt;/span&gt;="&lt;span class="pl-s"&gt;2&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;import os&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;code&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;code&lt;/span&gt; &lt;span class="pl-c1"&gt;id&lt;/span&gt;="&lt;span class="pl-s"&gt;L3&lt;/span&gt;" &lt;span class="pl-c1"&gt;data-line&lt;/span&gt;="&lt;span class="pl-s"&gt;3&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&amp;amp;nbsp;&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;code&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;code&lt;/span&gt; &lt;span class="pl-c1"&gt;id&lt;/span&gt;="&lt;span class="pl-s"&gt;L4&lt;/span&gt;" &lt;span class="pl-c1"&gt;data-line&lt;/span&gt;="&lt;span class="pl-s"&gt;4&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;VERSION = &amp;amp;#34;0.1&amp;amp;#34;&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;code&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
...&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I wanted to imitate GitHub's handling of line links, where adding &lt;code&gt;#L23&lt;/code&gt; to the URL both jumps to that line and causes the line to be highlighted. Here's &lt;a href="https://ripgrep.datasette.io/-/ripgrep/view/datasette-allow-permissions-debug/setup.py#L23"&gt;a demo of that&lt;/a&gt; - I use the following JavaScript to update the contents of a &lt;code&gt;&amp;lt;style id="highlightStyle"&amp;gt;&amp;lt;/style&amp;gt;&lt;/code&gt; element in the document head any time the URL fragment changes:&lt;/p&gt;
&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-c1"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;script&lt;/span&gt;&lt;span class="pl-c1"&gt;&amp;gt;&lt;/span&gt;
var highlightStyle = document.getElementById('highlightStyle');
function highlightLineFromFragment() &lt;span class="pl-kos"&gt;{&lt;/span&gt;
    &lt;span class="pl-en"&gt;if&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-pds"&gt;/&lt;span class="pl-cce"&gt;^&lt;/span&gt;#L&lt;span class="pl-cce"&gt;\d&lt;/span&gt;&lt;span class="pl-c1"&gt;+&lt;/span&gt;&lt;span class="pl-cce"&gt;$&lt;/span&gt;/&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;exec&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;location&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;hash&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
        &lt;span class="pl-s1"&gt;highlightStyle&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;innerText&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;`&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;${&lt;/span&gt;&lt;span class="pl-s1"&gt;location&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;hash&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt; { background-color: yellow; }`&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;}&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt;
&lt;span class="pl-en"&gt;highlightLineFromFragment&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-smi"&gt;window&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;addEventListener&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;"hashchange"&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s1"&gt;highlightLineFromFragment&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-c1"&gt;&amp;lt;&lt;/span&gt;/&lt;span class="pl-ent"&gt;script&lt;/span&gt;&lt;span class="pl-c1"&gt;&amp;gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It's the simplest way I could think of to achieve this effect.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update 28th November 2020&lt;/strong&gt;: Louis Lévêque on Twitter suggested using the CSS &lt;a href="https://developer.mozilla.org/en-US/docs/Web/CSS/:target"&gt;:target selector&lt;/a&gt; instead, which is indeed MUCH simpler - I deleted the above JavaScript and replaced it with this CSS:&lt;/p&gt;
&lt;div class="highlight highlight-source-css"&gt;&lt;pre&gt;:&lt;span class="pl-c1"&gt;target&lt;/span&gt; {
    &lt;span class="pl-c1"&gt;background-color&lt;/span&gt;: &lt;span class="pl-pds"&gt;&lt;span class="pl-kos"&gt;#&lt;/span&gt;FFFF99&lt;/span&gt;;
}&lt;/pre&gt;&lt;/div&gt;
&lt;h4&gt;Next steps for this project&lt;/h4&gt;
&lt;p&gt;I'm pleased to have got &lt;a href=""&gt;datasette-ripgrep&lt;/a&gt; to a workable state, and I'm looking forward to using it to answer questions about the growing Datasette ecosystem. I don't know how much more time I'll invest in this - if it proves useful then I may well expand it.&lt;/p&gt;
&lt;p&gt;I do think there's something really interesting about being able to spin up this kind of code search engine on demand using &lt;code&gt;datasette publish&lt;/code&gt;. It feels like a very useful trick to have access to.&lt;/p&gt;
&lt;h4&gt;Better URLs for my TILs&lt;/h4&gt;
&lt;p&gt;My other project this week was an upgrade to &lt;a href="https://til.simonwillison.net/"&gt;til.simonwillison.net&lt;/a&gt;: I finally spent the time to &lt;a href="https://github.com/simonw/til/issues/34"&gt;design nicer URLs&lt;/a&gt; for the site.&lt;/p&gt;
&lt;p&gt;Before:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;til.simonwillison.net/til/til/javascript_manipulating-query-params.md&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;After:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;til.simonwillison.net/javascript/manipulating-query-params&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;The implementation for this takes advantage of a feature I sneaked into Datasette 0.49: &lt;a href="https://simonwillison.net/2020/Sep/15/datasette-0-49#path-parameters-custom-page-templates"&gt;Path parameters for custom page templates&lt;/a&gt;. I can create a template file called &lt;code&gt;pages/{topic}/{slug}.html&lt;/code&gt; and Datasette use that template to handle 404 errors that match that pattern.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://github.com/simonw/til/blob/main/templates/pages/%7Btopic%7D/%7Bslug%7D.html"&gt;the new pages/{topic}/{slug}.html&lt;/a&gt; template for my TIL site. It uses the &lt;code&gt;sql()&lt;/code&gt; template function from the &lt;a href="https://github.com/simonw/datasette-template-sql"&gt;datasette-template-sql&lt;/a&gt; plugin to retrieve and render the matching TIL, or raises a 404 if no TIL can be found.&lt;/p&gt;
&lt;p&gt;I also needed to setup redirects from the old pages to the new ones. I wrote a &lt;a href="https://til.simonwillison.net/til/til/datasette_redirects-for-datasette.md"&gt;TIL on edirects for Datasette&lt;/a&gt; explaining how I did that.&lt;/p&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/til/til/datasette_redirects-for-datasette.md"&gt;Redirects for Datasette&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette-ripgrep/releases/tag/0.2"&gt;datasette-ripgrep 0.2&lt;/a&gt; - 2020-11-27&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette-ripgrep/releases/tag/0.1"&gt;datasette-ripgrep 0.1&lt;/a&gt; - 2020-11-26&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette-atom/releases/tag/0.8.1"&gt;datasette-atom 0.8.1&lt;/a&gt; - 2020-11-25&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette-ripgrep/releases/tag/0.1a1"&gt;datasette-ripgrep 0.1a1&lt;/a&gt; - 2020-11-25&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette-ripgrep/releases/tag/0.1a0"&gt;datasette-ripgrep 0.1a0&lt;/a&gt; - 2020-11-25&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/simonw/datasette-graphql/releases/tag/1.2.1"&gt;datasette-graphql 1.2.1&lt;/a&gt; - 2020-11-24&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/async"&gt;async&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/css"&gt;css&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cloudrun"&gt;cloudrun&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ripgrep"&gt;ripgrep&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/baked-data"&gt;baked-data&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="async"/><category term="css"/><category term="projects"/><category term="python"/><category term="regular-expressions"/><category term="datasette"/><category term="weeknotes"/><category term="cloudrun"/><category term="ripgrep"/><category term="baked-data"/></entry><entry><title>The unexpected Google wide domain check bypass</title><link href="https://simonwillison.net/2020/Mar/9/unexpected-google-wide-domain-check-bypass/#atom-tag" rel="alternate"/><published>2020-03-09T23:27:41+00:00</published><updated>2020-03-09T23:27:41+00:00</updated><id>https://simonwillison.net/2020/Mar/9/unexpected-google-wide-domain-check-bypass/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://bugs.xdavidhu.me/google/2020/03/08/the-unexpected-google-wide-domain-check-bypass/"&gt;The unexpected Google wide domain check bypass&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Fantastic story of discovering a devious security vulnerability in a bunch of Google products stemming from a single exploitable regular expression in the Google closure JavaScript library.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=22527842"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;&lt;/p&gt;



</summary><category term="regular-expressions"/><category term="security"/></entry><entry><title>Weeknotes: ONA19, twitter-to-sqlite, datasette-rure</title><link href="https://simonwillison.net/2019/Sep/13/weeknotestwitter-sqlite-datasette-rure/#atom-tag" rel="alternate"/><published>2019-09-13T21:35:52+00:00</published><updated>2019-09-13T21:35:52+00:00</updated><id>https://simonwillison.net/2019/Sep/13/weeknotestwitter-sqlite-datasette-rure/#atom-tag</id><summary type="html">
    &lt;p&gt;I’ve decided to start writing &lt;a href="https://weeknot.es/what-on-earth-are-weeknotes-a81874c5cef9"&gt;weeknotes&lt;/a&gt; for the duration of &lt;a href="https://simonwillison.net/2019/Sep/10/jsk-fellowship/"&gt;my JSK fellowship&lt;/a&gt;. Here goes!&lt;/p&gt;
&lt;p&gt;I started the fellowship last week, and this week I’ve been attending the &lt;a href="https://simonwillison.net/2019/Sep/10/jsk-fellowship/"&gt;Online News Association conference&lt;/a&gt; in New Orleans along with the other fellows.&lt;/p&gt;
&lt;p&gt;Natalie and I have had a lot of fun exploring New Orleans, including several niche museums (documented in &lt;a href="https://twitter.com/simonw/status/1171159213436997633"&gt;this Twitter thread&lt;/a&gt;, which I expect to continue adding to for years to come).&lt;/p&gt;
&lt;p&gt;ONA is the largest digital news conference on the calendar, and as such it’s a great opportunity to meet all kinds of different journalists and get a start on figuring out how best to approach the fellowship.&lt;/p&gt;
&lt;h3&gt;twitter-to-sqlite&lt;/h3&gt;
&lt;p&gt;I’ve been working on this tool for a couple of weeks now, but this week it really started to prove its worth. &lt;a href="https://github.com/dogsheep/twitter-to-sqlite"&gt;twitter-to-sqlite&lt;/a&gt; is a utility I’m building to fetch data from Twitter and write it into a SQLite database. It has &lt;a href="https://github.com/dogsheep/twitter-to-sqlite/blob/master/README.md"&gt;a comprehensive README&lt;/a&gt;, as do all of my recent projects: I’ve been doing this for long enough that I know that time spent on documentation will pay me back enormously in just a few months.&lt;/p&gt;
&lt;p&gt;My initial goal for this tool was to use it for personal analytics, partly inspired by my &lt;a href="https://simonwillison.net/2018/Jan/28/analyzing-my-twitter-followers/"&gt;Twitter followers project&lt;/a&gt; from last year. Then ONA sent out a spreadsheet with the Twitter names of most of the conference attendees and I realised there was an opportunity to use it for something more interesting.&lt;/p&gt;
&lt;p&gt;I won’t share the attendee list here because it’s intended to be private to conference attendees, but the short version of what I did with it is this:&lt;/p&gt;
&lt;p&gt;First, load the attendee list into a database using &lt;a href="https://github.com/simonw/csvs-to-sqlite"&gt;csvs-to-sqlite&lt;/a&gt; (after first converting the XLS file to CSV using Numbers):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ csvs-to-sqlite attendees.csv ona.db
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Next, fetch the full Twitter profiles for every one of those attendees. I added a &lt;code&gt;--sql&lt;/code&gt; option to &lt;code&gt;twitter-to-sqlite&lt;/code&gt; for this (&lt;a href="https://github.com/dogsheep/twitter-to-sqlite/issues/8"&gt;issue #8&lt;/a&gt;) and ran the following:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ twitter-to-sqlite users-lookup ona.db --sql="select Twitter from attendees"
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now I can run &lt;code&gt;datasette ona.db&lt;/code&gt; and execute full-text searches against the complete Twitter profiles of those attendees - great for figuring out e.g. who is attending the conference from the Washington Post.&lt;/p&gt;
&lt;p&gt;I took this a step further: can I identify people I follow on Twitter who are attending the conference? I can pull the list of IDs of people I follow like so:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ twitter-to-sqlite friends-ids ona.db simonw
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now I can see which of my Twitter friends are attending the conference by loading it into &lt;a href="https://datasette.readthedocs.io/"&gt;Datasette&lt;/a&gt; and using this query:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;select * from users
where screen_name in (select Twitter from attendees)
and id in (select followed_id from following where follower_id = 12497)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is not a particularly efficient SQL query… and it doesn’t matter! The great thing about working against tiny SQLite databases that contain just a few thousand rows of data is that you can nest SQL queries like this with little concern for their performance - this one runs on my laptop in 18.213ms.&lt;/p&gt;
&lt;h3&gt;datasette-rure&lt;/h3&gt;
&lt;p&gt;SQLite has &lt;a href="https://www.sqlite.org/lang_expr.html#like"&gt;a REGEXP operator&lt;/a&gt;, which is documented thus:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The REGEXP operator is a special syntax for the regexp() user function. No regexp() user function is defined by default and so use of the REGEXP operator will normally result in an error message. If an application-defined SQL function named “regexp” is added at run-time, then the “X REGEXP Y” operator will be implemented as a call to “regexp(Y,X)”.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Python’s sqlite3 module lets you &lt;a href="https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection.create_function"&gt;register custom functions&lt;/a&gt;, so this should be an easy fix. I’ve been holding off on implementing this in Datasette for a while though because Datasette allows user-provided queries, and Python’s regular expressions have a nasty quality: you can easily construct a regular expression and input that will &lt;a href="https://www.benfrederickson.com/python-catastrophic-regular-expressions-and-the-gil/"&gt;hang the Python interpreter&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;So I needed a safe regular expression library that I could trust not to hang my intepreter on the wrong input.&lt;/p&gt;
&lt;p&gt;Google have such a library - &lt;a href="https://github.com/google/re2"&gt;re2&lt;/a&gt; - but the various Python bindings for it aren’t cleanly installable via &lt;code&gt;pip install&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;I asked about this &lt;a href="https://twitter.com/simonw/status/1171415169202827264"&gt;on Twitter&lt;/a&gt; and Alex Willmer &lt;a href="https://twitter.com/moreati/status/1171477803365044232"&gt;pointed me&lt;/a&gt; to &lt;a href="https://github.com/davidblewett/rure-python"&gt;rure-python&lt;/a&gt; - a Python wrapper for the Rust regular expression library, which “guarantees linear time searching using finite automata” - exactly what I was looking for!&lt;/p&gt;
&lt;p&gt;So I built a quick Datasette plugin: &lt;a href="https://github.com/simonw/datasette-rure"&gt;datasette-rure&lt;/a&gt;, which provides a &lt;code&gt;regexp()&lt;/code&gt; function to enable the &lt;code&gt;REGEXP&lt;/code&gt; operator, powered by the Rust regular expression engine.&lt;/p&gt;
&lt;p&gt;I also added &lt;code&gt;regexp_match(pattern, text, index)&lt;/code&gt; and &lt;code&gt;regexp_matches(pattern, text)&lt;/code&gt; functions. Here are their examples from &lt;a href="https://github.com/simonw/datasette-rure/blob/master/README.md"&gt;the README&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;select regexp_match('.*( and .*)', title) as n from articles where n is not null
-- Returns the ' and X' component of any matching titles, e.g.
--     and Recognition
--     and Transitions Their Place
-- etc
select regexp_matches(
    'hello (?P&amp;lt;name&amp;gt;\w+) the (?P&amp;lt;species&amp;gt;\w+)',
    'hello bob the dog, hello maggie the cat, hello tarquin the otter'
)
-- Returns a JSON array:
-- [{"name": "bob", "species": "dog"},
--  {"name": "maggie", "species": "cat"},
--  {"name": "tarquin", "species": "otter"}]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Returning JSON from a SQLite custom function is a fun trick: it means you can then loop over the returned rows or even join them against other tables within SQL using the bundled &lt;a href="https://www.sqlite.org/json1.html#jeach"&gt;json_each() function&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I deployed &lt;a href="https://datasette-rure-demo.datasette.io/24ways?sql=select+regexp_matches%28%0D%0A++++%27hello+%28%3FP%3Cname%3E%5Cw%2B%29+the+%28%3FP%3Cspecies%3E%5Cw%2B%29%27%2C%0D%0A++++%27hello+bob+the+dog%2C+hello+maggie+the+cat%2C+hello+tarquin+the+otter%27%0D%0A%29"&gt;an interactive demo&lt;/a&gt; of Datasette running the plugin. I’ve been trying to do this for all of the plugins I release - &lt;a href="https://github.com/simonw/datasette-jq"&gt;datasette-jq&lt;/a&gt; is another recent example (the interactive demos are linked from &lt;a href="https://github.com/simonw/datasette-jq/blob/master/README.md"&gt;the README&lt;/a&gt;).&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/twitter"&gt;twitter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rust"&gt;rust&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="projects"/><category term="regular-expressions"/><category term="sqlite"/><category term="twitter"/><category term="rust"/><category term="datasette"/><category term="weeknotes"/></entry><entry><title>Details of the Cloudflare outage on July 2, 2019</title><link href="https://simonwillison.net/2019/Jul/12/details-cloudflare-outage-july-2-2019/#atom-tag" rel="alternate"/><published>2019-07-12T17:36:25+00:00</published><updated>2019-07-12T17:36:25+00:00</updated><id>https://simonwillison.net/2019/Jul/12/details-cloudflare-outage-july-2-2019/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/"&gt;Details of the Cloudflare outage on July 2, 2019&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Best retrospective I’ve read in a long time. The outage was caused by a backtracking regex rule that was added to the Web Application Firewall project, which rolls out globally and skips most of Cloudflare’s regular graduar rollout process (delightfully animal themed, named DOG for the dogfooding PoP that their employees use, PIG for the Guinea Pig PoPs reserved for free customers, then Canary for the final step) so that they can deploy counter-measures to newly discovered vulnerabilities as quickly as possible—but the real value in the retro is that it provides an extremely deep insight into how Cloudflare organize, test and manage their changes. Really interesting stuff.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=20421538"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/operations"&gt;operations&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cloudflare"&gt;cloudflare&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/postmortem"&gt;postmortem&lt;/a&gt;&lt;/p&gt;



</summary><category term="operations"/><category term="regular-expressions"/><category term="cloudflare"/><category term="postmortem"/></entry><entry><title>r1chardj0n3s/parse: Parse strings using a specification based on the Python format() syntax.</title><link href="https://simonwillison.net/2018/Feb/25/parse/#atom-tag" rel="alternate"/><published>2018-02-25T16:58:32+00:00</published><updated>2018-02-25T16:58:32+00:00</updated><id>https://simonwillison.net/2018/Feb/25/parse/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/r1chardj0n3s/parse"&gt;r1chardj0n3s/parse: Parse strings using a specification based on the Python format() syntax.&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Really neat API design: parse() behaves almost exactly in the opposite way to Python’s built-in format(), so you can use format strings as an alternative to regular expressions for extracting specific data from a string.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://github.com/kennethreitz/requests-html/blob/master/Pipfile"&gt;requests-html/Pipfile&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="regular-expressions"/></entry><entry><title>A Regular Expression Matcher: Code by Rob Pike, Exegesis by Brian Kernighan</title><link href="https://simonwillison.net/2017/Dec/5/regex/#atom-tag" rel="alternate"/><published>2017-12-05T18:36:12+00:00</published><updated>2017-12-05T18:36:12+00:00</updated><id>https://simonwillison.net/2017/Dec/5/regex/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://www.cs.princeton.edu/courses/archive/spr09/cos333/beautiful.html"&gt;A Regular Expression Matcher: Code by Rob Pike, Exegesis by Brian Kernighan&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Delightfully clear and succinct 30-line C implementation of a regular expression matcher that supports $, ^, . and * operations.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=15840487"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/c"&gt;c&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rob-pike"&gt;rob-pike&lt;/a&gt;&lt;/p&gt;



</summary><category term="c"/><category term="regular-expressions"/><category term="rob-pike"/></entry><entry><title>What are the best resources for learning regular expressions?</title><link href="https://simonwillison.net/2012/Feb/21/what-are-the-best/#atom-tag" rel="alternate"/><published>2012-02-21T17:47:00+00:00</published><updated>2012-02-21T17:47:00+00:00</updated><id>https://simonwillison.net/2012/Feb/21/what-are-the-best/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;em&gt;My answer to &lt;a href="https://www.quora.com/What-are-the-best-resources-for-learning-regular-expressions/answer/Simon-Willison"&gt;What are the best resources for learning regular expressions?&lt;/a&gt; on Quora&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The O'Reilly book on Regular Expressions is absolutely superb. It will help you build a much deeper understanding if how they actually work than any online tutorial I've seen.&lt;/p&gt;

&lt;p&gt;Chapter 3 of "Text Processing in Python" is excellent too, and is available for free online:&lt;/p&gt;

&lt;span&gt;&lt;a href="http://gnosis.cx/TPiP/chap3.txt"&gt;http://gnosis.cx/TPiP/chap3.txt&lt;/a&gt;&lt;/span&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/programming"&gt;programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/programming-languages"&gt;programming-languages&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/quora"&gt;quora&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="programming"/><category term="programming-languages"/><category term="python"/><category term="regular-expressions"/><category term="quora"/></entry><entry><title>Escaping regular expression characters in JavaScript (updated)</title><link href="https://simonwillison.net/2010/Jul/4/escaping/#atom-tag" rel="alternate"/><published>2010-07-04T18:23:00+00:00</published><updated>2010-07-04T18:23:00+00:00</updated><id>https://simonwillison.net/2010/Jul/4/escaping/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://simonwillison.net/2006/Jan/20/escape/#p-6"&gt;Escaping regular expression characters in JavaScript (updated)&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The JavaScript regular expression meta-character escaping code I posted back in 2006 has some serious flaws—I’ve just posted an update to the original post.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/escaping"&gt;escaping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/recovered"&gt;recovered&lt;/a&gt;&lt;/p&gt;



</summary><category term="escaping"/><category term="javascript"/><category term="regular-expressions"/><category term="recovered"/></entry><entry><title>Introduction to Surlex</title><link href="https://simonwillison.net/2010/Apr/11/surlex/#atom-tag" rel="alternate"/><published>2010-04-11T19:23:35+00:00</published><updated>2010-04-11T19:23:35+00:00</updated><id>https://simonwillison.net/2010/Apr/11/surlex/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://codysoyland.com/2009/sep/6/introduction-surlex/"&gt;Introduction to Surlex&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
A neat drop-in alternative for Django’s regular expression based URL parsing, providing simpler syntax for common path patterns.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/codysoyland"&gt;codysoyland&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/django"&gt;django&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/surlex"&gt;surlex&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/urls"&gt;urls&lt;/a&gt;&lt;/p&gt;



</summary><category term="codysoyland"/><category term="django"/><category term="python"/><category term="regular-expressions"/><category term="surlex"/><category term="urls"/></entry><entry><title>RE2: a principled approach to regular expression matching</title><link href="https://simonwillison.net/2010/Mar/12/re2/#atom-tag" rel="alternate"/><published>2010-03-12T09:28:21+00:00</published><updated>2010-03-12T09:28:21+00:00</updated><id>https://simonwillison.net/2010/Mar/12/re2/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html"&gt;RE2: a principled approach to regular expression matching&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Google have open sourced RE2, the C++ regular expression library they developed for Google Code Search, Sawzall, Bigtable and other internal projects. Unlike PCRE it avoids the potential for exponential run time and unbounded stack usage and guarantees that searches complete in linear time, mainly by dropping support for back references.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/c-plus-plus"&gt;c-plus-plus&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/pcre"&gt;pcre&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/re2"&gt;re2&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/russcox"&gt;russcox&lt;/a&gt;&lt;/p&gt;



</summary><category term="c-plus-plus"/><category term="google"/><category term="open-source"/><category term="pcre"/><category term="re2"/><category term="regular-expressions"/><category term="russcox"/></entry><entry><title>Request Routing With URI Templates in Node.JS</title><link href="https://simonwillison.net/2009/Nov/24/verbose/#atom-tag" rel="alternate"/><published>2009-11-24T09:06:22+00:00</published><updated>2009-11-24T09:06:22+00:00</updated><id>https://simonwillison.net/2009/Nov/24/verbose/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://www.dashdashverbose.com/2009/11/request-routing-with-uri-templates-in.html"&gt;Request Routing With URI Templates in Node.JS&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I quite like this approach (though the implementation is a bit “this” heavy for my taste). JavaScript has no equivalent to Python’s raw strings, so regular expression based routing ala Django ends up being a bit uglier in JavaScript. URI template syntax is more appealing.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/django"&gt;django&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/nodejs"&gt;nodejs&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/uritemplates"&gt;uritemplates&lt;/a&gt;&lt;/p&gt;



</summary><category term="django"/><category term="javascript"/><category term="nodejs"/><category term="python"/><category term="regular-expressions"/><category term="uritemplates"/></entry><entry><title>Quoting Andrew Clover</title><link href="https://simonwillison.net/2009/Nov/16/regex/#atom-tag" rel="alternate"/><published>2009-11-16T10:32:15+00:00</published><updated>2009-11-16T10:32:15+00:00</updated><id>https://simonwillison.net/2009/Nov/16/regex/#atom-tag</id><summary type="html">
    &lt;blockquote cite="http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454"&gt;&lt;p&gt;Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454"&gt;Andrew Clover&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/andrew-clover"&gt;andrew-clover&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/funny"&gt;funny&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/html"&gt;html&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parsing"&gt;parsing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/stackoverflow"&gt;stackoverflow&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xhtml"&gt;xhtml&lt;/a&gt;&lt;/p&gt;



</summary><category term="andrew-clover"/><category term="funny"/><category term="html"/><category term="parsing"/><category term="regular-expressions"/><category term="stackoverflow"/><category term="xhtml"/></entry><entry><title>Django security updates released</title><link href="https://simonwillison.net/2009/Oct/10/django/#atom-tag" rel="alternate"/><published>2009-10-10T00:24:59+00:00</published><updated>2009-10-10T00:24:59+00:00</updated><id>https://simonwillison.net/2009/Oct/10/django/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://www.djangoproject.com/weblog/2009/oct/09/security/"&gt;Django security updates released&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
A potential denial of service vulnerability has been discovered in the regular expressions used by Django form library’s EmailField and URLField—a malicious input could trigger a pathological performance. Patches (and patched releases) for Django 1.1 and Django 1.0 have been published.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/denial-of-service"&gt;denial-of-service&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/django"&gt;django&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;&lt;/p&gt;



</summary><category term="denial-of-service"/><category term="django"/><category term="python"/><category term="regular-expressions"/><category term="security"/></entry><entry><title>Introducing Yardbird</title><link href="https://simonwillison.net/2009/May/22/yardbird/#atom-tag" rel="alternate"/><published>2009-05-22T23:13:39+00:00</published><updated>2009-05-22T23:13:39+00:00</updated><id>https://simonwillison.net/2009/May/22/yardbird/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://zork.net/motd/nick/django/introducing-yardbird.html"&gt;Introducing Yardbird&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I absolutely love it—an IRC bot built on top of Twisted that passes incoming messages off to Django code running in a separate thread. Requests and Response objects are used to represent incoming and outgoing messages, and Django’s regex-based URL routing is used to dispatch messages to different handling functions based on their content.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/django"&gt;django&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/irc"&gt;irc&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/threads"&gt;threads&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/twisted"&gt;twisted&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/yardbird"&gt;yardbird&lt;/a&gt;&lt;/p&gt;



</summary><category term="django"/><category term="irc"/><category term="regular-expressions"/><category term="threads"/><category term="twisted"/><category term="yardbird"/></entry><entry><title>Python gems of my own</title><link href="https://simonwillison.net/2008/Nov/3/python/#atom-tag" rel="alternate"/><published>2008-11-03T11:59:48+00:00</published><updated>2008-11-03T11:59:48+00:00</updated><id>https://simonwillison.net/2008/Nov/3/python/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://ericholscher.com/blog/2008/nov/3/python-gems-my-own/"&gt;Python gems of my own&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Did you know you can pass 128 as a flag to Python’s re.compile() function to spit out a parse tree? I didn’t. re.compile(“pattern”, 128)

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="http://www.eflorenzano.com/blog/post/gems-python/"&gt;Eric Florenzano&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/eric-holscher"&gt;eric-holscher&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;&lt;/p&gt;



</summary><category term="eric-holscher"/><category term="python"/><category term="regular-expressions"/></entry><entry><title>Primality regex</title><link href="https://simonwillison.net/2007/Mar/18/perl/#atom-tag" rel="alternate"/><published>2007-03-18T01:17:45+00:00</published><updated>2007-03-18T01:17:45+00:00</updated><id>https://simonwillison.net/2007/Mar/18/perl/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://montreal.pm.org/tech/neil_kandalgaonkar.shtml"&gt;Primality regex&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
A regular expression that can identify prime numbers. Unsurprisingly, this one comes from the Perl community.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/perl"&gt;perl&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/primenumbers"&gt;primenumbers&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;&lt;/p&gt;



</summary><category term="perl"/><category term="primenumbers"/><category term="regular-expressions"/></entry><entry><title>Wrapping Text With Regular Expressions</title><link href="https://simonwillison.net/2006/Jun/29/wrapping/#atom-tag" rel="alternate"/><published>2006-06-29T11:12:51+00:00</published><updated>2006-06-29T11:12:51+00:00</updated><id>https://simonwillison.net/2006/Jun/29/wrapping/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://macromates.com/blog/archives/2006/06/28/wrapping-text-with-regular-expressions/"&gt;Wrapping Text With Regular Expressions&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Neat regexp trick.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;&lt;/p&gt;



</summary><category term="regular-expressions"/></entry><entry><title>Escaping regular expression characters in JavaScript</title><link href="https://simonwillison.net/2006/Jan/20/escape/#atom-tag" rel="alternate"/><published>2006-01-20T12:19:13+00:00</published><updated>2006-01-20T12:19:13+00:00</updated><id>https://simonwillison.net/2006/Jan/20/escape/#atom-tag</id><summary type="html">
    &lt;p id="p-0"&gt;JavaScript's support for regular expressions is generally pretty good, but there is one notable omission: an escaping mechanism for literal strings. Say for example you need to create a regular expression that removes a specific string from the end of a string. If you know the string you want to remove when you write the script this is easy:&lt;/p&gt;

&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;var&lt;/span&gt; &lt;span class="pl-s1"&gt;newString&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;oldString&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;replace&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-pds"&gt;&lt;span class="pl-c1"&gt;/&lt;/span&gt;&lt;span class="pl-s"&gt;R&lt;/span&gt;&lt;span class="pl-s"&gt;e&lt;/span&gt;&lt;span class="pl-s"&gt;m&lt;/span&gt;&lt;span class="pl-s"&gt;o&lt;/span&gt;&lt;span class="pl-s"&gt;v&lt;/span&gt;&lt;span class="pl-s"&gt;e&lt;/span&gt;&lt;span class="pl-s"&gt; &lt;/span&gt;&lt;span class="pl-s"&gt;f&lt;/span&gt;&lt;span class="pl-s"&gt;r&lt;/span&gt;&lt;span class="pl-s"&gt;o&lt;/span&gt;&lt;span class="pl-s"&gt;m&lt;/span&gt;&lt;span class="pl-s"&gt; &lt;/span&gt;&lt;span class="pl-s"&gt;e&lt;/span&gt;&lt;span class="pl-s"&gt;n&lt;/span&gt;&lt;span class="pl-s"&gt;d&lt;/span&gt;&lt;span class="pl-cce"&gt;$&lt;/span&gt;&lt;span class="pl-c1"&gt;/&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;''&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p id="p-1"&gt;But what if the string to be removed comes from a variable? You'll need to construct a regular expression from the variable, using the RegExp constructor function:&lt;/p&gt;

&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;var&lt;/span&gt; &lt;span class="pl-s1"&gt;re&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;new&lt;/span&gt; &lt;span class="pl-v"&gt;RegExp&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;stringToRemove&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-s"&gt;'$'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-k"&gt;var&lt;/span&gt; &lt;span class="pl-s1"&gt;newString&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;oldString&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;replace&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;re&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;''&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p id="p-2"&gt;But what if the string you want to remove may contain regular expression metacharacters - characters like $ or . that affect the behaviour of the expression? Languages such as Python provide functions for escaping these characters (see &lt;a href="https://docs.python.org/2/library/re.html#re.escape" title="Python re module contents"&gt;re.escape&lt;/a&gt;); with JavaScript you have to write your own.&lt;/p&gt;

&lt;p id="p-3"&gt;Here's mine:&lt;/p&gt;

&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-v"&gt;RegExp&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;escape&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;function&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;text&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
  &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-c1"&gt;!&lt;/span&gt;&lt;span class="pl-smi"&gt;arguments&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;callee&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;sRE&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
    &lt;span class="pl-k"&gt;var&lt;/span&gt; &lt;span class="pl-s1"&gt;specials&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-kos"&gt;[&lt;/span&gt;
      &lt;span class="pl-s"&gt;'/'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;'.'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;'*'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;'+'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;'?'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;'|'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
      &lt;span class="pl-s"&gt;'('&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;')'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;'['&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;']'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;'{'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;'}'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;'\\'&lt;/span&gt;
    &lt;span class="pl-kos"&gt;]&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-smi"&gt;arguments&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;callee&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;sRE&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;new&lt;/span&gt; &lt;span class="pl-v"&gt;RegExp&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;
      &lt;span class="pl-s"&gt;'(\\'&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-s1"&gt;specials&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;join&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'|\\'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-s"&gt;')'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;'g'&lt;/span&gt;
    &lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-kos"&gt;}&lt;/span&gt;
  &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s1"&gt;text&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;replace&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;arguments&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;callee&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;sRE&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;'\\$1'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p id="p-4"&gt;This deals with another common problem in JavaScript: compiling a regular expression once (rather than every time you use it) while keeping it local to a function. &lt;code&gt;argmuments.callee&lt;/code&gt; inside a function always refers to the function itself, and since JavaScript functions are objects you can store properties on them. In this case, the first time the function is run it compiles a regular expression and stashes it in the sRE property. On subsequent calls the pre-compiled expression can be reused.&lt;/p&gt;

&lt;p id="p-5"&gt;In the above snippet I've added my function as a property of the &lt;code&gt;RegExp&lt;/code&gt; constructor. There's no pressing reason to do this other than a desire to keep generic functionality relating to regular expression handling the same place. If you rename the function it will still work as expected, since the use of &lt;code&gt;arguments.callee&lt;/code&gt; eliminates any coupling between the function definition and the rest of the code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update 18th Feb 2025&lt;/strong&gt;: 19 years after I published this &lt;code&gt;RegExp.escape()&lt;/code&gt; has &lt;a href="https://simonwillison.net/2025/Feb/18/tc39proposal-regex-escaping/"&gt;made it into the language&lt;/a&gt;!&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/escaping"&gt;escaping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="escaping"/><category term="javascript"/><category term="regular-expressions"/></entry><entry><title>Lexical Analysis, Python-style</title><link href="https://simonwillison.net/2005/Apr/28/lexical/#atom-tag" rel="alternate"/><published>2005-04-28T15:02:50+00:00</published><updated>2005-04-28T15:02:50+00:00</updated><id>https://simonwillison.net/2005/Apr/28/lexical/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://jason.diamond.name/weblog/2005/04/26/lexical-analysis-python-style"&gt;Lexical Analysis, Python-style&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Clever trick using named groups in regular expressions.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="http://www.pythonware.com/daily/"&gt;Daily Python-URL&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;&lt;/p&gt;



</summary><category term="regular-expressions"/></entry><entry><title>Unicode Regular Expressions</title><link href="https://simonwillison.net/2004/Mar/3/unicode/#atom-tag" rel="alternate"/><published>2004-03-03T03:06:47+00:00</published><updated>2004-03-03T03:06:47+00:00</updated><id>https://simonwillison.net/2004/Mar/3/unicode/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://www.unicode.org/reports/tr18/tr18-9.html"&gt;Unicode Regular Expressions&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
[a \U00010450] Match “a” or U+10450 SHAVIAN LETTER PEEP


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;&lt;/p&gt;



</summary><category term="regular-expressions"/></entry><entry><title>Brief Guide to Regular Expressions</title><link href="https://simonwillison.net/2003/Dec/4/brief/#atom-tag" rel="alternate"/><published>2003-12-04T01:48:15+00:00</published><updated>2003-12-04T01:48:15+00:00</updated><id>https://simonwillison.net/2003/Dec/4/brief/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://textartisan.com/articles/regex.html"&gt;Brief Guide to Regular Expressions&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Regular expressions for everyone else

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="http://www.yarinareth.net/caveatlector/archive/week_2003_11_30.html#e002415"&gt;And suddenly itâ€™s a good day&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;&lt;/p&gt;



</summary><category term="regular-expressions"/></entry><entry><title>Capturing the power of re.split</title><link href="https://simonwillison.net/2003/Oct/26/reSplit/#atom-tag" rel="alternate"/><published>2003-10-26T03:01:38+00:00</published><updated>2003-10-26T03:01:38+00:00</updated><id>https://simonwillison.net/2003/Oct/26/reSplit/#atom-tag</id><summary type="html">
    &lt;p&gt;A couple of Python tips. The first is really a tip for Mozilla/Firebird: You can set up a &lt;a href="http://www.mozilla.org/docs/end-user/keywords.html"&gt;Custom Keyword&lt;/a&gt; for instantly accessing Python module documentation using the string &lt;samp&gt;www.python.org/doc/current/lib/module-%s.html&lt;/samp&gt; - I have this set up as &lt;samp&gt;pydoc&lt;/samp&gt;, so I can type &lt;samp&gt;pydoc re&lt;/samp&gt; to jump straight to the re module documentation. I only set it up half an hour ago and I've already used it about a dozen times.&lt;/p&gt;

&lt;p&gt;The second tip is so powerful I've been kicking myself for not finding out about it sooner. It relates to the regular expression module's &lt;code class="python"&gt;re.split()&lt;/code&gt; function. Just like &lt;code&gt;string.split()&lt;/code&gt;, this lets you split up a string based on a certain token. With &lt;code class="python"&gt;string.split()&lt;/code&gt; you the token you split on isn't included in the resulting array:&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;&amp;gt;&amp;gt;&amp;gt; 'pipe|separated|values'.split('|')
['pipe', 'separated', 'values']
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This is also true of &lt;code class="python"&gt;re.split&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;&amp;gt;&amp;gt;&amp;gt; splitter = re.compile('&amp;lt;.&amp;gt;')
&amp;gt;&amp;gt;&amp;gt; splitter.split('hi&amp;lt;a&amp;gt;there&amp;lt;b&amp;gt;from&amp;lt;c&amp;gt;python')
['hi', 'there', 'from', 'python']
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Here's the magic part though. If you put part or all of the regular expression in parenthesis the separating tokens get included in the resulting list:&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;&amp;gt;&amp;gt;&amp;gt; splitter = re.compile('(&amp;lt;.&amp;gt;)')
&amp;gt;&amp;gt;&amp;gt; splitter.split('hi&amp;lt;a&amp;gt;there&amp;lt;b&amp;gt;from&amp;lt;c&amp;gt;python')
['hi', '&amp;lt;a&amp;gt;', 'there', '&amp;lt;b&amp;gt;', 'from', '&amp;lt;c&amp;gt;', 'python']
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Why is this a big deal? Because it suddenly makes writing simple parsers and tokenisers a whole heck of a lot easier. Using the above example, say you wanted to do something with each of the &lt;samp&gt;&amp;lt;?&amp;gt;&lt;/samp&gt; style tags. You can just iterate through the resulting list identifying each tag using the regular expression you've already compiled and then altering just those list items, before joining the whole list back together again at the end.&lt;/p&gt;

&lt;p&gt;Simple parsing and replacement of easily identified tags can already be achieved using the &lt;code class="python"&gt;re.sub()&lt;/code&gt; method, which allows you to provide a callback function to process each matching token. The difference with using &lt;code class="python"&gt;re.split()&lt;/code&gt; is that you can easily take in to account the &lt;em&gt;order&lt;/em&gt; of the tokens, allowing you to build systems that can use special tags to define areas of documents without getting confused by nesting tag sets. As a simple example, you could build a basic event based &lt;acronym title="eXtensible Markup Language"&gt;XML&lt;/acronym&gt; parser using just a couple of expressions. In fact, I discovered this technique while examining the source code for the &lt;a href="http://sourceforge.net/projects/tinpy/"&gt;tinpy&lt;/a&gt; tiny python template module, which gives a clue to why I'm so interested in it.&lt;/p&gt;

&lt;p&gt;Having discovered this feature in Python, I just had to see if it existed in other languages as well. Unsurprisingly it does; &lt;acronym title="PHP: Hypertext Preprocessor"&gt;PHP&lt;/acronym&gt;'s &lt;a href="http://www.php.net/preg_split"&gt;preg_split&lt;/a&gt; offers an optional  &lt;code&gt;PREG_SPLIT_DELIM_CAPTURE&lt;/code&gt; flag (added in &lt;acronym title="PHP: Hypertext Preprocessor"&gt;PHP&lt;/acronym&gt; 4.0.5) and Javascript has similar behaviour to Python, including the splitting token if it is wrapped in parentheses.&lt;/p&gt;

&lt;p&gt;I'm probably the last person to find out about this, but it's such a useful technique I felt I just had to share it with the world.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="regular-expressions"/></entry></feed>