<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: political-hacking</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/political-hacking.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-08-19T20:40:50+00:00</updated><author><name>Simon Willison</name></author><entry><title>XSLT on congress.gov</title><link href="https://simonwillison.net/2025/Aug/19/xslt/#atom-tag" rel="alternate"/><published>2025-08-19T20:40:50+00:00</published><updated>2025-08-19T20:40:50+00:00</updated><id>https://simonwillison.net/2025/Aug/19/xslt/#atom-tag</id><summary type="html">
    &lt;p&gt;Today I learned - via &lt;a href="https://github.com/whatwg/html/pull/11563"&gt;a proposal to remove mentions of XSLT from the HTML spec&lt;/a&gt; - that &lt;code&gt;congress.gov&lt;/code&gt; uses XSLT to serve XML bills as XHTML - here's &lt;a href="https://www.congress.gov/117/bills/hr3617/BILLS-117hr3617ih.xml"&gt;H. R. 3617 117th CONGRESS 1st Session&lt;/a&gt; for example.&lt;/p&gt;
&lt;p&gt;View source on that page and it starts like this:&lt;/p&gt;
&lt;pre&gt;&amp;lt;?&lt;span class="pl-ent"&gt;xml&lt;/span&gt;&lt;span class="pl-e"&gt; version&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;1.0&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;?&amp;gt;
&amp;lt;?&lt;span class="pl-ent"&gt;xml-stylesheet&lt;/span&gt;&lt;span class="pl-e"&gt; type&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;text/xsl&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-e"&gt; href&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;billres.xsl&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;?&amp;gt;
&amp;lt;!&lt;span class="pl-ent"&gt;DOCTYPE&lt;/span&gt; &lt;span class="pl-e"&gt;bill&lt;/span&gt; PUBLIC "-//US Congress//DTDs/bill.dtd//EN" "bill.dtd"&amp;gt;
&amp;lt;&lt;span class="pl-ent"&gt;bill&lt;/span&gt; &lt;span class="pl-e"&gt;bill-stage&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Introduced-in-House&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;dms-id&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;H5BD50AB7712141319B352D46135AAC2B&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;public-private&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;public&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;key&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;H&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-e"&gt;bill-type&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;olc&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt; 
&amp;lt;&lt;span class="pl-ent"&gt;metadata&lt;/span&gt; &lt;span class="pl-e"&gt;xmlns&lt;/span&gt;&lt;span class="pl-e"&gt;:&lt;/span&gt;&lt;span class="pl-e"&gt;dc&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;http://purl.org/dc/elements/1.1/&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;
&amp;lt;&lt;span class="pl-ent"&gt;dublinCore&lt;/span&gt;&amp;gt;
&amp;lt;&lt;span class="pl-ent"&gt;dc&lt;/span&gt;&lt;span class="pl-ent"&gt;:&lt;/span&gt;&lt;span class="pl-ent"&gt;title&lt;/span&gt;&amp;gt;117 HR 3617 IH: Marijuana Opportunity Reinvestment and Expungement Act of 2021&amp;lt;/&lt;span class="pl-ent"&gt;dc&lt;/span&gt;&lt;span class="pl-ent"&gt;:&lt;/span&gt;&lt;span class="pl-ent"&gt;title&lt;/span&gt;&amp;gt;
&amp;lt;&lt;span class="pl-ent"&gt;dc&lt;/span&gt;&lt;span class="pl-ent"&gt;:&lt;/span&gt;&lt;span class="pl-ent"&gt;publisher&lt;/span&gt;&amp;gt;U.S. House of Representatives&amp;lt;/&lt;span class="pl-ent"&gt;dc&lt;/span&gt;&lt;span class="pl-ent"&gt;:&lt;/span&gt;&lt;span class="pl-ent"&gt;publisher&lt;/span&gt;&amp;gt;
&amp;lt;&lt;span class="pl-ent"&gt;dc&lt;/span&gt;&lt;span class="pl-ent"&gt;:&lt;/span&gt;&lt;span class="pl-ent"&gt;date&lt;/span&gt;&amp;gt;2021-05-28&amp;lt;/&lt;span class="pl-ent"&gt;dc&lt;/span&gt;&lt;span class="pl-ent"&gt;:&lt;/span&gt;&lt;span class="pl-ent"&gt;date&lt;/span&gt;&amp;gt;
&amp;lt;&lt;span class="pl-ent"&gt;dc&lt;/span&gt;&lt;span class="pl-ent"&gt;:&lt;/span&gt;&lt;span class="pl-ent"&gt;format&lt;/span&gt;&amp;gt;text/xml&amp;lt;/&lt;span class="pl-ent"&gt;dc&lt;/span&gt;&lt;span class="pl-ent"&gt;:&lt;/span&gt;&lt;span class="pl-ent"&gt;format&lt;/span&gt;&amp;gt;
&amp;lt;&lt;span class="pl-ent"&gt;dc&lt;/span&gt;&lt;span class="pl-ent"&gt;:&lt;/span&gt;&lt;span class="pl-ent"&gt;language&lt;/span&gt;&amp;gt;EN&amp;lt;/&lt;span class="pl-ent"&gt;dc&lt;/span&gt;&lt;span class="pl-ent"&gt;:&lt;/span&gt;&lt;span class="pl-ent"&gt;language&lt;/span&gt;&amp;gt;
&amp;lt;&lt;span class="pl-ent"&gt;dc&lt;/span&gt;&lt;span class="pl-ent"&gt;:&lt;/span&gt;&lt;span class="pl-ent"&gt;rights&lt;/span&gt;&amp;gt;Pursuant to Title 17 Section 105 of the United States Code, this file is not subject to copyright protection and is in the public domain.&amp;lt;/&lt;span class="pl-ent"&gt;dc&lt;/span&gt;&lt;span class="pl-ent"&gt;:&lt;/span&gt;&lt;span class="pl-ent"&gt;rights&lt;/span&gt;&amp;gt;
&amp;lt;/&lt;span class="pl-ent"&gt;dublinCore&lt;/span&gt;&amp;gt;
&amp;lt;/&lt;span class="pl-ent"&gt;metadata&lt;/span&gt;&amp;gt;
&amp;lt;&lt;span class="pl-ent"&gt;form&lt;/span&gt;&amp;gt;
&amp;lt;&lt;span class="pl-ent"&gt;distribution-code&lt;/span&gt; &lt;span class="pl-e"&gt;display&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;yes&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;I&amp;lt;/&lt;span class="pl-ent"&gt;distribution-code&lt;/span&gt;&amp;gt; 
&amp;lt;&lt;span class="pl-ent"&gt;congress&lt;/span&gt; &lt;span class="pl-e"&gt;display&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;yes&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;117th CONGRESS&amp;lt;/&lt;span class="pl-ent"&gt;congress&lt;/span&gt;&amp;gt;&amp;lt;&lt;span class="pl-ent"&gt;session&lt;/span&gt; &lt;span class="pl-e"&gt;display&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;yes&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;1st Session&amp;lt;/&lt;span class="pl-ent"&gt;session&lt;/span&gt;&amp;gt; 
&amp;lt;&lt;span class="pl-ent"&gt;legis-num&lt;/span&gt; &lt;span class="pl-e"&gt;display&lt;/span&gt;=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;yes&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&amp;gt;H. R. 3617&amp;lt;/&lt;span class="pl-ent"&gt;legis-num&lt;/span&gt;&amp;gt; 
&amp;lt;&lt;span class="pl-ent"&gt;current-chamber&lt;/span&gt;&amp;gt;IN THE HOUSE OF REPRESENTATIVES&amp;lt;/&lt;span class="pl-ent"&gt;current-chamber&lt;/span&gt;&amp;gt;&lt;/pre&gt;

&lt;p&gt;Digging into those XSLT stylesheets leads to &lt;code&gt;billres-details.xsl&lt;/code&gt; - &lt;a href="https://gist.github.com/simonw/64c9f172533203c09acbcf13a0bb67c4"&gt;gist copy here&lt;/a&gt; - which starts with a huge changelog comment with notes dating all the way back to 2004!&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/html"&gt;html&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/political-hacking"&gt;political-hacking&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/web-standards"&gt;web-standards&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xslt"&gt;xslt&lt;/a&gt;&lt;/p&gt;



</summary><category term="html"/><category term="political-hacking"/><category term="web-standards"/><category term="xslt"/></entry><entry><title>18f.org</title><link href="https://simonwillison.net/2025/Mar/2/18forg/#atom-tag" rel="alternate"/><published>2025-03-02T09:24:37+00:00</published><updated>2025-03-02T09:24:37+00:00</updated><id>https://simonwillison.net/2025/Mar/2/18forg/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://18f.org/"&gt;18f.org&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New site by members of 18F, the team within the US government that were doing some of the most effective work at improving government efficiency.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;For over 11 years, 18F has been proudly serving you to make government technology work better. We are non-partisan civil servants. 18F has worked on hundreds of projects, all designed to make government technology not just efficient but effective, and to save money for American taxpayers.&lt;/p&gt;
&lt;p&gt;However, all employees at 18F – a group that the Trump Administration GSA Technology Transformation Services Director called "the gold standard" of civic tech – were terminated today at midnight ET.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;18F was doing exactly the type of work that DOGE claims to want – yet we were eliminated.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The entire team is now on "administrative leave" and locked out of their computers.&lt;/p&gt;
&lt;p&gt;But these are not the kind of civil servants to abandon their mission without a fight:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;We’re not done yet.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;We’re still absorbing what has happened. We’re wrestling with what it will mean for ourselves and our families, as well as the impact on our partners and the American people.&lt;/p&gt;
&lt;p&gt;But we came to the government to fix things. And we’re not done with this work yet.&lt;/p&gt;
&lt;p&gt;More to come.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;You can &lt;a href="https://bsky.app/profile/team18f.bsky.social"&gt;follow @team18f.bsky.social&lt;/a&gt; on Bluesky.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/government"&gt;government&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/political-hacking"&gt;political-hacking&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/politics"&gt;politics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/bluesky"&gt;bluesky&lt;/a&gt;&lt;/p&gt;



</summary><category term="government"/><category term="political-hacking"/><category term="politics"/><category term="bluesky"/></entry><entry><title>Private School Labeler on Bluesky</title><link href="https://simonwillison.net/2024/Nov/22/private-school-labeler-on-bluesky/#atom-tag" rel="alternate"/><published>2024-11-22T17:44:34+00:00</published><updated>2024-11-22T17:44:34+00:00</updated><id>https://simonwillison.net/2024/Nov/22/private-school-labeler-on-bluesky/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://bsky.app/profile/daddys.cash"&gt;Private School Labeler on Bluesky&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I am utterly delighted by this subversive use of Bluesky's &lt;a href="https://docs.bsky.app/docs/advanced-guides/moderation"&gt;labels feature&lt;/a&gt;, which allows you to subscribe to a custom application that then adds visible labels to profiles.&lt;/p&gt;
&lt;p&gt;The feature was designed for moderation, but this labeler subverts it by displaying labels on accounts belonging to British public figures showing which expensive private school they went to and what the current fees are for that school.&lt;/p&gt;
&lt;p&gt;Here's what it looks like on an account - tapping the label brings up the information about the fees:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a social media profile and post. Profile shows &amp;quot;James O'Brien @mrjamesob.bsky.social&amp;quot; with 166.7K followers, 531 following, 183 posts. Bio reads &amp;quot;Broadcaster &amp;amp; author.&amp;quot; Shows education at Ampleforth School and Private School. Contains a repost from Julia Hines about Rabbi Jeffrey, followed by a label showing &amp;quot;Ampleforth School £46,740/year (2024/2025). This label was applied by Private School Labeller" src="https://static.simonwillison.net/static/2024/bluesky-label.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;These labels are only visible to users who have deliberately subscribed to the labeler. Unsurprisingly, some of those labeled aren't too happy about it!&lt;/p&gt;
&lt;p&gt;In response to a comment about attending on a scholarship, the label creator &lt;a href="https://bsky.app/profile/daddys.cash/post/3lbl43ifho22n"&gt;said&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I'm explicit with the labeller that scholarship pupils, grant pupils, etc, are still included - because it's the later effects that are useful context - students from these schools get a leg up and a degree of privilege, which contributes eg to the overrepresentation in British media/politics&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;On the one hand, there are clearly opportunities for abuse here. But given the opt-in nature of the labelers, this doesn't feel hugely different to someone creating a separate webpage full of information about Bluesky profiles.&lt;/p&gt;
&lt;p&gt;I'm intrigued by the possibilities of labelers. There's a list of others on &lt;a href="https://www.bluesky-labelers.io/"&gt;bluesky-labelers.io&lt;/a&gt;, including another brilliant hack: &lt;a href="https://bsky.app/profile/did:plc:w6yx4bltuzdmiolooi4kd6zt"&gt;Bookmarks&lt;/a&gt;, which lets you "report" a post to the labeler and then displays those reported posts in a custom feed - providing a private bookmarks feature that Bluesky itself currently lacks.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt; &lt;a href="https://bsky.app/profile/us-gov-funding.bsky.social"&gt;@us-gov-funding.bsky.social&lt;/a&gt; is the inevitable labeler for US politicians showing which companies and industries are their top donors, built &lt;a href="https://bsky.app/profile/hipstersmoothie.com/post/3lbl2lgnq7c2f"&gt;by Andrew Lisowski&lt;/a&gt; (&lt;a href="https://github.com/hipstersmoothie/us-gov-contributions-labeler"&gt;source code here&lt;/a&gt;) using data sourced from &lt;a href="https://www.opensecrets.org/"&gt;OpenScrets&lt;/a&gt;. Here's what it looks like on &lt;a href="https://bsky.app/profile/senatorschumer.bsky.social/post/3lbkvtdc5ik2z"&gt;this post&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Post by Chuck Schumer. Labels show affiliated organizations: Citigroup Inc, Goldman Sachs, Lawyers/Law Firms, Paul, Weiss et al, Real Estate, Securities &amp;amp; Investment. Post text reads &amp;quot;Democracy is in serious trouble, but it's not dead. We all have power, and we can use it together to defend our freedoms.&amp;quot;" src="https://static.simonwillison.net/static/2024/chuck-label.jpg" /&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/apis"&gt;apis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/moderation"&gt;moderation&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/political-hacking"&gt;political-hacking&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/politics"&gt;politics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/bluesky"&gt;bluesky&lt;/a&gt;&lt;/p&gt;



</summary><category term="apis"/><category term="moderation"/><category term="political-hacking"/><category term="politics"/><category term="bluesky"/></entry><entry><title>Project: Civic Band - scraping and searching PDF meeting minutes from hundreds of municipalities</title><link href="https://simonwillison.net/2024/Nov/16/civic-band/#atom-tag" rel="alternate"/><published>2024-11-16T22:14:01+00:00</published><updated>2024-11-16T22:14:01+00:00</updated><id>https://simonwillison.net/2024/Nov/16/civic-band/#atom-tag</id><summary type="html">
    &lt;p&gt;I interviewed &lt;a href="https://phildini.dev/"&gt;Philip James&lt;/a&gt; about &lt;a href="https://civic.band/"&gt;Civic Band&lt;/a&gt;, his "slowly growing collection of databases of the minutes from civic governments". Philip demonstrated the site and talked through his pipeline for scraping and indexing meeting minutes from many different local government authorities around the USA.&lt;/p&gt;

&lt;iframe style="margin-top: 1.5em; margin-bottom: 1.5em;" width="560" height="315" src="https://www.youtube-nocookie.com/embed/OziYd7xcGzc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="allowfullscreen"&gt; &lt;/iframe&gt;

&lt;p&gt;We recorded this conversation as part of yesterday's Datasette Public Office Hours session.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2024/Nov/16/civic-band/#civic-band"&gt;Civic Band&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2024/Nov/16/civic-band/#the-technical-stack"&gt;The technical stack&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2024/Nov/16/civic-band/#scale-and-storage"&gt;Scale and storage&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2024/Nov/16/civic-band/#future-plans"&gt;Future plans&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id="civic-band"&gt;Civic Band&lt;/h4&gt;
&lt;p&gt;Philip was inspired to start thinking more about local government after the 2016 US election. He realised that there was a huge amount of information about decisions made by local authorities tucked away in their meeting minutes,but that information was hidden away in thousands of PDF files across many different websites.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;There was this massive backlog of basically every decision that had ever been made by one of these bodies. But it was almost impossible to discover because it lives in these systems where the method of exchange is a PDF.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Philip lives in Alameda, which makes its minutes available &lt;a href="https://alameda.legistar.com/Calendar.aspx"&gt;via this portal&lt;/a&gt; powered by &lt;a href="https://granicus.com/product/legistar-agenda-management/"&gt;Legistar&lt;/a&gt;. It turns out there are a small number of vendors that provide this kind of software tool, so once you've written a scraper for one it's likely to work for many others as well.&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://alameda.ca.civic.band/"&gt;the Civic Band portal for Alameda&lt;/a&gt;, powered by &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/civic-band-1.jpg" alt="Datasette instance titled Alameda Civic Data, has search box, a note that says  A fully-searchable database of Alameda, CA civic meeting minutes. Last updated: 2024-11-15T20:27:36. See the full list at Civic Band and a meetings database with tables minutes and agendas." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;It's running the &lt;a href="https://github.com/simonw/datasette-search-all"&gt;datasette-search-all&lt;/a&gt; plugin and has both tables configured for full-text search. Here's a &lt;a href="https://alameda.ca.civic.band/-/search?q=housing"&gt;search for housing&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2024/civic-band-2.jpg" alt="Search all tables - for housing. 43 results in meetings: agendas. Each result shows a meeting, date, page, text and a rendered page image" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4 id="the-technical-stack"&gt;The technical stack&lt;/h4&gt;
&lt;p&gt;The public Civic Band sites all run using Datasette in Docker Containers - one container per municipality. They're hosted on a single &lt;a href="https://www.hetzner.com/"&gt;Hetzner&lt;/a&gt; machine.&lt;/p&gt;
&lt;p&gt;The ingestion pipeline runs separately from the main hosting environment, using a Mac Mini on Philp's desk at home.&lt;/p&gt;
&lt;p&gt;OCR works by breaking each PDF up into images and then running &lt;a href="https://github.com/tesseract-ocr/tesseract"&gt;Tesseract OCR&lt;/a&gt; against them directly on the Mac Mini. This processes in the order of 10,000 or less new pages of documents a day.&lt;/p&gt;
&lt;p&gt;Philip treats PDF as a normalization target, because the pipeline is designed around documents with pages of text. In the rare event that a municipality publishes documents in another format such as &lt;code&gt;.docx&lt;/code&gt; he converts them to PDF before processing.&lt;/p&gt;
&lt;p&gt;PNG images of the PDF pages are served via a CDN, and the OCRd text is written to SQLite database files - one per municipality. &lt;a href="https://sqlite.org/fts5.html"&gt;SQLite FTS&lt;/a&gt; provides full-text search.&lt;/p&gt;
&lt;h4 id="scale-and-storage"&gt;Scale and storage&lt;/h4&gt;
&lt;p&gt;The entire project currently comes to about 265GB on disk.  The PNGs of the pages use about 350GB of CDN storage.&lt;/p&gt;
&lt;p&gt;Most of the individual SQLite databases are very small. The largest is for &lt;a href="https://maui-county.hi.civic.band/"&gt;Maui County&lt;/a&gt; which is around 535MB because that county has professional stenographers taking detailed notes for every one of their meetings.&lt;/p&gt;
&lt;p&gt;Each city adds only a few documents a week so growth is manageable even as the number of cities grows.&lt;/p&gt;
&lt;h4 id="future-plans"&gt;Future plans&lt;/h4&gt;
&lt;p&gt;We talked quite a bit about a goal to allow users to subscribe to updates that match specific search terms.&lt;/p&gt;
&lt;p&gt;Philip has been building out a separate site called Civic Observer to address this need, which will store searches and then execute the periodically using the Datasette JSON API, with a Django app to record state to avoid sending the same alert more than once.&lt;/p&gt;

&lt;p&gt;I've had a long term ambition to build some kind of saved search alerts plugin for Datasette generally, to allow users to subscribe to new results for arbitrary SQL queries. My &lt;a href="https://github.com/simonw/sqlite-chronicle"&gt;sqlite-chronicle&lt;/a&gt; library is part or that effort - it uses SQLite triggers to maintain version numbers for individual rows in a table, allowing you to query just the rows that have been inserted or modified since the version number last time you ran the query.&lt;/p&gt;

&lt;p&gt;Philip is keen to talk to anyone who is interested in using Civic Band or helping expand it to even more cities. You can find him on the &lt;a href="https://datasette.io/discord"&gt;Datasette Discord&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/data-journalism"&gt;data-journalism&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/political-hacking"&gt;political-hacking&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/politics"&gt;politics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette-public-office-hours"&gt;datasette-public-office-hours&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="data-journalism"/><category term="political-hacking"/><category term="politics"/><category term="sqlite"/><category term="datasette"/><category term="datasette-public-office-hours"/></entry><entry><title>Spevktator: OSINT analysis tool for VK</title><link href="https://simonwillison.net/2022/Sep/5/spevktator/#atom-tag" rel="alternate"/><published>2022-09-05T20:48:20+00:00</published><updated>2022-09-05T20:48:20+00:00</updated><id>https://simonwillison.net/2022/Sep/5/spevktator/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/MischaU8/spevktator/blob/master/README.md"&gt;Spevktator: OSINT analysis tool for VK&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This is a really cool project that came out of a recent Bellingcat hackathon. Spevktator takes 67,000 posts from five popular Russian news channels on VK (a popular Russian social media platform) and makes them available in Datasette, along with automated translations to English, post sharing metrics and sentiment analysis scores. This README includes some detailed analysis of the data, plus a link to an Observable notebook that implements custom visualizations against queries run directly against the Datasette instance.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/political-hacking"&gt;political-hacking&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/observable"&gt;observable&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/bellingcat"&gt;bellingcat&lt;/a&gt;&lt;/p&gt;



</summary><category term="political-hacking"/><category term="datasette"/><category term="observable"/><category term="bellingcat"/></entry><entry><title>Digitizing 55,000 pages of civic meetings</title><link href="https://simonwillison.net/2022/Aug/22/digitizing-55000-pages-of-civic-meetings/#atom-tag" rel="alternate"/><published>2022-08-22T16:26:04+00:00</published><updated>2022-08-22T16:26:04+00:00</updated><id>https://simonwillison.net/2022/Aug/22/digitizing-55000-pages-of-civic-meetings/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://phildini.dev/digitizing-55-000-pages-of-civic-meetings"&gt;Digitizing 55,000 pages of civic meetings&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Philip James has been building public, searchable archives of city council meetings for various cities—Oakland and Alamedia so far—using my s3-ocr script to run Textract OCR against the PDFs of the minutes, and deploying them to Fly using Datasette. This is a really cool project, and very much the kind of thing I’ve been hoping to support with the tools I’ve been building.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/phildini/status/1561745329971613696"&gt;@phildini&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/archiving"&gt;archiving&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ocr"&gt;ocr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/political-hacking"&gt;political-hacking&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/fly"&gt;fly&lt;/a&gt;&lt;/p&gt;



</summary><category term="archiving"/><category term="ocr"/><category term="political-hacking"/><category term="datasette"/><category term="fly"/></entry><entry><title>Exploring the UK Register of Members Interests with SQL and Datasette</title><link href="https://simonwillison.net/2018/Apr/25/register-members-interests/#atom-tag" rel="alternate"/><published>2018-04-25T15:49:39+00:00</published><updated>2018-04-25T15:49:39+00:00</updated><id>https://simonwillison.net/2018/Apr/25/register-members-interests/#atom-tag</id><summary type="html">
    &lt;p&gt;Ever wondered which UK Members of Parliament &lt;a href="https://register-of-members-interests.datasettes.com/regmem?sql=select+people_name%2C+person_id%2C+count%28*%29+as+n%2C+group_concat%28item%2C+%22+%3A%3A+%22%29+from+%28select+distinct+item%2C+people.name+as+people_name%2C+person_id%0D%0Afrom+items+join+people+on+items.person_id+%3D+people.id%0D%0Awhere+items.rowid+in+%28select+rowid+from+%5Bitems_fts%5D+where+%5Bitems_fts%5D+match+%3Asearch%29+order+by+hash%29%0D%0Agroup+by+people_name%2C+person_id%0D%0Aorder+by+n+desc%3B&amp;amp;search=helicopter"&gt;get gifted the most helicopter rides&lt;/a&gt;? How about &lt;a href="https://register-of-members-interests.datasettes.com/regmem?sql=select+distinct+item%2C+people.name%2C+person_id%0D%0Afrom+items+join+people+on+items.person_id+%3D+people.id%0D%0Awhere+items.rowid+in+%28select+rowid+from+%5Bitems_fts%5D+where+%5Bitems_fts%5D+match+%3Asearch%29+order+by+hash&amp;amp;search=gift+sultan+brunei"&gt;which MPs have been given Christmas hampers&lt;/a&gt; by the Sultan of Brunei? (David Cameron, William Hague and Michael Howard apparently). Here’s how to dig through the Register of Members Interests using SQL and Datasette.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://register-of-members-interests.datasettes.com/regmem?sql=select+distinct+item%2C+people.name%2C+person_id%0D%0Afrom+items+join+people+on+items.person_id+%3D+people.id%0D%0Awhere+items.rowid+in+%28select+rowid+from+%5Bitems_fts%5D+where+%5Bitems_fts%5D+match+%3Asearch%29+order+by+hash&amp;amp;search=gift+sultan+brunei" class="a-img"&gt;&lt;img style="max-width: 100%" src="https://static.simonwillison.net/static/2018/mp-gifts-sultan.png" alt="Gifts from the Sultan" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.mysociety.org/"&gt;mySociety&lt;/a&gt; have been building incredible civic participation applications like &lt;a href="https://www.theyworkforyou.com/"&gt;TheyWorkForYou&lt;/a&gt; and &lt;a href="https://www.fixmystreet.com/"&gt;FixMyStreet&lt;/a&gt; for nearly 15 years now, and have accumulated all kinds of interesting data along the way.&lt;/p&gt;
&lt;p&gt;They recently launched their own data portal at &lt;a href="https://data.mysociety.org/"&gt;data.mysociety.org&lt;/a&gt; listing all of the information they have available. While exploring it I stumbled across their copy of the &lt;a href="https://data.mysociety.org/datasets/members-interest/"&gt;UK Register of Members Interests&lt;/a&gt;. Every UK Member of Parliament has to &lt;a href="https://www.parliament.uk/mps-lords-and-offices/standards-and-financial-interests/parliamentary-commissioner-for-standards/registers-of-interests/register-of-members-financial-interests/"&gt;register their conflicts of interest and income sources&lt;/a&gt;, and mySociety have an ongoing project to parse that data into a more useful format.&lt;/p&gt;
&lt;p&gt;It won’t surprise you to hear that I couldn’t resist turning their XML files into a SQLite database.&lt;/p&gt;
&lt;p&gt;The result is &lt;a href="https://register-of-members-interests.datasettes.com"&gt;register-of-members-interests.datasettes.com&lt;/a&gt; - a &lt;a href="https://github.com/simonw/datasette"&gt;Datasette&lt;/a&gt; instance running against a SQLite database containing over 1.3 million line-items registered by 1,419 MPs over the course of 18 years.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Some_fun_queries_14"&gt;&lt;/a&gt;Some fun queries&lt;/h3&gt;
&lt;p&gt;A few of my favourites so far:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Which MPs have taken &lt;a href="https://register-of-members-interests.datasettes.com/regmem?sql=select+people_name%2C+person_id%2C+count%28*%29+as+n%2C+group_concat%28item%2C+%22+%3A%3A+%22%29+from+%28select+distinct+item%2C+people.name+as+people_name%2C+person_id%0D%0Afrom+items+join+people+on+items.person_id+%3D+people.id%0D%0Awhere+items.rowid+in+%28select+rowid+from+%5Bitems_fts%5D+where+%5Bitems_fts%5D+match+%3Asearch%29+order+by+hash%29%0D%0Agroup+by+people_name%2C+person_id%0D%0Aorder+by+n+desc%3B&amp;amp;search=helicopter"&gt;the most donated helicopter rides&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Which MPs have accepted &lt;a href="https://register-of-members-interests.datasettes.com/regmem?sql=select+distinct+item%2C+people.name%2C+person_id%0D%0Afrom+items+join+people+on+items.person_id+%3D+people.id%0D%0Awhere+items.rowid+in+%28select+rowid+from+%5Bitems_fts%5D+where+%5Bitems_fts%5D+match+%3Asearch%29+order+by+hash&amp;amp;search=gift+sultan+brunei"&gt;gifts from the Surtan of Brunei&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;A better gifts query with more robust de-duping: here's gifts of &lt;a href="https://register-of-members-interests.datasettes.com/regmem?sql=select+item%2C+people.name%2C+max%28date%29+as+max_date%0D%0Afrom+items+join+people+on+items.person_id+%3D+people.id%0D%0Awhere+items.rowid+in+%28select+rowid+from+%5Bitems_fts%5D+where+%5Bitems_fts%5D+match+%3Asearch%29%0D%0Agroup+by+item%2C+people.name%2C+person_id%0D%0Aorder+by+max_date+desc&amp;amp;search=hamper"&gt;hampers&lt;/a&gt;, &lt;a href="https://register-of-members-interests.datasettes.com/regmem?sql=select+item%2C+people.name%2C+max%28date%29+as+max_date%0D%0Afrom+items+join+people+on+items.person_id+%3D+people.id%0D%0Awhere+items.rowid+in+%28select+rowid+from+%5Bitems_fts%5D+where+%5Bitems_fts%5D+match+%3Asearch%29%0D%0Agroup+by+item%2C+people.name%2C+person_id%0D%0Aorder+by+max_date+desc&amp;amp;search=gift+watch"&gt;watches&lt;/a&gt; and &lt;a href="https://register-of-members-interests.datasettes.com/regmem?sql=select+item%2C+people.name%2C+max%28date%29+as+max_date%0D%0Afrom+items+join+people+on+items.person_id+%3D+people.id%0D%0Awhere+items.rowid+in+%28select+rowid+from+%5Bitems_fts%5D+where+%5Bitems_fts%5D+match+%3Asearch%29%0D%0Agroup+by+item%2C+people.name%2C+person_id%0D%0Aorder+by+max_date+desc&amp;amp;search=glastonbury"&gt;Glastonbury festival tickets&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Which MPs own significant shares in &lt;a href="https://register-of-members-interests.datasettes.com/regmem/mps_ranked_by_matching_shareholdings?search=shell"&gt;Shell&lt;/a&gt;, &lt;a href="(https://register-of-members-interests.datasettes.com/regmem/mps_ranked_by_matching_shareholdings?search=apple)"&gt;Apple&lt;/a&gt;, or &lt;a href="https://register-of-members-interests.datasettes.com/regmem/mps_ranked_by_matching_shareholdings?search=bp"&gt;BP&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Who has reported &lt;a href="https://register-of-members-interests.datasettes.com/regmem/mps_ranked_by_matching_overseas_line_items?search=saudi+arabia"&gt;the most overseas trips to Saudi Arabia&lt;/a&gt;? The top hit there was Daniel Kawczynski, and it turns out his &lt;a href="https://en.wikipedia.org/wiki/Daniel_Kawczynski"&gt;Wikipedia page&lt;/a&gt; has a section dedicated to his relationship with the kindgom.&lt;/li&gt;
&lt;li&gt;How much do MPs get paid for &lt;a href="https://register-of-members-interests.datasettes.com/regmem/items?_search=Have+I+Got+News+for+You&amp;amp;_sort_desc=date"&gt;appearances on Have I Got News For You&lt;/a&gt;. The going rate seems to be £1,500 but you can find &lt;a href="https://register-of-members-interests.datasettes.com/regmem?sql=select+distinct+item%2C+person_id%2C+people.name+from+items+join+people+on+people.id+%3D+items.person_id%0D%0Awhere+%22item%22+not+like+%22%251%2C500%25%22+and+%22item%22+not+like+%22%251500%22+and+%22item%22+like+%22%25%C2%A3%25%22+and+items.rowid+in+%28select+rowid+from+%5Bitems_fts%5D+where+%5Bitems_fts%5D+match+%3Asearch%29+order+by+date+desc+limit+101&amp;amp;search=Have+I+Got+News+For+You"&gt;some interesting discrepancies&lt;/a&gt; if you exclude that value from the results.&lt;/li&gt;
&lt;li&gt;Which MPs are responsible for &lt;a href="https://register-of-members-interests.datasettes.com/regmem?sql=select+%22person_id%22%2C+people.name%2C+count(*)+as+%22count%22%0D%0Afrom+items+join+people+on+items.person_id+%3D+people.id%0D%0Agroup+by+%22person_id%22+order+by+%22count%22+desc+limit+100"&gt;the most total line items reported&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;a id="Understanding_the_data_model_23"&gt;&lt;/a&gt;Understanding the data model&lt;/h3&gt;
&lt;p&gt;Most of the action takes place in the &lt;a href="https://register-of-members-interests.datasettes.com/regmem/items"&gt;items&lt;/a&gt; table, where each item is a line-item from an MP’s filing. You can search that table by keyword (see helicopter example above) or apply filters to it using the standard Datasette interface. You can also execute your own SQL directly against the database.&lt;/p&gt;
&lt;p&gt;Each item is filed against a &lt;a href="https://register-of-members-interests.datasettes.com/regmem/categories?_sort=type"&gt;category&lt;/a&gt;. There appears to have been quite a bit of churn in the way that the categories are defined over the years, plus the data is pretty untidy - there are no less than 10 ways of spelling “Remunerated employment, office, profession etc.” for example!&lt;/p&gt;
&lt;p&gt;&lt;a href="https://register-of-members-interests.datasettes.com/regmem/categories" class="a-img"&gt;&lt;img style="max-width: 100%" src="https://static.simonwillison.net/static/2018/mp-categories.png" alt="Categories" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;There are also a LOT of duplicate items in the set - it appears that MPs frequently list the same item (a rental property for example) every time they fill out the register. SQL DISTINCT clauses can help filter through these, as seen in some of the above examples.&lt;/p&gt;
&lt;p&gt;The data also has the concepts of both &lt;a href="https://register-of-members-interests.datasettes.com/regmem/members"&gt;members&lt;/a&gt; and &lt;a href="https://register-of-members-interests.datasettes.com/regmem/people"&gt;people&lt;/a&gt;. As far as I can tell people are distinct, but members may contain duplicates - presumably to represent MPs who have served more than one term in office. It looks like the member field &lt;a href="https://register-of-members-interests.datasettes.com/regmem/items?_sort_desc=date&amp;amp;member_id__notblank=1"&gt;stopped being populated in March 2015&lt;/a&gt; so analysis is best performed against the people table.&lt;/p&gt;
&lt;p&gt;Once concept I have introduced myself is the &lt;code&gt;record_id&lt;/code&gt;. In the XML documents the items are often grouped together into a related collection, like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;regmem personid=&amp;quot;uk.org.publicwhip/person/10001&amp;quot;
    memberid=&amp;quot;uk.org.publicwhip/member/40289&amp;quot; membername=&amp;quot;Diane Abbott&amp;quot; date=&amp;quot;2014-07-14&amp;quot;&amp;gt;
    &amp;lt;category type=&amp;quot;2&amp;quot; name=&amp;quot;Remunerated employment, office, profession etc&amp;quot;&amp;gt;
        &amp;lt;item&amp;gt;Payments from MRL Public Sector Consultants, Pepple House, 8 Broad Street, Great Cambourne, Cambridge CB23 6HJ:&amp;lt;/item&amp;gt;
        &amp;lt;item&amp;gt;26 November 2013, I received a fee of £1,000 for speaking at the 1st African Legislative Summit, National Assembly, Abuja, Nigeria.  Hours: 8 hrs. The cost of my flights, transfers and hotel accommodation in Abuja were also met; estimated value £5,000. &amp;lt;em&amp;gt;(Registered 3 December 2013)&amp;lt;/em&amp;gt;&amp;lt;/item&amp;gt;
        &amp;lt;item&amp;gt;23 July 2013, I received a fee of £5,000 for appearing as a contestant on ITV&amp;amp;#8217;s &amp;amp;#8216;The Chase Celebrity &amp;amp;#8211; Series 3&amp;amp;#8217; television programme.  Address of payer:  ITV Studios Ltd, London Television Centre, Upper Ground, London SE1 9Lt.  Hours: 12 hrs.   &amp;lt;em&amp;gt;(Registered 23 July 2013)&amp;lt;/em&amp;gt;&amp;lt;/item&amp;gt;
    &amp;lt;/category&amp;gt;
&amp;lt;/regmem&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;While these items are presented as separate line items, their grouping carries meaning: the first line item here acts as a kind of heading to help provide context to the other items.&lt;/p&gt;
&lt;p&gt;To model this in the simplest way possible, I’ve attempted to preserve the order of these groups using a pair of additional columns: the &lt;code&gt;record_id&lt;/code&gt; and the &lt;code&gt;sort_order&lt;/code&gt;. I construct the &lt;code&gt;record_id&lt;/code&gt; using a collection of other fields - the idea is for it to be sortable, and for each line-item in the same grouping to have the same &lt;code&gt;record_id&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;record_id = &amp;quot;{date}-{category_id}-{person_id}-{record}&amp;quot;.format(
    date=date,
    category_id=category_id,
    person_id=person_id.split(&amp;quot;/&amp;quot;)[
        -1
    ],
    record=record,
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The resulting &lt;code&gt;record_id&lt;/code&gt; might look like this: &lt;code&gt;2018-04-16-70b64e89-24878-0&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;To recreate that particular sequence of line-items, you can search for all items matching that &lt;code&gt;record_id&lt;/code&gt; and then sort them by their &lt;code&gt;sort_order&lt;/code&gt;. Here’s &lt;a href="https://register-of-members-interests.datasettes.com/regmem/items?_sort=sort_order&amp;amp;record_id__exact=2014-07-14-1f359ece-10001-0"&gt;that record from Diane Abbott&lt;/a&gt; shown with its surrounding context.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://register-of-members-interests.datasettes.com/regmem/items?_sort=sort_order&amp;amp;record_id__exact=2014-07-14-1f359ece-10001-0" class="a-img"&gt;&lt;img style="max-width: 100%" src="https://static.simonwillison.net/static/2018/mp-record.png" alt="A single record" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;&lt;a id="How_I_built_it_65"&gt;&lt;/a&gt;How I built it&lt;/h3&gt;
&lt;p&gt;The short version: I downloaded all of the XML files and wrote a Python script which parsed them using &lt;a href="https://docs.python.org/3/library/xml.etree.elementtree.html"&gt;ElementTree&lt;/a&gt; and inserted them into a SQLite database. I’ve &lt;a href="https://github.com/simonw/register-of-members-interests"&gt;put the code on GitHub&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;A couple of fun tricks: firstly, I borrowed some code from &lt;a href="https://github.com/simonw/csvs-to-sqlite"&gt;csvs-to-sqlite&lt;/a&gt; to create the full-text search index and enable searching:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;def create_and_populate_fts(conn):
    create_sql = &amp;quot;&amp;quot;&amp;quot;
        CREATE VIRTUAL TABLE &amp;quot;items_fts&amp;quot;
        USING {fts_version} (item, person_name, content=&amp;quot;items&amp;quot;)
    &amp;quot;&amp;quot;&amp;quot;.format(
        fts_version=best_fts_version()
    )
    conn.executescript(create_sql)
    conn.executescript(
        &amp;quot;&amp;quot;&amp;quot;
        INSERT INTO &amp;quot;items_fts&amp;quot; (rowid, item, person_name)
        SELECT items.rowid, items.item, people.name
        FROM items LEFT JOIN people ON items.person_id = people.id
    &amp;quot;&amp;quot;&amp;quot;
    )
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;best_fts_version()&lt;/code&gt; function &lt;a href="https://github.com/simonw/register-of-members-interests/blob/6372cfbc0905c4c045714e9092a981668f2d6345/convert_xml_to_sqlite.py#L56-L71"&gt;implements basic feature detection&lt;/a&gt; against SQLite by trying operations in an in-memory database.&lt;/p&gt;
&lt;p&gt;Secondly, I ended up writing my own tiny utility function for inserting records into SQLite. SQLite has useful &lt;code&gt;INSERT OR REPLACE INTO&lt;/code&gt; syntax which allows you to insert a record and will automatically update an existing record if there is a match on the primary key. This meant I could write this utility function and use it for all of my data inserts:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;def insert_or_replace(conn, table, record):
    pairs = record.items()
    columns = [p[0] for p in pairs]
    params = [p[1] for p in pairs]
    sql = &amp;quot;INSERT OR REPLACE INTO {table} ({column_list}) VALUES ({value_list});&amp;quot;.format(
        table=table,
        column_list=&amp;quot;, &amp;quot;.join(columns),
        value_list=&amp;quot;, &amp;quot;.join([&amp;quot;?&amp;quot; for p in params]),
    )
    conn.execute(sql, params)

# ...

insert_or_replace(
    db,
    &amp;quot;people&amp;quot;,
    {
        &amp;quot;id&amp;quot;: person_id,
        &amp;quot;name&amp;quot;: regmem_el.attrib[&amp;quot;membername&amp;quot;],
    },
)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;&lt;a id="What_can_you_find_113"&gt;&lt;/a&gt;What can you find?&lt;/h3&gt;
&lt;p&gt;I’ve really only scratched the surface of what’s in here with my initial queries. What can you find? Send me Datasette query links &lt;a href="https://twitter.com/simonw"&gt;on Twitter&lt;/a&gt; with your discoveries!&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/mysociety"&gt;mysociety&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/political-hacking"&gt;political-hacking&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/politics"&gt;politics&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xml"&gt;xml&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="mysociety"/><category term="political-hacking"/><category term="politics"/><category term="projects"/><category term="sqlite"/><category term="xml"/><category term="datasette"/></entry><entry><title>Video speech matching on TheyWorkForYou.com</title><link href="https://simonwillison.net/2008/Jun/1/video/#atom-tag" rel="alternate"/><published>2008-06-01T13:52:55+00:00</published><updated>2008-06-01T13:52:55+00:00</updated><id>https://simonwillison.net/2008/Jun/1/video/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://www.theyworkforyou.com/video/"&gt;Video speech matching on TheyWorkForYou.com&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Launched this morning at BarCamp London by Matthew Somerville—TheyWorkForYou now has video from BBC Parliament but they need your help matching it exactly to their transcripts from Hansard. Neat example of a game that helps process large amounts of data.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/barcamplondon"&gt;barcamplondon&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/barcamplondon4"&gt;barcamplondon4&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/government"&gt;government&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/matthew-somerville"&gt;matthew-somerville&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mysociety"&gt;mysociety&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/political-hacking"&gt;political-hacking&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/theyworkforyou"&gt;theyworkforyou&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/timestamping"&gt;timestamping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/video"&gt;video&lt;/a&gt;&lt;/p&gt;



</summary><category term="barcamplondon"/><category term="barcamplondon4"/><category term="government"/><category term="matthew-somerville"/><category term="mysociety"/><category term="political-hacking"/><category term="theyworkforyou"/><category term="timestamping"/><category term="video"/></entry><entry><title>WhatDoTheyKnow</title><link href="https://simonwillison.net/2008/Mar/4/whatdotheyknow/#atom-tag" rel="alternate"/><published>2008-03-04T23:38:46+00:00</published><updated>2008-03-04T23:38:46+00:00</updated><id>https://simonwillison.net/2008/Mar/4/whatdotheyknow/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://www.whatdotheyknow.com/"&gt;WhatDoTheyKnow&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
New from mySociety: a site for submitting and publically tracking Freedom of Information requests to the UK government.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="http://www.mysociety.org/2008/02/22/mysocietys-freedom-of-information-site-goes-live/"&gt;mySociety&amp;#x27;s Freedom of Information site goes live&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/foi"&gt;foi&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/government"&gt;government&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mysociety"&gt;mysociety&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/political-hacking"&gt;political-hacking&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/whatdotheyknow"&gt;whatdotheyknow&lt;/a&gt;&lt;/p&gt;



</summary><category term="foi"/><category term="government"/><category term="mysociety"/><category term="political-hacking"/><category term="whatdotheyknow"/></entry><entry><title>They Work For You</title><link href="https://simonwillison.net/2004/Jun/6/theyWorkForYou/#atom-tag" rel="alternate"/><published>2004-06-06T21:51:21+00:00</published><updated>2004-06-06T21:51:21+00:00</updated><id>https://simonwillison.net/2004/Jun/6/theyWorkForYou/#atom-tag</id><summary type="html">
    &lt;p id="p-0"&gt;Today is/was (you never can tell with these wretched time zone differences) &lt;a href="http://www.xcom2002.com/nc04/"&gt;NotCon 2004&lt;/a&gt;, London's premiere low-cost, informal, one-day technology conference. Friday's &lt;a href="http://www.ntk.net/2004/06/04/"&gt;MiniNTK&lt;/a&gt; promised the &lt;q cite="http://www.ntk.net/2004/06/04/"&gt;unveiling of a new project from the people behind FaxYourMP and PublicWhip&lt;/q&gt; and sure enough, here it is: &lt;a href="http://www.theyworkforyou.com/"&gt;TheyWorkForYou.com&lt;/a&gt;.&lt;/p&gt;

&lt;p id="p-1"&gt;TheyWorkForYou.com is the finest example of a &lt;a href="/2004/Feb/17/hackingPolitics/" title="Hacking the political system"&gt;political hack&lt;/a&gt; I've ever seen. It's basically an ultra-user-friendly front-end to the Hansard public record of all speeches and debates in the houses of commons, with each &lt;acronym title="Member of Parliament"&gt;MP&lt;/acronym&gt; getting their own page complete with a summary of their recent performance and member's interests and a list of their recently recorded parliamentary appearances.&lt;/p&gt;

&lt;p id="p-2"&gt;The community features are pretty impressive as well: you can add comments to any speech made in any debate, and there's also a &lt;a href="http://www.theyworkforyou.com/addterm/"&gt;user-populated glossary&lt;/a&gt; which is automatically linked in to the speech transcripts.&lt;/p&gt;

&lt;p id="p-3"&gt;But with all that, the most useful feature is probably the most subtle: you can subscribe to an &lt;acronym title="Really Simple Syndication"&gt;RSS&lt;/acronym&gt; feed of your &lt;acronym title="Member of Parliament"&gt;MP&lt;/acronym&gt;'s appearances, meaning you can keep track of everything they say on your behalf. It's simple, powerful and a perfect example of the political hacker ethic at work.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/mysociety"&gt;mysociety&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/political-hacking"&gt;political-hacking&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/politics"&gt;politics&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="mysociety"/><category term="political-hacking"/><category term="politics"/></entry></feed>