<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: colin-dellow</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/colin-dellow.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2024-04-21T22:28:29+00:00</updated><author><name>Simon Willison</name></author><entry><title>qrank</title><link href="https://simonwillison.net/2024/Apr/21/qrank/#atom-tag" rel="alternate"/><published>2024-04-21T22:28:29+00:00</published><updated>2024-04-21T22:28:29+00:00</updated><id>https://simonwillison.net/2024/Apr/21/qrank/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/hikeratlas/qrank"&gt;qrank&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Interesting and very niche project by Colin Dellow.&lt;/p&gt;

&lt;p&gt;Wikidata has pages for huge numbers of concepts, people, places and things.&lt;/p&gt;

&lt;p&gt;One of the many pieces of data they publish is QRank—“ranking Wikidata entities by aggregating page views on Wikipedia, Wikispecies, Wikibooks, Wikiquote, and other Wikimedia projects”. Every item gets a score and these scores can be used to answer questions like “which island nations get the most interest across Wikipedia”—potentially useful for things like deciding which labels to display on a highly compressed map of the world.&lt;/p&gt;

&lt;p&gt;QRank is published as a gzipped CSV file.&lt;/p&gt;

&lt;p&gt;Colin’s hikeratlas/qrank GitHub repository runs weekly, fetches the latest qrank.csv.gz file and loads it into a SQLite database using SQLite’s “.import” mechanism. Then it publishes the resulting SQLite database as an asset attached to the “latest” GitHub release on that repo—currently a 307MB file.&lt;/p&gt;

&lt;p&gt;The database itself has just a single table mapping the Wikidata ID (a primary key integer) to the latest QRank—another integer. You’d need your own set of data with Wikidata IDs to join against this to do anything useful.&lt;/p&gt;

&lt;p&gt;I’d never thought of using GitHub Releases for this kind of thing. I think it’s a really interesting pattern.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=40104774#40106913"&gt;Hacker News comment&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/wikipedia"&gt;wikipedia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github-actions"&gt;github-actions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/colin-dellow"&gt;colin-dellow&lt;/a&gt;&lt;/p&gt;



</summary><category term="sqlite"/><category term="wikipedia"/><category term="github-actions"/><category term="colin-dellow"/></entry><entry><title>datasette-scraper, Big Local News and other weeknotes</title><link href="https://simonwillison.net/2023/Jan/30/datasette-scraper/#atom-tag" rel="alternate"/><published>2023-01-30T02:52:50+00:00</published><updated>2023-01-30T02:52:50+00:00</updated><id>https://simonwillison.net/2023/Jan/30/datasette-scraper/#atom-tag</id><summary type="html">
    &lt;p&gt;In addition to exploring &lt;a href="https://simonwillison.net/2023/Jan/27/exploring-musiccaps/"&gt;the new MusicCaps training and evaluation data&lt;/a&gt; I've been working on the big Datasette JSON refactor, and getting excited about a Datasette project that I didn't work on at all.&lt;/p&gt;
&lt;h4&gt;datasette-scraper&lt;/h4&gt;
&lt;p&gt;The best thing about a plugin system is that you can wake up one day and your software has grown extra features without you even having to review a pull request.&lt;/p&gt;
&lt;p&gt;Colin Dellow's &lt;a href="https://github.com/cldellow/datasette-scraper"&gt;datasette-scraper&lt;/a&gt; - first released a few weeks ago - takes that a step further: it's a plugin that builds an entire custom application on top of Datasette.&lt;/p&gt;
&lt;p&gt;It's really cool!&lt;/p&gt;
&lt;p&gt;Colin has a &lt;a href="https://www.youtube.com/watch?v=zrSGnz7ErNI"&gt;ten minute demo&lt;/a&gt; up on YouTube which is well worth checking out.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.youtube.com/watch?v=zrSGnz7ErNI"&gt;&lt;img src="https://static.simonwillison.net/static/2023/datasette-scraper.jpg" alt="Screenshot of a dss_crawl_queue_history table, showing a history of crawls - with a window to the side showing the terminal output as the crawler ran" style="max-width: 100%;" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The plugin implements a website crawler which can crawl pages, build up a queue (using &lt;code&gt;sitemap.xml&lt;/code&gt; if it's available), then fetch and cache the content - storing its progress and the resulting data in a SQLite database hosted by Datasette.&lt;/p&gt;
&lt;p&gt;It uses some really neat tricks to customize various pieces of the Datasette interface to provide an interface for configuring and controlling the crawler.&lt;/p&gt;
&lt;p&gt;Most impressively of all, it implements its own plugin hooks... which means you can use small custom plugins to define how you would like data to be extracted from the pages you are crawling.&lt;/p&gt;
&lt;p&gt;Colin has a bunch of other Datasette plugins that are worth checking out too:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/cldellow/datasette-rewrite-sql"&gt;datasette-rewrite-sql&lt;/a&gt; monkey-patches Datasette's database connection code (since there's no appropriate plugin hook there yet) to provide its own hook for further plugin functions that can rewrite SQL queries before they get executed.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/cldellow/datasette-ersatz-table-valued-functions"&gt;datasette-ersatz-table-valued-functions&lt;/a&gt; (&lt;em&gt;ersatz (adj.): made or used as a substitute, typically an inferior one, for something else.&lt;/em&gt;) is a delightfully gnarly hack which supports custom table-valued SQL functions in SQLite, despite Python's &lt;code&gt;sqlite3&lt;/code&gt; module not providing those. It works by rewriting a SQL query against a function that returns JSON to use a &lt;a href="https://github.com/cldellow/ersatz-table-valued-functions"&gt;gnarly CTE and json_each()&lt;/a&gt; combo instead.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/cldellow/datasette-ui-extras"&gt;datasette-ui-extras&lt;/a&gt; is brand new: it tweaks the Datasette default interface in various ways, adding features like sticky header and facets in a sidebar. I'm so excited to see someone experimenting with changes to the default UI in this way, and I fully expect that some of the ideas Colin is playing with here will make it into Datasette core in the future.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="datasette-and-big-local-news"&gt;Datasette and Big Local News&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://biglocalnews.org"&gt;Big Local News&lt;/a&gt; is a collaborative data archive for journalists run by a team out of Stanford.&lt;/p&gt;
&lt;p&gt;I worked with &lt;a href="https://palewi.re/who-is-ben-welsh/"&gt;Ben Welsh&lt;/a&gt; this week to &lt;a href="https://biglocalnews.org/content/news/2023/01/30/datasette-integration.html"&gt;build an updated version&lt;/a&gt; of an old integration with Datasette, which went live on Friday.&lt;/p&gt;
&lt;p&gt;Here's the animated GIF demo &lt;a href="https://biglocalnews.org/content/news/2023/01/30/datasette-integration.html"&gt;from their announcement&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2023/big-local-datasette.gif" alt="The user starts out on the Big Local site, browses into a project, selects a menu item next to a CSV file in that project and runs &amp;quot;Open in Datasette&amp;quot; - they are then taken to a separate Datasette instance where that CSV file is imported into a table, ready for them to interact with." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;The code I wrote for this integration lives in &lt;a href="https://github.com/simonw/datasette-big-local"&gt;simonw/datasette-big-local&lt;/a&gt;. I ended up building a custom plugin which exclusively works with Big Local (so it's not shipped to PyPI).&lt;/p&gt;
&lt;p&gt;The main challenge with this project involved authentication and permissions. Big Local has both public and private datasets, and we wanted the integration to work with both. But if someone opened a file in a private project, only they and others with access to that project should be able to view the resulting table.&lt;/p&gt;
&lt;p&gt;In the end I solved this by creating a new SQLite database for each project, then configuring &lt;a href="https://docs.datasette.io/en/stable/authentication.html"&gt;Datasette's authentication system&lt;/a&gt; to run a permission check for each user to check that they were allowed to access files in the associated project, via a call to the Big Local GraphQL API.&lt;/p&gt;
&lt;p&gt;The first time a user opens a file in Big Local JavaScript on the site there performs an HTTP POST to Datasette, transmitting details of the project ID, the filename and an authentication token for that user. That token can then be used by Datasette to call the GraphQL API on their behalf, verifying their permissions and signing them into the Datasette instance using a signed cookie.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/simonw/datasette-big-local/blob/main/README.md"&gt;datasette-big-local README&lt;/a&gt; includes full details about how this all works.&lt;/p&gt;

&lt;h4&gt;Progress on ?_extra=&lt;/h4&gt;
&lt;p&gt;I finally started making progress on &lt;a href="https://github.com/simonw/datasette/issues/262"&gt;Datasette issue #262: Add ?_extra= mechanism for requesting extra properties in JSON&lt;/a&gt; - first opened back in May 2018!&lt;/p&gt;
&lt;p&gt;This is the key step on the journey to slimming down Datasette's default JSON representation for tables and queries.&lt;/p&gt;
&lt;p&gt;I want to return this by default:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
    &lt;span class="pl-ent"&gt;"ok"&lt;/span&gt;: &lt;span class="pl-c1"&gt;true&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"rows"&lt;/span&gt;: [
        {&lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-c1"&gt;1&lt;/span&gt;, &lt;span class="pl-ent"&gt;"title"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Example 1&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;},
        {&lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-c1"&gt;2&lt;/span&gt;, &lt;span class="pl-ent"&gt;"title"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Example 2&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;},
        {&lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-c1"&gt;3&lt;/span&gt;, &lt;span class="pl-ent"&gt;"title"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Example 3&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;}
    ],
    &lt;span class="pl-ent"&gt;"next"&lt;/span&gt;: &lt;span class="pl-c1"&gt;null&lt;/span&gt;
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then allow users to specify all kinds of extra information - the table schema, the suggested facets, information on columns, the underlying SQL query... all by adding on &lt;code&gt;?_extra=x&lt;/code&gt; parameters to the URL.&lt;/p&gt;
&lt;p&gt;This turns out to be a lot of work: I'm having to completely refactor the internals of the most complicated part of Datasette's codebase.&lt;/p&gt;
&lt;p&gt;Still lots more to go, but I'm happy to finally be making progress here.&lt;/p&gt;
&lt;h4&gt;Improved Datasette examples&lt;/h4&gt;
&lt;p&gt;The Datasette website has long had an &lt;a href="https://datasette.io/examples"&gt;Examples&lt;/a&gt; page linked to from the top navigation - and analytics show that it's one of the most visited pages on the site.&lt;/p&gt;
&lt;p&gt;I finally gave that page the upgrade it needed. It now starts with illustrated examples that have been selected to help highlight what Datasette can do - both the variety of problems it can be used to solve, and the way plugins can be used to add extra functionality.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2023/datasette-examples.png" alt="Screenshot of the Examples of sites built using Datasette page, featuring datasette.io and LAION-Aesthetics and github-to-sqlite with GraphQL" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4&gt;shot-scraper 1.1&lt;/h4&gt;
&lt;p&gt;I implemented the new screenshots for the examples page using my &lt;a href="https://shot-scraper.datasette.io/"&gt;shot-scraper&lt;/a&gt; screenshot automation tool.&lt;/p&gt;
&lt;p&gt;The screenshots are taken by a GitHub Actions workflow in the &lt;a href="https://github.com/simonw/datasette-screenshots"&gt;datasette-screenshots&lt;/a&gt; repository.&lt;/p&gt;
&lt;p&gt;I &lt;a href="https://github.com/simonw/datasette-screenshots/commit/70d5a7adff89d66ee959867cfe32893eb59ad2b7"&gt;added 5 new screenshot definitions&lt;/a&gt; to the YAML that powers that workflow, which is used by the &lt;a href="https://shot-scraper.datasette.io/en/stable/multi.html"&gt;shot-scraper multi&lt;/a&gt; command.&lt;/p&gt;
&lt;p&gt;In working with &lt;code&gt;shot-scraper&lt;/code&gt; I spotted a couple of opportunities for small improvements, &lt;a href="https://github.com/simonw/shot-scraper/releases/tag/1.1"&gt;version 1.1&lt;/a&gt; with the following changes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;New &lt;code&gt;--log-console&lt;/code&gt; option for logging the output of calls to &lt;code&gt;console.log()&lt;/code&gt; to standard error. &lt;a href="https://github.com/simonw/shot-scraper/issues/101"&gt;#101&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;New &lt;code&gt;--skip&lt;/code&gt; and &lt;code&gt;--fail&lt;/code&gt; options to specify what should happen if an HTTP 4xx or 5xx error is encountered while trying to load the page. &lt;code&gt;--skip&lt;/code&gt; will ignore the error and either exit cleanly or move on to the next screenshot (in the case of &lt;code&gt;multi&lt;/code&gt;). &lt;code&gt;--fail&lt;/code&gt; will cause the tool to return a non-zero exit code, useful for running in CI environments. &lt;a href="https://github.com/simonw/shot-scraper/issues/102"&gt;#102&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;h4&gt;datasette-granian&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://github.com/emmett-framework/granian"&gt;Granian&lt;/a&gt; is a new web server for running Python WSGI and ASGI applications, written in Rust.&lt;/p&gt;
&lt;p&gt;A while ago I built &lt;a href="https://datasette.io/plugins/datasette-gunicorn"&gt;datasette-gunicorn&lt;/a&gt;, a plugin which adds a &lt;code&gt;datasette gunicorn my.db&lt;/code&gt; command for serving Datasette using the &lt;a href="https://gunicorn.org/"&gt;Gunicorn&lt;/a&gt; WSGI server.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/simonw/datasette-granian"&gt;datasette-granian&lt;/a&gt; now provides the same thing using Granian. It's an alpha release because I haven't actually used it in production yet, but it seems to work well and it adds yet another option for people who want to deploy Datasette.&lt;/p&gt;
&lt;p&gt;Creator of Granian Giovanni Barillari &lt;a href="https://github.com/emmett-framework/granian/issues/35"&gt;was really helpful&lt;/a&gt; in helping me figure out how to dynamically serve a freshly configured ASGI application rather than just passing a module path to the &lt;code&gt;granian&lt;/code&gt; CLI command.&lt;/p&gt;
&lt;h4&gt;datasette-faiss 0.2&lt;/h4&gt;
&lt;p&gt;I &lt;a href="https://simonwillison.net/2023/Jan/13/semantic-search-answers/"&gt;introduced datasette-faiss&lt;/a&gt; a few weeks ago. It's a plugin that suppors fast vector similarity lookups within Datasette using the &lt;a href="https://github.com/facebookresearch/faiss"&gt;FAISS&lt;/a&gt; vector search library by Facebook Research.&lt;/p&gt;
&lt;p&gt;The first release of the plugin created a FAISS index on server startup for each table that contains an embeddings column. Any similarity searches would then be run against that entire table.&lt;/p&gt;
&lt;p&gt;But what if you want to combine those searches with other filters in a query? For example, first filter to every article published in 2022, then run a similarity search on what's left.&lt;/p&gt;
&lt;p&gt;In &lt;a href="https://github.com/simonw/datasette-faiss/releases/tag/0.2"&gt;datasette-faiss 0.2&lt;/a&gt; I introduced two new SQLite aggregate functions: &lt;code&gt;faiss_agg()&lt;/code&gt; and &lt;code&gt;faiss_agg_with_scores()&lt;/code&gt;, that are designed to handle this case.&lt;/p&gt;
&lt;p&gt;The new functions work by constructing a new FAISS index from scratch every time they are called, covering just the rows that were processed by the aggregation.&lt;/p&gt;
&lt;p&gt;This is best illustrated with an example. The following query first selects the embeddings for just the blog entries published in 2022, then uses those to find items that are most similar to the provided ID.&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;with entries_2022 &lt;span class="pl-k"&gt;as&lt;/span&gt; (
  &lt;span class="pl-k"&gt;select&lt;/span&gt;
    id,
    embedding
  &lt;span class="pl-k"&gt;from&lt;/span&gt;
    blog_entry_embeddings
  &lt;span class="pl-k"&gt;where&lt;/span&gt;
    id &lt;span class="pl-k"&gt;in&lt;/span&gt; (&lt;span class="pl-k"&gt;select&lt;/span&gt; id &lt;span class="pl-k"&gt;from&lt;/span&gt; blog_entry &lt;span class="pl-k"&gt;where&lt;/span&gt; created &lt;span class="pl-k"&gt;like&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;2022%&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;)
),
faiss &lt;span class="pl-k"&gt;as&lt;/span&gt; (
  &lt;span class="pl-k"&gt;select&lt;/span&gt;
    faiss_agg(
      id,
      embedding,
      (&lt;span class="pl-k"&gt;select&lt;/span&gt; embedding &lt;span class="pl-k"&gt;from&lt;/span&gt; blog_entry_embeddings &lt;span class="pl-k"&gt;where&lt;/span&gt; id &lt;span class="pl-k"&gt;=&lt;/span&gt; :id),
      &lt;span class="pl-c1"&gt;10&lt;/span&gt;
    ) &lt;span class="pl-k"&gt;as&lt;/span&gt; results
  &lt;span class="pl-k"&gt;from&lt;/span&gt;
    entries_2022
),
ids &lt;span class="pl-k"&gt;as&lt;/span&gt; (
  &lt;span class="pl-k"&gt;select&lt;/span&gt;
    value &lt;span class="pl-k"&gt;as&lt;/span&gt; id
  &lt;span class="pl-k"&gt;from&lt;/span&gt;
    json_each(&lt;span class="pl-c1"&gt;faiss&lt;/span&gt;.&lt;span class="pl-c1"&gt;results&lt;/span&gt;),
    faiss
)
&lt;span class="pl-k"&gt;select&lt;/span&gt;
  &lt;span class="pl-c1"&gt;blog_entry&lt;/span&gt;.&lt;span class="pl-c1"&gt;id&lt;/span&gt;,
  &lt;span class="pl-c1"&gt;blog_entry&lt;/span&gt;.&lt;span class="pl-c1"&gt;title&lt;/span&gt;,
  &lt;span class="pl-c1"&gt;blog_entry&lt;/span&gt;.&lt;span class="pl-c1"&gt;created&lt;/span&gt;
&lt;span class="pl-k"&gt;from&lt;/span&gt;
  ids
  &lt;span class="pl-k"&gt;join&lt;/span&gt; blog_entry &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;ids&lt;/span&gt;.&lt;span class="pl-c1"&gt;id&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;blog_entry&lt;/span&gt;.&lt;span class="pl-c1"&gt;id&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;a href="https://datasette.simonwillison.net/simonwillisonblog?sql=with+entries_2022+as+%28%0D%0A++select%0D%0A++++id%2C%0D%0A++++embedding%0D%0A++from%0D%0A++++blog_entry_embeddings%0D%0A++where%0D%0A++++id+in+%28select+id+from+blog_entry+where+created+like+%272022%25%27%29%0D%0A%29%2C%0D%0Afaiss+as+%28%0D%0A++select%0D%0A++++faiss_agg%28%0D%0A++++++id%2C%0D%0A++++++embedding%2C%0D%0A++++++%28select+embedding+from+blog_entry_embeddings+where+id+%3D+%3Aid%29%2C%0D%0A++++++10%0D%0A++++%29+as+results%0D%0A++from%0D%0A++++entries_2022%0D%0A%29%2C%0D%0Aids+as+%28%0D%0A++select%0D%0A++++value+as+id%0D%0A++from%0D%0A++++json_each%28faiss.results%29%2C%0D%0A++++faiss%0D%0A%29%0D%0Aselect%0D%0A++blog_entry.id%2C%0D%0A++blog_entry.title%2C%0D%0A++blog_entry.created%0D%0Afrom%0D%0A++ids%0D%0A++join+blog_entry+on+ids.id+%3D+blog_entry.id&amp;amp;id=8214"&gt;You can try the query out here&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/shot-scraper"&gt;shot-scraper&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/shot-scraper/releases/tag/1.1"&gt;1.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/shot-scraper/releases"&gt;25 releases total&lt;/a&gt;) - 2023-01-30
&lt;br /&gt;A command-line utility for taking automated screenshots of websites&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-render-markdown"&gt;datasette-render-markdown&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-render-markdown/releases/tag/2.1.1"&gt;2.1.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-render-markdown/releases"&gt;10 releases total&lt;/a&gt;) - 2023-01-27
&lt;br /&gt;Datasette plugin for rendering Markdown&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-youtube-embed"&gt;datasette-youtube-embed&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-youtube-embed/releases/tag/0.1"&gt;0.1&lt;/a&gt; - 2023-01-27
&lt;br /&gt;Turn YouTube URLs into embedded players in Datasette&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-granian"&gt;datasette-granian&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-granian/releases/tag/0.1a0"&gt;0.1a0&lt;/a&gt; - 2023-01-20
&lt;br /&gt;Run Datasette using the Granian HTTP server&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-faiss"&gt;datasette-faiss&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-faiss/releases/tag/0.2"&gt;0.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-faiss/releases"&gt;2 releases total&lt;/a&gt;) - 2023-01-19
&lt;br /&gt;Maintain a FAISS index for specified Datasette tables&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/git/rewrite-repo-remove-secrets"&gt;Rewriting a Git repo to remove secrets from the history&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/sqlite/pragma-function-list"&gt;SQLite pragma_function_list()&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/python/lxml-m1-mac"&gt;Installing lxml for Python on an M1/M2 Mac&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/sqlite/cte-values"&gt;Combining CTEs and VALUES in SQLite&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/plugins"&gt;plugins&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/shot-scraper"&gt;shot-scraper&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/colin-dellow"&gt;colin-dellow&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="plugins"/><category term="projects"/><category term="datasette"/><category term="weeknotes"/><category term="shot-scraper"/><category term="colin-dellow"/></entry><entry><title>datasette-scraper walkthrough on YouTube</title><link href="https://simonwillison.net/2023/Jan/29/datasette-scraper-walkthrough/#atom-tag" rel="alternate"/><published>2023-01-29T05:23:42+00:00</published><updated>2023-01-29T05:23:42+00:00</updated><id>https://simonwillison.net/2023/Jan/29/datasette-scraper-walkthrough/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.youtube.com/watch?v=zrSGnz7ErNI"&gt;datasette-scraper walkthrough on YouTube&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
datasette-scraper is Colin Dellow’s new plugin that turns Datasette into a powerful web scraping tool, with a web UI based on plugin-driven customizations to the Datasette interface. It’s really impressive, and this ten minute demo shows quite how much it is capable of: it can crawl sitemaps and fetch pages, caching them (using zstandard with optional custom dictionaries for extra compression) to speed up subsequent crawls... and you can add your own plugins to extract structured data from crawled pages and save it to a separate SQLite table!

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://datasette.io/plugins/datasette-scraper"&gt;datasette-scraper&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/plugins"&gt;plugins&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/colin-dellow"&gt;colin-dellow&lt;/a&gt;&lt;/p&gt;



</summary><category term="plugins"/><category term="scraping"/><category term="datasette"/><category term="colin-dellow"/></entry><entry><title>Query Parquet files in SQLite</title><link href="https://simonwillison.net/2018/Jun/24/query-parquet-files-sqlite/#atom-tag" rel="alternate"/><published>2018-06-24T19:44:06+00:00</published><updated>2018-06-24T19:44:06+00:00</updated><id>https://simonwillison.net/2018/Jun/24/query-parquet-files-sqlite/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://cldellow.com/2018/06/22/sqlite-parquet-vtable.html"&gt;Query Parquet files in SQLite&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Colin Dellow built a SQLite virtual table extension that lets you query Parquet files directly using SQL. Parquet is interesting because it’s a columnar format that dramatically reduces the space needed to store tables with lots of duplicate column data—most CSV files, for example. Colin reports being able to shrink a 1291 MB CSV file from the Canadian census to an equivalent Parquet file weighing just 42MB (3% of the original)—then running a complex query against the data in just 60ms. I’d love to see someone get this extension working with Datasette.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/big-data"&gt;big-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parquet"&gt;parquet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/colin-dellow"&gt;colin-dellow&lt;/a&gt;&lt;/p&gt;



</summary><category term="sqlite"/><category term="big-data"/><category term="datasette"/><category term="parquet"/><category term="colin-dellow"/></entry></feed>