<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: facets</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/facets.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2018-05-20T23:42:18+00:00</updated><author><name>Simon Willison</name></author><entry><title>Datasette Facets</title><link href="https://simonwillison.net/2018/May/20/datasette-facets/#atom-tag" rel="alternate"/><published>2018-05-20T23:42:18+00:00</published><updated>2018-05-20T23:42:18+00:00</updated><id>https://simonwillison.net/2018/May/20/datasette-facets/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;a href="https://github.com/simonw/datasette/releases/tag/0.22"&gt;Datasette 0.22&lt;/a&gt; is out with the most significant new feature I’ve added since the initial release: &lt;strong&gt;faceted browse&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/simonw/datasette"&gt;Datasette&lt;/a&gt; lets you deploy an instant web UI and JSON API for any SQLite database. &lt;a href="https://github.com/simonw/csvs-to-sqlite"&gt;csvs-to-sqlite&lt;/a&gt; makes it easy to create a SQLite database out of any collection of CSV files. &lt;a href="https://publish.datasettes.com/"&gt;Datasette Publish&lt;/a&gt; is a web app that can run these combined tools against CSV files you upload from your browser. And now the new &lt;a href="https://datasette.readthedocs.io/en/latest/facets.html"&gt;Datasette Facets&lt;/a&gt; feature lets you explore any CSV file using faceted navigation with a couple of clicks.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Exploring_characters_from_Marvel_comics_6"&gt;&lt;/a&gt;Exploring characters from Marvel comics&lt;/h3&gt;
&lt;p&gt;Let’s use facets to explore every character in the Marvel Universe.&lt;/p&gt;
&lt;p&gt;FiveThirtyEight have published &lt;a href="https://github.com/fivethirtyeight/data/tree/master/comic-characters"&gt;a CSV file&lt;/a&gt; of 16,376 characters from Marvel comics, scraped from Wikia as part of the research for their 2014 story &lt;a href="https://fivethirtyeight.com/features/women-in-comic-books/"&gt;Comic Books Are Still Made By Men, For Men And About Men&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here’s that CSV file &lt;a href="https://fivethirtyeight.datasettes.com/fivethirtyeight-5de27e3/comic-characters%2Fmarvel-wikia-data"&gt;loaded into the latest version of Datasette&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img style="max-width: 100%" src="https://static.simonwillison.net/static/2018/marvel-facets.gif" alt="Marvel characters explored using Datasette Facets" /&gt;&lt;/p&gt;
&lt;p&gt;We start by applying the &lt;a href="https://fivethirtyeight.datasettes.com/fivethirtyeight-5de27e3/comic-characters%2Fmarvel-wikia-data?_facet=ID&amp;amp;_facet=ALIGN&amp;amp;_facet=SEX"&gt;identity status, alignment and gender facets&lt;/a&gt;. Then we filter down to just the &lt;a href="https://fivethirtyeight.datasettes.com/fivethirtyeight-5de27e3/comic-characters%2Fmarvel-wikia-data?_facet=ID&amp;amp;_facet=ALIGN&amp;amp;_facet=SEX&amp;amp;ID=Public+Identity&amp;amp;ALIGN=Bad+Characters"&gt;bad characters with a public identity&lt;/a&gt;, and apply the &lt;a href="https://fivethirtyeight.datasettes.com/fivethirtyeight-5de27e3/comic-characters%2Fmarvel-wikia-data?_facet=ID&amp;amp;_facet=ALIGN&amp;amp;_facet=SEX&amp;amp;ID=Public+Identity&amp;amp;ALIGN=Bad+Characters&amp;amp;_facet=EYE#facet-EYE"&gt;eye colour facet&lt;/a&gt;. Now we can filter to just the &lt;a href="https://fivethirtyeight.datasettes.com/fivethirtyeight-5de27e3/comic-characters%2Fmarvel-wikia-data?_facet=ID&amp;amp;_facet=ALIGN&amp;amp;_facet=SEX&amp;amp;ID=Public+Identity&amp;amp;ALIGN=Bad+Characters&amp;amp;_facet=EYE&amp;amp;EYE=Yellow+Eyes"&gt;20 bad characters with a public identity and yellow eyes&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;At each stage along the way we could see numerical summaries of the other facets. That’s a pretty sophisticated piece of analysis we’ve been able to run with just a few clicks (and it works responsively on mobile as well).&lt;/p&gt;
&lt;p&gt;I’ve published a full copy of everything else in the FiveThirtyEight data repository, which means you can find plenty more examples of facets in action at &lt;a href="https://fivethirtyeight.datasettes.com/"&gt;https://fivethirtyeight.datasettes.com/&lt;/a&gt; - one example: &lt;a href="https://fivethirtyeight.datasettes.com/fivethirtyeight-5de27e3/antiquities-act%2Factions_under_antiquities_act?_facet=current_agency&amp;amp;_facet=states&amp;amp;_facet=action&amp;amp;_facet=pres_or_congress"&gt;Actions under the Antiquities Act&lt;/a&gt;, faceted by states, pres_or_congress, action and current_agency.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Analyzing_GSA_IT_Standards_with_Datasette_Publish_22"&gt;&lt;/a&gt;Analyzing GSA IT Standards with Datasette Publish&lt;/h3&gt;
&lt;p&gt;The US government’s General Services Administration have a GitHub account, and they use it to publish &lt;a href="https://github.com/GSA/data"&gt;a repository of assorted data&lt;/a&gt; as CSVs.&lt;/p&gt;
&lt;p&gt;Let’s take one of those CSVS and analyze it with Datasette Facets, using the &lt;a href="https://publish.datasettes.com/"&gt;Datasette Publish&lt;/a&gt; web app to upload and process the CSV.&lt;/p&gt;
&lt;p&gt;We’ll start with the &lt;a href="https://github.com/GSA/data/blob/master/enterprise-architecture/it-standards.csv"&gt;it-standards.csv&lt;/a&gt; file, downloaded from their repo. We’ll upload it to Datasette Publish and add some associated metadata:&lt;/p&gt;
&lt;p&gt;&lt;img style="max-width: 100%" src="https://static.simonwillison.net/static/2018/gsa-facets.gif" alt="GSA IT standards published using Datasette Publish" /&gt;&lt;/p&gt;
&lt;p&gt;Here’s the result, with &lt;a href="https://datasette-tcuhqhqpud.now.sh/csv-data-41a17b5/it-standards?_facet=Status&amp;amp;_facet=Deployment+Type"&gt;the Status and Deployment Type facets applied&lt;/a&gt;. And here’s a query showing just &lt;a href="https://datasette-tcuhqhqpud.now.sh/csv-data-41a17b5/it-standards?_facet=Status&amp;amp;_facet=Deployment+Type&amp;amp;Deployment+Type=SaaS&amp;amp;Status=Approved+-+Preferred"&gt;SaaS tools with status Approved - Preferred&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;&lt;a id="European_Power_Stations_34"&gt;&lt;/a&gt;European Power Stations&lt;/h3&gt;
&lt;p&gt;The &lt;a href="https://open-power-system-data.org/"&gt;Open Power System Data project&lt;/a&gt; publishes data about electricity systems. They publish data in a number of formats, including SQLite databases. Let’s take their &lt;a href="https://data.open-power-system-data.org/conventional_power_plants/"&gt;conventional_power_plants.sqlite file&lt;/a&gt; and explore it with Datasette. With Datasette installed, run the following commands in your terminal:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;wget https://data.open-power-system-data.org/conventional_power_plants/2018-02-27/conventional_power_plants.sqlite
datasette conventional_power_plants.sqlite
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will start Datasette running at &lt;code&gt;http://127.0.0.1:8001/&lt;/code&gt; ready for you to explore the data.&lt;/p&gt;
&lt;p&gt;Next we can publish the SQLite database directly to the internet using the &lt;code&gt;datasette publish&lt;/code&gt; command-line tool:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ datasette publish now conventional_power_plants.sqlite \
    --source=&amp;quot;Open Power System Data. 2018. Data Package Conventional power plants. Version 2018-02-27&amp;quot; \
    --source_url=&amp;quot;https://data.open-power-system-data.org/conventional_power_plants/2018-02-27/&amp;quot; \
    --title=&amp;quot;Conventional power plants&amp;quot; \
    --branch=master
&amp;gt; Deploying /private/var/folders/jj/fngnv0810tn2lt_kd3911pdc0000gp/T/tmpufvxrzgp/datasette under simonw
&amp;gt; https://datasette-tgngfjddix.now.sh [in clipboard] (sfo1) [11s]
&amp;gt; Synced 3 files (1.28MB) [11s]
&amp;gt; Building…
&amp;gt; ▲ docker build
Sending build context to Docker daemon 1.343 MBkB
&amp;gt; Step 1/7 : FROM python:3
&amp;gt; 3: Pulling from library/python
&amp;gt; 3d77ce4481b1: Already exists
&amp;gt; 534514c83d69: Already exists
...
&amp;gt; Successfully built da7ac223e8aa
&amp;gt; Successfully tagged registry.now.systems/now/3d6d318f0da06d3ea1bc97417c7dc484aaac9026:latest
&amp;gt; ▲ Storing image
&amp;gt; Build completed
&amp;gt; Verifying instantiation in sfo1
&amp;gt; [0] Serve! files=('conventional_power_plants.sqlite',) on port 8001
&amp;gt; [0] [2018-05-20 22:51:51 +0000] [1] [INFO] Goin' Fast @ http://0.0.0.0:8001
&amp;gt; [0] [2018-05-20 22:51:51 +0000] [1] [INFO] Starting worker [1]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Finally, let’s give it a nicer URL using &lt;code&gt;now alias&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt; now alias https://datasette-tgngfjddix.now.sh conventional-power-plants.now.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The result can now be seen at &lt;a href="https://conventional-power-plants.now.sh/"&gt;https://conventional-power-plants.now.sh/&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Here’s every conventional power plant in Europe &lt;a href="https://conventional-power-plants.now.sh/conventional_power_plants-e3c301c/conventional_power_plants_EU?_facet=country&amp;amp;_facet=energy_source&amp;amp;_facet=technology"&gt;faceted by country, energy source and technology&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img style="max-width: 100%" src="https://static.simonwillison.net/static/2018/power-plant-facets.png" alt="Power Plant Facets" /&gt;&lt;/p&gt;
&lt;h3&gt;&lt;a id="Implementation_notes_80"&gt;&lt;/a&gt;Implementation notes&lt;/h3&gt;
&lt;p&gt;I love faceted search engines. One of my first approaches to understanding any new large dataset has long been to throw it into a faceted search engine and see what comes out. In the past I’ve built them using &lt;a href="https://simonwillison.net/tags/solr/"&gt;Solr&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/elasticsearch/"&gt;Elasticsearch&lt;/a&gt;, &lt;a href="https://simonwillison.net/2017/Oct/5/django-postgresql-faceted-search/"&gt;PostgreSQL&lt;/a&gt; and even &lt;a href="https://whoosh.readthedocs.io/en/latest/facets.html"&gt;Whoosh&lt;/a&gt;. I guess it was inevitable that I’d try to build one with SQLite.&lt;/p&gt;
&lt;p&gt;You can follow the development of Datasette Facets in the now-closed &lt;a href="https://github.com/simonw/datasette/issues/255"&gt;issue #255&lt;/a&gt; on GitHub.&lt;/p&gt;
&lt;p&gt;Facets are requested by appending one or more &lt;code&gt;?_facet=colname&lt;/code&gt; parameters to the URL. This causes Datasette to run the following SQL query for each of those specified columns:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;select colname as value, count(*) as count
from tablename where (current where clauses)
group by colname order by count desc limit 31
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For large tables, this could get expensive. Datasette supports time limits for SQLite queries, and facets are given up to 200ms (by default, this limit &lt;a href="https://datasette.readthedocs.io/en/latest/config.html#facet-time-limit-ms"&gt;can be customized&lt;/a&gt;) to finish executing. If the query doesn’t complete in the given time the user sees a warning that the facet could not be displayed.&lt;/p&gt;
&lt;p&gt;We ask for 31 values in the limit clause even though we only display 30. This lets us detect if there are more values available and show a &lt;code&gt;...&lt;/code&gt; indicator to let the user know that the facets were truncated.&lt;/p&gt;
&lt;p&gt;Datasette also suggests facets that you might want to apply. This is implemented using another query, this time run against every column that is not yet being used as a facet. If a table has 20 columns this means 20 queries, so they run with an even tighter 50ms time limit. The query looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;select distinct colname
from tablename where (current where clauses)
limit 31
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;All we are doing here is trying to determine if the column in question has less than 30 unique values. The limit clause here means that if you run this query against a column with entirely distinct values (the primary key for example) the query will terminate extremely quickly - after it has found just the first 31 values.&lt;/p&gt;
&lt;p&gt;Once the query has executed, we count the distinct values and check to see if this column, when used as a facet:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Will return 30 or less unique options&lt;/li&gt;
&lt;li&gt;Will return more than one unique option&lt;/li&gt;
&lt;li&gt;Will return less unique options than the current total number of filtered rows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If the query takes longer than 50ms we terminate it and do not suggest that column as a potential facet.&lt;/p&gt;
&lt;h3&gt;&lt;a id="Facets_via_JSON_112"&gt;&lt;/a&gt;Facets via JSON&lt;/h3&gt;
&lt;p&gt;As with everything in Datasette, the facets you can view in your browser are also available as part of the JSON API (which ships with CORS headers so you can easily fetch data from JavaScript running in a browser on any web page).&lt;/p&gt;
&lt;p&gt;To get back JSON, add &lt;code&gt;.json&lt;/code&gt; to the path (before the first &lt;code&gt;?&lt;/code&gt;). Here’s that power plants example returned as JSON: &lt;a href="https://conventional-power-plants.now.sh/conventional_power_plants-e3c301c/conventional_power_plants_EU.json?_facet=country&amp;amp;_facet=energy_source&amp;amp;_facet=technology"&gt;https://conventional-power-plants.now.sh/conventional_power_plants-e3c301c/conventional_power_plants_EU.json?_facet=country&amp;amp;_facet=energy_source&amp;amp;_facet=technology&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Please &lt;a href="https://twitter.com/simonw"&gt;let me know&lt;/a&gt; if you build something interesting with Datasette Facets!&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/facets"&gt;facets&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/facetedsearch"&gt;facetedsearch&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="facets"/><category term="projects"/><category term="facetedsearch"/><category term="datasette"/></entry><entry><title>Digg Search: Now With 99.987% Less Suck</title><link href="https://simonwillison.net/2009/Apr/10/digg/#atom-tag" rel="alternate"/><published>2009-04-10T22:17:57+00:00</published><updated>2009-04-10T22:17:57+00:00</updated><id>https://simonwillison.net/2009/Apr/10/digg/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://blog.digg.com/?p=653"&gt;Digg Search: Now With 99.987% Less Suck&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Really nice implementation of faceted search, still using Lucene and Solr under the hood.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/digg"&gt;digg&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/facets"&gt;facets&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/full-text-search"&gt;full-text-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/lucene"&gt;lucene&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/solr"&gt;solr&lt;/a&gt;&lt;/p&gt;



</summary><category term="digg"/><category term="facets"/><category term="full-text-search"/><category term="lucene"/><category term="search"/><category term="solr"/></entry><entry><title>FacetMaps</title><link href="https://simonwillison.net/2002/Jul/28/facetMaps/#atom-tag" rel="alternate"/><published>2002-07-28T12:57:24+00:00</published><updated>2002-07-28T12:57:24+00:00</updated><id>https://simonwillison.net/2002/Jul/28/facetMaps/#atom-tag</id><summary type="html">
    &lt;p&gt;Yet another interesting take on &lt;acronym title="eXtensible Markup Language"&gt;XML&lt;/acronym&gt; metadata representations: &lt;a href="http://facetmap.com/"&gt;FacetMaps&lt;/a&gt;. A facet map (as I understand it) is a way of combining facets with hierarchies, best explained by the excellent interactive &lt;a href="http://facetmap.com:8080/browse.jsp"&gt;three minute concept intro&lt;/a&gt; on the site. One of the main contrasts to &lt;acronym title="XFML Faceted Metadata Language"&gt;XFML&lt;/acronym&gt; is that in a Facet Map Facets, rather than Topics, are the principle categorisation element. A resource in a Facet Map is linked directly to one or more facets, rather than going through a topic. The XML format is pretty simple (a lot simpler than &lt;acronym title="XML Topic Maps"&gt;XTM&lt;/acronym&gt; and &lt;acronym title="XFML Faceted Metadata Language"&gt;XFML&lt;/acronym&gt;) so I might have a go at a &lt;acronym title="PHP: Hypertext Preprocessor"&gt;PHP&lt;/acronym&gt; implementation at some point.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/facets"&gt;facets&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xfml"&gt;xfml&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="facets"/><category term="xfml"/></entry><entry><title>Facets understood</title><link href="https://simonwillison.net/2002/Jul/28/facetsUnderstood/#atom-tag" rel="alternate"/><published>2002-07-28T11:40:57+00:00</published><updated>2002-07-28T11:40:57+00:00</updated><id>https://simonwillison.net/2002/Jul/28/facetsUnderstood/#atom-tag</id><summary type="html">
    &lt;p&gt;And suddenly I understand faceted metadata. Sometimes all you need for that final moment of insight is a good example, and Peter Van Djick's &lt;a href="http://poorbuthappy.com/colombia/topics.php" title="List of topics on this website"&gt;Columbia Guide Site Map&lt;/a&gt; is just what I needed. A facet is simply a "flat", mutually exclusive (at least as far as the &lt;a href="http://www.xfml.org/spec/"&gt;XFML specification&lt;/a&gt; is concerned) way of categorising a topic - it can be described as a bottom-up method of categorisation rather than the more common hierarchical top-down approach (as seen on the &lt;acronym title="Open Directory Project"&gt;ODP&lt;/acronym&gt;) which seeks to assign all topics as sub-topics of something else. Peter writes in &lt;a href="http://www.xfml.org/#3"&gt;XFML Background and Concepts&lt;/a&gt; that &lt;q cite="http://www.xfml.org/#3"&gt;Faceted taxonomies are generally more powerful for websites than classic hierarchical taxonomies&lt;/q&gt; - this seems to make a great deal of sense, and it will be interesting to see this demonstrated by &lt;acronym title="eXtensible Faceted Metedata Language"&gt;XFML&lt;/acronym&gt; in the near future.&lt;/p&gt;

&lt;p&gt;Update: Incidentally, IAWiki have an excellent page describing &lt;a href="http://www.iawiki.net/FacetedClassification"&gt;FacetedClassification&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/facets"&gt;facets&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/peter-van-dijck"&gt;peter-van-dijck&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xfml"&gt;xfml&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="facets"/><category term="peter-van-dijck"/><category term="xfml"/></entry></feed>