Simon Willison's Weblog: visualization

My review of Claude's new Code Interpreter, released under a very confusing name

2025-09-09T18:11:32+00:00

Today on the Anthropic blog: Claude can now create and edit files:

Claude can now create and edit Excel spreadsheets, documents, PowerPoint slide decks, and PDFs directly in Claude.ai and the desktop app. [...]

File creation is now available as a preview for Max, Team, and Enterprise plan users. Pro users will get access in the coming weeks.

Then right at the very end of their post:

This feature gives Claude internet access to create and analyze files, which may put your data at risk. Monitor chats closely when using this feature. Learn more.

And tucked away half way down their Create and edit files with Claude help article:

With this feature, Claude can also do more advanced data analysis and data science work. Claude can create Python scripts for data analysis. Claude can create data visualizations in image files like PNG. You can also upload CSV, TSV, and other files for data analysis and visualization.

Talk about burying the lede... this is their version of ChatGPT Code Interpreter, my all-time favorite feature of ChatGPT!

Claude can now write and execute custom Python (and Node.js) code in a server-side sandbox and use it to process and analyze data.

In a particularly egregious example of AI companies being terrible at naming features, the official name for this one really does appear to be Upgraded file creation and analysis. Sigh.

This is quite a confusing release, because Claude already had a variant of this feature, released in October 2024 with the weak but more sensible name Analysis tool. Here are my notes from when that came out. That tool worked by generating and executing JavaScript in the user's own browser.

The new tool works entirely differently. It's much closer in implementation to OpenAI's Code Interpreter: Claude now has access to a server-side container environment in which it can run shell commands and execute Python and Node.js code to manipulate data and both read and generate files.

It's worth noting that Anthropic have a similar feature in their API called Code execution tool, but today is the first time end-users of Claude have been able to execute arbitrary code in a server-side container.

Switching it on in settings/features

I have a Pro Plan but found the setting to enable it on the claude.ai/settings/features. It's possible my account was granted early access without me realizing, since the Pro plan isn't supposed to have it yet:

Interestingly, toggling on the "Upgraded file creation and analysis" option toggles off the "Analysis tool" option, and vice versa. You can't have both at the same time.

I guess this is to help reduce confusion? It surprised me though as Claude can sometimes use the existing analysis tool to help it in the creation of artifacts, as I noted here when I wrote about the Claudeception feature they added in June.

Exploring the environment

I started by using the tool to explore the tool. Here's my first transcript where I simply prompted the following to see what it could figure out about itself:

Tell me everything you can about your shell and Python environments

Here's part of the output:

And the key details it returned, each of which I've verified as true by checking the displayed output of the relevant commands:

OS: Ubuntu 24.04.2 LTS (Noble) running on Linux kernel 4.4.0-

Architecture: x86_64

Shell: GNU Bash 5.2.21(1)-release

Current User: root (with access to /home/claude directory)

Working Directory: Currently in root (/), but /home/claude is the main workspace

Python Version: Python 3.12.3

Python Executable: /usr/bin/python3

Package Manager: pip 24.0

Python Path: Standard system locations including:

/usr/lib/python3.12 (standard library)

/usr/local/lib/python3.12/dist-packages (local packages)

Disk Space: 4.9GB total, with 4.6GB available

Memory: 9.0GB RAM available, minimal usage currently

Node.js: v18.19.1 available

npm: Configured with global packages in /home/claude/.npm-global

So we have an Ubuntu 24.04.2 container with 9GB of RAM, ~5GB of disk, Python 3.12.3, and Node.js v18.19.1.

How about network access? I tried the following:

Can you access the Internet? Can you run curl? How about using Python http to fetch https://latest.datasette.io/fixtures.json ?

Again, you can see the detailed output in the transcript. Claude tried https://latest.datasette.io/fixtures.json and then https://httpbin.org/json and got a 403 forbidden error for both, then https://google.com and got this curious result:

curl -s -I https://google.com

Output:

HTTP/1.1 200 OK
date: Tue, 09 Sep 2025 16:02:17 GMT
server: envoy

HTTP/2 403 
content-length: 13
content-type: text/plain
date: Tue, 09 Sep 2025 16:02:17 GMT

Claude did note that it can still use the web_fetch and web_search containers independently of that container environment, so it should be able to fetch web content using tools running outside of the container and then write it to a file there.

On a hunch I tried this:

Run pip install sqlite-utils

... and it worked! Claude can pip install additional packages from PyPI.

A little more poking around revealed the following relevant environment variables:

HTTPS_PROXY=http://21.0.0.167:15001
no_proxy=localhost,127.0.0.1,169.254.169.254,metadata.google.internal,*.svc.cluster.local,*.local,*.googleapis.com,*.google.com
NO_PROXY=localhost,127.0.0.1,169.254.169.254,metadata.google.internal,*.svc.cluster.local,*.local,*.googleapis.com,*.google.com
https_proxy=http://21.0.0.167:15001
http_proxy=http://21.0.0.167:15001

So based on an earlier HTTP header there's an Envoy proxy running at an accessible port which apparently implements a strict allowlist.

I later noticed that the help page includes a full description of what's on that allowlist:

Anthropic Services (Explicit)

api.anthropic.com, statsig.anthropic.com

Version Control

github.com

Package Managers - JavaScript/Node

NPM: registry.npmjs.org, npmjs.com, npmjs.org
Yarn: yarnpkg.com, registry.yarnpkg.com

Package Managers - Python

pypi.org, files.pythonhosted.org, pythonhosted.org

So it looks like we have a very similar system to ChatGPT Code Interpreter. The key differences are that Claude's system can install additional Python packages and has Node.js pre-installed.

One important limitation from the docs:

The maximum file size is 30MB per file for both uploads and downloads.

The ChatGPT limit here is 512MB. I've often uploaded 100MB+ SQLite database files to ChatGPT, so I'm a little disappointed by this lower limit for Claude.

Starting with something easy

I grabbed a copy of the SQLite database behind my TILs website (21.9MB from here) and uploaded it to Claude, then prompted:

Use your Python environment to explore this SQLite database and generate a PDF file containing a join diagram of all the tables

Here's that conversation. It did an OK job, producing both the PDF I asked for and a PNG equivalent which looks like this (since created files are not available in shared chats):

This isn't an ideal result - those join lines are difficult to follow - but I'm confident I could get from here to something I liked with only a little more prompting. The important thing is that the system clearly works, and can analyze data in uploaded SQLite files and use them to produce images and PDFs.

Something much harder: recreating the AI adoption chart

Thankfully I have a fresh example of a really challenging ChatGPT Code Interpreter task from just last night, which I described in great detail in Recreating the Apollo AI adoption rate chart with GPT-5, Python and Pyodide.

Short version: I took this chart from Apollo Global and asked ChatGPT to recreate it based on a screenshot and an uploaded XLSX file.

This time I skipped the bit where I had ChatGPT hunt down the original data and jumped straight to the "recreate this chart" step. I used the exact same prompt as I provided to ChatGPT:

Use this data to recreate this chart using python

And uploaded the same two files - this XLSX file and the screenshot of the original chart.

Claude wrote and ran a bunch of Python code and produced this:

That's not a bad first attempt - we'll sort out the jagged lines in a moment. With ChatGPT I later told it to plot a six-survey rolling average (since that's what Apollo had done with theirs), so I ran the exact same prompt against Claude:

Plot it as a six survey rolling average

So I told Claude:

Make those lines less jagged

This did not do what I wanted:

So I gave it a bit of a hint:

No not like that, I wanted the lines to be a smooth curve like in the original screenshot - you shouldn't have to use scipy for this or manipulate the data, I'm sure matplotlib can do this already

A little bit of sycophancy:

You're absolutely right! Matplotlib has built-in curve smoothing. I was overcomplicating it. Let me create a version that uses matplotlib's natural curve smoothing capabilities. [...]

This time it used the antialiased=True and solid_capstyle='round' options in matplotlib, which again wasn't what I was after. I tried one more thing:

I did not mean smooth the rendering of the lines. I meant that I wanted curved and not straight lines between points.

(Normally if I have to argue this much with a model I'll start a fresh session and try a different prompting strategy from the start.)

But this did give me what I was looking for:

Oddly enough my ChatGPT chart suffered from that same overlapping title glitch, so I gave Claude the same prompt I had given ChatGPT:

fix the chart title, it looks like this: (with a screenshot)

We got there in the end! I've shared the full transcript of the chat, although frustratingly the images and some of the code may not be visible. I created this Gist with copies of the files that it let me download.

Prompt injection risks

ChatGPT Code Interpreter has no access to the internet at all, which limits how much damage an attacker can do if they manage to sneak their own malicious instructions into the model's context.

Since Claude Code Interpreter (I'm not going to be calling it "Upgraded file creation and analysis"!) has a limited form of internet access, we need to worry about lethal trifecta and other prompt injection attacks.

The help article actually covers this in some detail:

It is possible for a bad actor to inconspicuously add instructions via external files or websites that trick Claude into:

Downloading and running untrusted code in the sandbox environment for malicious purposes.

Reading sensitive data from a claude.ai connected knowledge source (e.g., Remote MCP, projects) and using the sandbox environment to make an external network request to leak the data.

This means Claude can be tricked into sending information from its context (e.g., prompts, projects, data via MCP, Google integrations) to malicious third parties. To mitigate these risks, we recommend you monitor Claude while using the feature and stop it if you see it using or accessing data unexpectedly.

"We recommend you monitor Claude while using the feature" smells me to me like unfairly outsourcing the problem to Anthropic's users, but I'm not sure what more they can do!

It's interesting that they still describe the external communication risk even though they've locked down a lot of network access. My best guess is that they know that allowlisting github.com opens an enormous array of potential exfiltration vectors.

Anthropic also note:

We have performed red-teaming and security testing on the feature. We have a continuous process for ongoing security testing and red-teaming of this feature.

I plan to be cautious using this feature with any data that I very much don't want to be leaked to a third party, if there's even the slightest chance that a malicious instructions might sneak its way in.

My verdict on Claude Code Interpreter so far

I'm generally very excited about this. Code Interpreter has been my most-valued LLM feature since it launched in early 2023, and the Claude version includes some upgrades on the original - package installation, Node.js support - that I expect will be very useful.

I don't particularly mark it down for taking a little more prompting to recreate the Apollo chart than ChatGPT did. For one thing I was using Claude Sonnet 4 - I expect Claude Opus 4.1 would have done better. I also have a much stronger intuition for Code Interpreter prompts that work with GPT-5.

I don't think my chart recreation exercise here should be taken as showing any meaningful differences between the two.

AI labs find explaining this feature incredibly difficult

I find it fascinating how difficult the AI labs find describing this feature to people! OpenAI went from "Code Interpreter" to "Advanced Data Analysis" and maybe back again? It's hard to even find their official landing page for that feature now. (I got GPT-5 to look for it and it hunted for 37 seconds and settled on the help page for Data analysis with ChatGPT).

Anthropic already used the bad name "Analysis tool" for a different implementation, and now have the somehow-worse name "Upgraded file creation and analysis". Their launch announcement avoids even talking about code execution, focusing exclusively on the tool's ability to generate spreadsheets and PDFs!

I wonder if any of the AI labs will crack the code on how to name and explain this thing? I feel like it's still a very under-appreciated feature of LLMs, despite having been around for more than two years now.

Tags: nodejs, python, visualization, ai, openai, prompt-injection, generative-ai, chatgpt, llms, ai-assisted-programming, anthropic, claude, code-interpreter, llm-tool-use, vibe-coding

Recreating the Apollo AI adoption rate chart with GPT-5, Python and Pyodide

2025-09-09T06:47:49+00:00

Apollo Global Management's "Chief Economist" Dr. Torsten Sløk released this interesting chart which appears to show a slowdown in AI adoption rates among large (>250 employees) companies:

Here's the full description that accompanied the chart:

The US Census Bureau conducts a biweekly survey of 1.2 million firms, and one question is whether a business has used AI tools such as machine learning, natural language processing, virtual agents or voice recognition to help produce goods or services in the past two weeks. Recent data by firm size shows that AI adoption has been declining among companies with more than 250 employees, see chart below.

(My first thought on seeing that chart is that I hope it represents the peak of inflated expectations leading into the trough of dissillusionment in the Gartner Hype Cycle (which Wikipedia calls "largely disputed, with studies pointing to it being inconsistently true at best"), since that means we might be reaching the end of the initial hype phase and heading towards the slope of enlightenment.)

Finding the US Census data with GPT-5 search

This is the first I'd heard of the US Census Bureau running a biweekly (that's once every two weeks) survey about AI!

I decided to track down the numbers and see if I could recreate the chart myself. And since GPT-5 is really good at search now I fed it the following prompt to see how well it could do:

> The US Census Bureau conducts a biweekly survey of 1.2 million firms, and one question is whether a business has used AI tools such as machine learning, natural language processing, virtual agents or voice recognition to help produce goods or services in the past two weeks. Recent data by firm size shows that AI adoption has been declining among companies with more than 250 employees, see chart below.

Find me that original data

It took 1m19s to find exactly what I needed:

Here’s the original U.S. Census Bureau source for that series:

BTOS “Employment size class” time series (XLSX) — biweekly shares of firms answering Yes to “In the last two weeks, did this business use Artificial Intelligence (AI) in producing goods or services?”, broken out by firm-size buckets (incl. 250+ employees). Coverage: Sep 11, 2023 → Aug 24, 2025. (Census.gov) [...]

That Census page was not at all obvious. Thankfully GPT-5 had tipped me off to the "Employment size class" file, this link here:

So I downloaded that file, and confirmed that it was indeed a spreadsheet containing the data I wanted (in among all sorts of other survey questions). Here's a 374KB XLSX copy of the file I downloaded.

Recreating the chart with GPT-5 code interpreter

So what should I do with it now? I decided to see if GPT-5 could turn the spreadsheet back into that original chart, using Python running in its code interpreter tool.

So I uploaded the XLSX file back to ChatGPT, dropped in a screenshot of the Apollo chart and prompted:

Use this data to recreate this chart using python

I thought this was a pretty tall order, but it's always worth throwing big challenges at an LLM to learn from how well it does.

It really worked hard on this. I didn't time it exactly but it spent at least 7 minutes "reasoning" across 5 different thinking blocks, interspersed with over a dozen Python analysis sessions. It used pandas and numpy to explore the uploaded spreadsheet and find the right figures, then tried several attempts at plotting with matplotlib.

As far as I can tell GPT-5 in ChatGPT can now feed charts it creates back into its own vision model, because it appeared to render a broken (empty) chart and then keep on trying to get it working.

It found a data dictionary in the last tab of the spreadsheet and used that to build a lookup table matching the letters A through G to the actual employee size buckets.

At the end of the process it spat out this chart:

At first glance I thought it had nailed it... but then I compared the chart more closely with the Apollo original and spotted some definite discrepancies. GPT-5's chart peaked at 14.5% but the highest value in Apollo's was more like 13.5%. The GPT-5 chart was spikier - and most interestingly it included a clear uptick in the last data point where Apollo's had trended downwards.

I decided it was time to look at the actual data. I opened up the spreadsheet in Numbers, found the AI question columns and manually reviewed them. They seemed to match the GPT-5 chart results - so why the difference to Apollo's?

Then I noticed a crucial detail in the Apollo chart that I had cropped out of my original screenshot!

Note: Data is six-survey moving average.

So I told ChatGPT:

Do the first question, plot it as a six survey rolling average

I asked for the first question because it turned out there were two that were relevant in the survey spreadsheet.

In the last two weeks, did this business use Artificial Intelligence (AI) in producing goods or services? (Examples of AI: machine learning, natural language processing, virtual agents, voice recognition, etc.)
During the next six months, do you think this business will be using Artificial Intelligence (AI) in producing goods or services? (Examples of AI: machine learning, natural language processing, virtual agents, voice recognition, etc.)

It churned away for a little longer, added this code to the script:

# Compute 6-survey rolling average (biweekly cadence → ~12 weeks)
rolled = wide.rolling(window=6, min_periods=6).mean()

And popped out this chart (after I told it to fix the glitch in the title):

I think it's done it! This is a very solid match for the Apollo original, recreated using matplotlib and pandas from the same underlying source data from the US Census.

Here's the full Python code it wrote, which I think is quite readable (in as much as Pandas code can be):

import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter

path = "/mnt/data/Employment Size Class.xlsx"

resp = pd.read_excel(path, sheet_name="Response Estimates")
dates = pd.read_excel(path, sheet_name="Collection and Reference Dates")

is_current = resp["Question"].astype(str).str.strip().str.startswith("In the last two weeks")
ai_yes = resp[is_current & resp["Answer"].astype(str).str.strip().str.lower().eq("yes")].copy()

code_to_bucket = {"A":"1-4","B":"5-9","C":"10-19","D":"20-49","E":"50-99","F":"100-249","G":"250 or more employees"}
ai_yes["Bucket"] = ai_yes["Empsize"].map(code_to_bucket)

period_cols = [c for c in ai_yes.columns if str(c).isdigit() and len(str(c))==6]
long = ai_yes.melt(id_vars=["Bucket"], value_vars=period_cols, var_name="Smpdt", value_name="value")

dates["Smpdt"] = dates["Smpdt"].astype(str)
long["Smpdt"] = long["Smpdt"].astype(str)
merged = long.merge(dates[["Smpdt","Ref End"]], on="Smpdt", how="left")
merged["date"] = pd.to_datetime(merged["Ref End"], errors="coerce")

merged["value"] = pd.to_numeric(long["value"].astype(str).str.replace("%","",regex=False).str.strip(), errors="coerce")

order = ["250 or more employees","100-249","50-99","20-49","10-19","5-9","1-4"]
wide = merged.pivot_table(index="date", columns="Bucket", values="value", aggfunc="mean").sort_index()
wide = wide[[c for c in order if c in wide.columns]]
rolled = wide.rolling(window=6, min_periods=6).mean()

start, end = pd.Timestamp("2023-11-01"), pd.Timestamp("2025-08-31")
rolled_win = rolled.loc[(rolled.index >= start) & (rolled.index <= end)]

fig, ax = plt.subplots(figsize=(12, 6))
for col in order:
    if col in rolled_win.columns:
        ax.plot(rolled_win.index, rolled_win[col], label=col, linewidth=2)

ax.set_title("AI adoption (last two weeks) — 6‑survey rolling average", pad=16)
ax.yaxis.set_major_formatter(PercentFormatter(100))
ax.set_ylabel("%")
ax.set_xlabel("")
ax.grid(True, alpha=0.25, linestyle="--")
ax.legend(title=None, loc="upper left", ncols=2, frameon=False)
plt.tight_layout()

png_path = "/mnt/data/ai_adoption_rolling6_by_firm_size.png"
svg_path = "/mnt/data/ai_adoption_rolling6_by_firm_size.svg"
plt.savefig(png_path, dpi=200, bbox_inches="tight")
plt.savefig(svg_path, bbox_inches="tight")

I like how it generated an SVG version of the chart without me even asking for it.

You can access the ChatGPT transcript to see full details of everything it did.

Rendering that chart client-side using Pyodide

I had one more challenge to try out. Could I render that same chart entirely in the browser using Pyodide, which can execute both Pandas and Matplotlib?

I fired up a new ChatGPT GPT-5 session and prompted:

Build a canvas that loads Pyodide and uses it to render an example bar chart with pandas and matplotlib and then displays that on the page

My goal here was simply to see if I could get a proof of concept of a chart rendered, ideally using the Canvas feature of ChatGPT. Canvas is OpenAI's version of Claude Artifacts, which lets the model write and then execute HTML and JavaScript directly in the ChatGPT interface.

It worked! Here's the transcript and here's what it built me, exported to my tools.simonwillison.net GitHub Pages site (source code here).

I've now proven to myself that I can render those Python charts directly in the browser. Next step: recreate the Apollo chart.

I knew it would need a way to load the spreadsheet that was CORS-enabled. I uploaded my copy to my /static/cors-allow/2025/... directory (configured in Cloudflare to serve CORS headers), pasted in the finished plotting code from earlier and told ChatGPT:

Now update it to have less explanatory text and a less exciting design (black on white is fine) and run the equivalent of this:

(... pasted in Python code from earlier ...)

Load the XLSX sheet from https://static.simonwillison.net/static/cors-allow/2025/Employment-Size-Class-Sep-2025.xlsx

It didn't quite work - I got an error about openpyxl which I manually researched the fix for and prompted:

Use await micropip.install("openpyxl") to install openpyxl - instead of using loadPackage

I had to paste in another error message:

zipfile.BadZipFile: File is not a zip file

Then one about a SyntaxError: unmatched ')' and a TypeError: Legend.__init__() got an unexpected keyword argument 'ncols' - copying and pasting error messages remains a frustrating but necessary part of the vibe-coding loop.

... but with those fixes in place, the resulting code worked! Visit tools.simonwillison.net/ai-adoption to see the final result:

Here's the code for that page, 170 lines all-in of HTML, CSS, JavaScript and Python.

What I've learned from this

This was another of those curiosity-inspired investigations that turned into a whole set of useful lessons.

GPT-5 is great at tracking down US Census data, no matter how difficult their site is to understand if you don't work with their data often
It can do a very good job of turning data + a screenshot of a chart into a recreation of that chart using code interpreter, Pandas and matplotlib
Running Python + matplotlib in a browser via Pyodide is very easy and only takes a few dozen lines of code

Fetching an XLSX sheet into Pyodide is only a small extra step using pyfetch and openpyxl:

import micropip
await micropip.install("openpyxl")
from pyodide.http import pyfetch
resp_fetch = await pyfetch(URL)
wb_bytes = await resp_fetch.bytes()
xf = pd.ExcelFile(io.BytesIO(wb_bytes), engine='openpyxl')

Another new-to-me pattern: you can render an image to the DOM from Pyodide code like this:

from js import document
document.getElementById('plot').src = 'data:image/png;base64,' + img_b64

I will most definitely be using these techniques again in future.

Update: Coincidentally Claude released their own upgraded equivalent to ChatGPT Code Interpreter later on the day that I published this story, so I ran the same chart recreation experiment against Claude Sonnet 4 to see how it compared.

Tags: census, data-journalism, javascript, python, tools, visualization, ai, pyodide, openai, generative-ai, chatgpt, llms, ai-assisted-programming, code-interpreter, llm-reasoning, vibe-coding, ai-assisted-search, gpt-5

python-importtime-graph

2025-06-20T19:31:45+00:00

python-importtime-graph

I was exploring why a Python tool was taking over a second to start running and I learned about the python -X importtime feature, documented here. Adding that option causes Python to spit out a text tree showing the time spent importing every module.

I tried that like this:

python -X importtime -m llm plugins

That's for LLM running 41 different plugins. Here's the full output from that command, which starts like this:

import time: self [us] | cumulative | imported package
import time:        77 |         77 |   _io
import time:        19 |         19 |   marshal
import time:       131 |        131 |   posix
import time:       363 |        590 | _frozen_importlib_external
import time:       450 |        450 |   time
import time:       110 |        559 | zipimport
import time:        64 |         64 |     _codecs
import time:       252 |        315 |   codecs
import time:       277 |        277 |   encodings.aliases

Kevin Michel built this excellent tool for visualizing these traces as a treemap. It runs in a browser - visit kmichel.github.io/python-importtime-graph/ and paste in the trace to get the visualization.

Here's what I got for that LLM example trace:

As you can see, it's pretty dense! Here's the SVG version which is a lot more readable, since you can zoom in to individual sections.

Zooming in it looks like this:

Tags: performance, python, visualization

Observable Plot: Waffle mark

2024-08-06T21:40:48+00:00

Observable Plot: Waffle mark

New feature in Observable Plot 0.6.16: the waffle mark! I really like this one. Here's an example showing the gender and weight of athletes in this year's Olympics:

Via @mbostock

Tags: javascript, visualization, observable, observable-plot

I’ve stopped using box plots. Should you?

2024-06-19T00:22:55+00:00

I’ve stopped using box plots. Should you?

Nick Desbarats explains box plots (including with this excellent short YouTube video) and then discusses why he thinks "typically less than 20 percent" of participants in his workshops already understand how to read them.

A key problem is that they are unintuitive: a box plot has four sections, two thin lines (the top and bottom whisker segments) and two larger boxes, joined around the median. Each of these elements represents the same number of samples (one quartile each) but the thin lines v.s. thick boxes imply that the whiskers contain less samples than the boxes.

Via lobste.rs

Tags: visualization

datasette-plot - a new Datasette Plugin for building data visualizations

2023-12-31T05:04:19+00:00

datasette-plot - a new Datasette Plugin for building data visualizations

I forgot to link to this here last week: Alex Garcia released the first version of datasette-plot, a brand new Datasette visualization plugin built on top of the Observable Plot charting library. We plan to use this as the new, updated alternative to my older datasette-vega plugin.

Tags: plugins, visualization, datasette, observable, alex-garcia, observable-plot

datasette-dashboards

2022-04-07T18:36:45+00:00

datasette-dashboards

Romain Clement’s datasette-dashboards plugin lets you configure dashboards for Datasette using YAML, combining markdown blocks, Vega graphs and single number metrics using a layout powered by CSS grids. This is a beautiful piece of software design, with a very compelling live demo.

Via rmnclmnt comment on Hacker News

Tags: plugins, visualization, datasette

Datasette table diagram using Mermaid

2022-02-14T19:43:15+00:00

Datasette table diagram using Mermaid

Mermaid is a DSL for generating diagrams from plain text, designed to be embedded in Markdown. GitHub just added support for Mermaid to their Markdown pipeline, which inspired me to try it out. Here’s an Observable Notebook I built which uses Mermaid to visualize the relationships between Datasette tables based on their foreign keys.

Via @simonw

Tags: dsl, github, visualization, datasette, observable, mermaid

Observable Plot Cheatsheets

2022-01-25T22:12:45+00:00

Observable Plot Cheatsheets

Beautiful new set of cheatsheets by Mike Freeman for the Observable Plot charting library. This is really top notch documentation—the cheatsheets are available as printable PDFs but the real value here is in the interactive versions of them, which include Observable-powered sliders to tweak the different examples and copy out the resulting generated code.

Via @mf_viz

Tags: visualization, observable, observable-plot

Plot & Vega-Lite

2021-05-04T16:32:47+00:00

Plot & Vega-Lite

Useful documentation comparing the brand new Observable Plot to Vega-Lite, complete with examples of how to achieve the same thing in both libraries.

Tags: visualization, d3, observable, observable-plot

Observable Plot

2021-05-04T16:28:34+00:00

Observable Plot

This is huge: a brand new high-level JavaScript visualization library from Mike Bostock, the author of D3—partially inspired by Vega-Lite which I’ve used enthusiastically in the past. First impressions are that this is a big step forward for quickly building high-quality visualizations. It’s released under the ISC license which is “functionally equivalent to the BSD 2-Clause and MIT licenses”.

Via @mbostock

Tags: open-source, visualization, d3, observable, mike-bostock, observable-plot

Weeknotes: datasette-seaborn, fivethirtyeight-polls

2020-09-18T19:49:02+00:00

This week I released Datasette 0.49 and tinkered with datasette-seaborn, dogsheep-beta and polling data from FiveThirtyEight.

datasette-seaborn

Datasette currently has two key plugins for data visualization: datasette-vega for line, bar and scatter charts (powered by Vega-Lite) and datsette-cluster-map for creating clustered marker maps.

I'm always looking for other interesting visualization opportunities. Seaborn 0.11 came out last week and is a very exciting piece of software.

Seaborn focuses on statistical visualization tools - histograms, boxplots and the like - and represents 8 years of development lead by Michael Waskom. It's built on top of Matplotlib and exhibits meticulously good taste.

So I've started building datasette-seaborn, a plugin which provides an HTTP interface for generating Seaborn charts using data stored in Datasette.

It's still very alpha but early results are promising. I've chosen to implement it as a custom output renderer - so adding .seaborn to any Datasette table (plus some querystring parameters) will output a rendered PNG chart of that data, just like how .json or csv give you that data in different export formats.

Here's an example chart generated by the plugin using the fabulous palmerpenguins dataset (intended as a new alternative to Iris).

I generated this image from the following URL:

https://datasette-seaborn-demo.datasette.io/penguins/penguins.seaborn?_size=1000&_seaborn=kdeplot&_seaborn_x=flipper_length_mm&_seaborn_hue=species&_seaborn_multiple=stack

This interface should be considered unstable and likely to change, but it illustrates the key idea behind the plugin: use ?_seaborn_x parameters to feed in options for the chart that you want to render.

The two biggest issues with the plugin right now are that it renders images on the main thread, potentially blocking the event loop, and it passes querystring arguments directly to seaborn without first validating them which is almost certainly a security problem.

So very much an alpha, but it's a promising start!

dogsheep-beta

Dogsheep Beta provides faceted search across my personal data from a whole variety of different sources.

I demo'd it at PyCon AU a few weeks ago, and promised that a full write-up would follow. I still need to honour that promise! I'm figuring out how to provide a good interactive demo at the moment that doesn't expose my personal data.

I added sort by date in addition to sort by relevance in version 0.7 this week.

fivethirtyeight-polls

FiveThirtEight have long published the data behind their stories in their fivethirtyeight/data GitHub repository, and I've been using that data to power a Datasette demo ever since I first released the project.

They run an index of their data projects at data.fivethirtyeight.com, and this week I noticed that they list US election polling data there that wasn't being picked up by my fivethirtyeight.datasettes.com site.

It turns out this is listed in the GitHub repository as a README.md file but without the actual CSV data. Instead, the README links to external CSV files with URLs like https://projects.fivethirtyeight.com/polls-page/president_primary_polls.csv

It looks to me like they're running their own custom web application which provides the CSV data as an export format, rather than keeping that CSV data directly in their GitHub repository.

This makes sense - I imagine they run a lot of custom code to help them manage their data - but does mean that my Datasette publishing scripts weren't catching their polling data.

Since they release their data as Creative Commons Attribution 4.0 International I decided to start archiving it on GitHub, where it would be easier to automate.

I set up simonw/fivethirtyeight-polls to do just that. It's a classic implementation of the git-scraping pattern: it runs a workflow script four times a day which fetches their latest CSV files and commits them to a repo. This means I now have a commit history of changes they have made to their polling data!

I updated my FiveThirtyEight Datasette script to publish that data as a new polls.db database, which is now being served at fivethirtyeight.datasettes.com/polls.

And since that Datasette instance runs the datasette-graphql plugin, you can now use GraphQL to explore FiveThirtyEight's most recent polling data at https://fivethirtyeight.datasettes.com/graphql/polls - here's an example query.

github-to-sqlite get

github-to-sqlite lets you fetch data from the GitHub API and write it into a SQLite database.

This week I added a sub-command for hitting the API directly and getting back data on the console, inspired by the fetch subcommand in twitter-to-sqlite. This is useful for trying out new APIs, since it both takes into account your GitHub authentication credentials (from an environment variable or an auth.json file) and can handle Link header pagination.

This example fetches all of my repos, paginating across multiple API requests:

github-to-sqlite get /users/simonw/repos --paginate

You can use the --nl option to get back the results as newline-delimited JSON. This makes it easy to pipe them directly into sqlite-utils like this:

github-to-sqlite get /users/simonw/repos --paginate --nl \
  | sqlite-utils insert simonw.db repos - --nl

See Inserting JSON data in the sqlite-utils documentation for an explanation of what this is doing.

TIL this week

Releases this week

dogsheep-beta 0.7.1 - 2020-09-17
dogsheep-beta 0.7 - 2020-09-17
github-to-sqlite 2.6 - 2020-09-17
datasette 0.49.1 - 2020-09-15
datasette-ics 0.5 - 2020-09-14
datasette-copyable 0.3 - 2020-09-14
datasette-atom 0.8 - 2020-09-14
datasette-yaml 0.1 - 2020-09-14
datasette 0.49 - 2020-09-14
datasette 0.49a1 - 2020-09-14
datasette-seaborn 0.1a1 - 2020-09-11
datasette-seaborn 0.1a0 - 2020-09-11

Tags: projects, visualization, datasette, weeknotes

Datasette table diagram, now with a DOT graph

2020-05-08T03:23:32+00:00

Datasette table diagram, now with a DOT graph

Thomas Ballinger shared a huge improvement to my Observable notebook for rendering a diagram of a collection of Datasette tables. He showed how to use the DOT language to render a full schema digram with arrows joining together the different tables. I’ve applied his changes to my notebook.

Via @ballingt

Tags: visualization, datasette, observable

Weeknotes: Niche Museums, Kepler, Trees and Streaks

2019-10-28T22:42:10+00:00

Niche Museums

Every now and then someone will ask “so when are you going to build Museums Near Me then?”, based on my obsession with niche museums and websites like www.owlsnearme.com.

For my Strategic Communications course at Stanford last week I had to perform a midterm presentation - a six minute talk to convince my audience of something, accompanied by slides and a handout.

I chose “you should seek out and explore tiny museums” as my topic, and used it as an excuse to finally start the website!

www.niche-museums.com is the result. It’s a small but growing collection of niche museums (17 so far, mostly in the San Francisco Bay Area) complete with the all important blue “Use my location” button to see museums near you.

Naturally I built it on Datasette. I’ll be writing more about the implementation (and releasing the underlying code) soon. I also built a new plugin for it, datasette-haversine.

Mapping museums against Starbucks

I needed a way to emphasize quite how many tiny museums there are in the USA. I decided to do this with a visualization.

It turns out there are 15,891 branches of Starbucks in the USA… and at least 30,132 museums!

I made these maps using a couple of sources.

All The Places is a crowdsourced scraper project which aims to build scrapers for every company that has a “store locator” area of their website. Starbucks has a store locator and All The Places have a scraper for it, so you can download GeoJSON of every Starbucks. I wrote a quick script to import that GeoJSON into Datasette using sqlite-utils.

The Institute of Museum and Library Services is an independent agency of the federal government that supports museums and libraries across the country. They publish a dataset of Museums in the USA as a set of CSV files. I used csvs-to-sqlite to load those into Datasette, than ran a union query to combine the three files together.

So I have Datasette instances (with a CSV export feature) for both Starbucks and USA museums, with altitudes and longitudes for each.

Now how to turn that into a map?

I turned to my new favourite GIS tool, Kepler. Kepler is an open source GIS visualization tool released by Uber, based on WebGL. It’s astonishingly powerful and can be used directly in your browser by clicking the “Get Started” button on their website (which I assumed would take you to installation instructions, but no, it loads up the entire tool in your browser).

You can import millions of points of data into Kepler and it will visualize them for you directly. I used a Datasette query to export the CSVs, then loaded in my Starbucks CSV, exported an image, loaded in the Museums CSV as a separate colour and exported a second image. The whole project ended up taking about 15 minutes. Kepler is a great addition to the toolbelt!

Animating the PG&E outages

My PG&E outages scraper continues to record a snapshot of the PG&E outage map JSON every ten minutes. I’m posting updates to a thread on Twitter, but discovering Kepler inspired me to look at more sophisticated visualization options.

This tutorial by Giuseppe Macrì tipped me off the the fact that you can use Kepler to animate points against timestamps!

Here’s the result: a video animation showing how PG&E’s outages have evolved since the 5th of October:

Here's a video animation of PG&E's outages from October 5th up until just a few minutes ago pic.twitter.com/50K3BrROZR
- Simon Willison (@simonw) October 28, 2019

Hayes Valley Trees

The city announced plans to cut down 27 ficus trees in our neighborhood in San Francisco. I’ve been working with Natalie to help a small group of citizens organize an appeal, and this weekend I helped run a survey of the affected trees (recording their exact locations in a CSV file) and then built www.hayes-valley-trees.com (source on GitHub) to link to from fliers attached to each affected tree.

It started out as a Datasette (running on Glitch) but since it’s only 27 data points I ended up freezing the data in a static JSON file to avoid having to tolerate any cold start times. The site is deployed as static assets on Zeit Now using their handy GitHub continuous deployment tool.

Streaks

It turns out I’m very motivated by streaks: I’m at 342 days for Duolingo Spanish and 603 days for an Apple Watch move streak. Could I apply this to other things in my life?

I asked on Twitter and was recommended the Streaks iOS app. It’s beautiful! I’m now tracking streaks for guitar practice, Duolingo, checking email, checking Slack, reading some books and adding a new museum to www.niche-museums.com (if I add one a day I can get from 17 museums today to 382 in a year!)

It seems to be working pretty well so far. I particularly like their iPhone widget.

Tags: museums, productivity, projects, visualization, weeknotes, baked-data, streaks, duolingo

kepler.gl

2019-10-25T04:16:07+00:00

kepler.gl

Uber built this open source geospatial analysis tool for large-scale data sets, and they offer it as a free hosted online tool—just click Get Started on the site. I uploaded two CSV files with 30,000+ latitude/longitude points in them just now and used Kepler to render them as images.

Tags: geospatial, visualization

Thematic map - GIS Wiki

2019-10-21T02:25:17+00:00

Thematic map - GIS Wiki

This is a really useful wiki full of GIS information, and the coverage of different types of thematic maps is particularly thorough.

Tags: geospatial, visualization, wiki

Exploring Neural Networks with Activation Atlases

2019-04-19T02:24:16+00:00

Exploring Neural Networks with Activation Atlases

Another promising attempt at visualizing what’s going on inside a neural network.

Tags: machine-learning, visualization

List of Physical Visualizations

2019-03-04T02:45:52+00:00

List of Physical Visualizations

“A chronological list of physical visualizations and related artifacts, maintained by Pierre Dragicevic and Yvonne Jansen”—327 and counting!

Via @dvhdr

Tags: visualization

datasette-vega

2018-06-29T15:00:48+00:00

datasette-vega

I wrote a visualization plugin for Datasette that uses the excellent Vega “visualization grammar” library to provide bar, line and scatter charts configurable against any Datasette table or SQL query.

Via @simonw

Tags: plugins, projects, visualization, datasette

Datasette plugins, and building a clustered map visualization

2018-04-20T15:41:11+00:00

Datasette now supports plugins!

Last Saturday I asked Twitter for examples of Python projects with successful plugin ecosystems. pytest was the clear winner: the pytest plugin compatibility table (an ingenious innovation that I would love to eventually copy for Datasette) lists 457 plugins, and even the core pytest system itself is built as a collection of default plugins that can be replaced or over-ridden.

Best of all: pytest’s plugin mechanism is available as a separate package: pluggy. And pluggy was exactly what I needed for Datasette.

You can follow the ongoing development of the feature in issue #14. This morning I released Datasette 0.20 with support for a number of different plugin hooks: plugins can add custom template tags and SQL functions, and can also bundle their own static assets, JavaScript, CSS and templates. The hooks are described in some detail in the Datasette Plugins documentation.

datasette-cluster-map

I also released my first plugin: datasette-cluster-map. Once installed, it looks out for database tables that have a latitude and longitude column. When it finds them, it draws all of the points on an interactive map using Leaflet and Leaflet.markercluster.

Let’s try it out on some polar bears!

The USGS Alaska Science Center have released a delightful set of data entitled Sensor and Location data from Ear Tag PTTs Deployed on Polar Bears in the Southern Beaufort Sea 2009 to 2011. It’s a collection of CSV files, which means it’s trivial to convert it to SQLite using my csvs-to-sqlite tool.

Having created the SQLite database, we can deploy it to a hosting account on Zeit Now alongside the new plugin like this:

# Make sure we have the latest datasette
pip3 install datasette --upgrade
# Deploy polar-bears.db to now with an increased default page_size
datasette publish now \
    --install=datasette-cluster-map \
    --extra-options "--page_size=500" \
    polar-bears.db

The --install option is new in Datasette 0.20 (it works for datasette publish heroku as well) - it tells the publishing provider to pip install the specified package. You can use it more than once to install multiple plugins, and it accepts a path to a zip file in addition to the name of a PyPI package.

Explore the full demo at https://datasette-cluster-map-demo.now.sh/polar-bears

Visualize any query on a map

Since the plugin inserts itself at the top of any Datasette table view with latitude and longitude columns, there are all sorts of neat tricks you can do with it.

I also loaded the San Francisco tree list (thanks, Department of Public Works) into the demo. Impressively, you can click “load all” on this page and Leaflet.markercluster will load in all 189,144 points and display them on the same map… and it works fine on my laptop and my phone. Computers in 2018 are pretty good!

But since it’s a Datasette table, we can filter it. Here’s a map of every New Zealand Xmas Tree in San Francisco (8,683 points). Here’s every tree where the Caretaker is Friends of the Urban Forest. Here’s every palm tree planted in 1990:

Update: This is an incorrect example: there are 21 matches on "palm avenue" because the FTS search index covers the address field - they're not actually palm trees. Here's a corrected query for palm trees planted in 1990.

The plugin currently only works against columns called latitude and longitude… but if your columns are called something else, don’t worry: you can craft a custom SQL query that aliases your columns and everything will work as intended. Here’s an example against some more polar bear data:

select *, "Capture Latitude" as latitude, "Capture Longitude" as longitude
from [USGS_WC_eartag_deployments_2009-2011]

Writing your own plugins

I’m really excited to see what people invent. If you want to have a go, your first stop should be the Plugins documentation. If you want an example of a simple plugin (including the all-important mechanism for packaging it up using setup.py) take a look at datasette-cluster-map on GitHub.

And if you have any thoughts, ideas or suggestions on how the plugin mechanism can be further employed please join the conversation on issue #14. I’ve literally just got started with Datasette’s plugin hooks, and I’m very keen to hear about things people want to build that aren’t yet supported.

Tags: maps, plugins, projects, visualization, datasette, pytest, leaflet

Visualizing disk IO activity using log-scale banded graphs

2018-04-11T17:04:45+00:00

Visualizing disk IO activity using log-scale banded graphs

This is a neat data visualization trick: to display rates of disk I/O, it splits the rate into a GB, MB and KB section on a stacked chart. This means that if you are getting jitter in the order of KBs even while running at 400+MB/second you can see the jitter in the KB section.

Via Show HN

Tags: visualization

Vega-Lite

2018-03-28T17:22:05+00:00

Vega-Lite

A “high-level grammar of interactive graphics”. Part of the Vega project, which provides a mechanism for creating declarative visualizations by defining them using JSON. Vega-Lite is particularly interesting to me because it makes extremely tasteful decisions about how data should be visualized—give it some records, tell it which properties to plot on an axis and it will default to a display that makes sense for that data. The more I play with this the more impressed I am at the quality of its default settings.

Tags: visualization

Observable: An Earthquake Globe in Ten Minutes

2018-01-31T17:01:16+00:00

Observable: An Earthquake Globe in Ten Minutes

Well worth your time. Jeremy Ashkenas uses Observable to live-code an interactive visualization of recent earthquakes around the world, using USGS data (fetched as JSON), d3, topoJSON and an Observable notebook. I’m sold—this is truly ground-breaking new technology.

Tags: javascript, jeremy-ashkenas, visualization, explorables, d3, observable

Exploring Line Lengths in Python Packages

2017-11-10T15:34:29+00:00

Exploring Line Lengths in Python Packages

Interesting exploration of the impact if the 79 character length limit rule of thumb on various Python packages—and a thoroughly useful guide to histogram plotting in Jupyter, pandas and matplotlib.

Tags: pandas, python, visualization, jupyter

Generating interactive HTML charts from Python?

2016-11-25T20:05:00+00:00

My answer to Generating interactive HTML charts from Python? on Ask MetaFilter

D3 is absolutely amazing but the learning curve is a bit steep. Totally worth the effort to learn it in the long run, but it's not so useful if you want to get something done quickly.

I've used NVD3 successfully in the past - it's another high level library on top of D3. Much faster to get results than D3 on its own.

From your description it sounds like you should also check out crossfilter which, for the right use-cases, is phenomenal.

Tags: ask-metafilter, bokeh, pandas, python, visualization, plotly, datavis

Visualization Strategies: Text & Documents

2008-08-22T11:17:22+00:00

Visualization Strategies: Text & Documents

“List of ...” style posts usually make me want to stab someone with a fork; this is how that kind of post should be done—well researched, carefully written and, most importantly doesn’t call itself a “Top X Ys that will Z your ZZ”!

Tags: lists, visualization

Designing Google Reader's trends

2007-01-15T00:53:23+00:00

Designing Google Reader's trends

“But beyond the visualization, this serves as a good example of collecting and understanding the ambient information that flows through our digital lives.”

Tags: design, google, google-reader, jeffrey-veen, visualization