<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: vector-search</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/vector-search.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-11-11T23:38:39+00:00</updated><author><name>Simon Willison</name></author><entry><title>Scaling HNSWs</title><link href="https://simonwillison.net/2025/Nov/11/scaling-hnsws/#atom-tag" rel="alternate"/><published>2025-11-11T23:38:39+00:00</published><updated>2025-11-11T23:38:39+00:00</updated><id>https://simonwillison.net/2025/Nov/11/scaling-hnsws/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://antirez.com/news/156"&gt;Scaling HNSWs&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Salvatore Sanfilippo spent much of this year working on &lt;a href="https://github.com/redis/redis/blob/8.2.3/modules/vector-sets/README.md"&gt;vector sets for Redis&lt;/a&gt;, which first shipped in &lt;a href="https://redis.io/blog/redis-8-ga/"&gt;Redis 8 in May&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;A big part of that work involved implementing HNSW - Hierarchical Navigable Small World - an indexing technique first introduced in &lt;a href="https://arxiv.org/abs/1603.09320"&gt;this 2016 paper&lt;/a&gt; by Yu. A. Malkov and D. A. Yashunin.&lt;/p&gt;
&lt;p&gt;Salvatore's detailed notes on the Redis implementation here offer an immersive trip through a fascinating modern field of computer science. He describes several new contributions he's made to the HNSW algorithm, mainly around efficient deletion and updating of existing indexes.&lt;/p&gt;
&lt;p&gt;Since embedding vectors are notoriously memory-hungry I particularly appreciated this note about how you can scale a large HNSW vector set across many different nodes and run parallel queries against them for both reads and writes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[...] if you have different vectors about the same use case split in different instances / keys, you can ask VSIM for the same query vector into all the instances, and add the WITHSCORES option (that returns the cosine distance) and merge the results client-side, and you have magically scaled your hundred of millions of vectors into multiple instances, splitting your dataset N times [One interesting thing about such a use case is that you can query the N instances in parallel using multiplexing, if your client library is smart enough].&lt;/p&gt;
&lt;p&gt;Another very notable thing about HNSWs exposed in this raw way, is that you can finally scale writes very easily. Just hash your element modulo N, and target the resulting Redis key/instance. Multiple instances can absorb the (slow, but still fast for HNSW standards) writes at the same time, parallelizing an otherwise very slow process.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It's always exciting to see new implementations of fundamental algorithms and data structures like this make it into Redis because Salvatore's C code is so clearly commented and pleasant to read - here's &lt;a href="https://github.com/redis/redis/blob/8.2.3/modules/vector-sets/hnsw.c"&gt;vector-sets/hnsw.c&lt;/a&gt; and &lt;a href="https://github.com/redis/redis/blob/8.2.3/modules/vector-sets/vset.c"&gt;vector-sets/vset.c&lt;/a&gt;.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=45887466"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/algorithms"&gt;algorithms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/c"&gt;c&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/computer-science"&gt;computer-science&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/data-structures"&gt;data-structures&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/redis"&gt;redis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/salvatore-sanfilippo"&gt;salvatore-sanfilippo&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vector-search"&gt;vector-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/embeddings"&gt;embeddings&lt;/a&gt;&lt;/p&gt;



</summary><category term="algorithms"/><category term="c"/><category term="computer-science"/><category term="data-structures"/><category term="redis"/><category term="salvatore-sanfilippo"/><category term="vector-search"/><category term="embeddings"/></entry><entry><title>The case against pgvector</title><link href="https://simonwillison.net/2025/Nov/3/the-case-against-pgvector/#atom-tag" rel="alternate"/><published>2025-11-03T20:26:10+00:00</published><updated>2025-11-03T20:26:10+00:00</updated><id>https://simonwillison.net/2025/Nov/3/the-case-against-pgvector/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://alex-jacobs.com/posts/the-case-against-pgvector/"&gt;The case against pgvector&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I wasn't keen on the title of this piece but the content is great: Alex Jacobs talks through lessons learned trying to run the popular pgvector PostgreSQL vector indexing extension at scale, in particular the challenges involved in maintaining a large index with close-to-realtime updates using the IVFFlat or HNSW index types.&lt;/p&gt;
&lt;p&gt;The section on pre-v.s.-post filtering is particularly useful:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Okay but let's say you solve your index and insert problems. Now you have a document search system with millions of vectors. Documents have metadata---maybe they're marked as &lt;code&gt;draft&lt;/code&gt;, &lt;code&gt;published&lt;/code&gt;, or &lt;code&gt;archived&lt;/code&gt;. A user searches for something, and you only want to return published documents.&lt;/p&gt;
&lt;p&gt;[...] should Postgres filter on status first (pre-filter) or do the vector search first and then filter (post-filter)?&lt;/p&gt;
&lt;p&gt;This seems like an implementation detail. It’s not. It’s the difference between queries that take 50ms and queries that take 5 seconds. It’s also the difference between returning the most relevant results and… not.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The &lt;a href="https://news.ycombinator.com/item?id=45798479"&gt;Hacker News thread&lt;/a&gt; for this article attracted a robust discussion, including some fascinating comments by Discourse developer Rafael dos Santos Silva (xfalcox) about how they are using pgvector at scale:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We [run pgvector in production] at Discourse, in thousands of databases, and it's leveraged in most of the billions of page views we serve. [...]&lt;/p&gt;
&lt;p&gt;Also worth mentioning that we use quantization extensively:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;halfvec (16bit float) for storage - bit (binary vectors) for indexes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Which makes the storage cost and on-going performance good enough that we could enable this in all our hosting. [...]&lt;/p&gt;
&lt;p&gt;In Discourse embeddings power:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Related Topics, a list of topics to read next, which uses embeddings of the current topic as the key to search for similar ones&lt;/li&gt;
&lt;li&gt;Suggesting tags and categories when composing a new topic&lt;/li&gt;
&lt;li&gt;Augmented search&lt;/li&gt;
&lt;li&gt;RAG for uploaded files&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=45798479"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/postgresql"&gt;postgresql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scaling"&gt;scaling&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vector-search"&gt;vector-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/embeddings"&gt;embeddings&lt;/a&gt;&lt;/p&gt;



</summary><category term="postgresql"/><category term="scaling"/><category term="vector-search"/><category term="embeddings"/></entry><entry><title>Quoting James Luan</title><link href="https://simonwillison.net/2025/Sep/8/james-luan/#atom-tag" rel="alternate"/><published>2025-09-08T16:24:24+00:00</published><updated>2025-09-08T16:24:24+00:00</updated><id>https://simonwillison.net/2025/Sep/8/james-luan/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://zilliz.com/blog/will-amazon-s3-vectors-kill-vector-databases-or-save-them"&gt;&lt;p&gt;I recently spoke with the CTO of a popular AI note-taking app who told me something surprising: they spend &lt;strong&gt;&lt;em&gt;twice&lt;/em&gt;&lt;/strong&gt; &lt;em&gt;as much&lt;/em&gt; on vector search as they do on OpenAI API calls. Think about that for a second. Running the retrieval layer costs them more than paying for the LLM itself.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://zilliz.com/blog/will-amazon-s3-vectors-kill-vector-databases-or-save-them"&gt;James Luan&lt;/a&gt;, Engineering architect of Milvus&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/vector-search"&gt;vector-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/embeddings"&gt;embeddings&lt;/a&gt;&lt;/p&gt;



</summary><category term="vector-search"/><category term="embeddings"/></entry><entry><title>Cursor: Security</title><link href="https://simonwillison.net/2025/May/11/cursor-security/#atom-tag" rel="alternate"/><published>2025-05-11T19:15:46+00:00</published><updated>2025-05-11T19:15:46+00:00</updated><id>https://simonwillison.net/2025/May/11/cursor-security/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.cursor.com/en/security"&gt;Cursor: Security&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Cursor's security documentation page includes a surprising amount of detail about how the Cursor text editor's backend systems work.&lt;/p&gt;
&lt;p&gt;I've recently learned that checking an organization's list of documented subprocessors is a great way to get a feel for how everything works under the hood - it's a loose "view source" for their infrastructure! That was how I confirmed that Anthropic's search features &lt;a href="https://simonwillison.net/2025/Mar/21/"&gt;used Brave search&lt;/a&gt; back in March.&lt;/p&gt;
&lt;p&gt;Cursor's list includes AWS, Azure and GCP (AWS for primary infrastructure, Azure and GCP for "some secondary infrastructure"). They host their own custom models on &lt;a href="https://fireworks.ai/"&gt;Fireworks&lt;/a&gt; and make API calls out to OpenAI, Anthropic, Gemini and xAI depending on user preferences. They're using &lt;a href="https://turbopuffer.com/"&gt;turbopuffer&lt;/a&gt; as a hosted vector store.&lt;/p&gt;
&lt;p&gt;The most interesting section is about &lt;a href="https://www.cursor.com/en/security#codebase-indexing"&gt;codebase indexing&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Cursor allows you to semantically index your codebase, which allows it to answer questions with the context of all of your code as well as write better code by referencing existing implementations. […]&lt;/p&gt;
&lt;p&gt;At our server, we chunk and embed the files, and store the embeddings in Turbopuffer. To allow filtering vector search results by file path, we store with every vector an obfuscated relative file path, as well as the line range the chunk corresponds to. We also store the embedding in a cache in AWS, indexed by the hash of the chunk, to ensure that indexing the same codebase a second time is much faster (which is particularly useful for teams).&lt;/p&gt;
&lt;p&gt;At inference time, we compute an embedding, let Turbopuffer do the nearest neighbor search, send back the obfuscated file path and line range to the client, and read those file chunks on the client locally. We then send those chunks back up to the server to answer the user’s question.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;When operating in &lt;a href="https://www.cursor.com/security#privacy-mode-guarantee"&gt;privacy mode&lt;/a&gt; - which they say is enabled by 50% of their users - they are careful not to store any raw code on their servers for longer than the duration of a single request. This is why they store the embeddings and obfuscated file paths but not the code itself.&lt;/p&gt;
&lt;p&gt;Reading this made me instantly think of the paper &lt;a href="https://simonwillison.net/2024/Jan/8/text-embeddings-reveal-almost-as-much-as-text/"&gt;Text Embeddings Reveal (Almost) As Much As Text&lt;/a&gt; about how vector embeddings can be reversed. The security documentation touches on that in the notes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Embedding reversal: academic work has shown that reversing embeddings is possible in some cases. Current attacks rely on having access to the model and embedding short strings into big vectors, which makes us believe that the attack would be somewhat difficult to do here. That said, it is definitely possible for an adversary who breaks into our vector database to learn things about the indexed codebases.&lt;/p&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://lobste.rs/s/myrlhi/how_cursor_indexes_codebases_fast"&gt;lobste.rs&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vector-search"&gt;vector-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/embeddings"&gt;embeddings&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cursor"&gt;cursor&lt;/a&gt;&lt;/p&gt;



</summary><category term="security"/><category term="ai"/><category term="generative-ai"/><category term="vector-search"/><category term="llms"/><category term="ai-assisted-programming"/><category term="embeddings"/><category term="cursor"/></entry><entry><title>Redis is open source again</title><link href="https://simonwillison.net/2025/May/1/redis-is-open-source-again/#atom-tag" rel="alternate"/><published>2025-05-01T17:19:36+00:00</published><updated>2025-05-01T17:19:36+00:00</updated><id>https://simonwillison.net/2025/May/1/redis-is-open-source-again/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://antirez.com/news/151"&gt;Redis is open source again&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Salvatore Sanfilippo:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Five months ago, I rejoined Redis and quickly started to talk with my colleagues about a possible switch to the AGPL license, only to discover that there was already an ongoing discussion, a very old one, too. [...]&lt;/p&gt;
&lt;p&gt;I’ll be honest: I truly wanted the code I wrote for the new Vector Sets data type to be released under an open source license. [...]&lt;/p&gt;
&lt;p&gt;So, honestly, while I can’t take credit for the license switch, I hope I contributed a little bit to it, because today I’m happy. I’m happy that Redis is open source software again, under the terms of the AGPLv3 license.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'm absolutely &lt;em&gt;thrilled&lt;/em&gt; to hear this. Redis 8.0 is &lt;a href="https://redis.io/blog/redis-8-ga/"&gt;out today under the new license&lt;/a&gt;, including a beta release of &lt;a href="https://redis.io/docs/latest/develop/data-types/vector-sets/"&gt;Vector Sets&lt;/a&gt;. I've been watching Salvatore's work on those with &lt;a href="https://antirez.com/news/149"&gt;fascination&lt;/a&gt;, while sad that I probably wouldn't use it often due to the janky license. That concern is now gone. I'm looking forward to putting them through their paces!&lt;/p&gt;
&lt;p&gt;See also &lt;a href="https://redis.io/blog/agplv3/"&gt;Redis is now available under the AGPLv3 open source license&lt;/a&gt; on the Redis blog. An interesting note from that is that they are also:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Integrating Redis Stack technologies, including JSON, Time Series, probabilistic data types, Redis Query Engine and more into core Redis 8 under AGPL&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That's a whole bunch of new things that weren't previously part of Redis core.&lt;/p&gt;
&lt;p&gt;I hadn't encountered &lt;a href="https://redis.io/docs/latest/develop/interact/search-and-query/"&gt;Redis Query Engine&lt;/a&gt; before - it looks like that's a whole set of features that turn Redis into more of an Elasticsearch-style document database complete with full-text, vector search operations and geospatial operations and aggregations. It supports search syntax that looks a bit like this:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;FT.SEARCH places "museum @city:(san francisco|oakland) @shape:[CONTAINS $poly]" PARAMS 2 poly 'POLYGON((-122.5 37.7, -122.5 37.8, -122.4 37.8, -122.4 37.7, -122.5 37.7))' DIALECT 3&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;(Noteworthy that Elasticsearch chose the AGPL too when they switched back from the SSPL to an open source license &lt;a href="https://simonwillison.net/2024/Aug/29/elasticsearch-is-open-source-again/"&gt;last year&lt;/a&gt;).

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=43859446"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/redis"&gt;redis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/salvatore-sanfilippo"&gt;salvatore-sanfilippo&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vector-search"&gt;vector-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/janky-licenses"&gt;janky-licenses&lt;/a&gt;&lt;/p&gt;



</summary><category term="open-source"/><category term="redis"/><category term="salvatore-sanfilippo"/><category term="vector-search"/><category term="janky-licenses"/></entry><entry><title>Quoting Jo Kristian Bergum</title><link href="https://simonwillison.net/2024/Dec/28/jo-kristian-bergum/#atom-tag" rel="alternate"/><published>2024-12-28T14:22:29+00:00</published><updated>2024-12-28T14:22:29+00:00</updated><id>https://simonwillison.net/2024/Dec/28/jo-kristian-bergum/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://twitter.com/jobergum/status/1872923872007217309"&gt;&lt;p&gt;Looking back, it's clear we overcomplicated things. While embeddings fundamentally changed how we can represent and compare content, they didn't need an entirely new infrastructure category. What we label as "vector databases" are, in reality, search engines with vector capabilities. The market is already correcting this categorization—vector search providers rapidly add traditional search features while established search engines incorporate vector search capabilities. This category convergence isn't surprising: building a good retrieval engine has always been about combining multiple retrieval and ranking strategies. Vector search is just another powerful tool in that toolbox, not a category of its own.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://twitter.com/jobergum/status/1872923872007217309"&gt;Jo Kristian Bergum&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vector-search"&gt;vector-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/embeddings"&gt;embeddings&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/jo-kristian-bergum"&gt;jo-kristian-bergum&lt;/a&gt;&lt;/p&gt;



</summary><category term="search"/><category term="vector-search"/><category term="embeddings"/><category term="jo-kristian-bergum"/></entry><entry><title>From where I left</title><link href="https://simonwillison.net/2024/Dec/10/from-where-i-left/#atom-tag" rel="alternate"/><published>2024-12-10T18:56:26+00:00</published><updated>2024-12-10T18:56:26+00:00</updated><id>https://simonwillison.net/2024/Dec/10/from-where-i-left/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://antirez.com/news/144"&gt;From where I left&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Four and a half years after he left the project, Redis creator Salvatore Sanfilippo is returning to work on Redis.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Hacking randomly was cool but, in the long run, my feeling was that I was lacking a real purpose, and every day I started to feel a bigger urgency to be part of the tech world again. At the same time, I saw the Redis community fragmenting, something that was a bit concerning to me, even as an outsider.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'm personally still upset at the license change, but Salvatore sees it as necessary to support the commercial business model for Redis Labs. It feels to me like a betrayal of the volunteer efforts by previous contributors. I &lt;a href="https://news.ycombinator.com/item?id=42378488#42379400"&gt;posted about that&lt;/a&gt; on Hacker News and Salvatore replied:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I can understand that, but the thing about the BSD license is that such value never gets lost. People are able to fork, and after a fork for the original project to still lead will be require to put something more on the table.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Salvatore's first new project is an exploration of adding vector sets to Redis. The vector similarity API he previews in this post reminds me of why I fell in love with Redis in the first place - it's clean, simple and feels obviously right to me.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;VSIM top_1000_movies_imdb ELE "The Matrix"  WITHSCORES
1) "The Matrix"
2) "0.9999999403953552"
3) "Ex Machina"
4) "0.8680362105369568"
...
&lt;/code&gt;&lt;/pre&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/redis"&gt;redis&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/salvatore-sanfilippo"&gt;salvatore-sanfilippo&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vector-search"&gt;vector-search&lt;/a&gt;&lt;/p&gt;



</summary><category term="open-source"/><category term="redis"/><category term="salvatore-sanfilippo"/><category term="vector-search"/></entry><entry><title>Hybrid full-text search and vector search with SQLite</title><link href="https://simonwillison.net/2024/Oct/4/hybrid-full-text-search-and-vector-search-with-sqlite/#atom-tag" rel="alternate"/><published>2024-10-04T16:22:09+00:00</published><updated>2024-10-04T16:22:09+00:00</updated><id>https://simonwillison.net/2024/Oct/4/hybrid-full-text-search-and-vector-search-with-sqlite/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://alexgarcia.xyz/blog/2024/sqlite-vec-hybrid-search/index.html"&gt;Hybrid full-text search and vector search with SQLite&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
As part of Alex’s work on his &lt;a href="https://github.com/asg017/sqlite-vec"&gt;sqlite-vec&lt;/a&gt; SQLite extension - adding fast vector lookups to SQLite - he’s been investigating hybrid search, where search results from both vector similarity and traditional full-text search are combined together.&lt;/p&gt;
&lt;p&gt;The most promising approach looks to be &lt;a href="https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking"&gt;Reciprocal Rank Fusion&lt;/a&gt;, which combines the top ranked items from both approaches. Here’s Alex’s SQL query:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;--&lt;/span&gt; the sqlite-vec KNN vector search results&lt;/span&gt;
with vec_matches &lt;span class="pl-k"&gt;as&lt;/span&gt; (
  &lt;span class="pl-k"&gt;select&lt;/span&gt;
    article_id,
    row_number() over (&lt;span class="pl-k"&gt;order by&lt;/span&gt; distance) &lt;span class="pl-k"&gt;as&lt;/span&gt; rank_number,
    distance
  &lt;span class="pl-k"&gt;from&lt;/span&gt; vec_articles
  &lt;span class="pl-k"&gt;where&lt;/span&gt;
    headline_embedding match lembed(:query)
    &lt;span class="pl-k"&gt;and&lt;/span&gt; k &lt;span class="pl-k"&gt;=&lt;/span&gt; :k
),
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;--&lt;/span&gt; the FTS5 search results&lt;/span&gt;
fts_matches &lt;span class="pl-k"&gt;as&lt;/span&gt; (
  &lt;span class="pl-k"&gt;select&lt;/span&gt;
    rowid,
    row_number() over (&lt;span class="pl-k"&gt;order by&lt;/span&gt; rank) &lt;span class="pl-k"&gt;as&lt;/span&gt; rank_number,
    rank &lt;span class="pl-k"&gt;as&lt;/span&gt; score
  &lt;span class="pl-k"&gt;from&lt;/span&gt; fts_articles
  &lt;span class="pl-k"&gt;where&lt;/span&gt; headline match :query
  &lt;span class="pl-k"&gt;limit&lt;/span&gt; :k
),
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;--&lt;/span&gt; combine FTS5 + vector search results with RRF&lt;/span&gt;
final &lt;span class="pl-k"&gt;as&lt;/span&gt; (
  &lt;span class="pl-k"&gt;select&lt;/span&gt;
    &lt;span class="pl-c1"&gt;articles&lt;/span&gt;.&lt;span class="pl-c1"&gt;id&lt;/span&gt;,
    &lt;span class="pl-c1"&gt;articles&lt;/span&gt;.&lt;span class="pl-c1"&gt;headline&lt;/span&gt;,
    &lt;span class="pl-c1"&gt;vec_matches&lt;/span&gt;.&lt;span class="pl-c1"&gt;rank_number&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; vec_rank,
    &lt;span class="pl-c1"&gt;fts_matches&lt;/span&gt;.&lt;span class="pl-c1"&gt;rank_number&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; fts_rank,
    &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;--&lt;/span&gt; RRF algorithm&lt;/span&gt;
    (
      coalesce(&lt;span class="pl-c1"&gt;1&lt;/span&gt;.&lt;span class="pl-c1"&gt;0&lt;/span&gt; &lt;span class="pl-k"&gt;/&lt;/span&gt; (:rrf_k &lt;span class="pl-k"&gt;+&lt;/span&gt; &lt;span class="pl-c1"&gt;fts_matches&lt;/span&gt;.&lt;span class="pl-c1"&gt;rank_number&lt;/span&gt;), &lt;span class="pl-c1"&gt;0&lt;/span&gt;.&lt;span class="pl-c1"&gt;0&lt;/span&gt;) &lt;span class="pl-k"&gt;*&lt;/span&gt; :weight_fts &lt;span class="pl-k"&gt;+&lt;/span&gt;
      coalesce(&lt;span class="pl-c1"&gt;1&lt;/span&gt;.&lt;span class="pl-c1"&gt;0&lt;/span&gt; &lt;span class="pl-k"&gt;/&lt;/span&gt; (:rrf_k &lt;span class="pl-k"&gt;+&lt;/span&gt; &lt;span class="pl-c1"&gt;vec_matches&lt;/span&gt;.&lt;span class="pl-c1"&gt;rank_number&lt;/span&gt;), &lt;span class="pl-c1"&gt;0&lt;/span&gt;.&lt;span class="pl-c1"&gt;0&lt;/span&gt;) &lt;span class="pl-k"&gt;*&lt;/span&gt; :weight_vec
    ) &lt;span class="pl-k"&gt;as&lt;/span&gt; combined_rank,
    &lt;span class="pl-c1"&gt;vec_matches&lt;/span&gt;.&lt;span class="pl-c1"&gt;distance&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; vec_distance,
    &lt;span class="pl-c1"&gt;fts_matches&lt;/span&gt;.&lt;span class="pl-c1"&gt;score&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; fts_score
  &lt;span class="pl-k"&gt;from&lt;/span&gt; fts_matches
  full outer &lt;span class="pl-k"&gt;join&lt;/span&gt; vec_matches &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;vec_matches&lt;/span&gt;.&lt;span class="pl-c1"&gt;article_id&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;fts_matches&lt;/span&gt;.&lt;span class="pl-c1"&gt;rowid&lt;/span&gt;
  &lt;span class="pl-k"&gt;join&lt;/span&gt; articles &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;articles&lt;/span&gt;.&lt;span class="pl-c1"&gt;rowid&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; coalesce(&lt;span class="pl-c1"&gt;fts_matches&lt;/span&gt;.&lt;span class="pl-c1"&gt;rowid&lt;/span&gt;, &lt;span class="pl-c1"&gt;vec_matches&lt;/span&gt;.&lt;span class="pl-c1"&gt;article_id&lt;/span&gt;)
  &lt;span class="pl-k"&gt;order by&lt;/span&gt; combined_rank &lt;span class="pl-k"&gt;desc&lt;/span&gt;
)
&lt;span class="pl-k"&gt;select&lt;/span&gt; &lt;span class="pl-k"&gt;*&lt;/span&gt; &lt;span class="pl-k"&gt;from&lt;/span&gt; final;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I’ve been puzzled in the past over how to best do that because the distance scores from vector similarity and the relevance scores from FTS are meaningless in comparison to each other. RRF doesn’t even attempt to compare them - it uses them purely for &lt;code&gt;row_number()&lt;/code&gt; ranking within each set and combines the results based on that.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/full-text-search"&gt;full-text-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sql"&gt;sql&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/alex-garcia"&gt;alex-garcia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vector-search"&gt;vector-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/embeddings"&gt;embeddings&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rag"&gt;rag&lt;/a&gt;&lt;/p&gt;



</summary><category term="full-text-search"/><category term="search"/><category term="sql"/><category term="sqlite"/><category term="alex-garcia"/><category term="vector-search"/><category term="embeddings"/><category term="rag"/></entry><entry><title>Introducing Contextual Retrieval</title><link href="https://simonwillison.net/2024/Sep/20/introducing-contextual-retrieval/#atom-tag" rel="alternate"/><published>2024-09-20T01:34:21+00:00</published><updated>2024-09-20T01:34:21+00:00</updated><id>https://simonwillison.net/2024/Sep/20/introducing-contextual-retrieval/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.anthropic.com/news/contextual-retrieval"&gt;Introducing Contextual Retrieval&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Here's an interesting new embedding/RAG technique, described by Anthropic but it should work for any embedding model against any other LLM.&lt;/p&gt;
&lt;p&gt;One of the big challenges in implementing semantic search against vector embeddings - often used as part of a RAG system - is creating "chunks" of documents that are most likely to semantically match queries from users.&lt;/p&gt;
&lt;p&gt;Anthropic provide this solid example where semantic chunks might let you down:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Imagine you had a collection of financial information (say, U.S. SEC filings) embedded in your knowledge base, and you received the following question: "What was the revenue growth for ACME Corp in Q2 2023?"&lt;/p&gt;
&lt;p&gt;A relevant chunk might contain the text: "The company's revenue grew by 3% over the previous quarter." However, this chunk on its own doesn't specify which company it's referring to or the relevant time period, making it difficult to retrieve the right information or use the information effectively.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Their proposed solution is to take each chunk at indexing time and expand it using an LLM - so the above sentence would become this instead:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This chunk is from an SEC filing on ACME corp's performance in Q2 2023; the previous quarter's revenue was $314 million. The company's revenue grew by 3% over the previous quarter.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This chunk was created by Claude 3 Haiku (their least expensive model) using the following prompt template:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;&amp;lt;document&amp;gt;&lt;/code&gt;&lt;br&gt;
&lt;code&gt;{{WHOLE_DOCUMENT}}&lt;/code&gt;&lt;br&gt;
&lt;code&gt;&amp;lt;/document&amp;gt;&lt;/code&gt;&lt;br&gt;
&lt;code&gt;Here is the chunk we want to situate within the whole document&lt;/code&gt;&lt;br&gt;
&lt;code&gt;&amp;lt;chunk&amp;gt;&lt;/code&gt;&lt;br&gt;
&lt;code&gt;{{CHUNK_CONTENT}}&lt;/code&gt;&lt;br&gt;
&lt;code&gt;&amp;lt;/chunk&amp;gt;&lt;/code&gt;&lt;br&gt;
&lt;code&gt;Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here's the really clever bit: running the above prompt for every chunk in a document could get really expensive thanks to the inclusion of the entire document in each prompt. Claude &lt;a href="https://simonwillison.net/2024/Aug/14/prompt-caching-with-claude/"&gt;added context caching&lt;/a&gt; last month, which allows you to pay around 1/10th of the cost for tokens cached up to your specified beakpoint.&lt;/p&gt;
&lt;p&gt;By Anthropic's calculations:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Assuming 800 token chunks, 8k token documents, 50 token context instructions, and 100 tokens of context per chunk, the one-time cost to generate contextualized chunks is $1.02 per million document tokens.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Anthropic provide a &lt;a href="https://github.com/anthropics/anthropic-cookbook/blob/main/skills/contextual-embeddings/guide.ipynb"&gt;detailed notebook&lt;/a&gt; demonstrating an implementation of this pattern. Their eventual solution combines cosine similarity and BM25 indexing, uses embeddings from &lt;a href="https://docs.voyageai.com/docs/embeddings"&gt;Voyage AI&lt;/a&gt; and adds a reranking step powered by &lt;a href="https://cohere.com/rerank"&gt;Cohere&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The notebook also includes an evaluation set using JSONL - here's that evaluation data &lt;a href="https://lite.datasette.io/?json=https://github.com/anthropics/anthropic-cookbook/blob/main/skills/contextual-embeddings/data/evaluation_set.jsonl#/data/evaluation_set"&gt;in Datasette Lite&lt;/a&gt;.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/alexalbert__/status/1836854956785352776"&gt;Alex Albert&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-engineering"&gt;prompt-engineering&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vector-search"&gt;vector-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/embeddings"&gt;embeddings&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rag"&gt;rag&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/prompt-caching"&gt;prompt-caching&lt;/a&gt;&lt;/p&gt;



</summary><category term="search"/><category term="ai"/><category term="prompt-engineering"/><category term="generative-ai"/><category term="vector-search"/><category term="llms"/><category term="embeddings"/><category term="anthropic"/><category term="claude"/><category term="rag"/><category term="prompt-caching"/></entry><entry><title>OpenAI: Improve file search result relevance with chunk ranking</title><link href="https://simonwillison.net/2024/Aug/30/openai-file-search/#atom-tag" rel="alternate"/><published>2024-08-30T04:03:01+00:00</published><updated>2024-08-30T04:03:01+00:00</updated><id>https://simonwillison.net/2024/Aug/30/openai-file-search/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://platform.openai.com/docs/assistants/tools/file-search/improve-file-search-result-relevance-with-chunk-ranking"&gt;OpenAI: Improve file search result relevance with chunk ranking&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I've mostly been ignoring OpenAI's &lt;a href="https://platform.openai.com/docs/assistants/overview"&gt;Assistants API&lt;/a&gt;. It provides an alternative to their standard messages API where you construct "assistants", chatbots with optional access to additional tools and that store full conversation threads on the server so you don't need to pass the previous conversation with every call to their API.&lt;/p&gt;
&lt;p&gt;I'm pretty comfortable with their existing API and I found the assistants API to be quite a bit more complicated. So far the only thing I've used it for is a &lt;a href="https://github.com/simonw/scrape-openai-code-interpreter/blob/main/scrape.py"&gt;script to scrape OpenAI Code Interpreter&lt;/a&gt; to keep track of &lt;a href="https://github.com/simonw/scrape-openai-code-interpreter/commits/main/packages.txt"&gt;updates to their enviroment's Python packages&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Code Interpreter aside, the other interesting assistants feature is &lt;a href="https://platform.openai.com/docs/assistants/tools/file-search"&gt;File Search&lt;/a&gt;. You can upload files in a wide variety of formats and OpenAI will chunk them, store the chunks in a vector store and make them available to help answer questions posed to your assistant - it's their version of hosted &lt;a href="https://simonwillison.net/tags/rag/"&gt;RAG&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Prior to today OpenAI had kept the details of how this worked undocumented. I found this infuriating, because when I'm building a RAG system the details of how files are chunked and scored for relevance is the &lt;em&gt;whole game&lt;/em&gt; - without understanding that I can't make effective decisions about what kind of documents to use and how to build on top of the tool.&lt;/p&gt;
&lt;p&gt;This has finally changed! You can now run a "step" (a round of conversation in the chat) and then retrieve details of exactly which chunks of the file were used in the response and how they were scored using the following incantation:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-s1"&gt;run_step&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;client&lt;/span&gt;.&lt;span class="pl-s1"&gt;beta&lt;/span&gt;.&lt;span class="pl-s1"&gt;threads&lt;/span&gt;.&lt;span class="pl-s1"&gt;runs&lt;/span&gt;.&lt;span class="pl-s1"&gt;steps&lt;/span&gt;.&lt;span class="pl-en"&gt;retrieve&lt;/span&gt;(
    &lt;span class="pl-s1"&gt;thread_id&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"thread_abc123"&lt;/span&gt;,
    &lt;span class="pl-s1"&gt;run_id&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"run_abc123"&lt;/span&gt;,
    &lt;span class="pl-s1"&gt;step_id&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"step_abc123"&lt;/span&gt;,
    &lt;span class="pl-s1"&gt;include&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;[
        &lt;span class="pl-s"&gt;"step_details.tool_calls[*].file_search.results[*].content"&lt;/span&gt;
    ]
)&lt;/pre&gt;
&lt;p&gt;(See what I mean about the API being a little obtuse?)&lt;/p&gt;
&lt;p&gt;I tried this out today and the results were very promising. Here's &lt;a href="https://gist.github.com/simonw/0c8b87ad1e23e81060594a4760bd370d"&gt;a chat transcript&lt;/a&gt; with an assistant I created against an old PDF copy of the Datasette documentation - I used the above new API to dump out the full list of snippets used to answer the question "tell me about ways to use spatialite". &lt;/p&gt;
&lt;p&gt;It pulled in a lot of content! 57,017 characters by my count, spread across 20 search results (&lt;a href="https://platform.openai.com/docs/assistants/tools/file-search/customizing-file-search-settings"&gt;customizable&lt;/a&gt;), for a total of 15,021 tokens as measured by &lt;a href="https://github.com/simonw/ttok"&gt;ttok&lt;/a&gt;. At current GPT-4o-mini prices that would cost 0.225 cents (less than a quarter of a cent), but with regular GPT-4o it would cost 7.5 cents.&lt;/p&gt;
&lt;p&gt;OpenAI provide up to 1GB of vector storage for free, then charge $0.10/GB/day for vector storage beyond that. My 173 page PDF seems to have taken up 728KB after being chunked and stored, so that GB should stretch a pretty long way.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Confession:&lt;/strong&gt; I couldn't be bothered to work through the OpenAI code examples myself, so I hit Ctrl+A on that web page and copied the whole lot into Claude 3.5 Sonnet, then prompted it:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Based on this documentation, write me a Python CLI app (using the Click CLi library) with the following features:&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;openai-file-chat add-files name-of-vector-store *.pdf *.txt&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;This creates a new vector store called name-of-vector-store and adds all the files passed to the command to that store.&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;openai-file-chat name-of-vector-store1 name-of-vector-store2 ...&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;This starts an interactive chat with the user, where any time they hit enter the question is answered by a chat assistant using the specified vector stores.&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;We &lt;a href="https://gist.github.com/simonw/97e29b86540fcc627da4984daf5b7f9f"&gt;iterated on this a few times&lt;/a&gt; to build me a one-off CLI app for trying out the new features. It's got a few bugs that I haven't fixed yet, but it was a very productive way of prototyping against the new API.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/OpenAIDevs/status/1829259020437475771"&gt;@OpenAIDevs&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vector-search"&gt;vector-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/embeddings"&gt;embeddings&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rag"&gt;rag&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude-3-5-sonnet"&gt;claude-3-5-sonnet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-search"&gt;ai-assisted-search&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="openai"/><category term="generative-ai"/><category term="vector-search"/><category term="llms"/><category term="ai-assisted-programming"/><category term="embeddings"/><category term="rag"/><category term="claude-3-5-sonnet"/><category term="ai-assisted-search"/></entry><entry><title>Using DuckDB for Embeddings and Vector Search</title><link href="https://simonwillison.net/2024/Jun/15/duckdb-for-embeddings/#atom-tag" rel="alternate"/><published>2024-06-15T14:39:18+00:00</published><updated>2024-06-15T14:39:18+00:00</updated><id>https://simonwillison.net/2024/Jun/15/duckdb-for-embeddings/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.brunk.io/posts/similarity-search-with-duckdb/"&gt;Using DuckDB for Embeddings and Vector Search&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Sören Brunk's comprehensive tutorial combining DuckDB 1.0, a subset of German Wikipedia from Hugging Face (loaded using Parquet), the &lt;a href="https://huggingface.co/BAAI/bge-m3"&gt;BGE M3&lt;/a&gt; embedding model and DuckDB's &lt;a href="https://duckdb.org/2024/05/03/vector-similarity-search-vss.html"&gt;new vss extension&lt;/a&gt; for implementing an HNSW vector index.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/soebrunk/status/1801631086386012453"&gt;@soebrunk&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parquet"&gt;parquet&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/duckdb"&gt;duckdb&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vector-search"&gt;vector-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/embeddings"&gt;embeddings&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="parquet"/><category term="duckdb"/><category term="vector-search"/><category term="embeddings"/></entry><entry><title>I'm writing a new vector search SQLite Extension</title><link href="https://simonwillison.net/2024/May/3/sqlite-vec/#atom-tag" rel="alternate"/><published>2024-05-03T03:16:39+00:00</published><updated>2024-05-03T03:16:39+00:00</updated><id>https://simonwillison.net/2024/May/3/sqlite-vec/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://alexgarcia.xyz/blog/2024/building-new-vector-search-sqlite/index.html"&gt;I&amp;#x27;m writing a new vector search SQLite Extension&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Alex Garcia is working on &lt;code&gt;sqlite-vec&lt;/code&gt;, a spiritual successor to his &lt;code&gt;sqlite-vss&lt;/code&gt; project. The new SQLite C extension will have zero other dependencies (&lt;code&gt;sqlite-vss&lt;/code&gt; used some tricky C++ libraries) and will work using virtual tables, storing chunks of vectors in shadow tables to avoid needing to load everything into memory at once.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/c"&gt;c&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vectors"&gt;vectors&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/alex-garcia"&gt;alex-garcia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vector-search"&gt;vector-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/embeddings"&gt;embeddings&lt;/a&gt;&lt;/p&gt;



</summary><category term="c"/><category term="sqlite"/><category term="vectors"/><category term="alex-garcia"/><category term="vector-search"/><category term="embeddings"/></entry><entry><title>My binary vector search is better than your FP32 vectors</title><link href="https://simonwillison.net/2024/Mar/26/binary-vector-search/#atom-tag" rel="alternate"/><published>2024-03-26T04:56:25+00:00</published><updated>2024-03-26T04:56:25+00:00</updated><id>https://simonwillison.net/2024/Mar/26/binary-vector-search/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.pgvecto.rs/my-binary-vector-search-is-better-than-your-fp32-vectors"&gt;My binary vector search is better than your FP32 vectors&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I’m still trying to get my head around this, but here’s what I understand so far.&lt;/p&gt;

&lt;p&gt;Embedding vectors as calculated by models such as OpenAI text-embedding-3-small are arrays of floating point values, which look something like this:&lt;/p&gt;

&lt;p&gt;[0.0051681744, 0.017187592, -0.018685209, -0.01855924, -0.04725188...]—1356 elements long&lt;/p&gt;

&lt;p&gt;Different embedding models have different lengths, but they tend to be hundreds up to low thousands of numbers. If each float is 32 bits that’s 4 bytes per float, which can add up to a lot of memory if you have millions of embedding vectors to compare.&lt;/p&gt;

&lt;p&gt;If you look at those numbers you’ll note that they are all pretty small positive or negative numbers, close to 0.&lt;/p&gt;

&lt;p&gt;Binary vector search is a trick where you take that sequence of floating point numbers and turn it into a binary vector—just a list of 1s and 0s, where you store a 1 if the corresponding float was greater than 0 and a 0 otherwise.&lt;/p&gt;

&lt;p&gt;For the above example, this would start [1, 1, 0, 0, 0...]&lt;/p&gt;

&lt;p&gt;Incredibly, it looks like the cosine distance between these 0 and 1 vectors captures much of the semantic relevant meaning present in the distance between the much more accurate vectors. This means you can use 1/32nd of the space and still get useful results!&lt;/p&gt;

&lt;p&gt;Ce Gao here suggests a further optimization: use the binary vectors for a fast brute-force lookup of the top 200 matches, then run a more expensive re-ranking against those filtered values using the full floating point vectors.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/vector-search"&gt;vector-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/embeddings"&gt;embeddings&lt;/a&gt;&lt;/p&gt;



</summary><category term="vector-search"/><category term="embeddings"/></entry><entry><title>Embeddings: What they are and why they matter</title><link href="https://simonwillison.net/2023/Oct/23/embeddings/#atom-tag" rel="alternate"/><published>2023-10-23T13:36:21+00:00</published><updated>2023-10-23T13:36:21+00:00</updated><id>https://simonwillison.net/2023/Oct/23/embeddings/#atom-tag</id><summary type="html">
    &lt;p&gt;Embeddings are a really neat trick that often come wrapped in a pile of intimidating jargon.&lt;/p&gt;
&lt;p&gt;If you can make it through that jargon, they unlock powerful and exciting techniques that can be applied to all sorts of interesting problems.&lt;/p&gt;
&lt;p&gt;I gave a talk about embeddings at &lt;a href="https://pybay.com/"&gt;PyBay 2023&lt;/a&gt;. This article represents an improved version of that talk, which should stand alone even without watching the video.&lt;/p&gt;
&lt;p&gt;If you're not yet familiar with embeddings I hope to give you everything you need to get started applying them to real-world problems.&lt;/p&gt;
&lt;p&gt;In this article:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Oct/23/embeddings/#the-38-minute-video-version"&gt;The 38 minute video version&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Oct/23/embeddings/#what-are-embeddings"&gt;What are embeddings?&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Oct/23/embeddings/#related-content-using-embeddings"&gt;Related content using embeddings&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Oct/23/embeddings/#exploring-how-these-things-work-with-word2vec"&gt;Exploring how these things work with Word2Vec&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Oct/23/embeddings/#calculating-embeddings-using-my-llm-tool"&gt;Calculating embeddings using my LLM tool&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Oct/23/embeddings/#vibes-based-search"&gt;Vibes-based search&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Oct/23/embeddings/#embeddings-for-code-using-symbex"&gt;Embeddings for code using Symbex&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Oct/23/embeddings/#embedding-text-and-images-together-using-clip"&gt;Embedding text and images together using CLIP&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Oct/23/embeddings/#faucet-finder-finding-faucets-with-clip"&gt;Faucet Finder: finding faucets with CLIP&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Oct/23/embeddings/#clustering-embeddings"&gt;Clustering embeddings&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Oct/23/embeddings/#visualize-in-2d-with-principal-component-analysis"&gt;Visualize in 2D with Principal Component Analysis&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Oct/23/embeddings/#scoring-sentences-using-average-locations"&gt;Scoring sentences using average locations&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Oct/23/embeddings/#answering-questions-with-retrieval-augmented-generation"&gt;Answering questions with Retrieval-Augmented Generation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Oct/23/embeddings/#qa"&gt;Q&amp;amp;A&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Oct/23/embeddings/#embeddings-further-reading"&gt;Further reading&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;!-- cutoff --&gt;

&lt;h4 id="the-38-minute-video-version"&gt;The 38 minute video version&lt;/h4&gt;
&lt;p&gt;Here's &lt;a href="https://www.youtube.com/watch?v=ArnMdc-ICCM&amp;amp;t=50s"&gt;a video&lt;/a&gt; of the talk that I gave at PyBay:&lt;/p&gt;

&lt;iframe style="max-width: 100%" width="560" height="315" src="https://www.youtube-nocookie.com/embed/ArnMdc-ICCM?start=50" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen="allowfullscreen"&gt; &lt;/iframe&gt;

&lt;p&gt;The audio quality of the official video wasn't great due to an issue with the microphone, but I ran that audio through Adobe's &lt;a href="https://podcast.adobe.com/enhance"&gt;Enhance Speech tool&lt;/a&gt; and uploaded my own video with the enhanced audio to YouTube.&lt;/p&gt;
&lt;h4 id="what-are-embeddings"&gt;What are embeddings?&lt;/h4&gt;
&lt;p&gt;Embeddings are a technology that's adjacent to the wider field of Large Language Models - the technology behind ChatGPT and Gemini and Claude.&lt;/p&gt;
&lt;p&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2023/embeddings/embeddings.002.jpeg" alt="On the left, a blog entry titled Storing and serving related documents with oepnai-to-sqlite and embeddings. On the right, a JSON array of floating point numbers, with the caption Fixed zise: 300, 1000, 1536..." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Embeddings are based around one trick: take a piece of content - in this case &lt;a href="https://til.simonwillison.net/llms/openai-embeddings-related-content"&gt;a blog entry&lt;/a&gt; - and turn that piece of content into an array of floating point numbers.&lt;/p&gt;
&lt;p&gt;The key thing about that array is that it will always be the same length, no matter how long the content is. The length is defined by the embedding model you are using - an array might be 300, or 1,000, or 1,536 numbers long.&lt;/p&gt;
&lt;p&gt;The best way to think about this array of numbers is to imagine it as co-ordinates in a very weird multi-dimensional space.&lt;/p&gt;
&lt;p&gt;It's hard to visualize 1,536 dimensional space, so here's a 3D visualization of the same idea:&lt;/p&gt;
&lt;p&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2023/embeddings/embeddings.003.jpeg" alt="a 3D chart showing a location in many-multi-dimensional space. 400 randomly placed red dots are scattered around the chart." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Why place content in this space? Because we can learn interesting things about that content based on its location - in particular, based on what else is nearby.&lt;/p&gt;
&lt;p&gt;The location within the space represents the semantic meaning of the content, according to the embedding model's weird, mostly incomprehensible understanding of the world. It might capture colors, shapes, concepts or all sorts of other characteristics of the content that has been embedded.&lt;/p&gt;
&lt;p&gt;Nobody fully understands what those individual numbers mean, but we know that their locations can be used to find out useful things about the content.&lt;/p&gt;
&lt;h4 id="related-content-using-embeddings"&gt;Related content using embeddings&lt;/h4&gt;
&lt;p&gt;One of the first problems I solved with embeddings was to build a "related content" feature for &lt;a href="https://til.simonwillison.net/"&gt;my TIL blog&lt;/a&gt;. I wanted to be able to show a list of related articles at the bottom of each page.&lt;/p&gt;
&lt;p&gt;I did this using embeddings - in this case, I used the OpenAI &lt;code&gt;text-embedding-ada-002&lt;/code&gt; model, which is available &lt;a href="https://platform.openai.com/docs/guides/embeddings"&gt;via their API&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I currently have 472 articles on my site. I calculated the 1,536 dimensional embedding vector (array of floating point numbers) for each of those articles, and stored those vectors in my site's SQLite database.&lt;/p&gt;
&lt;p&gt;Now, if I want to find related articles for a given article, I can calculate the &lt;em&gt;cosine similarity&lt;/em&gt; between the embedding vector for that article and every other article in the database, then return the 10 closest matches by distance.&lt;/p&gt;
&lt;p&gt;There's an example at &lt;a href="https://til.simonwillison.net/sqlite/sqlite-tg#related"&gt;the bottom of this page&lt;/a&gt;. The top five related articles for &lt;a href="https://til.simonwillison.net/sqlite/sqlite-tg"&gt;Geospatial SQL queries in SQLite using TG, sqlite-tg and datasette-sqlite-tg&lt;/a&gt; are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://til.simonwillison.net/sqlite/geopoly"&gt;Geopoly in SQLite&lt;/a&gt; - 2023-01-04&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://til.simonwillison.net/spatialite/viewing-geopackage-data-with-spatialite-and-datasette"&gt;Viewing GeoPackage data with SpatiaLite and Datasette&lt;/a&gt; - 2022-12-11&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://til.simonwillison.net/gis/gdal-sql"&gt;Using SQL with GDAL&lt;/a&gt; - 2023-03-09&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://til.simonwillison.net/spatialite/knn"&gt;KNN queries with SpatiaLite&lt;/a&gt; - 2021-05-16&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://til.simonwillison.net/spatialite/gunion-to-combine-geometries"&gt;GUnion to combine geometries in SpatiaLite&lt;/a&gt; - 2022-04-12&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That's a pretty good list!&lt;/p&gt;
&lt;p&gt;Here's &lt;a href="https://github.com/simonw/llm/blob/bf229945fe57036fa75e8105e59d9e506a720156/llm/__init__.py#L252C1-L256C53"&gt;the Python function&lt;/a&gt; I'm using to calculate those cosine similarity distances:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;cosine_similarity&lt;/span&gt;(&lt;span class="pl-s1"&gt;a&lt;/span&gt;, &lt;span class="pl-s1"&gt;b&lt;/span&gt;):
    &lt;span class="pl-s1"&gt;dot_product&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;sum&lt;/span&gt;(&lt;span class="pl-s1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt; &lt;span class="pl-s1"&gt;y&lt;/span&gt; &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;x&lt;/span&gt;, &lt;span class="pl-s1"&gt;y&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-en"&gt;zip&lt;/span&gt;(&lt;span class="pl-s1"&gt;a&lt;/span&gt;, &lt;span class="pl-s1"&gt;b&lt;/span&gt;))
    &lt;span class="pl-s1"&gt;magnitude_a&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;sum&lt;/span&gt;(&lt;span class="pl-s1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt; &lt;span class="pl-s1"&gt;x&lt;/span&gt; &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;a&lt;/span&gt;) &lt;span class="pl-c1"&gt;**&lt;/span&gt; &lt;span class="pl-c1"&gt;0.5&lt;/span&gt;
    &lt;span class="pl-s1"&gt;magnitude_b&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;sum&lt;/span&gt;(&lt;span class="pl-s1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt; &lt;span class="pl-s1"&gt;x&lt;/span&gt; &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;b&lt;/span&gt;) &lt;span class="pl-c1"&gt;**&lt;/span&gt; &lt;span class="pl-c1"&gt;0.5&lt;/span&gt;
    &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s1"&gt;dot_product&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; (&lt;span class="pl-s1"&gt;magnitude_a&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt; &lt;span class="pl-s1"&gt;magnitude_b&lt;/span&gt;)&lt;/pre&gt;
&lt;p&gt;My TIL site runs on my &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; Python framework, which supports building sites on top of a SQLite database. I wrote more about how that works in &lt;a href="https://simonwillison.net/2021/Jul/28/baked-data/"&gt;the Baked Data architectural pattern&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;You can browse the SQLite table that stores the calculated embeddings at &lt;a href="https://til.simonwillison.net/tils/embeddings"&gt;tils/embeddings&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2023/embeddings/til-simonwillison-net-tils-embeddings.png" alt="Screenshot of the embeddings table in Datasette, it has 472, rows each of which consists of a text ID and a binary 6.144 bytes embedding" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Those are binary values. We can &lt;a href="https://til.simonwillison.net/tils?sql=select+id%2C+hex%28embedding%29+from+embeddings"&gt;run this SQL query&lt;/a&gt; to view them as hexadecimal:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;select&lt;/span&gt; id, hex(embedding) &lt;span class="pl-k"&gt;from&lt;/span&gt; embeddings&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2023/embeddings/til-simonwillison-net-tils.png" alt="Running that SQL query in Datasette returns text IDs and long hexadecimal strings for each embedding" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;That's still not very readable though. We can use the &lt;code&gt;llm_embed_decode()&lt;/code&gt; custom SQL function to turn them into a JSON array:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;select&lt;/span&gt; id, llm_embed_decode(embedding) &lt;span class="pl-k"&gt;from&lt;/span&gt; embeddings &lt;span class="pl-k"&gt;limit&lt;/span&gt; &lt;span class="pl-c1"&gt;10&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;a href="https://til.simonwillison.net/tils?sql=select+id%2C+llm_embed_decode%28embedding%29+from+embeddings+limit+10"&gt;Try that here&lt;/a&gt;. It shows that each article is accompanied by that array of 1,536 floating point numbers.&lt;/p&gt;
&lt;p&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2023/embeddings/til-simonwillison-net-tils.1.png" alt="Now the SQL query returns a JSON array of floating point numbers for each ID" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;We can use another custom SQL function, &lt;code&gt;llm_embed_cosine(vector1, vector2)&lt;/code&gt;, to calculate those cosine distances and find the most similar content.&lt;/p&gt;
&lt;p&gt;That SQL function &lt;a href="https://github.com/simonw/datasette-llm-embed/blob/ebded67fa9ee19db2c4b1badb1895cef0d58ac4a/datasette_llm_embed.py#L22-L26"&gt;is defined here&lt;/a&gt; in my &lt;a href="https://datasette.io/plugins/datasette-llm-embed"&gt;datasette-llm-embed&lt;/a&gt; plugin.&lt;/p&gt;
&lt;p&gt;Here's a query returning the five most similar articles to my SQLite TG article:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;select&lt;/span&gt;
  id,
  llm_embed_cosine(
    embedding,
    (
      &lt;span class="pl-k"&gt;select&lt;/span&gt;
        embedding
      &lt;span class="pl-k"&gt;from&lt;/span&gt;
        embeddings
      &lt;span class="pl-k"&gt;where&lt;/span&gt;
        id &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;sqlite_sqlite-tg.md&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
    )
  ) &lt;span class="pl-k"&gt;as&lt;/span&gt; score
&lt;span class="pl-k"&gt;from&lt;/span&gt;
  embeddings
&lt;span class="pl-k"&gt;order by&lt;/span&gt;
  score &lt;span class="pl-k"&gt;desc&lt;/span&gt;
&lt;span class="pl-k"&gt;limit&lt;/span&gt; &lt;span class="pl-c1"&gt;5&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;a href="https://til.simonwillison.net/tils?sql=select%0D%0A++id%2C%0D%0A++llm_embed_cosine%28%0D%0A++++embedding%2C%0D%0A++++%28%0D%0A++++++select%0D%0A++++++++embedding%0D%0A++++++from%0D%0A++++++++embeddings%0D%0A++++++where%0D%0A++++++++id+%3D+%27sqlite_sqlite-tg.md%27%0D%0A++++%29%0D%0A++%29+as+score%0D%0Afrom%0D%0A++embeddings%0D%0Aorder+by%0D%0A++score+desc%0D%0Alimit+5"&gt;Executing that query&lt;/a&gt; returns the following results:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;id&lt;/th&gt;
&lt;th&gt;score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;sqlite_sqlite-tg.md&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sqlite_geopoly.md&lt;/td&gt;
&lt;td&gt;0.8817322855676049&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;spatialite_viewing-geopackage-data-with-spatialite-and-datasette.md&lt;/td&gt;
&lt;td&gt;0.8813094978399854&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gis_gdal-sql.md&lt;/td&gt;
&lt;td&gt;0.8799581261326747&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;spatialite_knn.md&lt;/td&gt;
&lt;td&gt;0.8692992294266506&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;As expected, the similarity between the article and itself is 1.0. The other articles are all related to geospatial SQL queries in SQLite.&lt;/p&gt;
&lt;p&gt;This query takes around 400ms to execute. To speed things up, I pre-calculate the top 10 similarities for every article and store them in a separate table called &lt;a href="https://til.simonwillison.net/tils/similarities"&gt;tils/similarities&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2023/embeddings/til-simonwillison-net-tils-similarities.png" alt="The similarities table has 4,922 rows each with an id, other_id and score column." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;I wrote a Python function to &lt;a href="https://github.com/simonw/til/blob/a244856c72000760c2939550ae2a78266dee4f07/plugins/template_vars.py#L21-L37"&gt;look up related documents from that table&lt;/a&gt; and &lt;a href="https://github.com/simonw/til/blob/a244856c72000760c2939550ae2a78266dee4f07/templates/pages/%7Btopic%7D/%7Bslug%7D.html#L37-L45"&gt;called it from the template&lt;/a&gt; that's used to render the article page.&lt;/p&gt;
&lt;p&gt;My &lt;a href="https://til.simonwillison.net/llms/openai-embeddings-related-content"&gt;Storing and serving related documents with openai-to-sqlite and embeddings&lt;/a&gt; TIL explains how this all works in detail, including how GitHub Actions are used to fetch new embeddings &lt;a href="https://github.com/simonw/til/blob/a244856c72000760c2939550ae2a78266dee4f07/.github/workflows/build.yml#L67-L75"&gt;as part of the build script&lt;/a&gt; that deploys the site.&lt;/p&gt;
&lt;p&gt;I used the OpenAI embeddings API for this project. It's extremely inexpensive - for my TIL website I embedded around 402,500 tokens, which at $0.0001 / 1,000 tokens comes to $0.04 - just 4 cents!&lt;/p&gt;
&lt;p&gt;It's really easy to use: you POST it some text along with your API key, it gives you back that JSON array of floating point numbers.&lt;/p&gt;
&lt;p&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2023/embeddings/embeddings.006.jpeg" alt="Screenshot of curl against api.openai.com/v1/embeddings sending a Bearer token header and a JSON body specifying input text and the text-embedding-ada-002 model. The API responds with a JSON list of numbers." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;But... it's a proprietary model. A few months ago OpenAI &lt;a href="https://openai.com/blog/gpt-4-api-general-availability#deprecation-of-older-embeddings-models"&gt;shut down some of their older embeddings models&lt;/a&gt;, which is a problem if you've stored large numbers of embeddings from those models since you'll need to recalculate them against a supported model if you want to be able to embed anything else new.&lt;/p&gt;
&lt;p&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2023/embeddings/embeddings.007.jpeg" alt="Screenshot of the OpenAI First-generation text embedding models list, showing the shutdown date of 4th April 2024 for 7 legacy models." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;To OpenAI's credit, they did promise to "cover the financial cost of users re-embedding content with these new models." - but it's still a reason to be cautious about relying on proprietary models.&lt;/p&gt;
&lt;p&gt;The good news is that there are extremely powerful openly licensed models which you can run on your own hardware, avoiding any risk of them being shut down. We'll talk about that more in a moment.&lt;/p&gt;
&lt;h4 id="exploring-how-these-things-work-with-word2vec"&gt;Exploring how these things work with Word2Vec&lt;/h4&gt;
&lt;p&gt;Google Research put out an influential paper 10 years ago describing an early embedding model they created called Word2Vec.&lt;/p&gt;
&lt;p&gt;That paper is &lt;a href="https://arxiv.org/abs/1301.3781"&gt;Efficient Estimation of Word Representations in Vector Space&lt;/a&gt;, dated 16th January 2013. It's a paper that helped kick off widespread interest in embeddings.&lt;/p&gt;
&lt;p&gt;Word2Vec is a model that takes single words and turns them into a list of 300 numbers. That list of numbers captures something about the meaning of the associated word.&lt;/p&gt;
&lt;p&gt;This is best illustrated by a demo.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://turbomaze.github.io/word2vecjson/"&gt;turbomaze.github.io/word2vecjson&lt;/a&gt; is an interactive tool put together by &lt;a href="https://anthony.ai/"&gt;Anthony Liu&lt;/a&gt; with a 10,000 word subset of the Word2Vec corpus. You can view &lt;a href="https://turbomaze.github.io/word2vecjson/data/wordvecs10000.js"&gt;this JavaScript file&lt;/a&gt; to see the JSON for those 10,000 words and their associated 300-long arrays of numbers.&lt;/p&gt;
&lt;p&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2023/embeddings/word2vec.jpg" alt="Screenshot of the Word to Vec JS Demo showing the results for france and the algebra results for germany + paris - france" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Search for a word to find similar words based on cosine distance to their Word2Vec representation. For example, the word "france" returns the following related results:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;word&lt;/th&gt;
&lt;th&gt;similarity&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;france&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;french&lt;/td&gt;
&lt;td&gt;0.7000748343471224&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;belgium&lt;/td&gt;
&lt;td&gt;0.6933180492111168&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;paris&lt;/td&gt;
&lt;td&gt;0.6334910653433325&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;germany&lt;/td&gt;
&lt;td&gt;0.627075617939471&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;italy&lt;/td&gt;
&lt;td&gt;0.6135215284228007&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;spain&lt;/td&gt;
&lt;td&gt;0.6064218103692152&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That's a mixture of french things and European geography.&lt;/p&gt;
&lt;p&gt;A really interesting thing you can do here is perform arithmetic on these vectors.&lt;/p&gt;
&lt;p&gt;Take the vector for "germany", add "paris" and subtract "france". The resulting vector is closest to "berlin"!&lt;/p&gt;
&lt;p&gt;Something about this model has captured the idea of nationalities and geography to the point that you can use arithmetic to explore additional facts about the world.&lt;/p&gt;
&lt;p&gt;Word2Vec was trained on 1.6 billion words of content. The embedding models we use today are trained on much larger datasets and capture a much richer understanding of the underlying relationships.&lt;/p&gt;
&lt;h4 id="calculating-embeddings-using-my-llm-tool"&gt;Calculating embeddings using my LLM tool&lt;/h4&gt;
&lt;p&gt;I've been building a command-line utility and Python library called &lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;You can read more about LLM here:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2023/May/18/cli-tools-for-llms/"&gt;llm, ttok and strip-tags—CLI tools for working with ChatGPT and other LLMs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2023/Jul/12/llm/"&gt;The LLM CLI tool now supports self-hosted language models via plugins&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2023/Sep/4/llm-embeddings/"&gt;LLM now provides tools for working with embeddings&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2023/Sep/12/llm-clip-and-chat/"&gt;Build an image search engine with llm-clip, chat with models with llm chat&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;LLM is a tool for working with Large Language Models. You can install it like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;pip install llm&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Or via Homebrew:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;brew install llm&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;You can use it as a command-line tool for interacting with LLMs, or as &lt;a href="https://llm.datasette.io/en/stable/python-api.html"&gt;a Python library&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Out of the box it can work with the OpenAI API. Set &lt;a href="https://llm.datasette.io/en/stable/setup.html#saving-and-using-stored-keys"&gt;an API key&lt;/a&gt; and you can run commands like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;ten fun names for a pet pelican&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Where it gets really fun is when you start &lt;a href="https://llm.datasette.io/en/stable/plugins/index.html"&gt;installing plugins&lt;/a&gt;. There are plugins that add entirely new language models to it, including models that run directly on your own machine.&lt;/p&gt;
&lt;p&gt;A few months ago &lt;a href="https://simonwillison.net/2023/Sep/4/llm-embeddings/"&gt;I extended LLM&lt;/a&gt; to support plugins that can run embedding models as well.&lt;/p&gt;
&lt;p&gt;Here's how to run the catchily titled &lt;a href="https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2"&gt;all-MiniLM-L6-v2&lt;/a&gt; model using LLM:&lt;/p&gt;
&lt;p&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2023/embeddings/embeddings.014.jpeg" alt="Slide showing the commands listed below" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;First, we install &lt;code&gt;llm&lt;/code&gt; and then use that to install the &lt;a href="https://github.com/simonw/llm-sentence-transformers"&gt;llm-sentence-transformers&lt;/a&gt; plugin - a wrapper around the &lt;a href="https://www.sbert.net/"&gt;SentenceTransformers&lt;/a&gt; library.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;pip install llm
llm install llm-sentence-transformers&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Next we need to register the &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; model. This will download the model from Hugging Face to your computer:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm sentence-transformers register all-MiniLM-L6-v2&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We can test that out by embedding a single sentence like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm embed -m sentence-transformers/all-MiniLM-L6-v2 \
  -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Hello world&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This outputs a JSON array that starts like this:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;[-0.03447725251317024, 0.031023245304822922, 0.006734962109476328, 0.026108916848897934, -0.03936201333999634, ...&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Embeddings like this on their own aren't very interesting - we need to store and compare them to start getting useful results.&lt;/p&gt;
&lt;p&gt;LLM can store embeddings in a "collection" - a SQLite table. The &lt;a href="https://llm.datasette.io/en/stable/embeddings/cli.html#llm-embed-multi"&gt;embed-multi command&lt;/a&gt; can be used to embed multiple pieces of content at once and store them in a collection.&lt;/p&gt;
&lt;p&gt;That's what this next command does:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm embed-multi readmes \
  --model sentence-transformers/all-MiniLM-L6-v2 \
  --files &lt;span class="pl-k"&gt;~&lt;/span&gt;/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;**/README.md&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; --store&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here we are populating a collection called "readmes".&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;--files&lt;/code&gt; option takes two arguments: a directory to search and a glob pattern to match against filenames. In this case I'm searching my home directory recursively for any file named &lt;code&gt;README.md&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;--store&lt;/code&gt; option causes LLM to store the raw text in the SQLite table in addition to the embedding vector.&lt;/p&gt;
&lt;p&gt;This command took around 30 minutes to run on my computer, but it worked! I now have a collection called &lt;code&gt;readmes&lt;/code&gt; with 16,796 rows - one for each &lt;code&gt;README.md&lt;/code&gt; file it found in my home directory.&lt;/p&gt;
&lt;h4 id="vibes-based-search"&gt;Vibes-based search&lt;/h4&gt;
&lt;p&gt;Now that we have a collection of embeddings, we can run searches against it using the &lt;a href="https://llm.datasette.io/en/stable/embeddings/cli.html#llm-similar"&gt;llm similar command&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2023/embeddings/embeddings.015.jpeg" alt="A terminal running llm similar and piping the results through jq" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm similar readmes -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;sqlite backup tools&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; jq .id&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We are asking for items in the &lt;code&gt;readmes&lt;/code&gt; collection that are similar to the embedding vector for the phrase "sqlite backup tools".&lt;/p&gt;
&lt;p&gt;This command outputs JSON by default, which includes the full text of the README files since we stored them using &lt;code&gt;--store&lt;/code&gt; earlier.&lt;/p&gt;
&lt;p&gt;Piping the results through &lt;code&gt;jq .id&lt;/code&gt; causes the command to output just the IDs of the matching rows.&lt;/p&gt;
&lt;p&gt;The top matching results are:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;"sqlite-diffable/README.md"
"sqlite-dump/README.md"
"ftstri/salite/ext/repair/README.md"
"simonw/README.md"
"sqlite-generate/README.md"
"sqlite-history/README.md"
"dbf-to-sqlite/README.md"
"ftstri/sqlite/ext/README.md"
"sqlite-utils/README.md"
"ftstri/sqlite/README.md'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These are good results! Each of these READMEs either describes a tool for working with SQLite backups or a project that relates to backups ins ome way.&lt;/p&gt;
&lt;p&gt;What's interesting about this is that it's not guaranteed that the term "backups" appeared directly in the text of those READMEs. The content is semantically similar to that phrase, but might not be an exact textual match.&lt;/p&gt;
&lt;p&gt;We can call this semantic search. I like to think of it as &lt;strong&gt;vibes-based search&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The vibes of those READMEs relate to our search term, according to this weird multi-dimensional space representation of the meaning of words.&lt;/p&gt;
&lt;p&gt;This is absurdly useful. If you've ever built a search engine for a website, you know that exact matches don't always help people find what they are looking for.&lt;/p&gt;
&lt;p&gt;We can use this kind of semantic search to build better search engines for a whole bunch of different kinds of content.&lt;/p&gt;
&lt;h4 id="embeddings-for-code-using-symbex"&gt;Embeddings for code using Symbex&lt;/h4&gt;
&lt;p&gt;Another tool I've been building is called &lt;a href="https://github.com/simonw/symbex"&gt;Symbex&lt;/a&gt;. It's a tool for exploring the symbols in a Python codebase.&lt;/p&gt;
&lt;p&gt;I &lt;a href="https://simonwillison.net/2023/Jun/18/symbex/"&gt;originally built it&lt;/a&gt; to help quickly find Python functions and classes and pipe them into LLMs to help explain and rewrite them.&lt;/p&gt;
&lt;p&gt;Then I realized that I could use it to calculate embeddings for all of the functions in a codebase, and use those embeddings to build a code search engine.&lt;/p&gt;
&lt;p&gt;I added a feature that could output JSON or CSV representing the symbols it found, using the same output format that &lt;code&gt;llm embed-multi&lt;/code&gt; can use as an input.&lt;/p&gt;
&lt;p&gt;Here's how I built a collection of all of the functions in my &lt;a href="https://github.com/simonw/datasette"&gt;Datasette&lt;/a&gt; project, using a newly released model called &lt;a href="https://huggingface.co/TaylorAI/gte-tiny"&gt;gte-tiny&lt;/a&gt; - just a 60MB file!&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm sentence-transformers register TaylorAI/gte-tiny

&lt;span class="pl-c1"&gt;cd&lt;/span&gt; datasette/datasette

symbex &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;*&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;*:*&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; --nl &lt;span class="pl-k"&gt;|&lt;/span&gt; \
  llm embed-multi functions - \
  --model sentence-transformers/TaylorAI/gte-tiny \
  --format nl \
  --store&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;code&gt;symbex '*' '*:*' --nl&lt;/code&gt; finds all functions (&lt;code&gt;*&lt;/code&gt;) and class methods (the &lt;code&gt;*:*&lt;/code&gt; pattern) in the current directory and outputs them as newline-delimited JSON.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;llm embed-multi ... --format nl&lt;/code&gt; command expects newline-delimited JSON as input, so we can pipe the output of &lt;code&gt;symbex&lt;/code&gt; directly into it.&lt;/p&gt;
&lt;p&gt;This defaults to storing the embeddings in the default LLM SQLite database. You can add &lt;code&gt;--database /tmp/data.db&lt;/code&gt; to specify an alternative location.&lt;/p&gt;
&lt;p&gt;And now... I can run vibes-based semantic search against my codebase!&lt;/p&gt;
&lt;p&gt;I could use the &lt;code&gt;llm similar&lt;/code&gt; command for this, but I also have the ability to run these searches using Datasette itself.&lt;/p&gt;
&lt;p&gt;Here's a SQL query for that, using the &lt;a href="https://datasette.io/plugins/datasette-llm-embed"&gt;datasette-llm-embed&lt;/a&gt; plugin from earlier:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;with input &lt;span class="pl-k"&gt;as&lt;/span&gt; (
  &lt;span class="pl-k"&gt;select&lt;/span&gt;
    llm_embed(
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;sentence-transformers/TaylorAI/gte-tiny&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;,
      :input
    ) &lt;span class="pl-k"&gt;as&lt;/span&gt; e
)
&lt;span class="pl-k"&gt;select&lt;/span&gt;
  id,
  content
&lt;span class="pl-k"&gt;from&lt;/span&gt;
  embeddings,
  input
&lt;span class="pl-k"&gt;where&lt;/span&gt;
  collection_id &lt;span class="pl-k"&gt;=&lt;/span&gt; (
    &lt;span class="pl-k"&gt;select&lt;/span&gt; id &lt;span class="pl-k"&gt;from&lt;/span&gt; collections &lt;span class="pl-k"&gt;where&lt;/span&gt; name &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;functions&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
  )
&lt;span class="pl-k"&gt;order by&lt;/span&gt;
  llm_embed_cosine(embedding, &lt;span class="pl-c1"&gt;input&lt;/span&gt;.&lt;span class="pl-c1"&gt;e&lt;/span&gt;) &lt;span class="pl-k"&gt;desc&lt;/span&gt;
&lt;span class="pl-k"&gt;limit&lt;/span&gt; &lt;span class="pl-c1"&gt;5&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The &lt;code&gt;:input&lt;/code&gt; parameter is automatically turned into a form field by Datasette.&lt;/p&gt;
&lt;p&gt;When I run this, I get back functions that relate to the concept of listing plugins:&lt;/p&gt;
&lt;p&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2023/embeddings/list-plugins.jpg" alt="Running that query in Datasette with an input of list plugins returns the plugins() function from the cli.py file on line 175" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;The key idea here is to use SQLite as an integration point - a substrate for combining together multiple tools.&lt;/p&gt;
&lt;p&gt;I can run separate tools that extract functions from a codebase, run them through an embedding model, write those embeddings to SQLite and then run queries against the results.&lt;/p&gt;
&lt;p&gt;Anything that can be piped into a tool can now be embedded and processed by the other components of this ecosystem.&lt;/p&gt;
&lt;h4 id="embedding-text-and-images-together-using-clip"&gt;Embedding text and images together using CLIP&lt;/h4&gt;
&lt;p&gt;My current favorite embedding model is &lt;a href="https://openai.com/blog/clip/"&gt;CLIP&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;CLIP is a fascinating model released by OpenAI - back in January 2021, when they were still doing most things in the open - that can embed both text and images.&lt;/p&gt;
&lt;p&gt;Crucially, it embeds them both into the same vector space.&lt;/p&gt;
&lt;p&gt;If you embed the string "dog", you'll get a location in 512 dimensional space (depending on your CLIP configuration).&lt;/p&gt;
&lt;p&gt;If you embed a photograph of a dog, you'll get a location in that same space... and it will be close in terms of distance to the location of the string "dog"!&lt;/p&gt;
&lt;p&gt;This means we can search for related images using text, and search for related text using images.&lt;/p&gt;
&lt;p&gt;I built &lt;a href="https://observablehq.com/@simonw/openai-clip-in-a-browser"&gt;an interactive demo&lt;/a&gt; to help explain how this works. The demo is an Observable notebook that runs the CLIP model directly in the browser.&lt;/p&gt;
&lt;p&gt;It's a pretty heavy page - it has to load 158MB of resources (64.6MB for the CLIP text model and 87.6MB for the image model) - but once loaded you can use it to embed an image, then embed a string of text and calculate the distance between the two.&lt;/p&gt;
&lt;p&gt;I can give it this photo I took of a beach:&lt;/p&gt;
&lt;p&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2023/embeddings/beach.jpg" alt="A bright blue sky over a beach, with sandy cliffs and the Pacific ocean in the frame" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Then type in different text strings to calculate a similarity score, here displayed as a percentage:&lt;/p&gt;
&lt;p&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2023/embeddings/clip.gif" alt="Animation showing different similarity scores for different text strings" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;text&lt;/th&gt;
&lt;th&gt;score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;beach&lt;/td&gt;
&lt;td&gt;26.946%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;city&lt;/td&gt;
&lt;td&gt;19.839%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sunshine&lt;/td&gt;
&lt;td&gt;24.146%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sunshine beach&lt;/td&gt;
&lt;td&gt;26.741%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;california&lt;/td&gt;
&lt;td&gt;25.686%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;california beach&lt;/td&gt;
&lt;td&gt;27.427%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;It's pretty amazing that we can do all of this in JavaScript running in the browser!&lt;/p&gt;
&lt;p&gt;There's an obvious catch: it's not actually that useful to be able to take an arbitrary photo and say "how similar is this to the term 'city'?".&lt;/p&gt;
&lt;p&gt;The trick is to build additional interfaces on top of this. Once again, we have the ability to build vibes-based search engines.&lt;/p&gt;
&lt;p&gt;Here's a great example of one of those.&lt;/p&gt;
&lt;h4 id="faucet-finder-finding-faucets-with-clip"&gt;Faucet Finder: finding faucets with CLIP&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://www.dbreunig.com/"&gt;Drew Breunig&lt;/a&gt; used LLM and my &lt;a href="https://github.com/simonw/llm-clip"&gt;llm-clip&lt;/a&gt; plugin to build a search engine for faucet taps.&lt;/p&gt;
&lt;p&gt;He was renovating his bathroom, and he needed to buy new faucets. So he scraped 20,000 photographs of faucets from a faucet supply company and ran CLIP against all of them.&lt;/p&gt;
&lt;p&gt;He used the result to build &lt;a href="https://faucet-finder.fly.dev/"&gt;Faucet Finder&lt;/a&gt; - a custom tool (deployed using Datasette) for finding faucets that look similar to other faucets.&lt;/p&gt;
&lt;p&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2023/embeddings/faucet-finder.jpg" alt="The Faucet Finder homepage - six faucets, each with a Find Similar button." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Among other things, this means you can find an expensive faucet you like and then look for visually similar cheaper options!&lt;/p&gt;
&lt;p&gt;Drew wrote more about his project in &lt;a href="https://www.dbreunig.com/2023/09/26/faucet-finder.html"&gt;Finding Bathroom Faucets with Embeddings&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Drew's demo uses pre-calculated embeddings to display similar results without having to run the CLIP model on the server.&lt;/p&gt;
&lt;p&gt;Inspired by this, I spent some time figuring out &lt;a href="https://til.simonwillison.net/fly/clip-on-fly"&gt;how to deploy a server-side CLIP model&lt;/a&gt; hosted by my own &lt;a href="https://fly.io/"&gt;Fly.io&lt;/a&gt; account.&lt;/p&gt;
&lt;p&gt;Drew's Datasette instance &lt;a href="https://faucet-finder.fly.dev/faucets/embeddings"&gt;includes this table&lt;/a&gt; of embedding vectors, exposed via the Datasette API.&lt;/p&gt;
&lt;p&gt;I deployed my own instance with &lt;a href="https://clip-datasette-on-fly.fly.dev/_memory?sql=select+hex(llm_embed(%27clip%27%2C+%3Aq))+as+x&amp;amp;q=purple"&gt;this API&lt;/a&gt; for embedding text strings, then built an Observable notebook demo that hits both APIs and combines the results.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://observablehq.com/@simonw/search-for-faucets-with-clip-api"&gt;observablehq.com/@simonw/search-for-faucets-with-clip-api&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Now I can search for things like "gold purple" and get back vibes-based faucet results:&lt;/p&gt;
&lt;p&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2023/embeddings/clip-gold-purple.jpg" alt="Observable notebook: Search for Faucets with CLIP. The search term gold purple produces 8 alarmingly tasteless faucets in those combined colors." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Being able to spin up this kind of ultra-specific search engine in a few hours is exactly the kind of trick that excites me about having embeddings as a tool in my toolbox.&lt;/p&gt;
&lt;h4 id="clustering-embeddings"&gt;Clustering embeddings&lt;/h4&gt;
&lt;p&gt;Related content and semantic / vibes-based search are the two most comon applications of embeddings, but there are a bunch of other neat things you can do with them too.&lt;/p&gt;
&lt;p&gt;One of those is clustering.&lt;/p&gt;
&lt;p&gt;I built a plugin for this called &lt;a href="https://github.com/simonw/llm-cluster"&gt;llm-cluster&lt;/a&gt; which implements this using &lt;a href="https://scikit-learn.org/stable/modules/clustering.html"&gt;sklearn.cluster&lt;/a&gt; from scikit-learn.&lt;/p&gt;
&lt;p&gt;To demonstrate that, I used my &lt;a href="https://github.com/simonw/paginate-json"&gt;paginate-json&lt;/a&gt; tool and the GitHub issues API to collect the titles of all of the issues in my &lt;code&gt;simonw/llm&lt;/code&gt; repository into a collection called &lt;code&gt;llm-issues&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;paginate-json &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://api.github.com/repos/simonw/llm/issues?state=all&amp;amp;filter=all&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
  &lt;span class="pl-k"&gt;|&lt;/span&gt; jq &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;[.[] | {id: .id, title: .title}]&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
  &lt;span class="pl-k"&gt;|&lt;/span&gt; llm embed-multi llm-issues - \
  --store&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Now I can create 10 clusters of issues like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-cluster

llm cluster llm-issues 10&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Clusters are output as a JSON array, with output that looks something like this (truncated):&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;[
  {
    &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;2&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"items"&lt;/span&gt;: [
      {
        &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;1650662628&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
        &lt;span class="pl-ent"&gt;"content"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Initial design&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
      },
      {
        &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;1650682379&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
        &lt;span class="pl-ent"&gt;"content"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Log prompts and responses to SQLite&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
      }
    ]
  },
  {
    &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;4&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"items"&lt;/span&gt;: [
      {
        &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;1650760699&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
        &lt;span class="pl-ent"&gt;"content"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;llm web command - launches a web server&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
      },
      {
        &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;1759659476&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
        &lt;span class="pl-ent"&gt;"content"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;`llm models` command&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
      },
      {
        &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;1784156919&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
        &lt;span class="pl-ent"&gt;"content"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;`llm.get_model(alias)` helper&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
      }
    ]
  },
  {
    &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;7&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"items"&lt;/span&gt;: [
      {
        &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;1650765575&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
        &lt;span class="pl-ent"&gt;"content"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;--code mode for outputting code&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
      },
      {
        &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;1659086298&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
        &lt;span class="pl-ent"&gt;"content"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Accept PROMPT from --stdin&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
      },
      {
        &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;1714651657&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
        &lt;span class="pl-ent"&gt;"content"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Accept input from standard in&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
      }
    ]
  }
]&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;These do appear to be related, but we can do better. The &lt;code&gt;llm cluster&lt;/code&gt; command has a &lt;code&gt;--summary&lt;/code&gt; option which causes it to pass the resulting cluster text through a LLM and use it to generate a descriptive name for each cluster:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm cluster llm-issues 10 --summary&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This gives back names like "Log Management and Interactive Prompt Tracking" and "Continuing Conversation Mechanism and Management". See &lt;a href="https://github.com/simonw/llm-cluster/blob/main/README.md#generating-summaries-for-each-cluster"&gt;the README&lt;/a&gt; for more details.&lt;/p&gt;
&lt;h4 id="visualize-in-2d-with-principal-component-analysis"&gt;Visualize in 2D with Principal Component Analysis&lt;/h4&gt;
&lt;p&gt;The problem with massively multi-dimensional space is that it's really hard to visualize.&lt;/p&gt;
&lt;p&gt;We can use a technique called Principal Component Analysis to reduce the dimensionality of the data to a more manageable size - and it turns out lower dimensions continue to capture useful semantic meaning about the content.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://interconnected.org/"&gt;Matt Webb&lt;/a&gt; used the OpenAI embedding model to generate embeddings for descriptions of every episode of the BBC's In Our Time podcast. He used these to find related episodes, but also ran PCA against them to create &lt;a href="https://interconnected.org/more/2023/02/in_our_time-PCA-plot.html"&gt;an interactive 2D visualization&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2023/embeddings/cpa.gif" alt="Animated screenshot of a cloud of points in 2D space. At one side hovering over them shows things like The War of 1812 and The Battle of Trafalgar - at the other side we get Quantum Gravity and Higgs Boson and Carbon" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Reducing 1,536 dimensions to just two still produces a meaningful way of exploring the data! Episodes about historic wars show up near each other, elsewhere there's a cluster of episodes about modern scientific discoveries.&lt;/p&gt;
&lt;p&gt;Matt wrote more about this in &lt;a href="https://interconnected.org/home/2023/02/07/braggoscope"&gt;Browse the BBC In Our Time archive by Dewey decimal code&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id="scoring-sentences-using-average-locations"&gt;Scoring sentences using average locations&lt;/h4&gt;
&lt;p&gt;Another trick with embeddings is to use them for classification.&lt;/p&gt;
&lt;p&gt;First calculate the average location for a group of embeddings that you have classified in a certain way, then compare embeddings of new content to those locations to assign it to a category.&lt;/p&gt;
&lt;p&gt;Amelia Wattenberger demonstrated a beautiful example of this in &lt;a href="https://wattenberger.com/thoughts/yay-embeddings-math"&gt;Getting creative with embeddings&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;She wanted to help people improve their writing by encouraging a mixture of concrete and abstract sentences. But how do you tell if a sentence of text is concrete or abstract?&lt;/p&gt;
&lt;p&gt;Her trick was to generate samples of the two types of sentence, calculate their average locations and then score new sentences based on how close they are to either end of that newly defined spectrum.&lt;/p&gt;
&lt;p&gt;&lt;img loading="lazy" src="https://static.simonwillison.net/static/2023/embeddings/amelia.jpg" alt="A document. Different sentences are displayed in different shades of green and purple, with a key on the right hand side showing that green means concreete and purple means abstract, with a gradient between them." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;This score can even be converted into a color loosely representing how abstract or concrete a given sentence is!&lt;/p&gt;
&lt;p&gt;This is a really neat demonstration of the kind of creative interfaces you can start to build on top of this technology.&lt;/p&gt;
&lt;h4 id="answering-questions-with-retrieval-augmented-generation"&gt;Answering questions with Retrieval-Augmented Generation&lt;/h4&gt;
&lt;p&gt;I'll finish with the idea that first got me excited about embeddings.&lt;/p&gt;
&lt;p&gt;Everyone who tries out ChatGPT ends up asking the same question: how could I use a version of this to answer questions based on my own private notes, or the internal documents owned by my company?&lt;/p&gt;
&lt;p&gt;People assume that the answer is to train a custom model on top of that content, likely at great expense.&lt;/p&gt;
&lt;p&gt;It turns out that's not actually necessary. You can use an off the shelf Large Language Model model (a hosted one or one that runs locally) and a trick called Retrieval Augmented Generation, or RAG.&lt;/p&gt;
&lt;p&gt;The key idea is this: a user asks a question. You search your private documents for content that appears relevant to the question, then paste excerpts of that content into the LLM (respecting its size limit, usually between 3,000 and 6,000 words) along with the original question.&lt;/p&gt;
&lt;p&gt;The LLM can then answer the question based on the additional content you provided.&lt;/p&gt;
&lt;p&gt;This cheap trick is astonishingly effective. It's trivial to get a basic version of this working - the challenge is in getting it to work as well as possible given the infinite set of questions a user might ask.&lt;/p&gt;
&lt;p&gt;The key problem in RAG is figuring out the best possible excerpts of content to include in the prompt to the LLM.&lt;/p&gt;
&lt;p&gt;"Vibes-based" semantic search powered by embedding is exactly the kind of thing you need to gather potentially relevant content to help answer a user's question.&lt;/p&gt;
&lt;p&gt;I built a version of this against content from my blog, described in detail in &lt;a href="https://til.simonwillison.net/llms/embed-paragraphs"&gt;Embedding paragraphs from my blog with E5-large-v2&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I used a model called &lt;a href="https://huggingface.co/intfloat/e5-large-v2"&gt;E5-large-v2&lt;/a&gt; for this. It's a model trained with this exact use-case in mind.&lt;/p&gt;
&lt;p&gt;A challenge with finding content relevant to questions is that the user's question - "What is shot-scraper?" - isn't guaranteed to be considered semantically similar to content that answers that question. Questions and assertions have a different grammar from each other.&lt;/p&gt;
&lt;p&gt;E5-large-v2 addresses this by supporting two types of content. You can embed phrases (factual sentences) and queries (questions) in the same space, similar to how CLIP supports both images and text.&lt;/p&gt;
&lt;p&gt;I embedded 19,000 paragraphs of text from my blog as phrases, and now I can embed a question as a query and use that to find the paragraphs most likely to answer that question.&lt;/p&gt;
&lt;p&gt;The result is RAG implemented as a one-liner Bash script:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm similar blog-paragraphs -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;query: &lt;span class="pl-smi"&gt;$1&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
  &lt;span class="pl-k"&gt;|&lt;/span&gt; jq &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;.content | sub("passage: "; "")&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; -r \
  &lt;span class="pl-k"&gt;|&lt;/span&gt; llm -m mlc-chat-Llama-2-7b-chat-hf-q4f16_1 \
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$1&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; -s &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;You answer questions as a single paragraph&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This example uses Llama 2 Chat 7B running on my laptop (with the &lt;a href="https://github.com/simonw/llm-mlc"&gt;llm-mlc&lt;/a&gt; plugin), so I'm able to answer questions entirely offline using code running on my laptop.&lt;/p&gt;
&lt;p&gt;Running this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;./blog-answer.sh &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;What is shot-scraper?&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Outputs this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Shot-scraper is a Python utility that wraps Playwright, providing both a command line interface and a YAML-driven configuration flow for automating the process of taking screenshots of web pages and scraping data from them using JavaScript. It can be used to take one-off screenshots or take multiple screenshots in a repeatable way by defining them in a YAML file. Additionally, it can be used to execute JavaScript on a page and return the resulting value.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That's a really good description of my &lt;a href="https://shot-scraper.datasette.io/"&gt;shot-scraper&lt;/a&gt; tool. I checked and none of that output is an exact match to content I had previously published on my blog.&lt;/p&gt;
&lt;h4 id="qa"&gt;Q&amp;amp;A&lt;/h4&gt;
&lt;p&gt;My talk ended with a Q&amp;amp;A session. Here are the summarized questions and answers.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How does LangChain fit into this?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.langchain.com/"&gt;LangChain&lt;/a&gt; is a popular framework for implementing features on top of LLMs. It covers a &lt;em&gt;lot&lt;/em&gt; of ground - my only problem with LangChain is that you have to invest a lot of work in understanding how it works and what it can do for you. Retrieval Augmented Generation is one of the key features of LangChain, so a lot of the things I've shown you today could be built on top of LangChain if you invest the effort.&lt;/p&gt;
&lt;p&gt;My philosophy around this stuff differs from LangChain in that I'm focusing on building a suite of small tools that can work together, as opposed to a single framework that solves everything in one go.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Have you tried distance functions other than cosine similarity?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I have not. Cosine similarity is the default function that everyone else seems to be using and I've not spent any time yet exploring other options.&lt;/p&gt;
&lt;p&gt;I actually got ChatGPT to write all of my different versions of cosine similarity, across both Python and JavaScript!&lt;/p&gt;
&lt;p&gt;A fascinating thing about RAG is that it has so many different knobs that you can tweak. You can try different distance functions, different embedding models, different prompting strategies and different LLMs. There's a lot of scope for experimentation here.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What do you need to adjust if you have 1 billion objects?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The demos I've shown today have all been on the small side - up to around 20,000 embeddings. This is small enough that you can run brute force cosine similarity functions against everything and get back results in a reasonable amount of time.&lt;/p&gt;
&lt;p&gt;If you're dealing with more data there are a growing number of options that can help.&lt;/p&gt;
&lt;p&gt;Lots of startups are launching new "vector databases" - which are effectively databases that are custom built to answer nearest-neighbour queries against vectors as quickly as possible.&lt;/p&gt;
&lt;p&gt;I'm not convinced you need an entirely new database for this: I'm more excited about adding custom indexes to existing databases. For example, SQLite has &lt;a href="https://github.com/asg017/sqlite-vss"&gt;sqlite-vss&lt;/a&gt; and PostgreSQL has &lt;a href="https://github.com/pgvector/pgvector"&gt;pgvector&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I've also done some successful experiments with Facebook's &lt;a href="https://github.com/facebookresearch/faiss"&gt;FAISS&lt;/a&gt; library, including building a Datasette plugin that uses it called &lt;a href="https://datasette.io/plugins/datasette-faiss"&gt;datasette-faiss&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What improvements to embedding models are you excited to see?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I'm really excited about multi-modal models. CLIP is a great example, but I've also been experimenting with Facebook's &lt;a href="https://github.com/facebookresearch/ImageBind"&gt;ImageBind&lt;/a&gt;, which "learns a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data." It looks like we can go a lot further than just images and text!&lt;/p&gt;
&lt;p&gt;I also like the trend of these models getting smaller. I demonstrated a new model, gte-tiny, earlier which is just 60MB. Being able to run these things on constrained devices, or in the browser, is really exciting to me.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="embeddings-further-reading"&gt;Further reading&lt;/h4&gt;
&lt;p&gt;If you want to dive more into the low-level details of how embeddings work, I suggest the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://vickiboykis.com/what_are_embeddings/"&gt;What are embeddings?&lt;/a&gt; by Vicki Boykis&lt;/li&gt;
&lt;li&gt;&lt;a href="https://txt.cohere.com/text-embeddings/"&gt;Text Embeddings Visually Explained&lt;/a&gt; by Meor Amer for Cohere&lt;/li&gt;
&lt;li&gt;&lt;a href="https://projector.tensorflow.org/"&gt;The Tensorflow Embedding Projector&lt;/a&gt; - an interactive tool for exploring embedding spaces&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.pinecone.io/learn/"&gt;Learn to Love Working with Vector Embeddings&lt;/a&gt; is a collection of tutorials from vector database vendor Pinecone&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/my-talks"&gt;my-talks&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vector-search"&gt;vector-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/embeddings"&gt;embeddings&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/annotated-talks"&gt;annotated-talks&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rag"&gt;rag&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/clip"&gt;clip&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="my-talks"/><category term="ai"/><category term="generative-ai"/><category term="vector-search"/><category term="embeddings"/><category term="llm"/><category term="annotated-talks"/><category term="rag"/><category term="clip"/></entry><entry><title>LLM now provides tools for working with embeddings</title><link href="https://simonwillison.net/2023/Sep/4/llm-embeddings/#atom-tag" rel="alternate"/><published>2023-09-04T20:32:39+00:00</published><updated>2023-09-04T20:32:39+00:00</updated><id>https://simonwillison.net/2023/Sep/4/llm-embeddings/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;a href="https://llm.datasette.io/"&gt;LLM&lt;/a&gt; is my Python library and command-line tool for working with language models. I just released &lt;a href="https://llm.datasette.io/en/stable/changelog.html#v0-9"&gt;LLM 0.9&lt;/a&gt; with a new set of features that extend LLM to provide tools for working with &lt;em&gt;embeddings&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;This is a long post with a lot of theory and background. If you already know what embeddings are, here's a TLDR you can try out straight away:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Install LLM&lt;/span&gt;
pip install llm

&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; If you already installed via Homebrew/pipx you can upgrade like this:&lt;/span&gt;
llm install -U llm

&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Install the llm-sentence-transformers plugin&lt;/span&gt;
llm install llm-sentence-transformers

&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Install the all-MiniLM-L6-v2 embedding model&lt;/span&gt;
llm sentence-transformers register all-MiniLM-L6-v2

&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Generate and store embeddings for every README.md in your home directory, recursively&lt;/span&gt;
llm embed-multi readmes \
  --model sentence-transformers/all-MiniLM-L6-v2 \
  --files &lt;span class="pl-k"&gt;~&lt;/span&gt;/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;**/README.md&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Add --store to store the text content as well&lt;/span&gt;

&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Run a similarity search for "sqlite" against those embeddings&lt;/span&gt;
llm similar readmes -c sqlite&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;For everyone else, read on and the above example should hopefully all make sense.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Sep/4/llm-embeddings/#embeddings"&gt;Embeddings&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Sep/4/llm-embeddings/#llms-new-embedding-features"&gt;LLM's new embedding features&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Sep/4/llm-embeddings/#embedding-collections"&gt;Embedding collections&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Sep/4/llm-embeddings/#embedding-similarity-search"&gt;Embedding similarity search&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Sep/4/llm-embeddings/#embedding-in-bulk"&gt;Embedding in bulk&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Sep/4/llm-embeddings/#embedding-files-in-a-directory"&gt;Embedding files in a directory&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Sep/4/llm-embeddings/#embeddings-in-python"&gt;Embeddings in Python&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Sep/4/llm-embeddings/#llm-cluster"&gt;Clustering with llm-cluster&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Sep/4/llm-embeddings/#future-plans"&gt;Future plans&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Sep/4/llm-embeddings/#indexing"&gt;Indexing&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Sep/4/llm-embeddings/#chunking"&gt;Chunking&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://simonwillison.net/2023/Sep/4/llm-embeddings/#get-involved"&gt;Get involved&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id="embeddings"&gt;Embeddings&lt;/h4&gt;
&lt;p&gt;Embeddings are a fascinating concept within the larger world of language models.&lt;/p&gt;
&lt;p&gt;I explain embeddings in detail (with both a video and heavily annotated slides) in &lt;a href="https://simonwillison.net/2023/Oct/23/embeddings/"&gt;Embeddings: What they are and why they matter&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;An embedding model lets you take a string of text - a word, sentence, paragraph or even a whole document - and turn that into an array of floating point numbers called an &lt;em&gt;embedding vector&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2023/wordcamp-llms/llm-work-for-you.055.jpeg" alt="On the left is a text post from one of my sites: Storing and serving related documents with openai-to-sqlite and embeddings. An arrow points to a huge JSON array on the right, with the label 1536 floating point numbers." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;A model will always produce the same length of array - 1,536 numbers for the &lt;a href="https://platform.openai.com/docs/guides/embeddings"&gt;OpenAI embedding model&lt;/a&gt;, 384 for &lt;a href="https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2"&gt;all-MiniLM-L6-v2&lt;/a&gt; - but the array itself is inscrutable. What are you meant to do with it?&lt;/p&gt;
&lt;p&gt;The answer is that you can compare them. I like to think of an embedding vector as a location in 1,536-dimensional space. The distance between two vectors is a measure of how semantically similar they are in meaning, at least according to the model that produced them.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2023/wordcamp-llms/llm-work-for-you.056.jpeg" alt="A location in 1,536 dimension space  There's a 3D plot with 400 red dots arranged randomly across 3 axis." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;"One happy dog" and "A playful hound" will end up close together, even though they don't share any keywords. The embedding vector represents the language model's interpretation of the meaning of the text.&lt;/p&gt;
&lt;p&gt;Things you can do with embeddings include:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Find &lt;strong&gt;related items&lt;/strong&gt;. I use this on &lt;a href="https://til.simonwillison.net/"&gt;my TIL site&lt;/a&gt; to display related articles, as described in &lt;a href="https://til.simonwillison.net/llms/openai-embeddings-related-content"&gt;Storing and serving related documents with openai-to-sqlite and embeddings&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Build &lt;strong&gt;semantic search&lt;/strong&gt;. As shown above, an embeddings-based search engine can find content relevant to the user's search term even if none of the keywords match.&lt;/li&gt;
&lt;li&gt;Implement &lt;strong&gt;retrieval augmented generation&lt;/strong&gt; - the trick where you take a user's question, find relevant documentation in your own corpus and use that to get an LLM to spit out an answer. More on that &lt;a href="https://simonwillison.net/2023/Aug/27/wordcamp-llms/#retrieval-augmented-generation"&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clustering&lt;/strong&gt;: you can find clusters of nearby items and identify patterns in a corpus of documents.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Classification&lt;/strong&gt;: calculate the embedding of a piece of text and compare it to pre-calculated "average" embeddings for different categories.&lt;/li&gt;
&lt;/ol&gt;
&lt;h4 id="llms-new-embedding-features"&gt;LLM's new embedding features&lt;/h4&gt;
&lt;p&gt;My goal with LLM is to provide a plugin-driven abstraction around a growing collection of language models. I want to make installing, using and comparing these models as easy as possible.&lt;/p&gt;
&lt;p&gt;The new release adds several command-line tools for working with embeddings, plus a new Python API for working with embeddings in your own code.&lt;/p&gt;
&lt;p&gt;It also adds support for installing additional embedding models via plugins. I've released one plugin for this so far: &lt;a href="https://github.com/simonw/llm-sentence-transformers"&gt;llm-sentence-transformers&lt;/a&gt;, which adds support for new models based on the &lt;a href="https://www.sbert.net/"&gt;sentence-transformers&lt;/a&gt; library.&lt;/p&gt;
&lt;p&gt;The example above shows how to use &lt;code&gt;sentence-transformers&lt;/code&gt;. LLM also supports API-driven access to the OpenAI &lt;code&gt;ada-002&lt;/code&gt; model.&lt;/p&gt;
&lt;p&gt;Here's how to embed some text using &lt;code&gt;ada-002&lt;/code&gt;, assuming you have &lt;a href="https://llm.datasette.io/en/stable/setup.html"&gt;installed LLM already&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Set your OpenAI API key&lt;/span&gt;
llm keys &lt;span class="pl-c1"&gt;set&lt;/span&gt; openai
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; &amp;lt;paste key here&amp;gt;&lt;/span&gt;

&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Embed some text&lt;/span&gt;
llm embed -m ada-002 -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Hello world&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This will output a huge JSON list of floating point numbers to your terminal. You can add &lt;code&gt;-f base64&lt;/code&gt; (or &lt;code&gt;-f hex&lt;/code&gt;) to get that back in a different format, though none of these outputs are instantly useful.&lt;/p&gt;
&lt;p&gt;Embeddings are much more interesting when you store them.&lt;/p&gt;
&lt;p&gt;LLM already uses SQLite to &lt;a href="https://llm.datasette.io/en/stable/logging.html"&gt;store prompts and responses&lt;/a&gt;. It was a natural fit to use SQLite to store embeddings as well.&lt;/p&gt;
&lt;h4 id="embedding-collections"&gt;Embedding collections&lt;/h4&gt;
&lt;p&gt;LLM 0.9 introduces the concept of a &lt;strong&gt;collection&lt;/strong&gt; of embeddings. A collection has a name - like &lt;code&gt;readmes&lt;/code&gt; - and contains a set of embeddings, each of which has an ID and an embedding vector.&lt;/p&gt;
&lt;p&gt;All of the embeddings in a collection are generated by the same model, to ensure they can be compared with each others.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;llm embed&lt;/code&gt; command can store the vector in the database instead of returning it to the console. Pass it the name of an existing (or to-be-created) collection and the ID to use to store the embedding.&lt;/p&gt;
&lt;p&gt;Here we'll store the embedding for the phrase "Hello world" in a collection called &lt;code&gt;phrases&lt;/code&gt; with the ID &lt;code&gt;hello&lt;/code&gt;, using that &lt;code&gt;ada-002&lt;/code&gt; embedding model:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm embed phrases hello -m ada-002 -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Hello world&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Future phrases can be added without needing to specify the model again, since it is remembered by the collection:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm embed phrases goodbye -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Goodbye world&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The &lt;code&gt;llm embed-db collections&lt;/code&gt; shows a list of collections:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;phrases: ada-002
  2 embeddings
readmes: sentence-transformers/all-MiniLM-L6-v2
  16796 embeddings&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The data is stored in a SQLite &lt;code&gt;embeddings&lt;/code&gt; table with the following schema:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;CREATE TABLE [collections] (
   [id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [name] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [model] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;
);
&lt;span class="pl-k"&gt;CREATE&lt;/span&gt; &lt;span class="pl-k"&gt;TABLE&lt;/span&gt; "&lt;span class="pl-en"&gt;embeddings&lt;/span&gt;" (
   [collection_id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;REFERENCES&lt;/span&gt; [collections]([id]),
   [id] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [embedding] BLOB,
   [content] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [content_hash] BLOB,
   [metadata] &lt;span class="pl-k"&gt;TEXT&lt;/span&gt;,
   [updated] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt;,
   &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt; ([collection_id], [id])
);

CREATE UNIQUE INDEX [idx_collections_name]
    &lt;span class="pl-k"&gt;ON&lt;/span&gt; [collections] ([name]);
CREATE INDEX [idx_embeddings_content_hash]
    &lt;span class="pl-k"&gt;ON&lt;/span&gt; [embeddings] ([content_hash]);&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;By default this is the SQLite database at the location revealed by &lt;a href=""&gt;llm embed-db path&lt;/a&gt;, but you can pass &lt;code&gt;--database my-embeddings.db&lt;/code&gt; to various LLM commands to use a different database.&lt;/p&gt;
&lt;p&gt;Each embedding vector is stored as a binary BLOB in the &lt;code&gt;embedding&lt;/code&gt; column, consisting of those floating point numbers packed together as 32 bit floats.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;content_hash&lt;/code&gt; column contains a MD5 hash of the content. This helps avoid re-calculating the embedding (which can cost actual money for API-based embedding models like &lt;code&gt;ada-002&lt;/code&gt;) unless the content has changed.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;content&lt;/code&gt; column is usually &lt;code&gt;null&lt;/code&gt;, but can contain a copy of the original text content if you pass the &lt;code&gt;--store&lt;/code&gt; option to the &lt;code&gt;llm embed&lt;/code&gt; command.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;metadata&lt;/code&gt; can contain a JSON object with metadata, if you pass &lt;code&gt;--metadata '{"json": "goes here"}&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;You don't have to pass content using &lt;code&gt;-c&lt;/code&gt; - you can instead pass a file path using the &lt;code&gt;-i/--input&lt;/code&gt; option:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm embed docs llm-setup -m ada-002 -i llm/docs/setup.md&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Or pipe things to standard input like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;cat llm/docs/setup.md &lt;span class="pl-k"&gt;|&lt;/span&gt; llm embed docs llm-setup -m ada-002 -i -&lt;/pre&gt;&lt;/div&gt;
&lt;h4 id="embedding-similarity-search"&gt;Embedding similarity search&lt;/h4&gt;
&lt;p&gt;Once you've built a collection, you can search for similar embeddings using the &lt;code&gt;llm similar&lt;/code&gt; command.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;-c "term"&lt;/code&gt; option will embed the text you pass in using the embedding model for the collection and use that as the comparison vector:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm similar readmes -c sqlite&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;You can also pass the ID of an object in that collection to use that embedding instead. This gets you related documents, for example:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm similar readmes sqlite-utils/README.md&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The output from this command is currently newline-delimited JSON.&lt;/p&gt;
&lt;h4 id="embedding-in-bulk"&gt;Embedding in bulk&lt;/h4&gt;
&lt;p&gt;The &lt;code&gt;llm embed&lt;/code&gt; command embeds a single string at a time. &lt;code&gt;llm embed-multi&lt;/code&gt; is much more powerful: you can feed a CSV or JSON file, a SQLite database or even have it read from a directory of files in order to embed multiple items at once.&lt;/p&gt;
&lt;p&gt;Many embeddings models are optimized for batch operations, so embedding multiple items at a time can provide a significant speed boost.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;embed-multi&lt;/code&gt; command is described &lt;a href="https://llm.datasette.io/en/stable/embeddings/cli.html#llm-embed-multi"&gt;in detail in the documentation&lt;/a&gt;. Here are a couple of fun things you can do with it.&lt;/p&gt;
&lt;p&gt;First, I'm going to create embeddings for every single one of my Apple Notes.&lt;/p&gt;
&lt;p&gt;My &lt;a href="https://datasette.io/tools/apple-notes-to-sqlite"&gt;apple-notes-to-sqlite&lt;/a&gt; tool can export Apple Notes to a SQLite database. I'll run that first:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;apple-notes-to-sqlite notes.db&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This took quite a while to run on my machine and generated a 828M SQLite database containing 6,462 records!&lt;/p&gt;
&lt;p&gt;Next, I'm going to embed the content of all of those notes using the &lt;code&gt;sentence-transformers/all-MiniLM-L6-v2&lt;/code&gt; model:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm embed-multi notes \
  -d notes.db \
  --sql &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;select id, title, body from notes&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
  -m sentence-transformers/all-MiniLM-L6-v2&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This took around 15 minutes to run, and increased the size of my database by 13MB.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;--sql&lt;/code&gt; option here specifies a SQL query. The first column must be an &lt;code&gt;id&lt;/code&gt;, then any subsequent columns will be concatenated together to form the content to embed.&lt;/p&gt;
&lt;p&gt;In this case the embeddings are written back to the same &lt;code&gt;notes.db&lt;/code&gt; database that the content came from.&lt;/p&gt;
&lt;p&gt;And now I can run embedding similarity operations against all of my Apple notes!&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm similar notes -d notes.db -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;ideas for blog posts&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h4 id="embedding-files-in-a-directory"&gt;Embedding files in a directory&lt;/h4&gt;
&lt;p&gt;Let's revisit the example from the top of this post. In this case, I'm using the &lt;code&gt;--files&lt;/code&gt; option to search for files on disk and embed each of them:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm embed-multi readmes \
  --model sentence-transformers/all-MiniLM-L6-v2 \
  --files &lt;span class="pl-k"&gt;~&lt;/span&gt;/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;**/README.md&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The &lt;code&gt;--files&lt;/code&gt; option takes two arguments: a path to a directory and a pattern to match against filenames. In this case I'm searching my home directory recursively for any files named &lt;code&gt;README.md&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Running this command gives me embeddings for all of my README.md files, which I can then search against like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm similar readmes -c sqlite&lt;/pre&gt;&lt;/div&gt;
&lt;h4 id="embeddings-in-python"&gt;Embeddings in Python&lt;/h4&gt;
&lt;p&gt;So far I've only covered the command-line tools. LLM 0.9 also introduces a new Python API for working with embeddings.&lt;/p&gt;
&lt;p&gt;There are two aspects to this. If you just want to embed content and handle the resulting vectors yourself, you can use &lt;code&gt;llm.get_embedding_model()&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;

&lt;span class="pl-c"&gt;# This takes model IDs and aliases defined by plugins:&lt;/span&gt;
&lt;span class="pl-s1"&gt;model&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;.&lt;span class="pl-en"&gt;get_embedding_model&lt;/span&gt;(&lt;span class="pl-s"&gt;"sentence-transformers/all-MiniLM-L6-v2"&lt;/span&gt;)
&lt;span class="pl-s1"&gt;vector&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;model&lt;/span&gt;.&lt;span class="pl-en"&gt;embed&lt;/span&gt;(&lt;span class="pl-s"&gt;"This is text to embed"&lt;/span&gt;)&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;vector&lt;/code&gt; will then be a Python list of floating point numbers.&lt;/p&gt;
&lt;p&gt;You can serialize that to the same binary format that LLM uses like this:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-s1"&gt;binary_vector&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;.&lt;span class="pl-en"&gt;encode&lt;/span&gt;(&lt;span class="pl-s1"&gt;vector&lt;/span&gt;)
&lt;span class="pl-c"&gt;# And to deserialize:&lt;/span&gt;
&lt;span class="pl-s1"&gt;vector&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;.&lt;span class="pl-en"&gt;decode&lt;/span&gt;(&lt;span class="pl-s1"&gt;binary_vector&lt;/span&gt;)&lt;/pre&gt;
&lt;p&gt;The second aspect of the Python API is the &lt;code&gt;llm.Collection&lt;/code&gt; class, for working with collections of embeddings. This example code is quoted &lt;a href="https://llm.datasette.io/en/stable/embeddings/python-api.html#working-with-collections"&gt;from the documentation&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;sqlite_utils&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;

&lt;span class="pl-c"&gt;# This collection will use an in-memory database that will be&lt;/span&gt;
&lt;span class="pl-c"&gt;# discarded when the Python process exits&lt;/span&gt;
&lt;span class="pl-s1"&gt;collection&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;.&lt;span class="pl-v"&gt;Collection&lt;/span&gt;(&lt;span class="pl-s"&gt;"entries"&lt;/span&gt;, &lt;span class="pl-s1"&gt;model_id&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"ada-002"&lt;/span&gt;)

&lt;span class="pl-c"&gt;# Or you can persist the database to disk like this:&lt;/span&gt;
&lt;span class="pl-s1"&gt;db&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;sqlite_utils&lt;/span&gt;.&lt;span class="pl-v"&gt;Database&lt;/span&gt;(&lt;span class="pl-s"&gt;"my-embeddings.db"&lt;/span&gt;)
&lt;span class="pl-s1"&gt;collection&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;.&lt;span class="pl-v"&gt;Collection&lt;/span&gt;(&lt;span class="pl-s"&gt;"entries"&lt;/span&gt;, &lt;span class="pl-s1"&gt;db&lt;/span&gt;, &lt;span class="pl-s1"&gt;model_id&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"ada-002"&lt;/span&gt;)

&lt;span class="pl-c"&gt;# You can pass a model directly using model= instead of model_id=&lt;/span&gt;
&lt;span class="pl-s1"&gt;embedding_model&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;.&lt;span class="pl-en"&gt;get_embedding_model&lt;/span&gt;(&lt;span class="pl-s"&gt;"ada-002"&lt;/span&gt;)
&lt;span class="pl-s1"&gt;collection&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;llm&lt;/span&gt;.&lt;span class="pl-v"&gt;Collection&lt;/span&gt;(&lt;span class="pl-s"&gt;"entries"&lt;/span&gt;, &lt;span class="pl-s1"&gt;db&lt;/span&gt;, &lt;span class="pl-s1"&gt;model&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;embedding_model&lt;/span&gt;)

&lt;span class="pl-c"&gt;# Store a string in the collection with an ID:&lt;/span&gt;
&lt;span class="pl-s1"&gt;collection&lt;/span&gt;.&lt;span class="pl-en"&gt;embed&lt;/span&gt;(&lt;span class="pl-s"&gt;"hound"&lt;/span&gt;, &lt;span class="pl-s"&gt;"my happy hound"&lt;/span&gt;)

&lt;span class="pl-c"&gt;# Or to store content and extra metadata:&lt;/span&gt;
&lt;span class="pl-s1"&gt;collection&lt;/span&gt;.&lt;span class="pl-en"&gt;embed&lt;/span&gt;(
    &lt;span class="pl-s"&gt;"hound"&lt;/span&gt;,
    &lt;span class="pl-s"&gt;"my happy hound"&lt;/span&gt;,
    &lt;span class="pl-s1"&gt;metadata&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;{&lt;span class="pl-s"&gt;"name"&lt;/span&gt;: &lt;span class="pl-s"&gt;"Hound"&lt;/span&gt;},
    &lt;span class="pl-s1"&gt;store&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;
)

&lt;span class="pl-c"&gt;# Or embed things in bulk:&lt;/span&gt;
&lt;span class="pl-s1"&gt;collection&lt;/span&gt;.&lt;span class="pl-en"&gt;embed_multi&lt;/span&gt;(
    [
        (&lt;span class="pl-s"&gt;"hound"&lt;/span&gt;, &lt;span class="pl-s"&gt;"my happy hound"&lt;/span&gt;),
        (&lt;span class="pl-s"&gt;"cat"&lt;/span&gt;, &lt;span class="pl-s"&gt;"my dissatisfied cat"&lt;/span&gt;),
    ],
    &lt;span class="pl-c"&gt;# Add this to store the strings in the content column:&lt;/span&gt;
    &lt;span class="pl-s1"&gt;store&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;,
)&lt;/pre&gt;
&lt;p&gt;As with everything else in LLM, the goal is that anything you can do with the CLI can be done with the Python API, and vice-versa.&lt;/p&gt;
&lt;h4 id="llm-cluster"&gt;Clustering with llm-cluster&lt;/h4&gt;
&lt;p&gt;Another interesting application of embeddings is that you can use them to cluster content - identifying patterns in a corpus of documents.&lt;/p&gt;
&lt;p&gt;I've started exploring this area with a new plugin, called &lt;strong&gt;&lt;a href="https://github.com/simonw/llm-cluster"&gt;llm-cluster&lt;/a&gt;.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;You can install it like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm install llm-cluster&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Let's create a new collection using data pulled from GitHub. I'm going to import all of the &lt;a href="https://github.com/simonw/llm/issues"&gt;LLM issues&lt;/a&gt; from the GitHub API, using my &lt;a href="https://github.com/simonw/paginate-json"&gt;paginate-json&lt;/a&gt; tool:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;paginate-json &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;https://api.github.com/repos/simonw/llm/issues?state=all&amp;amp;filter=all&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
  &lt;span class="pl-k"&gt;|&lt;/span&gt; jq &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;[.[] | {id: .id, title: .title}]&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
  &lt;span class="pl-k"&gt;|&lt;/span&gt; llm embed-multi llm-issues - \
    --database issues.db \
    --model sentence-transformers/all-MiniLM-L6-v2 \
    --store&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Running this gives me a &lt;code&gt;issues.db&lt;/code&gt; SQLite database with 218 embeddings contained in a collection called &lt;code&gt;llm-issues&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Now let's try out the &lt;code&gt;llm-cluster&lt;/code&gt; command, requesting ten clusters from that collection:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm cluster llm-issues --database issues.db 10&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The output from this command, truncated, looks like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;[
  {
    &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;0&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"items"&lt;/span&gt;: [
      {
        &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;1784149135&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
        &lt;span class="pl-ent"&gt;"content"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Tests fail with pydantic 2&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
      },
      {
        &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;1837084995&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
        &lt;span class="pl-ent"&gt;"content"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Allow for use of Pydantic v1 as well as v2.&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
      },
      {
        &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;1857942721&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
        &lt;span class="pl-ent"&gt;"content"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Get tests passing against Pydantic 1&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
      }
    ]
  },
  {
    &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;1&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"items"&lt;/span&gt;: [
      {
        &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;1724577618&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
        &lt;span class="pl-ent"&gt;"content"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Better ways of storing and accessing API keys&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
      },
      {
        &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;1772024726&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
        &lt;span class="pl-ent"&gt;"content"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Support for `-o key value` options such as `temperature`&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
      },
      {
        &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;1784111239&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
        &lt;span class="pl-ent"&gt;"content"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;`--key` should be used in place of the environment variable&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
      }
    ]
  },
  {
    &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;8&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"items"&lt;/span&gt;: [
      {
        &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;1835739724&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
        &lt;span class="pl-ent"&gt;"content"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Bump the python-packages group with 1 update&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
      },
      {
        &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;1848143453&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
        &lt;span class="pl-ent"&gt;"content"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Python library support for adding aliases&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
      },
      {
        &lt;span class="pl-ent"&gt;"id"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;1857268563&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
        &lt;span class="pl-ent"&gt;"content"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Bump the python-packages group with 1 update&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
      }
    ]
  }
]&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;These look pretty good! But wouldn't it be neat if we had a snappy title for each one?&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;--summary&lt;/code&gt; option can provide exactly that, by piping the members of each cluster through a call to another LLM in order to generate a useful summary.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;llm cluster llm-issues --database issues.db 10 --summary&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This uses &lt;code&gt;gpt-3.5-turbo&lt;/code&gt; to generate a summary for each cluster, with this default prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Short, concise title for this cluster of related documents.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The results I got back are pretty good, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Template Storage and Management Improvements&lt;/li&gt;
&lt;li&gt;Package and Dependency Updates and Improvements&lt;/li&gt;
&lt;li&gt;Adding Conversation Mechanism and Tools&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I tried the same thing using a Llama 2 model &lt;a href="https://simonwillison.net/2023/Aug/1/llama-2-mac/"&gt;running on my own laptop&lt;/a&gt;, with a custom prompt:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;llm cluster llm-issues --database issues.db 10 \
  --summary --model mlc-chat-Llama-2-13b-chat-hf-q4f16_1 \
  --prompt 'Concise title for this cluster of related documents, just return the title'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I didn't quite get what I wanted! Llama 2 is proving a lot harder to prompt, so each cluster came back with something that looked like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Sure! Here's a concise title for this cluster of related documents:&lt;/p&gt;
&lt;p&gt;"Design Improvements for the Neat Prompt System"&lt;/p&gt;
&lt;p&gt;This title captures the main theme of the documents, which is to improve the design of the Neat prompt system. It also highlights the focus on improving the system's functionality and usability&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://github.com/simonw/llm-cluster"&gt;llm-cluster&lt;/a&gt; only took a few hours to throw together, which I'm seeing as a positive indicator that the LLM library is developing in the right direction.&lt;/p&gt;
&lt;h4 id="future-plans"&gt;Future plans&lt;/h4&gt;
&lt;p&gt;The two future features I'm most excited about are indexing and chunking.&lt;/p&gt;
&lt;h5 id="indexing"&gt;Indexing&lt;/h5&gt;
&lt;p&gt;The &lt;a href="https://llm.datasette.io/en/stable/embeddings/cli.html#llm-similar"&gt;llm similar&lt;/a&gt; command and &lt;a href="https://llm.datasette.io/en/stable/embeddings/python-api.html#retrieving-similar-items"&gt;collection.similar()&lt;/a&gt; Python method currently use effectively the slowest brute force approach possible: calculate a cosine difference between input vector and every other embedding in the collection, then sort the results.&lt;/p&gt;
&lt;p&gt;This works fine for collections with a few hundred items, but will start to suffer for collections of 100,000 or more.&lt;/p&gt;
&lt;p&gt;There are plenty of potential ways of speeding this up: you can run a vector index like &lt;a href="https://github.com/facebookresearch/faiss"&gt;FAISS&lt;/a&gt; or &lt;a href="https://github.com/nmslib/hnswlib"&gt;hnswlib&lt;/a&gt;, use a database extension like &lt;a href="https://github.com/asg017/sqlite-vss"&gt;sqlite-vss&lt;/a&gt; or &lt;a href="https://github.com/pgvector/pgvector"&gt;pgvector&lt;/a&gt;, or turn to a hosted vector database like &lt;a href="https://www.pinecone.io/"&gt;Pinecone&lt;/a&gt; or &lt;a href="https://milvus.io/"&gt;Milvus&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;With this many potential solutions, the obvious answer for LLM is to address this with plugins.&lt;/p&gt;
&lt;p&gt;I'm still thinking through the details, but the core idea is that users should be able to define an index against one or more collections, and LLM will then coordinate updates to that index. These may not happen in real-time - some indexes can be expensive to rebuild, so there are benefits to applying updates in batches.&lt;/p&gt;
&lt;p&gt;I experimented with FAISS earlier this year in &lt;a href="https://datasette.io/plugins/datasette-faiss"&gt;datasette-faiss&lt;/a&gt;. That's likely to be the base for my first implementation.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://llm.datasette.io/en/stable/embeddings/python-api.html#sql-schema"&gt;embeddings table&lt;/a&gt; has an &lt;code&gt;updated&lt;/code&gt; timestamp column to support this use-case - so indexers can run against just the items that have changed since the last indexing run.&lt;/p&gt;
&lt;p&gt;Follow &lt;a href="https://github.com/simonw/llm/issues/216"&gt;issue #216&lt;/a&gt; for updates on this feature.&lt;/p&gt;
&lt;h5 id="chunking"&gt;Chunking&lt;/h5&gt;
&lt;p&gt;When building an embeddings-based search engine, the hardest challenge is deciding how best to "chunk" the documents.&lt;/p&gt;
&lt;p&gt;Users will type in short phrases or questions. The embedding for a four word question might not necessarily map closely to the embedding of a thousand word article, even if the article itself should be a good match for that query.&lt;/p&gt;
&lt;p&gt;To maximize the chance of returning the most relevant content, we need to be smarter about what we embed.&lt;/p&gt;
&lt;p&gt;I'm still trying to get a good feeling for the strategies that make sense here. Some that I've seen include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Split a document up into fixed length shorter segments.&lt;/li&gt;
&lt;li&gt;Split into segments but including a ~10% overlap with the previous and next segments, to reduce problems caused by words and sentences being split in a way that disrupts their semantic meaning.&lt;/li&gt;
&lt;li&gt;Splitting by sentence, using NLP techniques.&lt;/li&gt;
&lt;li&gt;Splitting into higher level sections, based on things like document headings.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Then there are more exciting, LLM-driven approaches:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Generate an LLM summary of a document and embed that.&lt;/li&gt;
&lt;li&gt;Ask an LLM "What questions are answered by the following text?" and then embed each of the resulting questions!&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It's possible to try out these different techniques using LLM already: write code that does the splitting, then feed the results to &lt;a href="https://llm.datasette.io/en/stable/embeddings/python-api.html#storing-embeddings-in-bulk"&gt;Collection.embed_multi()&lt;/a&gt; or &lt;a href="https://llm.datasette.io/en/stable/embeddings/cli.html#llm-embed-multi"&gt;llm embed-multi&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;But... it would be really cool if LLM could split documents for you - with the splitting techniques themselves defined by plugins, to make it easy to try out new approaches.&lt;/p&gt;
&lt;h4 id="get-involved"&gt;Get involved&lt;/h4&gt;
&lt;p&gt;It should be clear by now that the potential scope of the LLM project is enormous. I'm trying to use plugins to tie together an enormous and rapidly growing ecosystem of models and techniques into something that's as easy for people to work with and build on as possible.&lt;/p&gt;
&lt;p&gt;There are plenty of ways you can help!&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://datasette.io/discord-llm"&gt;Join the #llm Discord&lt;/a&gt; to talk about the project.&lt;/li&gt;
&lt;li&gt;Try out plugins and run different models with them. There are &lt;a href="https://llm.datasette.io/en/stable/plugins/directory.html"&gt;12 plugins already&lt;/a&gt;, and several of those can be used to run dozens if not hundreds of models (&lt;a href="https://github.com/simonw/llm-mlc"&gt;llm-mlc&lt;/a&gt;, &lt;a href="https://github.com/simonw/llm-gpt4all"&gt;llm-gpt4all&lt;/a&gt; and &lt;a href="https://github.com/simonw/llm-llama-cpp"&gt;llm-llama-cpp&lt;/a&gt; in particular). I've hardly scratched the surface of these myself, and I'm testing exclusively on Apple Silicon. I'm really keen to learn more about which models work well, which models don't and which perform the best on different hardware.&lt;/li&gt;
&lt;li&gt;Try &lt;a href="https://llm.datasette.io/en/stable/plugins/tutorial-model-plugin.html"&gt;building a plugin&lt;/a&gt; for a new model. My dream here is that every significant Large Language Model will have an LLM plugin that makes it easy to install and use.&lt;/li&gt;
&lt;li&gt;Build stuff using LLM and let me know what you've built. Nothing fuels an open source project more than stories of cool things people have built with it.&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cli"&gt;cli&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vector-search"&gt;vector-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/embeddings"&gt;embeddings&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm"&gt;llm&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rag"&gt;rag&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="cli"/><category term="open-source"/><category term="projects"/><category term="sqlite"/><category term="ai"/><category term="generative-ai"/><category term="vector-search"/><category term="llms"/><category term="embeddings"/><category term="llm"/><category term="rag"/></entry><entry><title>sqlite-vss v0.1.1 Annotated Release Notes</title><link href="https://simonwillison.net/2023/Jul/20/sqlite-vss-v011-annotated-release-notes/#atom-tag" rel="alternate"/><published>2023-07-20T17:48:50+00:00</published><updated>2023-07-20T17:48:50+00:00</updated><id>https://simonwillison.net/2023/Jul/20/sqlite-vss-v011-annotated-release-notes/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://observablehq.com/@asg017/sqlite-vss-v0-1-1-annotated-release-notes"&gt;sqlite-vss v0.1.1 Annotated Release Notes&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Alex Garcia’s sqlite-vss adds vector search directly to SQLite through a custom extension. It’s now easily installed for Python, Node.js, Deno, Elixir, Go, Rust and Ruby (“gem install sqlite-vss”), and is being used actively by enough people that Alex is getting actionable feedback, including fixes for memory leaks spotted in production.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/agarcia_me/status/1682055955549417473"&gt;@agarcia_me&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/annotated-release-notes"&gt;annotated-release-notes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/alex-garcia"&gt;alex-garcia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vector-search"&gt;vector-search&lt;/a&gt;&lt;/p&gt;



</summary><category term="sqlite"/><category term="annotated-release-notes"/><category term="alex-garcia"/><category term="vector-search"/></entry><entry><title>Vector Search</title><link href="https://simonwillison.net/2023/Jun/2/vector-search/#atom-tag" rel="alternate"/><published>2023-06-02T05:02:22+00:00</published><updated>2023-06-02T05:02:22+00:00</updated><id>https://simonwillison.net/2023/Jun/2/vector-search/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://amjith.com/blog/2023/vector_search/"&gt;Vector Search&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Amjith Ramanujam provides a very thorough tutorial on implementing vector similarity search using SentenceTransformers embeddings (all-MiniLM-L6-v2) executed using sqlite-utils, then served via datasette-sqlite-vss and deployed using Fly.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/fly"&gt;fly&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vector-search"&gt;vector-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/embeddings"&gt;embeddings&lt;/a&gt;&lt;/p&gt;



</summary><category term="sqlite"/><category term="ai"/><category term="datasette"/><category term="fly"/><category term="vector-search"/><category term="embeddings"/></entry><entry><title>Introducing sqlite-vss: A SQLite Extension for Vector Search</title><link href="https://simonwillison.net/2023/Feb/10/sqlite-vss/#atom-tag" rel="alternate"/><published>2023-02-10T22:53:14+00:00</published><updated>2023-02-10T22:53:14+00:00</updated><id>https://simonwillison.net/2023/Feb/10/sqlite-vss/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://observablehq.com/@asg017/introducing-sqlite-vss"&gt;Introducing sqlite-vss: A SQLite Extension for Vector Search&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
This latest SQLite extension from Alex Garcia is possibly his best yet: it adds FAISS-powered vector similarity search directly to SQLite, enabling fast KNN similarity lookups against a virtual table that feels a lot like SQLite’s own built-in full text search feature. This write-up includes interactive demos using Datasette called from an Observable notebook, running similarity searches against an index of 200,000 news headlines and summaries in less than 50ms.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://fedi.simonwillison.net/@simon/109842505896608627"&gt;@simon on Mastodon&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/observable"&gt;observable&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/alex-garcia"&gt;alex-garcia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vector-search"&gt;vector-search&lt;/a&gt;&lt;/p&gt;



</summary><category term="sqlite"/><category term="datasette"/><category term="observable"/><category term="alex-garcia"/><category term="vector-search"/></entry><entry><title>Weeknotes: AI hacking and a SpatiaLite tutorial</title><link href="https://simonwillison.net/2023/Jan/15/weeknotes/#atom-tag" rel="alternate"/><published>2023-01-15T19:45:39+00:00</published><updated>2023-01-15T19:45:39+00:00</updated><id>https://simonwillison.net/2023/Jan/15/weeknotes/#atom-tag</id><summary type="html">
    &lt;p&gt;Short weeknotes this time because the key things I worked on have already been covered here:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2023/Jan/13/semantic-search-answers/"&gt;How to implement Q&amp;amp;A against your documentation with GPT3, embeddings and Datasette&lt;/a&gt; described my experiment using custom SQL functions in Datasette to provide a Q&amp;amp;A interface against an existing corpus of documents. This resulted in a flurry of software releases:
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://datasette.io/plugins/datasette-openai"&gt;datasette-openai&lt;/a&gt; - custom SQL functions for querying the OpenAI APIs from Datasette&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://datasette.io/plugins/datasette-faiss"&gt;datasette-faiss&lt;/a&gt; - a plugin that builds a FAISS index for fast vector similarity search&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://datasette.io/plugins/datasette-cookies-for-magic-parameters"&gt;datasette-cookies-for-magic-parameters&lt;/a&gt; - a supporting plugin that helps store secrets such as API keys in cookies so they won't be transmitted in query strings&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://datasette.io/tools/openai-to-sqlite"&gt;openai-to-sqlite&lt;/a&gt; - a command-line tool for storing OpenAI API responses in a SQLite database - useful for calculating embeddings&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://simonwillison.net/2023/Jan/9/datasette-064/"&gt;Datasette 0.64, with a warning about SpatiaLite&lt;/a&gt; - I released Datasette 0.64 with a new &lt;code&gt;--setting default_allow_sql off&lt;/code&gt; setting, due to the discovery that some of the SQL functions added by the SpatiaLite module can be used to crash the Datasette instance.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://datasette.io/tutorials/spatialite"&gt;Building a location to time zone API with SpatiaLite&lt;/a&gt; is the latest in the official series of tutorials for Datasette. It demonstrates how SpatiaLite can be used to safely build and deploy a geospatial API.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Releases this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-openai"&gt;datasette-openai&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-openai/releases/tag/0.2"&gt;0.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-openai/releases"&gt;4 releases total&lt;/a&gt;) - 2023-01-14
&lt;br /&gt;SQL functions for calling OpenAI APIs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/openai-to-sqlite"&gt;openai-to-sqlite&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/openai-to-sqlite/releases/tag/0.2"&gt;0.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/openai-to-sqlite/releases"&gt;2 releases total&lt;/a&gt;) - 2023-01-13
&lt;br /&gt;Save OpenAI API results to a SQLite database&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-cookies-for-magic-parameters"&gt;datasette-cookies-for-magic-parameters&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-cookies-for-magic-parameters/releases/tag/0.1.2"&gt;0.1.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-cookies-for-magic-parameters/releases"&gt;3 releases total&lt;/a&gt;) - 2023-01-12
&lt;br /&gt;UI for setting cookies to populate magic parameters&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/git-history"&gt;git-history&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/git-history/releases/tag/0.7a0"&gt;0.7a0&lt;/a&gt; - (&lt;a href="https://github.com/simonw/git-history/releases"&gt;10 releases total&lt;/a&gt;) - 2023-01-11
&lt;br /&gt;Tools for analyzing Git history using SQLite&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette"&gt;datasette&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette/releases/tag/0.64.1"&gt;0.64.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette/releases"&gt;124 releases total&lt;/a&gt;) - 2023-01-11
&lt;br /&gt;An open source multi-tool for exploring and publishing data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-faiss"&gt;datasette-faiss&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-faiss/releases/tag/0.1a0"&gt;0.1a0&lt;/a&gt; - 2023-01-11
&lt;br /&gt;Maintain a FAISS index for specified Datasette tables&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/json-to-files"&gt;json-to-files&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/json-to-files/releases/tag/0.1"&gt;0.1&lt;/a&gt; - 2023-01-10
&lt;br /&gt;Create separate files on disk based on a JSON object&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-auth-passwords"&gt;datasette-auth-passwords&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-auth-passwords/releases/tag/1.1"&gt;1.1&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-auth-passwords/releases"&gt;12 releases total&lt;/a&gt;) - 2023-01-09
&lt;br /&gt;Datasette plugin for authentication using passwords&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/datasette-publish-fly"&gt;datasette-publish-fly&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/datasette-publish-fly/releases/tag/1.3"&gt;1.3&lt;/a&gt; - (&lt;a href="https://github.com/simonw/datasette-publish-fly/releases"&gt;9 releases total&lt;/a&gt;) - 2023-01-09
&lt;br /&gt;Datasette plugin for publishing data using Fly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/simonw/shapefile-to-sqlite"&gt;shapefile-to-sqlite&lt;/a&gt;&lt;/strong&gt;: &lt;a href="https://github.com/simonw/shapefile-to-sqlite/releases/tag/0.4.2"&gt;0.4.2&lt;/a&gt; - (&lt;a href="https://github.com/simonw/shapefile-to-sqlite/releases"&gt;6 releases total&lt;/a&gt;) - 2023-01-08
&lt;br /&gt;Load shapefiles into a SQLite (optionally SpatiaLite) database&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TIL this week&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/clickhouse/github-explorer"&gt;Querying the GitHub archive with the ClickHouse playground&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/cooking/breakfast-tacos"&gt;Breakfast tacos with eggs, vegetarian sausages and avocado sauce&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/sqlite/geopoly"&gt;Geopoly in SQLite&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/sqlite/sqlite-extensions-python-macos"&gt;Loading SQLite extensions in Python on macOS&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/shot-scraper/scraping-flourish"&gt;Scraping the Sky News Westminster Accounts, a Flourish application&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://til.simonwillison.net/python/pipx-alpha"&gt;Upgrading a pipx application to an alpha version&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/spatialite"&gt;spatialite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/weeknotes"&gt;weeknotes&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-3"&gt;gpt-3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openai"&gt;openai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vector-search"&gt;vector-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="spatialite"/><category term="ai"/><category term="datasette"/><category term="weeknotes"/><category term="gpt-3"/><category term="openai"/><category term="generative-ai"/><category term="vector-search"/><category term="llms"/></entry><entry><title>How to implement Q&amp;A against your documentation with GPT3, embeddings and Datasette</title><link href="https://simonwillison.net/2023/Jan/13/semantic-search-answers/#atom-tag" rel="alternate"/><published>2023-01-13T23:47:30+00:00</published><updated>2023-01-13T23:47:30+00:00</updated><id>https://simonwillison.net/2023/Jan/13/semantic-search-answers/#atom-tag</id><summary type="html">
    &lt;p&gt;If you've spent any time with GPT-3 or ChatGPT, you've likely thought about how useful it would be if you could point them at a specific, current collection of text or documentation and have it use that as part of its input for answering questions.&lt;/p&gt;
&lt;p&gt;It turns out there is a neat trick for doing exactly that. I've been experimenting with it using my &lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; web application as a rapid prototyping platform for trying out new AI techniques using custom SQL functions.&lt;/p&gt;
&lt;p&gt;Here's how to do this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Run a text search (or a semantic search, described later) against your documentation to find content that looks like it could be relevant to the user's question&lt;/li&gt;
&lt;li&gt;Grab extracts of that content and glue them all together into a blob of text&lt;/li&gt;
&lt;li&gt;Construct a prompt consisting of that text followed by "Given the above content, answer the following question: " and the user's question&lt;/li&gt;
&lt;li&gt;Send the whole thing through &lt;a href="https://beta.openai.com/docs/api-reference/completions"&gt;the GPT-3 API&lt;/a&gt; and see what comes back&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I've been calling this the &lt;em&gt;semantic search answers&lt;/em&gt; pattern. &lt;strong&gt;Update:&lt;/strong&gt; Since publishing this post I've learned that this technique is known as &lt;strong&gt;Retrieval-Augmented Generation&lt;/strong&gt; or RAG, as described in &lt;a href="https://arxiv.org/abs/2005.11401"&gt;this paper from May 2020&lt;/a&gt;. I've also been pointed to the &lt;a href="https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb"&gt;Question Answering using Embeddings&lt;/a&gt; notebook in the OpenAI cookbook which describes this same technique.&lt;/p&gt;
&lt;p&gt;This is a form of &lt;a href="https://simonwillison.net/tags/promptengineering/"&gt;prompt engineering&lt;/a&gt; (and hence is prone to &lt;a href="https://simonwillison.net/series/prompt-injection/"&gt;prompt injection attacks&lt;/a&gt;, so bear that in mind). You need to fit your overall prompt into the token limit for GPT-3, which is currently 4,000 tokens. A token is more-or-less the same thing as a word.&lt;/p&gt;
&lt;p&gt;In this post I'll show how I implemented this pattern using SQLite, Datasette and some custom SQL functions and tools written in Python.&lt;/p&gt;
&lt;h4&gt;First, some demos&lt;/h4&gt;
&lt;p&gt;I've implemented this pattern against my blog - 3,000 posts going back over twenty years. I built it using some custom SQLite SQL functions that call the OpenAI APIs.&lt;/p&gt;
&lt;p&gt;First, some examples of questions and answers I ran against my blog:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What does shot scraper do?&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Shot-scraper is a command-line tool that helps automate the process of taking screenshots of web pages and scraping data from them using JavaScript. It can be used to take one-off screenshots or take multiple screenshots in a repeatable way by defining them in a YAML file. It can also be used to execute JavaScript on a page and return the resulting value.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://shot-scraper.datasette.io/"&gt;shot-scraper&lt;/a&gt; is a tool I released in 2022 - after the 2021 GPT-3 training cut-off date. This is a very good summary - better I think than anything I've written about &lt;code&gt;shot-scraper&lt;/code&gt; myself.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What museums has Simon been to?&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Simon has been to the Pitt Rivers museum, the British Museum, the Science Museum, the Natural History Museum, the V&amp;amp;A, the Victoria and Albert Museum, and the Pioneer Museum in Paso Robles. He has also seen the Giant Squid at the Natural History Museum and Charles Darwin's collection at the spirit building.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is entirely correct, though I've talked about a lot more museums than that on my blog.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What are some unconventional things you can do with GitHub Actions?&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;ol&gt;
&lt;li&gt;Create dynamic content for GitHub repository templates using cookiecutter and GitHub Actions&lt;/li&gt;
&lt;li&gt;Automate repetitive tasks like code linting, testing, and deployment using GitHub Actions&lt;/li&gt;
&lt;li&gt;Run scheduled scrapers in GitHub Actions&lt;/li&gt;
&lt;li&gt;Use GitHub Actions to trigger a webhook call when a repository is forked&lt;/li&gt;
&lt;li&gt;Create an automated issue-tracking system with GitHub Issues and GitHub Actions&lt;/li&gt;
&lt;li&gt;Automatically extract captions and transcripts from online videos using Whisper and GitHub Actions.&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;
&lt;p&gt;Some of my relevant posts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2021/Aug/28/dynamic-github-repository-templates/"&gt;Dynamic content for GitHub repository templates using cookiecutter and GitHub Actions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2020/Oct/9/git-scraping/"&gt;Git scraping: track changes over time by scraping to a Git repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2022/Sep/30/action-transcription/"&gt;A tool to run caption extraction against online videos using Whisper and GitHub Issues/Actions&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I've never written anything about triggering webhook calls or issue-tracking systems though, so 4 and 5 in that list are a little suspicious.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What is the big grey box at the top of the utility pole? What does it do?&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The big grey box at the top of the utility pole is likely a transformer. Transformers are used to step down the voltage of electricity from the high-voltage lines that run along the pole to the lower-voltage lines that are used to provide electricity to homes and businesses.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'm including this example because I've never written anything even remotely related to transformers and utility poles on my blog. This demonstrates that in the lack of useful context GPT-3 will answer the question entirely by itself - which may or may not be what you want from this system.&lt;/p&gt;
&lt;h4&gt;Try this out yourself&lt;/h4&gt;
&lt;p&gt;If you want to try this out yourself you'll need to get your own API key from OpenAI. I don't want to foot the bill for people using my blog as a free source of GPT-3 prompt answers!&lt;/p&gt;
&lt;p&gt;You can &lt;a href="https://beta.openai.com/signup"&gt;sign up for one here&lt;/a&gt;. I believe they are still running a free trial period.&lt;/p&gt;
&lt;p&gt;Now head over to this page:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://datasette.simonwillison.net/simonwillisonblog/answer_question?_hide_sql=1"&gt;https://datasette.simonwillison.net/simonwillisonblog/answer_question?_hide_sql=1&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;You'll need to paste in your OpenAI key. I am not logging these anywhere, and the form stores these in a cookie in order to avoid transmitting it over a GET query string where it could be accidentally logged somewhere.&lt;/p&gt;
&lt;p&gt;Then type in your question and see what comes back!&lt;/p&gt;
&lt;p&gt;Let's talk about how this all works - in a whole lot of detail.&lt;/p&gt;
&lt;h4 id="embeddings"&gt;Semantic search using embeddings&lt;/h4&gt;
&lt;p&gt;You can implement the first step of this sequence using any search engine you like - but there's a catch: we are encouraging users here to ask questions, which increases the chance that they might include text in their prompt which doesn't exactly match documents in our index.&lt;/p&gt;
&lt;p&gt;"What are the key features of Datasette?" for example might miss blog entries that don't include the word "feature" even though they describe functionality of the software in detail.&lt;/p&gt;
&lt;p&gt;What we want here is &lt;strong&gt;semantic search&lt;/strong&gt; - we want to find documents that match the meaning of the user's search term, even if the matching keywords are not present.&lt;/p&gt;
&lt;p&gt;OpenAI have a less well-known API that can help here, which had a big upgrade (and major price reduction) &lt;a href="https://openai.com/blog/new-and-improved-embedding-model/"&gt;back in December&lt;/a&gt;: their embedding model.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update 31st January 2023:&lt;/strong&gt; I figured out how to run an alternative embeddings model that can execute entirely on my laptop, described here: &lt;a href="https://til.simonwillison.net/python/gtr-t5-large"&gt;Calculating embeddings with gtr-t5-large in Python&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;An &lt;strong&gt;embedding&lt;/strong&gt; is a list of floating point numbers.&lt;/p&gt;
&lt;p&gt;As an example, consider a latitude/longitude location: it's a list of two floating point numbers. You can use those numbers to find other nearby points by calculating distances between them.&lt;/p&gt;
&lt;p&gt;Add a third number and now you can plot locations in three dimensional space - and still calculate distances between them to find the closest points.&lt;/p&gt;
&lt;p&gt;This idea keeps on working even as we go beyond three dimensions: you can calculate distances between vectors of any length, no matter how many dimensions they have.&lt;/p&gt;
&lt;p&gt;So if we can represent some text in a many-multi-dimensional vector space, we can calculate distances between those vectors to find the closest matches.&lt;/p&gt;
&lt;p&gt;The OpenAI embedding model lets you take any string of text (up to a ~8,000 word length limit) and turn that into a list of 1,536 floating point numbers. We'll call this list the "embedding" for the text.&lt;/p&gt;
&lt;p&gt;These numbers are derived from a sophisticated language model. They take a vast amount of knowledge of human language and flatten that down to a list of floating point numbers - at 4 bytes per floating point number that's 4*1,536 = 6,144 bytes per embedding - 6KiB.&lt;/p&gt;
&lt;p&gt;The distance between two embeddings represents how semantically similar the text is to each other.&lt;/p&gt;
&lt;p&gt;The two most obvious applications of this are search and similarity scores.&lt;/p&gt;
&lt;p&gt;Take a user's search term. Calculate its embedding. Now find the distance between that embedding and every pre-calculated embedding in a corpus and return the 10 closest results.&lt;/p&gt;
&lt;p&gt;Or for document similarity: calculate embeddings for every document in a collection, then look at each one in turn and find the closest other embeddings: those are the documents that are most similar to it.&lt;/p&gt;
&lt;p&gt;For my semantic search answers implementation, I use an embedding-based semantic search as the first step to find the best matches for the question. I then assemble these top 5 matches into the prompt to pass to GPT-3.&lt;/p&gt;
&lt;h4&gt;Calculating embeddings&lt;/h4&gt;
&lt;p&gt;Embeddings can be calculated from text using &lt;a href="https://beta.openai.com/docs/guides/embeddings"&gt;the OpenAI embeddings API&lt;/a&gt;. It's really easy to use:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;curl https://api.openai.com/v1/embeddings \
  -H &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Content-Type: application/json&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
  -H &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Authorization: Bearer &lt;span class="pl-smi"&gt;$OPENAI_API_KEY&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
  -d &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{"input": "Your text string goes here",&lt;/span&gt;
&lt;span class="pl-s"&gt;       "model":"text-embedding-ada-002"}&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The documentation doesn't mention this, but you can pass a list of strings (&lt;a href="https://github.com/openai/openai-python/blob/777c1c3de1a9cfc0a33e6376cc09b9badbb9cdf9/openai/embeddings_utils.py#L43"&gt;up to 2048&lt;/a&gt; according to the official Python library source code) as &lt;code&gt;"input"&lt;/code&gt; to run embeddings in bulk:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;curl https://api.openai.com/v1/embeddings \
  -H &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Content-Type: application/json&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
  -H &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;Authorization: Bearer &lt;span class="pl-smi"&gt;$OPENAI_API_KEY&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
  -d &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{"input": ["First string", "Second string", "Third string"],&lt;/span&gt;
&lt;span class="pl-s"&gt;       "model":"text-embedding-ada-002"}&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The returned data from this API looks like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
  &lt;span class="pl-ent"&gt;"data"&lt;/span&gt;: [
    {
      &lt;span class="pl-ent"&gt;"embedding"&lt;/span&gt;: [
        &lt;span class="pl-c1"&gt;-0.006929283495992422&lt;/span&gt;,
        &lt;span class="pl-c1"&gt;-0.005336422007530928&lt;/span&gt;,
        &lt;span class="pl-c1"&gt;...&lt;/span&gt;
        &lt;span class="pl-c1"&gt;-4.547132266452536e-05&lt;/span&gt;,
        &lt;span class="pl-c1"&gt;-0.024047505110502243&lt;/span&gt;
      ],
      &lt;span class="pl-ent"&gt;"index"&lt;/span&gt;: &lt;span class="pl-c1"&gt;0&lt;/span&gt;,
      &lt;span class="pl-ent"&gt;"object"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;embedding&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
    }
  ]&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;As expected, it's a list of 1,536 floating point numbers.&lt;/p&gt;
&lt;p&gt;I've been storing embeddings as a binary string that appends all of the floating point numbers together, using their 4-byte representation.&lt;/p&gt;
&lt;p&gt;Here are the tiny Python functions I've been using for doing that:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;struct&lt;/span&gt;

&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;decode&lt;/span&gt;(&lt;span class="pl-s1"&gt;blob&lt;/span&gt;):
    &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s1"&gt;struct&lt;/span&gt;.&lt;span class="pl-en"&gt;unpack&lt;/span&gt;(&lt;span class="pl-s"&gt;"f"&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt; &lt;span class="pl-c1"&gt;1536&lt;/span&gt;, &lt;span class="pl-s1"&gt;blob&lt;/span&gt;)

&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;encode&lt;/span&gt;(&lt;span class="pl-s1"&gt;values&lt;/span&gt;):
    &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s1"&gt;struct&lt;/span&gt;.&lt;span class="pl-en"&gt;pack&lt;/span&gt;(&lt;span class="pl-s"&gt;"f"&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt; &lt;span class="pl-c1"&gt;1536&lt;/span&gt;, &lt;span class="pl-c1"&gt;*&lt;/span&gt;&lt;span class="pl-s1"&gt;values&lt;/span&gt;)&lt;/pre&gt;
&lt;p&gt;I then store them in SQLite &lt;code&gt;blob&lt;/code&gt; columns in my database.&lt;/p&gt;
&lt;p&gt;I wrote a custom tool for doing this, called &lt;a href="https://datasette.io/tools/openai-to-sqlite"&gt;openai-to-sqlite&lt;/a&gt;. I can run it like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;openai-to-sqlite embeddings simonwillisonblog.db \
  --sql &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;select id, title, body from blog_entry&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
  --table blog_entry_embeddings&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This concatenates together the &lt;code&gt;title&lt;/code&gt; and &lt;code&gt;body&lt;/code&gt; columns from that table, runs them through the OpenAI embeddings API and stores the results in a new table called &lt;code&gt;blog_entry_embeddings&lt;/code&gt; with the following schema:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;CREATE TABLE [blog_entry_embeddings] (
   [id] &lt;span class="pl-k"&gt;INTEGER&lt;/span&gt; &lt;span class="pl-k"&gt;PRIMARY KEY&lt;/span&gt;,
   [embedding] BLOB
)&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;I can join this against the &lt;code&gt;blog_entry&lt;/code&gt; table by ID later on.&lt;/p&gt;
&lt;h4&gt;Finding the closest matches&lt;/h4&gt;
&lt;p&gt;The easiest way to calculate similarity between two embedding arrays is to use cosine similarity. A simple Python function for that looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;cosine_similarity&lt;/span&gt;(&lt;span class="pl-s1"&gt;a&lt;/span&gt;, &lt;span class="pl-s1"&gt;b&lt;/span&gt;):
    &lt;span class="pl-s1"&gt;dot_product&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;sum&lt;/span&gt;(&lt;span class="pl-s1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt; &lt;span class="pl-s1"&gt;y&lt;/span&gt; &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;x&lt;/span&gt;, &lt;span class="pl-s1"&gt;y&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-en"&gt;zip&lt;/span&gt;(&lt;span class="pl-s1"&gt;a&lt;/span&gt;, &lt;span class="pl-s1"&gt;b&lt;/span&gt;))
    &lt;span class="pl-s1"&gt;magnitude_a&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;sum&lt;/span&gt;(&lt;span class="pl-s1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt; &lt;span class="pl-s1"&gt;x&lt;/span&gt; &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;a&lt;/span&gt;) &lt;span class="pl-c1"&gt;**&lt;/span&gt; &lt;span class="pl-c1"&gt;0.5&lt;/span&gt;
    &lt;span class="pl-s1"&gt;magnitude_b&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;sum&lt;/span&gt;(&lt;span class="pl-s1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt; &lt;span class="pl-s1"&gt;x&lt;/span&gt; &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;b&lt;/span&gt;) &lt;span class="pl-c1"&gt;**&lt;/span&gt; &lt;span class="pl-c1"&gt;0.5&lt;/span&gt;
    &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s1"&gt;dot_product&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; (&lt;span class="pl-s1"&gt;magnitude_a&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt; &lt;span class="pl-s1"&gt;magnitude_b&lt;/span&gt;)&lt;/pre&gt;
&lt;p&gt;You can brute-force find the top matches for a table by executing that comparison for every row and returning the ones with the highest score.&lt;/p&gt;
&lt;p&gt;I added this to my &lt;a href="https://datasette.io/plugins/datasette-openai"&gt;datasette-openai&lt;/a&gt; Datasette plugin as a custom SQL function called &lt;code&gt;openai_embedding_similarity()&lt;/code&gt;. Here's a query that uses it:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;with input &lt;span class="pl-k"&gt;as&lt;/span&gt; (
  &lt;span class="pl-k"&gt;select&lt;/span&gt;
    embedding
  &lt;span class="pl-k"&gt;from&lt;/span&gt;
    blog_entry_embeddings
  &lt;span class="pl-k"&gt;where&lt;/span&gt;
    id &lt;span class="pl-k"&gt;=&lt;/span&gt; :entry_id
),
top_n &lt;span class="pl-k"&gt;as&lt;/span&gt; (
  &lt;span class="pl-k"&gt;select&lt;/span&gt;
    id,
    openai_embedding_similarity(
      &lt;span class="pl-c1"&gt;blog_entry_embeddings&lt;/span&gt;.&lt;span class="pl-c1"&gt;embedding&lt;/span&gt;,
      &lt;span class="pl-c1"&gt;input&lt;/span&gt;.&lt;span class="pl-c1"&gt;embedding&lt;/span&gt;
    ) &lt;span class="pl-k"&gt;as&lt;/span&gt; score
  &lt;span class="pl-k"&gt;from&lt;/span&gt;
    blog_entry_embeddings,
    input
  &lt;span class="pl-k"&gt;order by&lt;/span&gt;
    score &lt;span class="pl-k"&gt;desc&lt;/span&gt;
  &lt;span class="pl-k"&gt;limit&lt;/span&gt;
    &lt;span class="pl-c1"&gt;20&lt;/span&gt;
)
&lt;span class="pl-k"&gt;select&lt;/span&gt;
  score,
  &lt;span class="pl-c1"&gt;blog_entry&lt;/span&gt;.&lt;span class="pl-c1"&gt;id&lt;/span&gt;,
  &lt;span class="pl-c1"&gt;blog_entry&lt;/span&gt;.&lt;span class="pl-c1"&gt;title&lt;/span&gt;
&lt;span class="pl-k"&gt;from&lt;/span&gt;
  blog_entry
  &lt;span class="pl-k"&gt;join&lt;/span&gt; top_n &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;blog_entry&lt;/span&gt;.&lt;span class="pl-c1"&gt;id&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;top_n&lt;/span&gt;.&lt;span class="pl-c1"&gt;id&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;a href="https://datasette.simonwillison.net/simonwillisonblog?sql=with+input+as+%28%0D%0A++select%0D%0A++++embedding%0D%0A++from%0D%0A++++blog_entry_embeddings%0D%0A++where%0D%0A++++id+%3D+%3Aentry_id%0D%0A%29%2C%0D%0Atop_n+as+%28%0D%0A++select%0D%0A++++id%2C%0D%0A++++openai_embedding_similarity%28%0D%0A++++++blog_entry_embeddings.embedding%2C%0D%0A++++++input.embedding%0D%0A++++%29+as+score%0D%0A++from%0D%0A++++blog_entry_embeddings%2C%0D%0A++++input%0D%0A++order+by%0D%0A++++score+desc%0D%0A++limit%0D%0A++++20%0D%0A%29%0D%0Aselect%0D%0A++score%2C%0D%0A++blog_entry.id%2C%0D%0A++blog_entry.title%0D%0Afrom%0D%0A++blog_entry%0D%0A++join+top_n+on+blog_entry.id+%3D+top_n.id&amp;amp;entry_id=7977"&gt;Try that out here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This takes as input the ID of one of my blog entries and returns a list of the other entries, ordered by their similarity score.&lt;/p&gt;
&lt;p&gt;Unfortunately this is pretty slow! It takes over 1.3s to run against all 3,000 embeddings in my blog.&lt;/p&gt;
&lt;p&gt;I did some research and found that a highly regarded solutions for fast vector similarity calculations is &lt;a href="https://github.com/facebookresearch/faiss"&gt;FAISS&lt;/a&gt;, by Facebook AI research. It has neat Python bindings and can be installed using &lt;code&gt;pip install faiss-cpu&lt;/code&gt; (the &lt;code&gt;-gpu&lt;/code&gt; version requires a GPU).&lt;/p&gt;
&lt;p&gt;FAISS works against an in-memory index. My blog's Datasette instance uses the &lt;a href="https://simonwillison.net/2021/Jul/28/baked-data/"&gt;baked data&lt;/a&gt; pattern which means the entire thing is re-deployed any time the data changes - as such, I can spin up an in-memory index once on startup without needing to worry about updating the index continually as rows in the database change.&lt;/p&gt;
&lt;p&gt;So I built another plugin to do that: &lt;a href="https://datasette.io/plugins/datasette-faiss"&gt;datasette-faiss&lt;/a&gt; - which can be configured to build an in-memory FAISS index against a configured table on startup, and can then be queried using another custom SQL function.&lt;/p&gt;
&lt;p&gt;Here's the related entries query from above rewritten to use the FAISS index:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;with input &lt;span class="pl-k"&gt;as&lt;/span&gt; (
  &lt;span class="pl-k"&gt;select&lt;/span&gt;
    embedding
  &lt;span class="pl-k"&gt;from&lt;/span&gt;
    blog_entry_embeddings
  &lt;span class="pl-k"&gt;where&lt;/span&gt;
    id &lt;span class="pl-k"&gt;=&lt;/span&gt; :entry_id
),
top_n &lt;span class="pl-k"&gt;as&lt;/span&gt; (
  &lt;span class="pl-k"&gt;select&lt;/span&gt; value &lt;span class="pl-k"&gt;as&lt;/span&gt; id &lt;span class="pl-k"&gt;from&lt;/span&gt; json_each(
    faiss_search(
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;simonwillisonblog&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;blog_entry_embeddings&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-c1"&gt;input&lt;/span&gt;.&lt;span class="pl-c1"&gt;embedding&lt;/span&gt;,
      &lt;span class="pl-c1"&gt;20&lt;/span&gt;
    )
  ), input
)
&lt;span class="pl-k"&gt;select&lt;/span&gt;
  &lt;span class="pl-c1"&gt;blog_entry&lt;/span&gt;.&lt;span class="pl-c1"&gt;id&lt;/span&gt;,
  &lt;span class="pl-c1"&gt;blog_entry&lt;/span&gt;.&lt;span class="pl-c1"&gt;title&lt;/span&gt;
&lt;span class="pl-k"&gt;from&lt;/span&gt;
  blog_entry
  &lt;span class="pl-k"&gt;join&lt;/span&gt; top_n &lt;span class="pl-k"&gt;on&lt;/span&gt; &lt;span class="pl-c1"&gt;blog_entry&lt;/span&gt;.&lt;span class="pl-c1"&gt;id&lt;/span&gt; &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;top_n&lt;/span&gt;.&lt;span class="pl-c1"&gt;id&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;a href="https://datasette.simonwillison.net/simonwillisonblog?sql=with+input+as+%28%0D%0A++select%0D%0A++++embedding%0D%0A++from%0D%0A++++blog_entry_embeddings%0D%0A++where%0D%0A++++id+%3D+%3Aentry_id%0D%0A%29%2C%0D%0Atop_n+as+%28%0D%0A++select+value+as+id+from+json_each%28%0D%0A++++faiss_search%28%0D%0A++++++%27simonwillisonblog%27%2C%0D%0A++++++%27blog_entry_embeddings%27%2C%0D%0A++++++input.embedding%2C%0D%0A++++++20%0D%0A++++%29%0D%0A++%29%2C+input%0D%0A%29%0D%0Aselect%0D%0A++blog_entry.id%2C%0D%0A++blog_entry.title%0D%0Afrom%0D%0A++blog_entry%0D%0A++join+top_n+on+blog_entry.id+%3D+top_n.id&amp;amp;entry_id=7977"&gt;This one runs&lt;/a&gt; in 4.8ms!&lt;/p&gt;
&lt;p&gt;&lt;code&gt;faiss_search(database_name, table_name, embedding, n)&lt;/code&gt; returns a JSON array of the top &lt;code&gt;n&lt;/code&gt; IDs from the specified embeddings table, based on distance scores from the provided &lt;code&gt;embedding&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;json_each()&lt;/code&gt; trick here is a workaround for the fact that Python's SQLite driver doesn't yet provide an easy way to write table-valued functions - SQL functions that return something in the shape of a table.&lt;/p&gt;
&lt;p&gt;Instead, I use &lt;code&gt;json_each()&lt;/code&gt; to turn the string JSON array of IDs from &lt;code&gt;datasette_faiss()&lt;/code&gt; into a table that I can run further joins against.&lt;/p&gt;
&lt;h4&gt;Implementing semantic search with embeddings&lt;/h4&gt;
&lt;p&gt;So far we've just seen embeddings used for finding similar items. Let's implement semantic search, using a user-provided query.&lt;/p&gt;
&lt;p&gt;This is going to need an API key again, because it involves a call to OpenAI to run embeddings against the user's search query.&lt;/p&gt;
&lt;p&gt;Here's the SQL query:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;select&lt;/span&gt;
  value,
  &lt;span class="pl-c1"&gt;blog_entry&lt;/span&gt;.&lt;span class="pl-c1"&gt;title&lt;/span&gt;,
  substr(&lt;span class="pl-c1"&gt;blog_entry&lt;/span&gt;.&lt;span class="pl-c1"&gt;body&lt;/span&gt;, &lt;span class="pl-c1"&gt;0&lt;/span&gt;, &lt;span class="pl-c1"&gt;500&lt;/span&gt;)
&lt;span class="pl-k"&gt;from&lt;/span&gt;
  json_each(
    faiss_search(
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;simonwillisonblog&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;blog_entry_embeddings&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;,
      (
        &lt;span class="pl-k"&gt;select&lt;/span&gt;
          openai_embedding(:query, :_cookie_openai_api_key)
      ),
      &lt;span class="pl-c1"&gt;10&lt;/span&gt;
    )
  )
  &lt;span class="pl-k"&gt;join&lt;/span&gt; blog_entry &lt;span class="pl-k"&gt;on&lt;/span&gt; value &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;blog_entry&lt;/span&gt;.&lt;span class="pl-c1"&gt;id&lt;/span&gt;
  &lt;span class="pl-k"&gt;where&lt;/span&gt; length(coalesce(:query, &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;)) &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;0&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;a href="https://datasette.simonwillison.net/simonwillisonblog/embedding_search"&gt;Try that here&lt;/a&gt; (with extra some cosmetic tricks.)&lt;/p&gt;
&lt;p&gt;We're using a new function here: &lt;code&gt;openai_embedding()&lt;/code&gt; - which takes some text and an API key and returns an embedding for that text.&lt;/p&gt;
&lt;p&gt;The API key comes from &lt;code&gt;:_cookie_openai_api_key&lt;/code&gt; - this is a special Datasette mechanism called &lt;a href="https://docs.datasette.io/en/stable/sql_queries.html#magic-parameters"&gt;magic parameters&lt;/a&gt; which can read variables from cookies.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://datasette.io/plugins/datasette-cookies-for-magic-parameters"&gt;datasette-cookies-for-magic-parameters&lt;/a&gt; plugin notices these and turns them into an interface for the user to populate the cookies with, decsribed earlier.&lt;/p&gt;
&lt;p&gt;One last trick: adding &lt;code&gt;where length(coalesce(:query, '')) &amp;gt; 0&lt;/code&gt; to the query means that the query won't run if the user hasn't entered any text into the search box.&lt;/p&gt;
&lt;h4&gt;Constructing a prompt from semantic search query results&lt;/h4&gt;
&lt;p&gt;Getting back to our semantic search answers pattern.&lt;/p&gt;
&lt;p&gt;We need a way to construct a prompt for GPT-3 using the results of our semantic search query.&lt;/p&gt;
&lt;p&gt;There's one big catch: GPT-3 has a length limit, and it's strictly enforced. If you pass even one token over that limit you'll get an error.&lt;/p&gt;
&lt;p&gt;We want to use as much material from the top five search results as possible, leaving enough space for the rest of the prompt (the user's question and our own text) and the prompt response.&lt;/p&gt;
&lt;p&gt;I ended up solving this with another custom SQL function:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;select&lt;/span&gt; openai_build_prompt(content, &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Context:&lt;/span&gt;
&lt;span class="pl-s"&gt;------------&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;------------&lt;/span&gt;
&lt;span class="pl-s"&gt;Given the above context, answer the following question: &lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;||&lt;/span&gt; :question,
  &lt;span class="pl-c1"&gt;500&lt;/span&gt;
  ) &lt;span class="pl-k"&gt;from&lt;/span&gt; search_results&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This function works as an &lt;em&gt;aggregate&lt;/em&gt; function - it takes a table of results and returns a single string.&lt;/p&gt;
&lt;p&gt;It takes the column to aggregate - in this case &lt;code&gt;content&lt;/code&gt; - as the first argument. Then it takes a prefix and a suffix, which are concatenated together with the aggregated content in the middle.&lt;/p&gt;
&lt;p&gt;The third argument is the number of tokens to allow for the response.&lt;/p&gt;
&lt;p&gt;The function then attempts to truncate each of the input values to the maximum length that will still allow them all to be concatenated together while staying inside that 4,000 token limit.&lt;/p&gt;
&lt;h4&gt;Adding it all together&lt;/h4&gt;
&lt;p&gt;With all of the above in place, the following query is my full implementation of semantic search answers against my blog:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;with query &lt;span class="pl-k"&gt;as&lt;/span&gt; (
  &lt;span class="pl-k"&gt;select&lt;/span&gt;
    openai_embedding(:question, :_cookie_openai_api_key) &lt;span class="pl-k"&gt;as&lt;/span&gt; q
),
top_n &lt;span class="pl-k"&gt;as&lt;/span&gt; (
  &lt;span class="pl-k"&gt;select&lt;/span&gt;
    value
  &lt;span class="pl-k"&gt;from&lt;/span&gt; json_each(
    faiss_search(
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;simonwillisonblog&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;,
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;blog_entry_embeddings&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;,
      (&lt;span class="pl-k"&gt;select&lt;/span&gt; q &lt;span class="pl-k"&gt;from&lt;/span&gt; query),
      &lt;span class="pl-c1"&gt;5&lt;/span&gt;
    )
  )
  &lt;span class="pl-k"&gt;where&lt;/span&gt; length(coalesce(:question, &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;)) &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;0&lt;/span&gt;
),
texts &lt;span class="pl-k"&gt;as&lt;/span&gt; (
  &lt;span class="pl-k"&gt;select&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Created: &lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;||&lt;/span&gt; created &lt;span class="pl-k"&gt;||&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;, Title: &lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;||&lt;/span&gt; title &lt;span class="pl-k"&gt;||&lt;/span&gt; 
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;, Body: &lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;||&lt;/span&gt; openai_strip_tags(body) &lt;span class="pl-k"&gt;as&lt;/span&gt; &lt;span class="pl-k"&gt;text&lt;/span&gt;
  &lt;span class="pl-k"&gt;from&lt;/span&gt; blog_entry &lt;span class="pl-k"&gt;where&lt;/span&gt; id &lt;span class="pl-k"&gt;in&lt;/span&gt; (&lt;span class="pl-k"&gt;select&lt;/span&gt; value &lt;span class="pl-k"&gt;from&lt;/span&gt; top_n)
),
prompt &lt;span class="pl-k"&gt;as&lt;/span&gt; (
  &lt;span class="pl-k"&gt;select&lt;/span&gt; openai_build_prompt(&lt;span class="pl-k"&gt;text&lt;/span&gt;, &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Context:&lt;/span&gt;
&lt;span class="pl-s"&gt;------------&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;------------&lt;/span&gt;
&lt;span class="pl-s"&gt;Given the above context, answer the following question: &lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;||&lt;/span&gt; :question,
  &lt;span class="pl-c1"&gt;500&lt;/span&gt;
  ) &lt;span class="pl-k"&gt;as&lt;/span&gt; prompt &lt;span class="pl-k"&gt;from&lt;/span&gt; texts
)
&lt;span class="pl-k"&gt;select&lt;/span&gt;
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Response&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; title,
  openai_davinci(
    prompt,
    &lt;span class="pl-c1"&gt;500&lt;/span&gt;,
    &lt;span class="pl-c1"&gt;0&lt;/span&gt;.&lt;span class="pl-c1"&gt;7&lt;/span&gt;,
    :_cookie_openai_api_key
  ) &lt;span class="pl-k"&gt;as&lt;/span&gt; value
  &lt;span class="pl-k"&gt;from&lt;/span&gt; prompt
  &lt;span class="pl-k"&gt;where&lt;/span&gt; length(coalesce(:question, &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;)) &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;0&lt;/span&gt;
&lt;span class="pl-k"&gt;union all&lt;/span&gt;
&lt;span class="pl-k"&gt;select&lt;/span&gt;
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;Prompt&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; title,
  prompt &lt;span class="pl-k"&gt;from&lt;/span&gt; prompt&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;As you can see, I really like using CTEs (the &lt;code&gt;with name as (...)&lt;/code&gt; pattern) to assemble complex queries like this.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;texts as ...&lt;/code&gt; CTE is where I strip HTML tags from my content (using another custom function from the &lt;code&gt;datasete-openai&lt;/code&gt; plugin called &lt;code&gt;openai_strip_tags()&lt;/code&gt;) and assemble it along with the &lt;code&gt;Created&lt;/code&gt; and &lt;code&gt;Title&lt;/code&gt; metadata. Adding these gave the system a better chance of answering questions like "When did Natalie and Simon get married?" with the correct year.&lt;/p&gt;
&lt;p&gt;The last part of this query uses a handy debugging trick: it returns two rows via a &lt;code&gt;union all&lt;/code&gt; - the first has a &lt;code&gt;Response&lt;/code&gt; label and shows the response from GPT-3, while the second has a &lt;code&gt;Prompt&lt;/code&gt; label and shows the prompt that I passed to the model.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2023/semantic-search-answer.jpg" alt="A Datasette form page. Question is When did Natalie and Simon get married?. Answer is Natalie and Simon got married on Saturday the 5th of June in 2010. The prompt is then displayed, which is a whole bunch of text from relevant blog entries." style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;h4&gt;Next steps&lt;/h4&gt;
&lt;p&gt;There are &lt;em&gt;so many ways&lt;/em&gt; to improve this system.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Smarter prompt design. My prompt here is the first thing that I got to work - I'm certain there are all kinds of tricks that could be used to make this more effective.&lt;/li&gt;
&lt;li&gt;Better selection of the content to include in the prompt. I'm using embedding search but then truncating to the first portion: a smarter implementation would attempt to crop out the most relevant parts of each entry, maybe by using embeddings against smaller chunks of text.
&lt;ul&gt;
  &lt;li&gt;Yoz &lt;a href="https://mastodon.social/@yoz/109684742557484622"&gt;tipped me off&lt;/a&gt; to &lt;a href="https://gpt-index.readthedocs.io/en/latest/"&gt;GPT Index&lt;/a&gt;, a project which aims to solve this exact problem by using a pre-trained LLM to help summarize text to better fit in a prompt used for these kinds of queries.&lt;/li&gt;
&lt;li&gt;Spotted &lt;a href="https://twitter.com/theseamouse/status/1614453236349693953"&gt;this idea&lt;/a&gt; from Hassan Hayat: "don't embed the question when searching. Ask GPT-3 to generate a fake answer, embed this answer, and use this to search". See also this paper about &lt;a href="https://arxiv.org/abs/2212.10496"&gt;Hypothetical Document
Embeddings&lt;/a&gt;, via &lt;a href="https://twitter.com/mathemagic1an/status/1615378778863157248"&gt;Jay Hack&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Hold out for GPT-4: I've heard rumours that the next version of the model will have a significantly larger token limit, which should result in much better results from this mechanism.&lt;/li&gt;
&lt;/ul&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/projects"&gt;projects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/search"&gt;search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/sqlite"&gt;sqlite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-3"&gt;gpt-3&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vector-search"&gt;vector-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/embeddings"&gt;embeddings&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rag"&gt;rag&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-search"&gt;ai-assisted-search&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt"&gt;gpt&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="projects"/><category term="search"/><category term="sqlite"/><category term="ai"/><category term="datasette"/><category term="gpt-3"/><category term="generative-ai"/><category term="vector-search"/><category term="llms"/><category term="embeddings"/><category term="rag"/><category term="ai-assisted-search"/><category term="gpt"/></entry><entry><title>Vector search engines</title><link href="https://simonwillison.net/2003/Mar/1/vectorSearchEngines/#atom-tag" rel="alternate"/><published>2003-03-01T13:07:18+00:00</published><updated>2003-03-01T13:07:18+00:00</updated><id>https://simonwillison.net/2003/Mar/1/vectorSearchEngines/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;a href="http://www.perl.com/pub/a/2003/02/19/engine.html"&gt;Building a Vector Space Search Engine in Perl&lt;/a&gt; by Maciej Cegłowski:&lt;/p&gt;

&lt;blockquote cite="http://www.perl.com/pub/a/2003/02/19/engine.html"&gt;
&lt;p&gt;Vector-space search engines use the notion of a &lt;strong&gt;term space&lt;/strong&gt;, where each document is represented as a vector in a high-dimensional space. There are as many dimensions as there are unique words in the entire collection. Because a document's position in the term space is determined by the words it contains, documents with many words in common end up close together, while documents with few shared words end up far apart.&lt;/p&gt;

&lt;p&gt;To search our collection, we project a query into this term space and calculate the distance from the query vector to all the document vectors in turn. Those documents that are within a certain threshold distance get added to our result set. If all this sounds like gobbledygook to you, then don't worry - it will become clearer when we write the code.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Having done a course on Linear Algebra last term, it's interesting to see how it can be applied to the search problem. The technique described lends itself well to finding "similar documents" as well, as documents with similar word content will end up "near" to each other when projected on to the vector space.&lt;/p&gt;

&lt;p&gt;The article is also yet another demonstration of how Perl's modules make it such a powerful tool.  Lingua::Stem is used to find word "stems", providing a free algorithm for eliminating related words like cat and cats. The performance overhead of using Perl arrays to represent large vectors is avoided with the &lt;acronym title="Perl Data Language"&gt;PDL&lt;/acronym&gt; module, which implements a whole set of matrix algebra functions in compiled C for high performance. Without these two modules the technique described would be a great deal less powerful. Of course, neither of them are available for &lt;acronym title="PHP: HyperText Preprocessor"&gt;PHP&lt;/acronym&gt; or Python, my scripting languages of choice.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/maciej-ceglowski"&gt;maciej-ceglowski&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/perl"&gt;perl&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/vector-search"&gt;vector-search&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="maciej-ceglowski"/><category term="perl"/><category term="vector-search"/></entry></feed>