Simon Willison's Weblog: vector-search

Scaling HNSWs

2025-11-11T23:38:39+00:00

Salvatore Sanfilippo spent much of this year working on vector sets for Redis, which first shipped in Redis 8 in May.

A big part of that work involved implementing HNSW - Hierarchical Navigable Small World - an indexing technique first introduced in this 2016 paper by Yu. A. Malkov and D. A. Yashunin.

Salvatore's detailed notes on the Redis implementation here offer an immersive trip through a fascinating modern field of computer science. He describes several new contributions he's made to the HNSW algorithm, mainly around efficient deletion and updating of existing indexes.

Since embedding vectors are notoriously memory-hungry I particularly appreciated this note about how you can scale a large HNSW vector set across many different nodes and run parallel queries against them for both reads and writes:

[...] if you have different vectors about the same use case split in different instances / keys, you can ask VSIM for the same query vector into all the instances, and add the WITHSCORES option (that returns the cosine distance) and merge the results client-side, and you have magically scaled your hundred of millions of vectors into multiple instances, splitting your dataset N times [One interesting thing about such a use case is that you can query the N instances in parallel using multiplexing, if your client library is smart enough].

Another very notable thing about HNSWs exposed in this raw way, is that you can finally scale writes very easily. Just hash your element modulo N, and target the resulting Redis key/instance. Multiple instances can absorb the (slow, but still fast for HNSW standards) writes at the same time, parallelizing an otherwise very slow process.

It's always exciting to see new implementations of fundamental algorithms and data structures like this make it into Redis because Salvatore's C code is so clearly commented and pleasant to read - here's vector-sets/hnsw.c and vector-sets/vset.c.

Via Hacker News

Tags: algorithms, c, computer-science, data-structures, redis, salvatore-sanfilippo, vector-search, embeddings

The case against pgvector

2025-11-03T20:26:10+00:00

The case against pgvector

I wasn't keen on the title of this piece but the content is great: Alex Jacobs talks through lessons learned trying to run the popular pgvector PostgreSQL vector indexing extension at scale, in particular the challenges involved in maintaining a large index with close-to-realtime updates using the IVFFlat or HNSW index types.

The section on pre-v.s.-post filtering is particularly useful:

Okay but let's say you solve your index and insert problems. Now you have a document search system with millions of vectors. Documents have metadata---maybe they're marked as draft, published, or archived. A user searches for something, and you only want to return published documents.

[...] should Postgres filter on status first (pre-filter) or do the vector search first and then filter (post-filter)?

This seems like an implementation detail. It’s not. It’s the difference between queries that take 50ms and queries that take 5 seconds. It’s also the difference between returning the most relevant results and… not.

The Hacker News thread for this article attracted a robust discussion, including some fascinating comments by Discourse developer Rafael dos Santos Silva (xfalcox) about how they are using pgvector at scale:

We [run pgvector in production] at Discourse, in thousands of databases, and it's leveraged in most of the billions of page views we serve. [...]

Also worth mentioning that we use quantization extensively:

halfvec (16bit float) for storage - bit (binary vectors) for indexes

Which makes the storage cost and on-going performance good enough that we could enable this in all our hosting. [...]

In Discourse embeddings power:

Related Topics, a list of topics to read next, which uses embeddings of the current topic as the key to search for similar ones

Suggesting tags and categories when composing a new topic

Augmented search

RAG for uploaded files

Via Hacker News

Tags: postgresql, scaling, vector-search, embeddings

Quoting James Luan

2025-09-08T16:24:24+00:00

I recently spoke with the CTO of a popular AI note-taking app who told me something surprising: they spend twice as much on vector search as they do on OpenAI API calls. Think about that for a second. Running the retrieval layer costs them more than paying for the LLM itself.

— James Luan, Engineering architect of Milvus

Tags: vector-search, embeddings

Cursor: Security

2025-05-11T19:15:46+00:00

Cursor: Security

Cursor's security documentation page includes a surprising amount of detail about how the Cursor text editor's backend systems work.

I've recently learned that checking an organization's list of documented subprocessors is a great way to get a feel for how everything works under the hood - it's a loose "view source" for their infrastructure! That was how I confirmed that Anthropic's search features used Brave search back in March.

Cursor's list includes AWS, Azure and GCP (AWS for primary infrastructure, Azure and GCP for "some secondary infrastructure"). They host their own custom models on Fireworks and make API calls out to OpenAI, Anthropic, Gemini and xAI depending on user preferences. They're using turbopuffer as a hosted vector store.

The most interesting section is about codebase indexing:

Cursor allows you to semantically index your codebase, which allows it to answer questions with the context of all of your code as well as write better code by referencing existing implementations. […]

At our server, we chunk and embed the files, and store the embeddings in Turbopuffer. To allow filtering vector search results by file path, we store with every vector an obfuscated relative file path, as well as the line range the chunk corresponds to. We also store the embedding in a cache in AWS, indexed by the hash of the chunk, to ensure that indexing the same codebase a second time is much faster (which is particularly useful for teams).

At inference time, we compute an embedding, let Turbopuffer do the nearest neighbor search, send back the obfuscated file path and line range to the client, and read those file chunks on the client locally. We then send those chunks back up to the server to answer the user’s question.

When operating in privacy mode - which they say is enabled by 50% of their users - they are careful not to store any raw code on their servers for longer than the duration of a single request. This is why they store the embeddings and obfuscated file paths but not the code itself.

Reading this made me instantly think of the paper Text Embeddings Reveal (Almost) As Much As Text about how vector embeddings can be reversed. The security documentation touches on that in the notes:

Embedding reversal: academic work has shown that reversing embeddings is possible in some cases. Current attacks rely on having access to the model and embedding short strings into big vectors, which makes us believe that the attack would be somewhat difficult to do here. That said, it is definitely possible for an adversary who breaks into our vector database to learn things about the indexed codebases.

Via lobste.rs

Tags: security, ai, generative-ai, vector-search, llms, ai-assisted-programming, embeddings, cursor

Redis is open source again

2025-05-01T17:19:36+00:00

Redis is open source again

Salvatore Sanfilippo:

Five months ago, I rejoined Redis and quickly started to talk with my colleagues about a possible switch to the AGPL license, only to discover that there was already an ongoing discussion, a very old one, too. [...]

I’ll be honest: I truly wanted the code I wrote for the new Vector Sets data type to be released under an open source license. [...]

So, honestly, while I can’t take credit for the license switch, I hope I contributed a little bit to it, because today I’m happy. I’m happy that Redis is open source software again, under the terms of the AGPLv3 license.

I'm absolutely thrilled to hear this. Redis 8.0 is out today under the new license, including a beta release of Vector Sets. I've been watching Salvatore's work on those with fascination, while sad that I probably wouldn't use it often due to the janky license. That concern is now gone. I'm looking forward to putting them through their paces!

See also Redis is now available under the AGPLv3 open source license on the Redis blog. An interesting note from that is that they are also:

Integrating Redis Stack technologies, including JSON, Time Series, probabilistic data types, Redis Query Engine and more into core Redis 8 under AGPL

That's a whole bunch of new things that weren't previously part of Redis core.

I hadn't encountered Redis Query Engine before - it looks like that's a whole set of features that turn Redis into more of an Elasticsearch-style document database complete with full-text, vector search operations and geospatial operations and aggregations. It supports search syntax that looks a bit like this:

FT.SEARCH places "museum @city:(san francisco|oakland) @shape:[CONTAINS $poly]" PARAMS 2 poly 'POLYGON((-122.5 37.7, -122.5 37.8, -122.4 37.8, -122.4 37.7, -122.5 37.7))' DIALECT 3

(Noteworthy that Elasticsearch chose the AGPL too when they switched back from the SSPL to an open source license last year).

Via Hacker News

Tags: open-source, redis, salvatore-sanfilippo, vector-search, janky-licenses

Quoting Jo Kristian Bergum

2024-12-28T14:22:29+00:00

Looking back, it's clear we overcomplicated things. While embeddings fundamentally changed how we can represent and compare content, they didn't need an entirely new infrastructure category. What we label as "vector databases" are, in reality, search engines with vector capabilities. The market is already correcting this categorization—vector search providers rapidly add traditional search features while established search engines incorporate vector search capabilities. This category convergence isn't surprising: building a good retrieval engine has always been about combining multiple retrieval and ranking strategies. Vector search is just another powerful tool in that toolbox, not a category of its own.

— Jo Kristian Bergum

Tags: search, vector-search, embeddings, jo-kristian-bergum

From where I left

2024-12-10T18:56:26+00:00

From where I left

Four and a half years after he left the project, Redis creator Salvatore Sanfilippo is returning to work on Redis.

Hacking randomly was cool but, in the long run, my feeling was that I was lacking a real purpose, and every day I started to feel a bigger urgency to be part of the tech world again. At the same time, I saw the Redis community fragmenting, something that was a bit concerning to me, even as an outsider.

I'm personally still upset at the license change, but Salvatore sees it as necessary to support the commercial business model for Redis Labs. It feels to me like a betrayal of the volunteer efforts by previous contributors. I posted about that on Hacker News and Salvatore replied:

I can understand that, but the thing about the BSD license is that such value never gets lost. People are able to fork, and after a fork for the original project to still lead will be require to put something more on the table.

Salvatore's first new project is an exploration of adding vector sets to Redis. The vector similarity API he previews in this post reminds me of why I fell in love with Redis in the first place - it's clean, simple and feels obviously right to me.

VSIM top_1000_movies_imdb ELE "The Matrix"  WITHSCORES
1) "The Matrix"
2) "0.9999999403953552"
3) "Ex Machina"
4) "0.8680362105369568"
...

Tags: open-source, redis, salvatore-sanfilippo, vector-search

Hybrid full-text search and vector search with SQLite

2024-10-04T16:22:09+00:00

Hybrid full-text search and vector search with SQLite

As part of Alex’s work on his sqlite-vec SQLite extension - adding fast vector lookups to SQLite - he’s been investigating hybrid search, where search results from both vector similarity and traditional full-text search are combined together.

The most promising approach looks to be Reciprocal Rank Fusion, which combines the top ranked items from both approaches. Here’s Alex’s SQL query:

-- the sqlite-vec KNN vector search results
with vec_matches as (
  select
    article_id,
    row_number() over (order by distance) as rank_number,
    distance
  from vec_articles
  where
    headline_embedding match lembed(:query)
    and k = :k
),
-- the FTS5 search results
fts_matches as (
  select
    rowid,
    row_number() over (order by rank) as rank_number,
    rank as score
  from fts_articles
  where headline match :query
  limit :k
),
-- combine FTS5 + vector search results with RRF
final as (
  select
    articles.id,
    articles.headline,
    vec_matches.rank_number as vec_rank,
    fts_matches.rank_number as fts_rank,
    -- RRF algorithm
    (
      coalesce(1.0 / (:rrf_k + fts_matches.rank_number), 0.0) * :weight_fts +
      coalesce(1.0 / (:rrf_k + vec_matches.rank_number), 0.0) * :weight_vec
    ) as combined_rank,
    vec_matches.distance as vec_distance,
    fts_matches.score as fts_score
  from fts_matches
  full outer join vec_matches on vec_matches.article_id = fts_matches.rowid
  join articles on articles.rowid = coalesce(fts_matches.rowid, vec_matches.article_id)
  order by combined_rank desc
)
select * from final;

I’ve been puzzled in the past over how to best do that because the distance scores from vector similarity and the relevance scores from FTS are meaningless in comparison to each other. RRF doesn’t even attempt to compare them - it uses them purely for row_number() ranking within each set and combines the results based on that.

Tags: full-text-search, search, sql, sqlite, alex-garcia, vector-search, embeddings, rag

Introducing Contextual Retrieval

2024-09-20T01:34:21+00:00

Introducing Contextual Retrieval

Here's an interesting new embedding/RAG technique, described by Anthropic but it should work for any embedding model against any other LLM.

One of the big challenges in implementing semantic search against vector embeddings - often used as part of a RAG system - is creating "chunks" of documents that are most likely to semantically match queries from users.

Anthropic provide this solid example where semantic chunks might let you down:

Imagine you had a collection of financial information (say, U.S. SEC filings) embedded in your knowledge base, and you received the following question: "What was the revenue growth for ACME Corp in Q2 2023?"

A relevant chunk might contain the text: "The company's revenue grew by 3% over the previous quarter." However, this chunk on its own doesn't specify which company it's referring to or the relevant time period, making it difficult to retrieve the right information or use the information effectively.

Their proposed solution is to take each chunk at indexing time and expand it using an LLM - so the above sentence would become this instead:

This chunk is from an SEC filing on ACME corp's performance in Q2 2023; the previous quarter's revenue was $314 million. The company's revenue grew by 3% over the previous quarter.

This chunk was created by Claude 3 Haiku (their least expensive model) using the following prompt template:

<document>
{{WHOLE_DOCUMENT}}
</document>
Here is the chunk we want to situate within the whole document
<chunk>
{{CHUNK_CONTENT}}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.

Here's the really clever bit: running the above prompt for every chunk in a document could get really expensive thanks to the inclusion of the entire document in each prompt. Claude added context caching last month, which allows you to pay around 1/10th of the cost for tokens cached up to your specified beakpoint.

By Anthropic's calculations:

Assuming 800 token chunks, 8k token documents, 50 token context instructions, and 100 tokens of context per chunk, the one-time cost to generate contextualized chunks is $1.02 per million document tokens.

Anthropic provide a detailed notebook demonstrating an implementation of this pattern. Their eventual solution combines cosine similarity and BM25 indexing, uses embeddings from Voyage AI and adds a reranking step powered by Cohere.

The notebook also includes an evaluation set using JSONL - here's that evaluation data in Datasette Lite.

Via Alex Albert

Tags: search, ai, prompt-engineering, generative-ai, vector-search, llms, embeddings, anthropic, claude, rag, prompt-caching

OpenAI: Improve file search result relevance with chunk ranking

2024-08-30T04:03:01+00:00

OpenAI: Improve file search result relevance with chunk ranking

I've mostly been ignoring OpenAI's Assistants API. It provides an alternative to their standard messages API where you construct "assistants", chatbots with optional access to additional tools and that store full conversation threads on the server so you don't need to pass the previous conversation with every call to their API.

I'm pretty comfortable with their existing API and I found the assistants API to be quite a bit more complicated. So far the only thing I've used it for is a script to scrape OpenAI Code Interpreter to keep track of updates to their enviroment's Python packages.

Code Interpreter aside, the other interesting assistants feature is File Search. You can upload files in a wide variety of formats and OpenAI will chunk them, store the chunks in a vector store and make them available to help answer questions posed to your assistant - it's their version of hosted RAG.

Prior to today OpenAI had kept the details of how this worked undocumented. I found this infuriating, because when I'm building a RAG system the details of how files are chunked and scored for relevance is the whole game - without understanding that I can't make effective decisions about what kind of documents to use and how to build on top of the tool.

This has finally changed! You can now run a "step" (a round of conversation in the chat) and then retrieve details of exactly which chunks of the file were used in the response and how they were scored using the following incantation:

run_step = client.beta.threads.runs.steps.retrieve(
    thread_id="thread_abc123",
    run_id="run_abc123",
    step_id="step_abc123",
    include=[
        "step_details.tool_calls[*].file_search.results[*].content"
    ]
)

(See what I mean about the API being a little obtuse?)

I tried this out today and the results were very promising. Here's a chat transcript with an assistant I created against an old PDF copy of the Datasette documentation - I used the above new API to dump out the full list of snippets used to answer the question "tell me about ways to use spatialite".

It pulled in a lot of content! 57,017 characters by my count, spread across 20 search results (customizable), for a total of 15,021 tokens as measured by ttok. At current GPT-4o-mini prices that would cost 0.225 cents (less than a quarter of a cent), but with regular GPT-4o it would cost 7.5 cents.

OpenAI provide up to 1GB of vector storage for free, then charge $0.10/GB/day for vector storage beyond that. My 173 page PDF seems to have taken up 728KB after being chunked and stored, so that GB should stretch a pretty long way.

Confession: I couldn't be bothered to work through the OpenAI code examples myself, so I hit Ctrl+A on that web page and copied the whole lot into Claude 3.5 Sonnet, then prompted it:

Based on this documentation, write me a Python CLI app (using the Click CLi library) with the following features:

openai-file-chat add-files name-of-vector-store *.pdf *.txt

This creates a new vector store called name-of-vector-store and adds all the files passed to the command to that store.

openai-file-chat name-of-vector-store1 name-of-vector-store2 ...

This starts an interactive chat with the user, where any time they hit enter the question is answered by a chat assistant using the specified vector stores.

We iterated on this a few times to build me a one-off CLI app for trying out the new features. It's got a few bugs that I haven't fixed yet, but it was a very productive way of prototyping against the new API.

Via @OpenAIDevs

Tags: ai, openai, generative-ai, vector-search, llms, ai-assisted-programming, embeddings, rag, claude-3-5-sonnet, ai-assisted-search

Using DuckDB for Embeddings and Vector Search

2024-06-15T14:39:18+00:00

Using DuckDB for Embeddings and Vector Search

Sören Brunk's comprehensive tutorial combining DuckDB 1.0, a subset of German Wikipedia from Hugging Face (loaded using Parquet), the BGE M3 embedding model and DuckDB's new vss extension for implementing an HNSW vector index.

Via @soebrunk

Tags: ai, parquet, duckdb, vector-search, embeddings

I'm writing a new vector search SQLite Extension

2024-05-03T03:16:39+00:00

I'm writing a new vector search SQLite Extension

Alex Garcia is working on sqlite-vec, a spiritual successor to his sqlite-vss project. The new SQLite C extension will have zero other dependencies (sqlite-vss used some tricky C++ libraries) and will work using virtual tables, storing chunks of vectors in shadow tables to avoid needing to load everything into memory at once.

Tags: c, sqlite, vectors, alex-garcia, vector-search, embeddings

My binary vector search is better than your FP32 vectors

2024-03-26T04:56:25+00:00

My binary vector search is better than your FP32 vectors

I’m still trying to get my head around this, but here’s what I understand so far.

Embedding vectors as calculated by models such as OpenAI text-embedding-3-small are arrays of floating point values, which look something like this:

[0.0051681744, 0.017187592, -0.018685209, -0.01855924, -0.04725188...]—1356 elements long

Different embedding models have different lengths, but they tend to be hundreds up to low thousands of numbers. If each float is 32 bits that’s 4 bytes per float, which can add up to a lot of memory if you have millions of embedding vectors to compare.

If you look at those numbers you’ll note that they are all pretty small positive or negative numbers, close to 0.

Binary vector search is a trick where you take that sequence of floating point numbers and turn it into a binary vector—just a list of 1s and 0s, where you store a 1 if the corresponding float was greater than 0 and a 0 otherwise.

For the above example, this would start [1, 1, 0, 0, 0...]

Incredibly, it looks like the cosine distance between these 0 and 1 vectors captures much of the semantic relevant meaning present in the distance between the much more accurate vectors. This means you can use 1/32nd of the space and still get useful results!

Ce Gao here suggests a further optimization: use the binary vectors for a fast brute-force lookup of the top 200 matches, then run a more expensive re-ranking against those filtered values using the full floating point vectors.

Tags: vector-search, embeddings

Embeddings: What they are and why they matter

2023-10-23T13:36:21+00:00

Embeddings are a really neat trick that often come wrapped in a pile of intimidating jargon.

If you can make it through that jargon, they unlock powerful and exciting techniques that can be applied to all sorts of interesting problems.

I gave a talk about embeddings at PyBay 2023. This article represents an improved version of that talk, which should stand alone even without watching the video.

If you're not yet familiar with embeddings I hope to give you everything you need to get started applying them to real-world problems.

In this article:

The 38 minute video version

Here's a video of the talk that I gave at PyBay:

The audio quality of the official video wasn't great due to an issue with the microphone, but I ran that audio through Adobe's Enhance Speech tool and uploaded my own video with the enhanced audio to YouTube.

What are embeddings?

Embeddings are a technology that's adjacent to the wider field of Large Language Models - the technology behind ChatGPT and Gemini and Claude.

Embeddings are based around one trick: take a piece of content - in this case a blog entry - and turn that piece of content into an array of floating point numbers.

The key thing about that array is that it will always be the same length, no matter how long the content is. The length is defined by the embedding model you are using - an array might be 300, or 1,000, or 1,536 numbers long.

The best way to think about this array of numbers is to imagine it as co-ordinates in a very weird multi-dimensional space.

It's hard to visualize 1,536 dimensional space, so here's a 3D visualization of the same idea:

Why place content in this space? Because we can learn interesting things about that content based on its location - in particular, based on what else is nearby.

The location within the space represents the semantic meaning of the content, according to the embedding model's weird, mostly incomprehensible understanding of the world. It might capture colors, shapes, concepts or all sorts of other characteristics of the content that has been embedded.

Nobody fully understands what those individual numbers mean, but we know that their locations can be used to find out useful things about the content.

One of the first problems I solved with embeddings was to build a "related content" feature for my TIL blog. I wanted to be able to show a list of related articles at the bottom of each page.

I did this using embeddings - in this case, I used the OpenAI text-embedding-ada-002 model, which is available via their API.

I currently have 472 articles on my site. I calculated the 1,536 dimensional embedding vector (array of floating point numbers) for each of those articles, and stored those vectors in my site's SQLite database.

Now, if I want to find related articles for a given article, I can calculate the cosine similarity between the embedding vector for that article and every other article in the database, then return the 10 closest matches by distance.

There's an example at the bottom of this page. The top five related articles for Geospatial SQL queries in SQLite using TG, sqlite-tg and datasette-sqlite-tg are:

Geopoly in SQLite - 2023-01-04
Viewing GeoPackage data with SpatiaLite and Datasette - 2022-12-11
Using SQL with GDAL - 2023-03-09
KNN queries with SpatiaLite - 2021-05-16
GUnion to combine geometries in SpatiaLite - 2022-04-12

That's a pretty good list!

Here's the Python function I'm using to calculate those cosine similarity distances:

def cosine_similarity(a, b):
    dot_product = sum(x * y for x, y in zip(a, b))
    magnitude_a = sum(x * x for x in a) ** 0.5
    magnitude_b = sum(x * x for x in b) ** 0.5
    return dot_product / (magnitude_a * magnitude_b)

My TIL site runs on my Datasette Python framework, which supports building sites on top of a SQLite database. I wrote more about how that works in the Baked Data architectural pattern.

You can browse the SQLite table that stores the calculated embeddings at tils/embeddings.

Those are binary values. We can run this SQL query to view them as hexadecimal:

select id, hex(embedding) from embeddings

That's still not very readable though. We can use the llm_embed_decode() custom SQL function to turn them into a JSON array:

select id, llm_embed_decode(embedding) from embeddings limit 10

Try that here. It shows that each article is accompanied by that array of 1,536 floating point numbers.

We can use another custom SQL function, llm_embed_cosine(vector1, vector2), to calculate those cosine distances and find the most similar content.

That SQL function is defined here in my datasette-llm-embed plugin.

Here's a query returning the five most similar articles to my SQLite TG article:

select
  id,
  llm_embed_cosine(
    embedding,
    (
      select
        embedding
      from
        embeddings
      where
        id = 'sqlite_sqlite-tg.md'
    )
  ) as score
from
  embeddings
order by
  score desc
limit 5

Executing that query returns the following results:

id	score
sqlite_sqlite-tg.md	1.0
sqlite_geopoly.md	0.8817322855676049
spatialite_viewing-geopackage-data-with-spatialite-and-datasette.md	0.8813094978399854
gis_gdal-sql.md	0.8799581261326747
spatialite_knn.md	0.8692992294266506

As expected, the similarity between the article and itself is 1.0. The other articles are all related to geospatial SQL queries in SQLite.

This query takes around 400ms to execute. To speed things up, I pre-calculate the top 10 similarities for every article and store them in a separate table called tils/similarities.

I wrote a Python function to look up related documents from that table and called it from the template that's used to render the article page.

My Storing and serving related documents with openai-to-sqlite and embeddings TIL explains how this all works in detail, including how GitHub Actions are used to fetch new embeddings as part of the build script that deploys the site.

I used the OpenAI embeddings API for this project. It's extremely inexpensive - for my TIL website I embedded around 402,500 tokens, which at $0.0001 / 1,000 tokens comes to $0.04 - just 4 cents!

It's really easy to use: you POST it some text along with your API key, it gives you back that JSON array of floating point numbers.

But... it's a proprietary model. A few months ago OpenAI shut down some of their older embeddings models, which is a problem if you've stored large numbers of embeddings from those models since you'll need to recalculate them against a supported model if you want to be able to embed anything else new.

To OpenAI's credit, they did promise to "cover the financial cost of users re-embedding content with these new models." - but it's still a reason to be cautious about relying on proprietary models.

The good news is that there are extremely powerful openly licensed models which you can run on your own hardware, avoiding any risk of them being shut down. We'll talk about that more in a moment.

Exploring how these things work with Word2Vec

Google Research put out an influential paper 10 years ago describing an early embedding model they created called Word2Vec.

That paper is Efficient Estimation of Word Representations in Vector Space, dated 16th January 2013. It's a paper that helped kick off widespread interest in embeddings.

Word2Vec is a model that takes single words and turns them into a list of 300 numbers. That list of numbers captures something about the meaning of the associated word.

This is best illustrated by a demo.

turbomaze.github.io/word2vecjson is an interactive tool put together by Anthony Liu with a 10,000 word subset of the Word2Vec corpus. You can view this JavaScript file to see the JSON for those 10,000 words and their associated 300-long arrays of numbers.

Search for a word to find similar words based on cosine distance to their Word2Vec representation. For example, the word "france" returns the following related results:

word	similarity
france	1
french	0.7000748343471224
belgium	0.6933180492111168
paris	0.6334910653433325
germany	0.627075617939471
italy	0.6135215284228007
spain	0.6064218103692152

That's a mixture of french things and European geography.

A really interesting thing you can do here is perform arithmetic on these vectors.

Take the vector for "germany", add "paris" and subtract "france". The resulting vector is closest to "berlin"!

Something about this model has captured the idea of nationalities and geography to the point that you can use arithmetic to explore additional facts about the world.

Word2Vec was trained on 1.6 billion words of content. The embedding models we use today are trained on much larger datasets and capture a much richer understanding of the underlying relationships.

Calculating embeddings using my LLM tool

I've been building a command-line utility and Python library called LLM.

You can read more about LLM here:

LLM is a tool for working with Large Language Models. You can install it like this:

pip install llm

Or via Homebrew:

brew install llm

You can use it as a command-line tool for interacting with LLMs, or as a Python library.

Out of the box it can work with the OpenAI API. Set an API key and you can run commands like this:

llm 'ten fun names for a pet pelican'

Where it gets really fun is when you start installing plugins. There are plugins that add entirely new language models to it, including models that run directly on your own machine.

A few months ago I extended LLM to support plugins that can run embedding models as well.

Here's how to run the catchily titled all-MiniLM-L6-v2 model using LLM:

First, we install llm and then use that to install the llm-sentence-transformers plugin - a wrapper around the SentenceTransformers library.

pip install llm
llm install llm-sentence-transformers

Next we need to register the all-MiniLM-L6-v2 model. This will download the model from Hugging Face to your computer:

llm sentence-transformers register all-MiniLM-L6-v2

We can test that out by embedding a single sentence like this:

llm embed -m sentence-transformers/all-MiniLM-L6-v2 \
  -c 'Hello world'

This outputs a JSON array that starts like this:

[-0.03447725251317024, 0.031023245304822922, 0.006734962109476328, 0.026108916848897934, -0.03936201333999634, ...

Embeddings like this on their own aren't very interesting - we need to store and compare them to start getting useful results.

LLM can store embeddings in a "collection" - a SQLite table. The embed-multi command can be used to embed multiple pieces of content at once and store them in a collection.

That's what this next command does:

llm embed-multi readmes \
  --model sentence-transformers/all-MiniLM-L6-v2 \
  --files ~/ '**/README.md' --store

Here we are populating a collection called "readmes".

The --files option takes two arguments: a directory to search and a glob pattern to match against filenames. In this case I'm searching my home directory recursively for any file named README.md.

The --store option causes LLM to store the raw text in the SQLite table in addition to the embedding vector.

This command took around 30 minutes to run on my computer, but it worked! I now have a collection called readmes with 16,796 rows - one for each README.md file it found in my home directory.

Vibes-based search

Now that we have a collection of embeddings, we can run searches against it using the llm similar command:

llm similar readmes -c 'sqlite backup tools' | jq .id

We are asking for items in the readmes collection that are similar to the embedding vector for the phrase "sqlite backup tools".

This command outputs JSON by default, which includes the full text of the README files since we stored them using --store earlier.

Piping the results through jq .id causes the command to output just the IDs of the matching rows.

The top matching results are:

"sqlite-diffable/README.md"
"sqlite-dump/README.md"
"ftstri/salite/ext/repair/README.md"
"simonw/README.md"
"sqlite-generate/README.md"
"sqlite-history/README.md"
"dbf-to-sqlite/README.md"
"ftstri/sqlite/ext/README.md"
"sqlite-utils/README.md"
"ftstri/sqlite/README.md'

These are good results! Each of these READMEs either describes a tool for working with SQLite backups or a project that relates to backups ins ome way.

What's interesting about this is that it's not guaranteed that the term "backups" appeared directly in the text of those READMEs. The content is semantically similar to that phrase, but might not be an exact textual match.

We can call this semantic search. I like to think of it as vibes-based search.

The vibes of those READMEs relate to our search term, according to this weird multi-dimensional space representation of the meaning of words.

This is absurdly useful. If you've ever built a search engine for a website, you know that exact matches don't always help people find what they are looking for.

We can use this kind of semantic search to build better search engines for a whole bunch of different kinds of content.

Embeddings for code using Symbex

Another tool I've been building is called Symbex. It's a tool for exploring the symbols in a Python codebase.

I originally built it to help quickly find Python functions and classes and pipe them into LLMs to help explain and rewrite them.

Then I realized that I could use it to calculate embeddings for all of the functions in a codebase, and use those embeddings to build a code search engine.

I added a feature that could output JSON or CSV representing the symbols it found, using the same output format that llm embed-multi can use as an input.

Here's how I built a collection of all of the functions in my Datasette project, using a newly released model called gte-tiny - just a 60MB file!

llm sentence-transformers register TaylorAI/gte-tiny

cd datasette/datasette

symbex '*' '*:*' --nl | \
  llm embed-multi functions - \
  --model sentence-transformers/TaylorAI/gte-tiny \
  --format nl \
  --store

symbex '*' '*:*' --nl finds all functions (*) and class methods (the *:* pattern) in the current directory and outputs them as newline-delimited JSON.

The llm embed-multi ... --format nl command expects newline-delimited JSON as input, so we can pipe the output of symbex directly into it.

This defaults to storing the embeddings in the default LLM SQLite database. You can add --database /tmp/data.db to specify an alternative location.

And now... I can run vibes-based semantic search against my codebase!

I could use the llm similar command for this, but I also have the ability to run these searches using Datasette itself.

Here's a SQL query for that, using the datasette-llm-embed plugin from earlier:

with input as (
  select
    llm_embed(
      'sentence-transformers/TaylorAI/gte-tiny',
      :input
    ) as e
)
select
  id,
  content
from
  embeddings,
  input
where
  collection_id = (
    select id from collections where name = 'functions'
  )
order by
  llm_embed_cosine(embedding, input.e) desc
limit 5

The :input parameter is automatically turned into a form field by Datasette.

When I run this, I get back functions that relate to the concept of listing plugins:

The key idea here is to use SQLite as an integration point - a substrate for combining together multiple tools.

I can run separate tools that extract functions from a codebase, run them through an embedding model, write those embeddings to SQLite and then run queries against the results.

Anything that can be piped into a tool can now be embedded and processed by the other components of this ecosystem.

Embedding text and images together using CLIP

My current favorite embedding model is CLIP.

CLIP is a fascinating model released by OpenAI - back in January 2021, when they were still doing most things in the open - that can embed both text and images.

Crucially, it embeds them both into the same vector space.

If you embed the string "dog", you'll get a location in 512 dimensional space (depending on your CLIP configuration).

If you embed a photograph of a dog, you'll get a location in that same space... and it will be close in terms of distance to the location of the string "dog"!

This means we can search for related images using text, and search for related text using images.

I built an interactive demo to help explain how this works. The demo is an Observable notebook that runs the CLIP model directly in the browser.

It's a pretty heavy page - it has to load 158MB of resources (64.6MB for the CLIP text model and 87.6MB for the image model) - but once loaded you can use it to embed an image, then embed a string of text and calculate the distance between the two.

I can give it this photo I took of a beach:

Then type in different text strings to calculate a similarity score, here displayed as a percentage:

text	score
beach	26.946%
city	19.839%
sunshine	24.146%
sunshine beach	26.741%
california	25.686%
california beach	27.427%

It's pretty amazing that we can do all of this in JavaScript running in the browser!

There's an obvious catch: it's not actually that useful to be able to take an arbitrary photo and say "how similar is this to the term 'city'?".

The trick is to build additional interfaces on top of this. Once again, we have the ability to build vibes-based search engines.

Here's a great example of one of those.

Faucet Finder: finding faucets with CLIP

Drew Breunig used LLM and my llm-clip plugin to build a search engine for faucet taps.

He was renovating his bathroom, and he needed to buy new faucets. So he scraped 20,000 photographs of faucets from a faucet supply company and ran CLIP against all of them.

He used the result to build Faucet Finder - a custom tool (deployed using Datasette) for finding faucets that look similar to other faucets.

Among other things, this means you can find an expensive faucet you like and then look for visually similar cheaper options!

Drew wrote more about his project in Finding Bathroom Faucets with Embeddings.

Drew's demo uses pre-calculated embeddings to display similar results without having to run the CLIP model on the server.

Inspired by this, I spent some time figuring out how to deploy a server-side CLIP model hosted by my own Fly.io account.

Drew's Datasette instance includes this table of embedding vectors, exposed via the Datasette API.

I deployed my own instance with this API for embedding text strings, then built an Observable notebook demo that hits both APIs and combines the results.

observablehq.com/@simonw/search-for-faucets-with-clip-api

Now I can search for things like "gold purple" and get back vibes-based faucet results:

Being able to spin up this kind of ultra-specific search engine in a few hours is exactly the kind of trick that excites me about having embeddings as a tool in my toolbox.

Clustering embeddings

Related content and semantic / vibes-based search are the two most comon applications of embeddings, but there are a bunch of other neat things you can do with them too.

One of those is clustering.

I built a plugin for this called llm-cluster which implements this using sklearn.cluster from scikit-learn.

To demonstrate that, I used my paginate-json tool and the GitHub issues API to collect the titles of all of the issues in my simonw/llm repository into a collection called llm-issues:

paginate-json 'https://api.github.com/repos/simonw/llm/issues?state=all&filter=all' \
  | jq '[.[] | {id: .id, title: .title}]' \
  | llm embed-multi llm-issues - \
  --store

Now I can create 10 clusters of issues like this:

llm install llm-cluster

llm cluster llm-issues 10

Clusters are output as a JSON array, with output that looks something like this (truncated):

[
  {
    "id": "2",
    "items": [
      {
        "id": "1650662628",
        "content": "Initial design"
      },
      {
        "id": "1650682379",
        "content": "Log prompts and responses to SQLite"
      }
    ]
  },
  {
    "id": "4",
    "items": [
      {
        "id": "1650760699",
        "content": "llm web command - launches a web server"
      },
      {
        "id": "1759659476",
        "content": "`llm models` command"
      },
      {
        "id": "1784156919",
        "content": "`llm.get_model(alias)` helper"
      }
    ]
  },
  {
    "id": "7",
    "items": [
      {
        "id": "1650765575",
        "content": "--code mode for outputting code"
      },
      {
        "id": "1659086298",
        "content": "Accept PROMPT from --stdin"
      },
      {
        "id": "1714651657",
        "content": "Accept input from standard in"
      }
    ]
  }
]

These do appear to be related, but we can do better. The llm cluster command has a --summary option which causes it to pass the resulting cluster text through a LLM and use it to generate a descriptive name for each cluster:

llm cluster llm-issues 10 --summary

This gives back names like "Log Management and Interactive Prompt Tracking" and "Continuing Conversation Mechanism and Management". See the README for more details.

Visualize in 2D with Principal Component Analysis

The problem with massively multi-dimensional space is that it's really hard to visualize.

We can use a technique called Principal Component Analysis to reduce the dimensionality of the data to a more manageable size - and it turns out lower dimensions continue to capture useful semantic meaning about the content.

Matt Webb used the OpenAI embedding model to generate embeddings for descriptions of every episode of the BBC's In Our Time podcast. He used these to find related episodes, but also ran PCA against them to create an interactive 2D visualization.

Reducing 1,536 dimensions to just two still produces a meaningful way of exploring the data! Episodes about historic wars show up near each other, elsewhere there's a cluster of episodes about modern scientific discoveries.

Matt wrote more about this in Browse the BBC In Our Time archive by Dewey decimal code.

Scoring sentences using average locations

Another trick with embeddings is to use them for classification.

First calculate the average location for a group of embeddings that you have classified in a certain way, then compare embeddings of new content to those locations to assign it to a category.

Amelia Wattenberger demonstrated a beautiful example of this in Getting creative with embeddings.

She wanted to help people improve their writing by encouraging a mixture of concrete and abstract sentences. But how do you tell if a sentence of text is concrete or abstract?

Her trick was to generate samples of the two types of sentence, calculate their average locations and then score new sentences based on how close they are to either end of that newly defined spectrum.

This score can even be converted into a color loosely representing how abstract or concrete a given sentence is!

This is a really neat demonstration of the kind of creative interfaces you can start to build on top of this technology.

Answering questions with Retrieval-Augmented Generation

I'll finish with the idea that first got me excited about embeddings.

Everyone who tries out ChatGPT ends up asking the same question: how could I use a version of this to answer questions based on my own private notes, or the internal documents owned by my company?

People assume that the answer is to train a custom model on top of that content, likely at great expense.

It turns out that's not actually necessary. You can use an off the shelf Large Language Model model (a hosted one or one that runs locally) and a trick called Retrieval Augmented Generation, or RAG.

The key idea is this: a user asks a question. You search your private documents for content that appears relevant to the question, then paste excerpts of that content into the LLM (respecting its size limit, usually between 3,000 and 6,000 words) along with the original question.

The LLM can then answer the question based on the additional content you provided.

This cheap trick is astonishingly effective. It's trivial to get a basic version of this working - the challenge is in getting it to work as well as possible given the infinite set of questions a user might ask.

The key problem in RAG is figuring out the best possible excerpts of content to include in the prompt to the LLM.

"Vibes-based" semantic search powered by embedding is exactly the kind of thing you need to gather potentially relevant content to help answer a user's question.

I built a version of this against content from my blog, described in detail in Embedding paragraphs from my blog with E5-large-v2.

I used a model called E5-large-v2 for this. It's a model trained with this exact use-case in mind.

A challenge with finding content relevant to questions is that the user's question - "What is shot-scraper?" - isn't guaranteed to be considered semantically similar to content that answers that question. Questions and assertions have a different grammar from each other.

E5-large-v2 addresses this by supporting two types of content. You can embed phrases (factual sentences) and queries (questions) in the same space, similar to how CLIP supports both images and text.

I embedded 19,000 paragraphs of text from my blog as phrases, and now I can embed a question as a query and use that to find the paragraphs most likely to answer that question.

The result is RAG implemented as a one-liner Bash script:

llm similar blog-paragraphs -c "query: $1" \
  | jq '.content | sub("passage: "; "")' -r \
  | llm -m mlc-chat-Llama-2-7b-chat-hf-q4f16_1 \
  "$1" -s 'You answer questions as a single paragraph'

This example uses Llama 2 Chat 7B running on my laptop (with the llm-mlc plugin), so I'm able to answer questions entirely offline using code running on my laptop.

Running this:

./blog-answer.sh 'What is shot-scraper?'

Outputs this:

Shot-scraper is a Python utility that wraps Playwright, providing both a command line interface and a YAML-driven configuration flow for automating the process of taking screenshots of web pages and scraping data from them using JavaScript. It can be used to take one-off screenshots or take multiple screenshots in a repeatable way by defining them in a YAML file. Additionally, it can be used to execute JavaScript on a page and return the resulting value.

That's a really good description of my shot-scraper tool. I checked and none of that output is an exact match to content I had previously published on my blog.

Q&A

My talk ended with a Q&A session. Here are the summarized questions and answers.

How does LangChain fit into this?

LangChain is a popular framework for implementing features on top of LLMs. It covers a lot of ground - my only problem with LangChain is that you have to invest a lot of work in understanding how it works and what it can do for you. Retrieval Augmented Generation is one of the key features of LangChain, so a lot of the things I've shown you today could be built on top of LangChain if you invest the effort.

My philosophy around this stuff differs from LangChain in that I'm focusing on building a suite of small tools that can work together, as opposed to a single framework that solves everything in one go.
Have you tried distance functions other than cosine similarity?

I have not. Cosine similarity is the default function that everyone else seems to be using and I've not spent any time yet exploring other options.

I actually got ChatGPT to write all of my different versions of cosine similarity, across both Python and JavaScript!

A fascinating thing about RAG is that it has so many different knobs that you can tweak. You can try different distance functions, different embedding models, different prompting strategies and different LLMs. There's a lot of scope for experimentation here.
What do you need to adjust if you have 1 billion objects?

The demos I've shown today have all been on the small side - up to around 20,000 embeddings. This is small enough that you can run brute force cosine similarity functions against everything and get back results in a reasonable amount of time.

If you're dealing with more data there are a growing number of options that can help.

Lots of startups are launching new "vector databases" - which are effectively databases that are custom built to answer nearest-neighbour queries against vectors as quickly as possible.

I'm not convinced you need an entirely new database for this: I'm more excited about adding custom indexes to existing databases. For example, SQLite has sqlite-vss and PostgreSQL has pgvector.

I've also done some successful experiments with Facebook's FAISS library, including building a Datasette plugin that uses it called datasette-faiss.
What improvements to embedding models are you excited to see?

I'm really excited about multi-modal models. CLIP is a great example, but I've also been experimenting with Facebook's ImageBind, which "learns a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data." It looks like we can go a lot further than just images and text!

I also like the trend of these models getting smaller. I demonstrated a new model, gte-tiny, earlier which is just 60MB. Being able to run these things on constrained devices, or in the browser, is really exciting to me.

LLM now provides tools for working with embeddings

2023-09-04T20:32:39+00:00

LLM is my Python library and command-line tool for working with language models. I just released LLM 0.9 with a new set of features that extend LLM to provide tools for working with embeddings.

This is a long post with a lot of theory and background. If you already know what embeddings are, here's a TLDR you can try out straight away:

# Install LLM
pip install llm

# If you already installed via Homebrew/pipx you can upgrade like this:
llm install -U llm

# Install the llm-sentence-transformers plugin
llm install llm-sentence-transformers

# Install the all-MiniLM-L6-v2 embedding model
llm sentence-transformers register all-MiniLM-L6-v2

# Generate and store embeddings for every README.md in your home directory, recursively
llm embed-multi readmes \
  --model sentence-transformers/all-MiniLM-L6-v2 \
  --files ~/ '**/README.md'
  # Add --store to store the text content as well

# Run a similarity search for "sqlite" against those embeddings
llm similar readmes -c sqlite

For everyone else, read on and the above example should hopefully all make sense.

Embeddings

Embeddings are a fascinating concept within the larger world of language models.

I explain embeddings in detail (with both a video and heavily annotated slides) in Embeddings: What they are and why they matter.

An embedding model lets you take a string of text - a word, sentence, paragraph or even a whole document - and turn that into an array of floating point numbers called an embedding vector.

A model will always produce the same length of array - 1,536 numbers for the OpenAI embedding model, 384 for all-MiniLM-L6-v2 - but the array itself is inscrutable. What are you meant to do with it?

The answer is that you can compare them. I like to think of an embedding vector as a location in 1,536-dimensional space. The distance between two vectors is a measure of how semantically similar they are in meaning, at least according to the model that produced them.

"One happy dog" and "A playful hound" will end up close together, even though they don't share any keywords. The embedding vector represents the language model's interpretation of the meaning of the text.

Things you can do with embeddings include:

Find related items. I use this on my TIL site to display related articles, as described in Storing and serving related documents with openai-to-sqlite and embeddings.
Build semantic search. As shown above, an embeddings-based search engine can find content relevant to the user's search term even if none of the keywords match.
Implement retrieval augmented generation - the trick where you take a user's question, find relevant documentation in your own corpus and use that to get an LLM to spit out an answer. More on that here.
Clustering: you can find clusters of nearby items and identify patterns in a corpus of documents.
Classification: calculate the embedding of a piece of text and compare it to pre-calculated "average" embeddings for different categories.

LLM's new embedding features

My goal with LLM is to provide a plugin-driven abstraction around a growing collection of language models. I want to make installing, using and comparing these models as easy as possible.

The new release adds several command-line tools for working with embeddings, plus a new Python API for working with embeddings in your own code.

It also adds support for installing additional embedding models via plugins. I've released one plugin for this so far: llm-sentence-transformers, which adds support for new models based on the sentence-transformers library.

The example above shows how to use sentence-transformers. LLM also supports API-driven access to the OpenAI ada-002 model.

Here's how to embed some text using ada-002, assuming you have installed LLM already:

# Set your OpenAI API key
llm keys set openai
# <paste key here>

# Embed some text
llm embed -m ada-002 -c "Hello world"

This will output a huge JSON list of floating point numbers to your terminal. You can add -f base64 (or -f hex) to get that back in a different format, though none of these outputs are instantly useful.

Embeddings are much more interesting when you store them.

LLM already uses SQLite to store prompts and responses. It was a natural fit to use SQLite to store embeddings as well.

Embedding collections

LLM 0.9 introduces the concept of a collection of embeddings. A collection has a name - like readmes - and contains a set of embeddings, each of which has an ID and an embedding vector.

All of the embeddings in a collection are generated by the same model, to ensure they can be compared with each others.

The llm embed command can store the vector in the database instead of returning it to the console. Pass it the name of an existing (or to-be-created) collection and the ID to use to store the embedding.

Here we'll store the embedding for the phrase "Hello world" in a collection called phrases with the ID hello, using that ada-002 embedding model:

llm embed phrases hello -m ada-002 -c "Hello world"

Future phrases can be added without needing to specify the model again, since it is remembered by the collection:

llm embed phrases goodbye -c "Goodbye world"

The llm embed-db collections shows a list of collections:

phrases: ada-002
  2 embeddings
readmes: sentence-transformers/all-MiniLM-L6-v2
  16796 embeddings

The data is stored in a SQLite embeddings table with the following schema:

CREATE TABLE [collections] (
   [id] INTEGER PRIMARY KEY,
   [name] TEXT,
   [model] TEXT
);
CREATE TABLE "embeddings" (
   [collection_id] INTEGER REFERENCES [collections]([id]),
   [id] TEXT,
   [embedding] BLOB,
   [content] TEXT,
   [content_hash] BLOB,
   [metadata] TEXT,
   [updated] INTEGER,
   PRIMARY KEY ([collection_id], [id])
);

CREATE UNIQUE INDEX [idx_collections_name]
    ON [collections] ([name]);
CREATE INDEX [idx_embeddings_content_hash]
    ON [embeddings] ([content_hash]);

By default this is the SQLite database at the location revealed by llm embed-db path, but you can pass --database my-embeddings.db to various LLM commands to use a different database.

Each embedding vector is stored as a binary BLOB in the embedding column, consisting of those floating point numbers packed together as 32 bit floats.

The content_hash column contains a MD5 hash of the content. This helps avoid re-calculating the embedding (which can cost actual money for API-based embedding models like ada-002) unless the content has changed.

The content column is usually null, but can contain a copy of the original text content if you pass the --store option to the llm embed command.

metadata can contain a JSON object with metadata, if you pass --metadata '{"json": "goes here"}.

You don't have to pass content using -c - you can instead pass a file path using the -i/--input option:

llm embed docs llm-setup -m ada-002 -i llm/docs/setup.md

Or pipe things to standard input like this:

cat llm/docs/setup.md | llm embed docs llm-setup -m ada-002 -i -

Embedding similarity search

Once you've built a collection, you can search for similar embeddings using the llm similar command.

The -c "term" option will embed the text you pass in using the embedding model for the collection and use that as the comparison vector:

llm similar readmes -c sqlite

You can also pass the ID of an object in that collection to use that embedding instead. This gets you related documents, for example:

llm similar readmes sqlite-utils/README.md

The output from this command is currently newline-delimited JSON.

Embedding in bulk

The llm embed command embeds a single string at a time. llm embed-multi is much more powerful: you can feed a CSV or JSON file, a SQLite database or even have it read from a directory of files in order to embed multiple items at once.

Many embeddings models are optimized for batch operations, so embedding multiple items at a time can provide a significant speed boost.

The embed-multi command is described in detail in the documentation. Here are a couple of fun things you can do with it.

First, I'm going to create embeddings for every single one of my Apple Notes.

My apple-notes-to-sqlite tool can export Apple Notes to a SQLite database. I'll run that first:

apple-notes-to-sqlite notes.db

This took quite a while to run on my machine and generated a 828M SQLite database containing 6,462 records!

Next, I'm going to embed the content of all of those notes using the sentence-transformers/all-MiniLM-L6-v2 model:

llm embed-multi notes \
  -d notes.db \
  --sql 'select id, title, body from notes' \
  -m sentence-transformers/all-MiniLM-L6-v2

This took around 15 minutes to run, and increased the size of my database by 13MB.

The --sql option here specifies a SQL query. The first column must be an id, then any subsequent columns will be concatenated together to form the content to embed.

In this case the embeddings are written back to the same notes.db database that the content came from.

And now I can run embedding similarity operations against all of my Apple notes!

llm similar notes -d notes.db -c 'ideas for blog posts'

Embedding files in a directory

Let's revisit the example from the top of this post. In this case, I'm using the --files option to search for files on disk and embed each of them:

llm embed-multi readmes \
  --model sentence-transformers/all-MiniLM-L6-v2 \
  --files ~/ '**/README.md'

The --files option takes two arguments: a path to a directory and a pattern to match against filenames. In this case I'm searching my home directory recursively for any files named README.md.

Running this command gives me embeddings for all of my README.md files, which I can then search against like this:

llm similar readmes -c sqlite

Embeddings in Python

So far I've only covered the command-line tools. LLM 0.9 also introduces a new Python API for working with embeddings.

There are two aspects to this. If you just want to embed content and handle the resulting vectors yourself, you can use llm.get_embedding_model():

import llm

# This takes model IDs and aliases defined by plugins:
model = llm.get_embedding_model("sentence-transformers/all-MiniLM-L6-v2")
vector = model.embed("This is text to embed")

vector will then be a Python list of floating point numbers.

You can serialize that to the same binary format that LLM uses like this:

binary_vector = llm.encode(vector)
# And to deserialize:
vector = llm.decode(binary_vector)

The second aspect of the Python API is the llm.Collection class, for working with collections of embeddings. This example code is quoted from the documentation:

import sqlite_utils
import llm

# This collection will use an in-memory database that will be
# discarded when the Python process exits
collection = llm.Collection("entries", model_id="ada-002")

# Or you can persist the database to disk like this:
db = sqlite_utils.Database("my-embeddings.db")
collection = llm.Collection("entries", db, model_id="ada-002")

# You can pass a model directly using model= instead of model_id=
embedding_model = llm.get_embedding_model("ada-002")
collection = llm.Collection("entries", db, model=embedding_model)

# Store a string in the collection with an ID:
collection.embed("hound", "my happy hound")

# Or to store content and extra metadata:
collection.embed(
    "hound",
    "my happy hound",
    metadata={"name": "Hound"},
    store=True
)

# Or embed things in bulk:
collection.embed_multi(
    [
        ("hound", "my happy hound"),
        ("cat", "my dissatisfied cat"),
    ],
    # Add this to store the strings in the content column:
    store=True,
)

As with everything else in LLM, the goal is that anything you can do with the CLI can be done with the Python API, and vice-versa.

Clustering with llm-cluster

Another interesting application of embeddings is that you can use them to cluster content - identifying patterns in a corpus of documents.

I've started exploring this area with a new plugin, called llm-cluster.

You can install it like this:

llm install llm-cluster

Let's create a new collection using data pulled from GitHub. I'm going to import all of the LLM issues from the GitHub API, using my paginate-json tool:

paginate-json 'https://api.github.com/repos/simonw/llm/issues?state=all&filter=all' \
  | jq '[.[] | {id: .id, title: .title}]' \
  | llm embed-multi llm-issues - \
    --database issues.db \
    --model sentence-transformers/all-MiniLM-L6-v2 \
    --store

Running this gives me a issues.db SQLite database with 218 embeddings contained in a collection called llm-issues.

Now let's try out the llm-cluster command, requesting ten clusters from that collection:

llm cluster llm-issues --database issues.db 10

The output from this command, truncated, looks like this:

[
  {
    "id": "0",
    "items": [
      {
        "id": "1784149135",
        "content": "Tests fail with pydantic 2"
      },
      {
        "id": "1837084995",
        "content": "Allow for use of Pydantic v1 as well as v2."
      },
      {
        "id": "1857942721",
        "content": "Get tests passing against Pydantic 1"
      }
    ]
  },
  {
    "id": "1",
    "items": [
      {
        "id": "1724577618",
        "content": "Better ways of storing and accessing API keys"
      },
      {
        "id": "1772024726",
        "content": "Support for `-o key value` options such as `temperature`"
      },
      {
        "id": "1784111239",
        "content": "`--key` should be used in place of the environment variable"
      }
    ]
  },
  {
    "id": "8",
    "items": [
      {
        "id": "1835739724",
        "content": "Bump the python-packages group with 1 update"
      },
      {
        "id": "1848143453",
        "content": "Python library support for adding aliases"
      },
      {
        "id": "1857268563",
        "content": "Bump the python-packages group with 1 update"
      }
    ]
  }
]

These look pretty good! But wouldn't it be neat if we had a snappy title for each one?

The --summary option can provide exactly that, by piping the members of each cluster through a call to another LLM in order to generate a useful summary.

llm cluster llm-issues --database issues.db 10 --summary

This uses gpt-3.5-turbo to generate a summary for each cluster, with this default prompt:

Short, concise title for this cluster of related documents.

The results I got back are pretty good, including:

Template Storage and Management Improvements
Package and Dependency Updates and Improvements
Adding Conversation Mechanism and Tools

I tried the same thing using a Llama 2 model running on my own laptop, with a custom prompt:

llm cluster llm-issues --database issues.db 10 \
  --summary --model mlc-chat-Llama-2-13b-chat-hf-q4f16_1 \
  --prompt 'Concise title for this cluster of related documents, just return the title'

I didn't quite get what I wanted! Llama 2 is proving a lot harder to prompt, so each cluster came back with something that looked like this:

Sure! Here's a concise title for this cluster of related documents:

"Design Improvements for the Neat Prompt System"

This title captures the main theme of the documents, which is to improve the design of the Neat prompt system. It also highlights the focus on improving the system's functionality and usability

llm-cluster only took a few hours to throw together, which I'm seeing as a positive indicator that the LLM library is developing in the right direction.

Future plans

The two future features I'm most excited about are indexing and chunking.

Indexing

The llm similar command and collection.similar() Python method currently use effectively the slowest brute force approach possible: calculate a cosine difference between input vector and every other embedding in the collection, then sort the results.

This works fine for collections with a few hundred items, but will start to suffer for collections of 100,000 or more.

There are plenty of potential ways of speeding this up: you can run a vector index like FAISS or hnswlib, use a database extension like sqlite-vss or pgvector, or turn to a hosted vector database like Pinecone or Milvus.

With this many potential solutions, the obvious answer for LLM is to address this with plugins.

I'm still thinking through the details, but the core idea is that users should be able to define an index against one or more collections, and LLM will then coordinate updates to that index. These may not happen in real-time - some indexes can be expensive to rebuild, so there are benefits to applying updates in batches.

I experimented with FAISS earlier this year in datasette-faiss. That's likely to be the base for my first implementation.

The embeddings table has an updated timestamp column to support this use-case - so indexers can run against just the items that have changed since the last indexing run.

Follow issue #216 for updates on this feature.

Chunking

When building an embeddings-based search engine, the hardest challenge is deciding how best to "chunk" the documents.

Users will type in short phrases or questions. The embedding for a four word question might not necessarily map closely to the embedding of a thousand word article, even if the article itself should be a good match for that query.

To maximize the chance of returning the most relevant content, we need to be smarter about what we embed.

I'm still trying to get a good feeling for the strategies that make sense here. Some that I've seen include:

Split a document up into fixed length shorter segments.
Split into segments but including a ~10% overlap with the previous and next segments, to reduce problems caused by words and sentences being split in a way that disrupts their semantic meaning.
Splitting by sentence, using NLP techniques.
Splitting into higher level sections, based on things like document headings.

Then there are more exciting, LLM-driven approaches:

Generate an LLM summary of a document and embed that.
Ask an LLM "What questions are answered by the following text?" and then embed each of the resulting questions!

It's possible to try out these different techniques using LLM already: write code that does the splitting, then feed the results to Collection.embed_multi() or llm embed-multi.

But... it would be really cool if LLM could split documents for you - with the splitting techniques themselves defined by plugins, to make it easy to try out new approaches.

Get involved

It should be clear by now that the potential scope of the LLM project is enormous. I'm trying to use plugins to tie together an enormous and rapidly growing ecosystem of models and techniques into something that's as easy for people to work with and build on as possible.

There are plenty of ways you can help!

Join the #llm Discord to talk about the project.
Try out plugins and run different models with them. There are 12 plugins already, and several of those can be used to run dozens if not hundreds of models (llm-mlc, llm-gpt4all and llm-llama-cpp in particular). I've hardly scratched the surface of these myself, and I'm testing exclusively on Apple Silicon. I'm really keen to learn more about which models work well, which models don't and which perform the best on different hardware.
Try building a plugin for a new model. My dream here is that every significant Large Language Model will have an LLM plugin that makes it easy to install and use.
Build stuff using LLM and let me know what you've built. Nothing fuels an open source project more than stories of cool things people have built with it.

Tags: cli, open-source, projects, sqlite, ai, generative-ai, vector-search, llms, embeddings, llm, rag

sqlite-vss v0.1.1 Annotated Release Notes

2023-07-20T17:48:50+00:00

sqlite-vss v0.1.1 Annotated Release Notes

Alex Garcia’s sqlite-vss adds vector search directly to SQLite through a custom extension. It’s now easily installed for Python, Node.js, Deno, Elixir, Go, Rust and Ruby (“gem install sqlite-vss”), and is being used actively by enough people that Alex is getting actionable feedback, including fixes for memory leaks spotted in production.

Via @agarcia_me

Tags: sqlite, annotated-release-notes, alex-garcia, vector-search

Vector Search

2023-06-02T05:02:22+00:00

Vector Search

Amjith Ramanujam provides a very thorough tutorial on implementing vector similarity search using SentenceTransformers embeddings (all-MiniLM-L6-v2) executed using sqlite-utils, then served via datasette-sqlite-vss and deployed using Fly.

Tags: sqlite, ai, datasette, fly, vector-search, embeddings

Introducing sqlite-vss: A SQLite Extension for Vector Search

2023-02-10T22:53:14+00:00

Introducing sqlite-vss: A SQLite Extension for Vector Search

This latest SQLite extension from Alex Garcia is possibly his best yet: it adds FAISS-powered vector similarity search directly to SQLite, enabling fast KNN similarity lookups against a virtual table that feels a lot like SQLite’s own built-in full text search feature. This write-up includes interactive demos using Datasette called from an Observable notebook, running similarity searches against an index of 200,000 news headlines and summaries in less than 50ms.

Via @simon on Mastodon

Tags: sqlite, datasette, observable, alex-garcia, vector-search

Weeknotes: AI hacking and a SpatiaLite tutorial

2023-01-15T19:45:39+00:00

Short weeknotes this time because the key things I worked on have already been covered here:

How to implement Q&A against your documentation with GPT3, embeddings and Datasette described my experiment using custom SQL functions in Datasette to provide a Q&A interface against an existing corpus of documents. This resulted in a flurry of software releases:
- datasette-openai - custom SQL functions for querying the OpenAI APIs from Datasette
- datasette-faiss - a plugin that builds a FAISS index for fast vector similarity search
- datasette-cookies-for-magic-parameters - a supporting plugin that helps store secrets such as API keys in cookies so they won't be transmitted in query strings
- openai-to-sqlite - a command-line tool for storing OpenAI API responses in a SQLite database - useful for calculating embeddings
Datasette 0.64, with a warning about SpatiaLite - I released Datasette 0.64 with a new --setting default_allow_sql off setting, due to the discovery that some of the SQL functions added by the SpatiaLite module can be used to crash the Datasette instance.
Building a location to time zone API with SpatiaLite is the latest in the official series of tutorials for Datasette. It demonstrates how SpatiaLite can be used to safely build and deploy a geospatial API.

Releases this week

datasette-openai: 0.2 - (4 releases total) - 2023-01-14
SQL functions for calling OpenAI APIs
openai-to-sqlite: 0.2 - (2 releases total) - 2023-01-13
Save OpenAI API results to a SQLite database
datasette-cookies-for-magic-parameters: 0.1.2 - (3 releases total) - 2023-01-12
UI for setting cookies to populate magic parameters
git-history: 0.7a0 - (10 releases total) - 2023-01-11
Tools for analyzing Git history using SQLite
datasette: 0.64.1 - (124 releases total) - 2023-01-11
An open source multi-tool for exploring and publishing data
datasette-faiss: 0.1a0 - 2023-01-11
Maintain a FAISS index for specified Datasette tables
json-to-files: 0.1 - 2023-01-10
Create separate files on disk based on a JSON object
datasette-auth-passwords: 1.1 - (12 releases total) - 2023-01-09
Datasette plugin for authentication using passwords
datasette-publish-fly: 1.3 - (9 releases total) - 2023-01-09
Datasette plugin for publishing data using Fly
shapefile-to-sqlite: 0.4.2 - (6 releases total) - 2023-01-08
Load shapefiles into a SQLite (optionally SpatiaLite) database

TIL this week

Tags: spatialite, ai, datasette, weeknotes, gpt-3, openai, generative-ai, vector-search, llms

How to implement Q&A against your documentation with GPT3, embeddings and Datasette

2023-01-13T23:47:30+00:00

If you've spent any time with GPT-3 or ChatGPT, you've likely thought about how useful it would be if you could point them at a specific, current collection of text or documentation and have it use that as part of its input for answering questions.

It turns out there is a neat trick for doing exactly that. I've been experimenting with it using my Datasette web application as a rapid prototyping platform for trying out new AI techniques using custom SQL functions.

Here's how to do this:

Run a text search (or a semantic search, described later) against your documentation to find content that looks like it could be relevant to the user's question
Grab extracts of that content and glue them all together into a blob of text
Construct a prompt consisting of that text followed by "Given the above content, answer the following question: " and the user's question
Send the whole thing through the GPT-3 API and see what comes back

I've been calling this the semantic search answers pattern. Update: Since publishing this post I've learned that this technique is known as Retrieval-Augmented Generation or RAG, as described in this paper from May 2020. I've also been pointed to the Question Answering using Embeddings notebook in the OpenAI cookbook which describes this same technique.

This is a form of prompt engineering (and hence is prone to prompt injection attacks, so bear that in mind). You need to fit your overall prompt into the token limit for GPT-3, which is currently 4,000 tokens. A token is more-or-less the same thing as a word.

In this post I'll show how I implemented this pattern using SQLite, Datasette and some custom SQL functions and tools written in Python.

First, some demos

I've implemented this pattern against my blog - 3,000 posts going back over twenty years. I built it using some custom SQLite SQL functions that call the OpenAI APIs.

First, some examples of questions and answers I ran against my blog:

What does shot scraper do?

Shot-scraper is a command-line tool that helps automate the process of taking screenshots of web pages and scraping data from them using JavaScript. It can be used to take one-off screenshots or take multiple screenshots in a repeatable way by defining them in a YAML file. It can also be used to execute JavaScript on a page and return the resulting value.

shot-scraper is a tool I released in 2022 - after the 2021 GPT-3 training cut-off date. This is a very good summary - better I think than anything I've written about shot-scraper myself.

What museums has Simon been to?

Simon has been to the Pitt Rivers museum, the British Museum, the Science Museum, the Natural History Museum, the V&A, the Victoria and Albert Museum, and the Pioneer Museum in Paso Robles. He has also seen the Giant Squid at the Natural History Museum and Charles Darwin's collection at the spirit building.

This is entirely correct, though I've talked about a lot more museums than that on my blog.

What are some unconventional things you can do with GitHub Actions?

Create dynamic content for GitHub repository templates using cookiecutter and GitHub Actions

Automate repetitive tasks like code linting, testing, and deployment using GitHub Actions

Run scheduled scrapers in GitHub Actions

Use GitHub Actions to trigger a webhook call when a repository is forked

Create an automated issue-tracking system with GitHub Issues and GitHub Actions

Automatically extract captions and transcripts from online videos using Whisper and GitHub Actions.

Some of my relevant posts:

I've never written anything about triggering webhook calls or issue-tracking systems though, so 4 and 5 in that list are a little suspicious.

What is the big grey box at the top of the utility pole? What does it do?

The big grey box at the top of the utility pole is likely a transformer. Transformers are used to step down the voltage of electricity from the high-voltage lines that run along the pole to the lower-voltage lines that are used to provide electricity to homes and businesses.

I'm including this example because I've never written anything even remotely related to transformers and utility poles on my blog. This demonstrates that in the lack of useful context GPT-3 will answer the question entirely by itself - which may or may not be what you want from this system.

Try this out yourself

If you want to try this out yourself you'll need to get your own API key from OpenAI. I don't want to foot the bill for people using my blog as a free source of GPT-3 prompt answers!

You can sign up for one here. I believe they are still running a free trial period.

Now head over to this page:

https://datasette.simonwillison.net/simonwillisonblog/answer_question?_hide_sql=1

You'll need to paste in your OpenAI key. I am not logging these anywhere, and the form stores these in a cookie in order to avoid transmitting it over a GET query string where it could be accidentally logged somewhere.

Then type in your question and see what comes back!

Let's talk about how this all works - in a whole lot of detail.

Semantic search using embeddings

You can implement the first step of this sequence using any search engine you like - but there's a catch: we are encouraging users here to ask questions, which increases the chance that they might include text in their prompt which doesn't exactly match documents in our index.

"What are the key features of Datasette?" for example might miss blog entries that don't include the word "feature" even though they describe functionality of the software in detail.

What we want here is semantic search - we want to find documents that match the meaning of the user's search term, even if the matching keywords are not present.

OpenAI have a less well-known API that can help here, which had a big upgrade (and major price reduction) back in December: their embedding model.

Update 31st January 2023: I figured out how to run an alternative embeddings model that can execute entirely on my laptop, described here: Calculating embeddings with gtr-t5-large in Python.

An embedding is a list of floating point numbers.

As an example, consider a latitude/longitude location: it's a list of two floating point numbers. You can use those numbers to find other nearby points by calculating distances between them.

Add a third number and now you can plot locations in three dimensional space - and still calculate distances between them to find the closest points.

This idea keeps on working even as we go beyond three dimensions: you can calculate distances between vectors of any length, no matter how many dimensions they have.

So if we can represent some text in a many-multi-dimensional vector space, we can calculate distances between those vectors to find the closest matches.

The OpenAI embedding model lets you take any string of text (up to a ~8,000 word length limit) and turn that into a list of 1,536 floating point numbers. We'll call this list the "embedding" for the text.

These numbers are derived from a sophisticated language model. They take a vast amount of knowledge of human language and flatten that down to a list of floating point numbers - at 4 bytes per floating point number that's 4*1,536 = 6,144 bytes per embedding - 6KiB.

The distance between two embeddings represents how semantically similar the text is to each other.

The two most obvious applications of this are search and similarity scores.

Take a user's search term. Calculate its embedding. Now find the distance between that embedding and every pre-calculated embedding in a corpus and return the 10 closest results.

Or for document similarity: calculate embeddings for every document in a collection, then look at each one in turn and find the closest other embeddings: those are the documents that are most similar to it.

For my semantic search answers implementation, I use an embedding-based semantic search as the first step to find the best matches for the question. I then assemble these top 5 matches into the prompt to pass to GPT-3.

Calculating embeddings

Embeddings can be calculated from text using the OpenAI embeddings API. It's really easy to use:

curl https://api.openai.com/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{"input": "Your text string goes here",
       "model":"text-embedding-ada-002"}'

The documentation doesn't mention this, but you can pass a list of strings (up to 2048 according to the official Python library source code) as "input" to run embeddings in bulk:

curl https://api.openai.com/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{"input": ["First string", "Second string", "Third string"],
       "model":"text-embedding-ada-002"}'

The returned data from this API looks like this:

{
  "data": [
    {
      "embedding": [
        -0.006929283495992422,
        -0.005336422007530928,
        ...
        -4.547132266452536e-05,
        -0.024047505110502243
      ],
      "index": 0,
      "object": "embedding"
    }
  ]

As expected, it's a list of 1,536 floating point numbers.

I've been storing embeddings as a binary string that appends all of the floating point numbers together, using their 4-byte representation.

Here are the tiny Python functions I've been using for doing that:

import struct

def decode(blob):
    return struct.unpack("f" * 1536, blob)

def encode(values):
    return struct.pack("f" * 1536, *values)

I then store them in SQLite blob columns in my database.

I wrote a custom tool for doing this, called openai-to-sqlite. I can run it like this:

openai-to-sqlite embeddings simonwillisonblog.db \
  --sql 'select id, title, body from blog_entry' \
  --table blog_entry_embeddings

This concatenates together the title and body columns from that table, runs them through the OpenAI embeddings API and stores the results in a new table called blog_entry_embeddings with the following schema:

CREATE TABLE [blog_entry_embeddings] (
   [id] INTEGER PRIMARY KEY,
   [embedding] BLOB
)

I can join this against the blog_entry table by ID later on.

Finding the closest matches

The easiest way to calculate similarity between two embedding arrays is to use cosine similarity. A simple Python function for that looks like this:

def cosine_similarity(a, b):
    dot_product = sum(x * y for x, y in zip(a, b))
    magnitude_a = sum(x * x for x in a) ** 0.5
    magnitude_b = sum(x * x for x in b) ** 0.5
    return dot_product / (magnitude_a * magnitude_b)

You can brute-force find the top matches for a table by executing that comparison for every row and returning the ones with the highest score.

I added this to my datasette-openai Datasette plugin as a custom SQL function called openai_embedding_similarity(). Here's a query that uses it:

with input as (
  select
    embedding
  from
    blog_entry_embeddings
  where
    id = :entry_id
),
top_n as (
  select
    id,
    openai_embedding_similarity(
      blog_entry_embeddings.embedding,
      input.embedding
    ) as score
  from
    blog_entry_embeddings,
    input
  order by
    score desc
  limit
    20
)
select
  score,
  blog_entry.id,
  blog_entry.title
from
  blog_entry
  join top_n on blog_entry.id = top_n.id

Try that out here.

This takes as input the ID of one of my blog entries and returns a list of the other entries, ordered by their similarity score.

Unfortunately this is pretty slow! It takes over 1.3s to run against all 3,000 embeddings in my blog.

I did some research and found that a highly regarded solutions for fast vector similarity calculations is FAISS, by Facebook AI research. It has neat Python bindings and can be installed using pip install faiss-cpu (the -gpu version requires a GPU).

FAISS works against an in-memory index. My blog's Datasette instance uses the baked data pattern which means the entire thing is re-deployed any time the data changes - as such, I can spin up an in-memory index once on startup without needing to worry about updating the index continually as rows in the database change.

So I built another plugin to do that: datasette-faiss - which can be configured to build an in-memory FAISS index against a configured table on startup, and can then be queried using another custom SQL function.

Here's the related entries query from above rewritten to use the FAISS index:

with input as (
  select
    embedding
  from
    blog_entry_embeddings
  where
    id = :entry_id
),
top_n as (
  select value as id from json_each(
    faiss_search(
      'simonwillisonblog',
      'blog_entry_embeddings',
      input.embedding,
      20
    )
  ), input
)
select
  blog_entry.id,
  blog_entry.title
from
  blog_entry
  join top_n on blog_entry.id = top_n.id

This one runs in 4.8ms!

faiss_search(database_name, table_name, embedding, n) returns a JSON array of the top n IDs from the specified embeddings table, based on distance scores from the provided embedding.

The json_each() trick here is a workaround for the fact that Python's SQLite driver doesn't yet provide an easy way to write table-valued functions - SQL functions that return something in the shape of a table.

Instead, I use json_each() to turn the string JSON array of IDs from datasette_faiss() into a table that I can run further joins against.

Implementing semantic search with embeddings

So far we've just seen embeddings used for finding similar items. Let's implement semantic search, using a user-provided query.

This is going to need an API key again, because it involves a call to OpenAI to run embeddings against the user's search query.

Here's the SQL query:

select
  value,
  blog_entry.title,
  substr(blog_entry.body, 0, 500)
from
  json_each(
    faiss_search(
      'simonwillisonblog',
      'blog_entry_embeddings',
      (
        select
          openai_embedding(:query, :_cookie_openai_api_key)
      ),
      10
    )
  )
  join blog_entry on value = blog_entry.id
  where length(coalesce(:query, '')) > 0

Try that here (with extra some cosmetic tricks.)

We're using a new function here: openai_embedding() - which takes some text and an API key and returns an embedding for that text.

The API key comes from :_cookie_openai_api_key - this is a special Datasette mechanism called magic parameters which can read variables from cookies.

The datasette-cookies-for-magic-parameters plugin notices these and turns them into an interface for the user to populate the cookies with, decsribed earlier.

One last trick: adding where length(coalesce(:query, '')) > 0 to the query means that the query won't run if the user hasn't entered any text into the search box.

Constructing a prompt from semantic search query results

Getting back to our semantic search answers pattern.

We need a way to construct a prompt for GPT-3 using the results of our semantic search query.

There's one big catch: GPT-3 has a length limit, and it's strictly enforced. If you pass even one token over that limit you'll get an error.

We want to use as much material from the top five search results as possible, leaving enough space for the rest of the prompt (the user's question and our own text) and the prompt response.

I ended up solving this with another custom SQL function:

select openai_build_prompt(content, 'Context:
------------
', '
------------
Given the above context, answer the following question: ' || :question,
  500
  ) from search_results

This function works as an aggregate function - it takes a table of results and returns a single string.

It takes the column to aggregate - in this case content - as the first argument. Then it takes a prefix and a suffix, which are concatenated together with the aggregated content in the middle.

The third argument is the number of tokens to allow for the response.

The function then attempts to truncate each of the input values to the maximum length that will still allow them all to be concatenated together while staying inside that 4,000 token limit.

Adding it all together

With all of the above in place, the following query is my full implementation of semantic search answers against my blog:

with query as (
  select
    openai_embedding(:question, :_cookie_openai_api_key) as q
),
top_n as (
  select
    value
  from json_each(
    faiss_search(
      'simonwillisonblog',
      'blog_entry_embeddings',
      (select q from query),
      5
    )
  )
  where length(coalesce(:question, '')) > 0
),
texts as (
  select 'Created: ' || created || ', Title: ' || title || 
  ', Body: ' || openai_strip_tags(body) as text
  from blog_entry where id in (select value from top_n)
),
prompt as (
  select openai_build_prompt(text, 'Context:
------------
', '
------------
Given the above context, answer the following question: ' || :question,
  500
  ) as prompt from texts
)
select
  'Response' as title,
  openai_davinci(
    prompt,
    500,
    0.7,
    :_cookie_openai_api_key
  ) as value
  from prompt
  where length(coalesce(:question, '')) > 0
union all
select
  'Prompt' as title,
  prompt from prompt

As you can see, I really like using CTEs (the with name as (...) pattern) to assemble complex queries like this.

The texts as ... CTE is where I strip HTML tags from my content (using another custom function from the datasete-openai plugin called openai_strip_tags()) and assemble it along with the Created and Title metadata. Adding these gave the system a better chance of answering questions like "When did Natalie and Simon get married?" with the correct year.

The last part of this query uses a handy debugging trick: it returns two rows via a union all - the first has a Response label and shows the response from GPT-3, while the second has a Prompt label and shows the prompt that I passed to the model.

Next steps

There are so many ways to improve this system.

Smarter prompt design. My prompt here is the first thing that I got to work - I'm certain there are all kinds of tricks that could be used to make this more effective.
Better selection of the content to include in the prompt. I'm using embedding search but then truncating to the first portion: a smarter implementation would attempt to crop out the most relevant parts of each entry, maybe by using embeddings against smaller chunks of text.
- Yoz tipped me off to GPT Index, a project which aims to solve this exact problem by using a pre-trained LLM to help summarize text to better fit in a prompt used for these kinds of queries.
- Spotted this idea from Hassan Hayat: "don't embed the question when searching. Ask GPT-3 to generate a fake answer, embed this answer, and use this to search". See also this paper about Hypothetical Document Embeddings, via Jay Hack.
Hold out for GPT-4: I've heard rumours that the next version of the model will have a significantly larger token limit, which should result in much better results from this mechanism.

Tags: projects, search, sqlite, ai, datasette, gpt-3, generative-ai, vector-search, llms, embeddings, rag, ai-assisted-search, gpt

Vector search engines

2003-03-01T13:07:18+00:00

Building a Vector Space Search Engine in Perl by Maciej Cegłowski:

Vector-space search engines use the notion of a term space, where each document is represented as a vector in a high-dimensional space. There are as many dimensions as there are unique words in the entire collection. Because a document's position in the term space is determined by the words it contains, documents with many words in common end up close together, while documents with few shared words end up far apart.

To search our collection, we project a query into this term space and calculate the distance from the query vector to all the document vectors in turn. Those documents that are within a certain threshold distance get added to our result set. If all this sounds like gobbledygook to you, then don't worry - it will become clearer when we write the code.

Having done a course on Linear Algebra last term, it's interesting to see how it can be applied to the search problem. The technique described lends itself well to finding "similar documents" as well, as documents with similar word content will end up "near" to each other when projected on to the vector space.

The article is also yet another demonstration of how Perl's modules make it such a powerful tool. Lingua::Stem is used to find word "stems", providing a free algorithm for eliminating related words like cat and cats. The performance overhead of using Perl arrays to represent large vectors is avoided with the PDL module, which implements a whole set of matrix algebra functions in compiled C for high performance. Without these two modules the technique described would be a great deal less powerful. Of course, neither of them are available for PHP or Python, my scripting languages of choice.

Tags: maciej-ceglowski, perl, vector-search

Simon Willison's Weblog: vector-search

Scaling HNSWs

The case against pgvector

Quoting James Luan

Cursor: Security

Redis is open source again

Quoting Jo Kristian Bergum

From where I left

Hybrid full-text search and vector search with SQLite

Introducing Contextual Retrieval

OpenAI: Improve file search result relevance with chunk ranking

Using DuckDB for Embeddings and Vector Search

I'm writing a new vector search SQLite Extension

My binary vector search is better than your FP32 vectors

Embeddings: What they are and why they matter

The 38 minute video version

What are embeddings?

Related content using embeddings

Exploring how these things work with Word2Vec

Calculating embeddings using my LLM tool

Vibes-based search

Embeddings for code using Symbex

Embedding text and images together using CLIP

Faucet Finder: finding faucets with CLIP

Clustering embeddings

Visualize in 2D with Principal Component Analysis

Scoring sentences using average locations

Answering questions with Retrieval-Augmented Generation

Q&A

Further reading

LLM now provides tools for working with embeddings

Embeddings

LLM's new embedding features

Embedding collections

Embedding similarity search

Embedding in bulk

Embedding files in a directory

Embeddings in Python

Clustering with llm-cluster

Future plans

Indexing

Chunking

Get involved

sqlite-vss v0.1.1 Annotated Release Notes

Vector Search

Introducing sqlite-vss: A SQLite Extension for Vector Search

Weeknotes: AI hacking and a SpatiaLite tutorial

Releases this week

TIL this week

How to implement Q&A against your documentation with GPT3, embeddings and Datasette

First, some demos

Try this out yourself

Semantic search using embeddings

Calculating embeddings

Finding the closest matches

Implementing semantic search with embeddings

Constructing a prompt from semantic search query results

Adding it all together

Next steps

Vector search engines