Drop #510 (2024-08-02): SQLite Vector Search

sqlite-vec & sqlite-rembed

Before we begin, a quick note that I tidied up the uk Deno CLI project from the other day, and there’s more comments, a README, and Justfile to make it easier to walk through it. Even if you’re not interested in URL-ops, it’s not a horribad generic example of how to wrap a small JS library into an executable CLI tool.

Also, one update to the Apple Intelligence section from the other day: Apple started giving me AI-generated response suggestions in Messages. I now fear for all textboxes on macOS.

Just one section, today, as we walk through a recent pair of add-ons to SQLite3 that makes it pretty straightforward to get into vector search ops on your own hardware (no OpenAI tax required!).

sqlite-vec & sqlite-rembed

Photo by Skitterphoto on Pexels.com

Alex Garcia has made some extensions to the venerable SQLite that makes it (I try not to use this phrase lightly) really easy to get into vector search ops:

sqlite-vec is “an extremely small, [C-based] “fast enough” vector search SQLite extension”. It’s a project sponsored by the Mozilla Builders project.
sqlite-rembed is “a [Rust-based] SQLite extension for generating text embeddings from remote APIs (OpenAI, Nomic, Cohere, llamafile, Ollama, etc.)”

There are pre-built extensions for sqlite-vec in the GH releases, and — if you do make release in a cloned sqlite-remebd repo, the targets/release directory will contain the extension for that. For macOS folks: I needed to switch to Homebrew sqlite3 for the extensions to work. YMMV.

Once you have the extensions in a directory, you need some data! I need to go back and add TL;DR’s for legacy Drops, but many of them, now, have such a section, so I threw together a small R script to create a SQLite database from those resource entries:

library(RSQLite)
library(tidyverse)

# make a sequence from range in a 2-item vector
sseq <- \(.x) seq.int(.x[1], .x[2])

# extract the individual entries in the Drop's TL;DR section
extract_tldrs <- \(.x) {
  .x <- .x[sseq(which(grepl("^###", .x))[1:2] + c(1, -1))]
  .x <- .x[(!grepl("^>|^_|^$", .x))]
  .x[.x != ""]
  trimws(sub("^\\-", "", .x[grepl("^\\-", .x)]))
}

# final all Drops with TL;DR sections
res <- system(r"(rg -l --no-line-number "### TL;DR[[:space:]]*$" ~/projects/dropchat)", intern=TRUE)

# read them in 
posts <- lapply(res, readLines, warn=FALSE)

# associate the filename with the resource descriptions
map2_df(basename(res), posts, \(.drop, .post) {
  tibble(
    drop = .drop,
    tldr = extract_tldrs(.post)
  )
}) |> 
  mutate(
    id = 1:n() # add an index
  ) -> xdf

# save it out
con <- dbConnect(RSQLite::SQLite(), "~/projects/vector-search/drop.db")
dbWriteTable(con, "articles", xdf)
dbDisconnect(con)

That ended up making ~300 entries:

Rows: 298
Columns: 3
$ drop <chr> "2023-10-04-index.md", "2023-10-04-index.md", "2023-10-04-index…
$ tldr <chr> "This section discusses the anatomy of a mature design system e…
$ id   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, …

You can do something similar with your own content, but be aware that you’ll need to handle breaking it up into segments manually if the each piece is large. That’s why I went with the TL;DR summary entries vs the entire contents of each section.

We’ll be making embeddings from those entries, so we need something to help us do that with our new SQLite3 extensions. I used the mxbai-embed-large-v1 - llamafile (that’s a direct d/l link) llamafile since it’s small and fast. Download it and fire it up at the command line so it’s in server mode.

With the new extensions and that database in the same directory fire up sqlite3 drop.db and prepare for the work:

.load rembed0
.load vec0

-- this is a required table that the `rembed` function references. 
-- it knows the URL/port configs for ollama and llamafile but
-- you can specify anything that works. the README has all the info
insert into temp.rembed_clients(name, options) values('llamafile', 'llamafile');

-- this will hold our article embeddings
create virtual table vec_articles using vec0( tldr_embeddings float[1024] );

Now, we need to generate the embeddings:

insert into 
  vec_articles(rowid, tldr_embeddings) 
    select 
      rowid, 
      rembed('llamafile', tldr) 
    from 
      articles;

Yep. That’s it. Really!

That took just under eight seconds on my MacStudio (I’ll try it on lesser hardware at some point).

Now, we can just query it! Each query needs embeddings generated for it, too, but that’s quick work. Let’s find top Drops that are about web scraping in some way, shape, or form:

with matches as (
  select
    rowid,
    distance
  from vec_articles
  where tldr_embeddings match rembed('llamafile', 'web scraping tool')
  order by distance
  limit 5
)
select
  *
from matches
left join articles on articles.rowid = matches.rowid;

Despite the vector search operation being plain ol’ brute-force, that query finished in 0.077 seconds and returned entries from the 2023-11-29, 2024-02-21, 2024-02-14, 2024-01-10, and 2023-09-25 (in distance order, which ranged from 0.796438753604889 to 0.844647645950317). These were the entries:

The edition concludes with a mention of Flyscrape, a tool for internet scraping written in Golang. Flyscrape uses an embedded JavaScript interpreter for configuration and processing functions. The author plans to convert his Capitol insurrection DoJ scraper to Flyscrape and will report back on the results. The tool can be installed via a command provided in the post or through pre-built binaries on the site source.
Waybackpack: A command-line tool for downloading the entire Wayback Machine archive for a given URL, allowing users to scrape data from older versions of resources or resources that have disappeared. Installation and usage instructions are available at https://github.com/jsvine/waybackpack.
Webhook.Site: An online tool that provides a unique, random URL (and email address) for testing webhooks or arbitrary HTTP requests. It displays requests in real-time for inspection without needing own server infrastructure. The site also features a custom graphical editor and scripting language for processing HTTP requests, making it useful for connecting incompatible APIs or quickly building new ones.
The post discusses two tools for archiving and preserving digital content. The first tool is Monolith, a Rust-based CLI tool that can convert any HTML page into a self-contained HTML file. This tool embeds all necessary assets such as CSS, JavaScript, and images into a single HTML file, allowing for offline access and preservation of the original webpage.
tldts: A JavaScript library for extracting hostnames, domains, public suffixes, top-level domains, and subdomains from URLs. It is fast, supports Unicode/IDNA, and is continuously updated. Check it out on npm and GitHub.

(I never did do the Flyscrape thing.)

Not a bad job for a terrible query!

I tried “monospace font” and the first result was the post on Monaspace (older posts with monospace font content had no TL;DR, remember), and “docker or container” had solid top three results for this corpus. Oddly enough, neither “syntax highlight” nor “terminal editor” had the results I thought they would in the top five.

One really nice thing about sqlite-vec is that it runs everywhere (including SQLite in the browser). Which means browser-based search for, say, static sites will soon be leveled up quite a few notches.

Both of these extension are in the early stages of development, so they’re only going to get better (they’re already pretty dope), and they’re definitely helping democratize working with embeddings and vector search.

FIN

Remember, you can follow and interact with the full text of The Daily Drop’s free posts on Mastodon via @dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev ☮️

One response to “Drop #510 (2024-08-02): SQLite Vector Search”

Drop #514 (2024-08-09): DuckDB Vector Search – hrbrmstr's Daily Drop

August 9, 2024

[…] we showcased some fancy vector search ops with SQLite in Drop #510, it’s only fair to do the same thing with DuckDB’s nascent VSS superpowers, […]

LikeLike

Drop #510 (2024-08-02): SQLite Vector Search

sqlite-vec & sqlite-rembed

FIN

Share this:

One response to “Drop #510 (2024-08-02): SQLite Vector Search”

Leave a comment Cancel reply