Drop #565 (2024-12-04): All Strings ATTACHed

Abusing Remotely Hosted DuckDB Databases For Fun And Full Text Search

Drew Breunig’s post on “Turning Your Root URL Into a DuckDB Remote Database” has been making the rounds this week, and I feel pretty daft for not lingering on DuckDB’s manual page on ATTACH long enough (prior) to realize the full potential of this seemingly mundane command.

If you’re a DuckDB user, you very, very likely read remote Parquet (and other) files on-the-regular. I’ve got a few chapters on that idiom in the still-unfinished “Cooking With DuckDB” tome. For example:

$ duckdb -c "
FROM read_parquet([
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-03.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-04.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-05.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-06.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-07.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-08.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-09.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-10.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-11.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-12.parquet'
  ])
SELECT
  COUNT(*) AS n_rides_2024
"
┌──────────────┐
│ n_rides_2024 │
│    int64     │
├──────────────┤
│     38310226 │
└──────────────┘

I do that quite a bit in many-an-Observable-Framework dashboard I put together for $WORK.

But, you can have DuckDB ATTACH a full-on DuckDB database from a remote source. If you’ve only ever used the “:memory:” mode of DuckDB, you should read up on “persistence” before continuing.

Drew’s post caused a bonkers thought to enter my noggin: will DuckDB’s full text search capability work in such a remote mode? If so, it opens up tons of possibilities.

Let’s explore one of them.

Getting The Daily Drop Into A Database

Photo by Pixabay on Pexels.com

Longtime readers will likely remember that I use markdown to write all the posts. After a bumpy start (back in 2022), I settled into a comfy idiom for the shape of the post files. There’s a boring date + numeric issue title, usually (unless I forget) with a tagline, a line below that with topics, and the markdown content of the sections. I also rejiggered the directory structure after said rough start. Since earlier posts did not conform to the idiom, I had to go back and edit a few dozen posts to conform to this style.

Since the posts were, now, all in good shape, that made it possible to shove them into a DuckDB database (with a little help from R):

library(stringi)
library(purrr)
library(tibble)

# find all the Drops
fils <- list.files(pdir, pattern = "index.md", full.names = TRUE, recursive = TRUE) 

# read them int
fils |> 
  map(stri_read_lines) -> drops

# get the dates of the posts
days <- stri_match_first_regex(fils, "([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2})")[,2]

# get the title of the posts
titles <- map_chr(drops, head, 1)

# get the sections in the posts
sections <- trimws(map_chr(drops, \(.x) .x[2]))

# collapse the lines in each post into one giant string
markdown <- map_chr(drops, paste0, collapse = "\n")

# we're going to extract the URLs that are in the posts
url_regex <- "https?://(?:www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&//=]*)"

stri_match_all_regex(markdown, url_regex, omit_no_match = TRUE, multi_line = TRUE) |> 
  lapply(
    stri_replace_last_regex, r"(\).|\)$)", ""
  ) |> 
  lapply(
    unique
  ) |> 
  lapply(
    stri_subset_regex, "twitter|substack|youtube.com/watch$", negate = TRUE # ignore crufty ones
  ) |> 
  lapply(
    stri_replace_all_fixed, "http://", "https://"
  ) -> urls

# save it out
jsonlite::stream_out(
  tibble(
    day = days,
    title = titles,
    sections = sections,
    md = markdown,
    urls = urls
  ),
  gzfile("/tmp/drops.json.gz")
)

I could have done more in R, but I still favor working in DuckDB’s CLI for SQL ops, so we can turn that JSON into a DuckDB database with full text search capabilities:

$ duckdb drops.db
> INSTALL fts; -- install the full text search extension
> LOAD fts; -- load it
> CREATE TABLE posts AS (FROM read_ndjson('drops.json.gz')); -- create table from the json
> .schema
CREATE TABLE posts("day" DATE, title VARCHAR, sections VARCHAR, md VARCHAR, urls VARCHAR[]);
> PRAGMA create_fts_index('posts', 'title', 'md'); -- setup fts indices
> .schema
CREATE TABLE fts_main_posts.dict(termid BIGINT, term VARCHAR, df BIGINT);
CREATE TABLE fts_main_posts.docs(docid BIGINT, "name" VARCHAR, len BIGINT);
CREATE TABLE fts_main_posts.fields(fieldid BIGINT, field VARCHAR);
CREATE TABLE posts("day" DATE, title VARCHAR, sections VARCHAR, md VARCHAR, urls VARCHAR[]);
CREATE TABLE fts_main_posts.stats(num_docs BIGINT, avgdl DOUBLE);
CREATE TABLE fts_main_posts.stopwords(sw VARCHAR);
CREATE TABLE fts_main_posts.terms(docid BIGINT, fieldid BIGINT, termid BIGINT);

I put that at https://rud.is/data/drops.db. That directory already had a generous CORS policy, but I had to make some further tweaks to the nginx config for it to efficiently handle the remote DuckDB operations:

location ~* /data {
  add_header "Access-Control-Allow-Origin" "*";
  add_header "Access-Control-Expose-Headers" "Content-Range";
  sendfile on;
  tcp_nopush on;
  add_header Accept-Ranges bytes;
  proxy_force_ranges on;
  proxy_http_version 1.1;
  proxy_set_header Range $http_range;
  proxy_set_header If-Range $http_if_range;
}

I also had to touch drops.db.wal in that data/ directory, since DuckDB looks for that Write-Ahead Log file and will error out if it doesn’t find it.

You can go ahead and play with that remote file from the DuckDB CLI right now. The below will find the top 2 posts about atproto:

$ duckdb
> INSTALL fts;
> LOAD fts;
> ATTACH 'https://rud.is/data/drops.db' AS drops_db;
> USE drops_db;
> .schema
CREATE TABLE fts_main_posts.dict(termid BIGINT, term VARCHAR, df BIGINT);
CREATE TABLE fts_main_posts.docs(docid BIGINT, "name" VARCHAR, len BIGINT);
CREATE TABLE fts_main_posts.fields(fieldid BIGINT, field VARCHAR);
CREATE TABLE posts("day" DATE, title VARCHAR, sections VARCHAR, md VARCHAR, urls VARCHAR[]);
CREATE TABLE fts_main_posts.stats(num_docs BIGINT, avgdl DOUBLE);
CREATE TABLE fts_main_posts.stopwords(sw VARCHAR);
CREATE TABLE fts_main_posts.terms(docid BIGINT, fieldid BIGINT, termid BIGINT);
>.mode json
>
FROM (
  FROM posts
  SELECT *, fts_main_posts.match_bm25(
    title,
    'atproto'
  ) AS score
) sq
SELECT sections, score
WHERE score IS NOT NULL
ORDER BY score DESC
LIMIT 2;
[
  {"sections":"atproto/bsky Deep Dive; Ethernet Is Old; Stocky","score":3.1507045410668577},
  {"sections":"goat; sampler; 2024 CSS Survey Results","score":3.124673589581529}
]

I may be odd, but I find that to be super cool.

We could have just gone and used the epic Charm ecosystem to build a TUI remote DuckDB Drop reader app, but I wanted to see if this would work in DuckDB WASM context. So, I tried the above in DuckDB’s Web Shell (that link runs the query in your browser) and it turns out DuckDB WASM does support the full text search plugin.

Since that worked, I got the daft idea to throw together a quick/small Post Explorer app using ViteLit, DuckDB, and my lame CSS skills.

If you’re comfortable with those pieces of tech, you can likely skip the blathering, below, and hit up the source on Codeberg.

The Daily Drop In DuckDB Post Explorer

Vite has direct support for Lit, so you can do:

$ npm create vite@latest duckdb-drops -- --template lit-ts

or, if you prefer wizard mode:

$ npm create vite@latest

For those keen-eyed readers who noticed the lit-ts bit and pondered —“I thought you didn’t like TypeScript?” … I really don’t, but using TS in a Vite+Lit context speeds up development quite a bit, especially when it comes to Lit’s decorators.

Once the skeleton project is created, I followed the boilerplate on the DuckDB WASM npm repo to get it working. The salient bits of that are below:

import * as duckdb from "@duckdb/duckdb-wasm";
import duckdb_wasm from "@duckdb/duckdb-wasm/dist/duckdb-mvp.wasm?url";
import mvp_worker from "@duckdb/duckdb-wasm/dist/duckdb-browser-mvp.worker.js?url";
import duckdb_wasm_eh from "@duckdb/duckdb-wasm/dist/duckdb-eh.wasm?url";
import eh_worker from "@duckdb/duckdb-wasm/dist/duckdb-browser-eh.worker.js?url";

const MANUAL_BUNDLES: duckdb.DuckDBBundles = {
  mvp: {
    mainModule: duckdb_wasm,
    mainWorker: mvp_worker,
  },
  eh: {
    mainModule: duckdb_wasm_eh,
    mainWorker: eh_worker,
  },
};

// …

// in Lit, this will fire up duckdb when the element is on the page
async firstUpdated() {
  this.isLoading = true;
  try {
    const bundle = await duckdb.selectBundle(MANUAL_BUNDLES);
    const worker = new Worker(bundle.mainWorker!);
    const logger = new duckdb.ConsoleLogger();
    this.db = new duckdb.AsyncDuckDB(logger, worker);
    await this.db.instantiate(bundle.mainModule, bundle.pthreadWorker);
    const c = await this.db.connect();
    await c.query(`
      INSTALL fts;
      LOAD fts;
      ATTACH 'https://rud.is/data/drops.db' AS drops_db;
      USE drops_db;
    `);
    this.dbConn = c;
    await this.performSearch(""); // init the full table listing
  } finally {
    this.isLoading = false;
  }
}

The searching/listing is handled like this:

// table results are either "everything" or the ones from the full text search
async performSearch(query: string) {
  const searchSQL = query.trim()
    ? `
      FROM posts
      SELECT day, title, sections, md, urls, fts_main_posts.match_bm25(title, '${query}') AS score
      WHERE fts_main_posts.match_bm25(title, '${query}') IS NOT NULL
      ORDER BY score DESC;
    `
    : `
      FROM posts
      SELECT day, title, sections, md, urls
      ORDER BY day DESC;
    `;

  const res = await this.dbConn.query(searchSQL);
  this.posts = await Promise.all(
    res.toArray().map(async (row: any) => {
      const post: PostRecord = {
        day: new Date(row.day).toISOString().split("T")[0],
        title: row.title,
        sections: row.sections,
        md: row.md,
        urls: Array.from(row.urls),
        score: row.score,
        renderedMd: await marked(row.md), // Pre-render the markdown b/c while the async render works in dev mode vite whines abt it in build
      };
      return post;
    }),
  );
}

The rest is really just boring HTML/CSS/Lit boilerplate code, but we now have the viewer side of a bona fide content management system ready to go.

This is not optimized/robust code as it was very hastily thrown together, but it shows that we can now use this powerful feature of DuckDB (full text search) in a remote context.

The app is here, the full code is here, and I’ll work on a system to update the database with new posts later this week.


FIN

We all will need to get much, much better at sensitive comms, and Signal is one of the only ways to do that in modern times. You should absolutely use that if you are doing any kind of community organizing (etc.). Ping me on Mastodon or Bluesky with a “🦇?” request (public or faux-private) and I’ll provide a one-time use link to connect us on Signal.

Remember, you can follow and interact with the full text of The Daily Drop’s free posts on Mastodon via @dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev ☮️

72 responses to “Drop #565 (2024-12-04): All Strings ATTACHed”

  1. Drop #566 (2024-12-05): Prompts, Two Ways – hrbrmstr's Daily Drop Avatar

    […] refer to this post to see how to access a database of all the Drops with extracted links, and full text search […]

    Like

  2. Drop #567 (2024-12-06): In Defense Of… – hrbrmstr's Daily Drop Avatar

    […] refer to this post to see how to access a database of all the Drops with extracted links, and full text search […]

    Like

  3. Bonus Drop #68 (2024-12-08): All Strings ATTACHed (CLI Version) – hrbrmstr's Daily Drop Avatar

    […] While the previous version of this worked fine, folks rightfully were a tad dismayed at the GIANT WASM DOWNLOAD SIZE just to read my blatherings in that way. […]

    Like

  4. Drop #568 (2024-12-09): Crafting Accessible Data Visualizations – hrbrmstr's Daily Drop Avatar

    […] refer to this post and this post to see how to access a regularly database of all the Drops with […]

    Like

  5. Drop #569 (2024-12-10): Typography Tuesday – hrbrmstr's Daily Drop Avatar

    […] refer to this post and this post to see how to access a regularly database of all the Drops with […]

    Like

Leave a reply to Bonus Drop #68 (2024-12-08): All Strings ATTACHed (CLI Version) – hrbrmstr's Daily Drop Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.