Drop #565 (2024-12-04): All Strings ATTACHed

hrbrmstr

December 4, 2024

Abusing Remotely Hosted DuckDB Databases For Fun And Full Text Search

Drew Breunig’s post on “Turning Your Root URL Into a DuckDB Remote Database” has been making the rounds this week, and I feel pretty daft for not lingering on DuckDB’s manual page on ATTACH long enough (prior) to realize the full potential of this seemingly mundane command.

If you’re a DuckDB user, you very, very likely read remote Parquet (and other) files on-the-regular. I’ve got a few chapters on that idiom in the still-unfinished “Cooking With DuckDB” tome. For example:

$ duckdb -c "
FROM read_parquet([
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-03.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-04.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-05.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-06.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-07.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-08.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-09.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-10.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-11.parquet',
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-12.parquet'
  ])
SELECT
  COUNT(*) AS n_rides_2024
"
┌──────────────┐
│ n_rides_2024 │
│    int64     │
├──────────────┤
│     38310226 │
└──────────────┘

I do that quite a bit in many-an-Observable-Framework dashboard I put together for $WORK.

But, you can have DuckDB ATTACH a full-on DuckDB database from a remote source. If you’ve only ever used the “:memory:” mode of DuckDB, you should read up on “persistence” before continuing.

Drew’s post caused a bonkers thought to enter my noggin: will DuckDB’s full text search capability work in such a remote mode? If so, it opens up tons of possibilities.

Let’s explore one of them.

Getting The Daily Drop Into A Database

Photo by Pixabay on Pexels.com

Longtime readers will likely remember that I use markdown to write all the posts. After a bumpy start (back in 2022), I settled into a comfy idiom for the shape of the post files. There’s a boring date + numeric issue title, usually (unless I forget) with a tagline, a line below that with topics, and the markdown content of the sections. I also rejiggered the directory structure after said rough start. Since earlier posts did not conform to the idiom, I had to go back and edit a few dozen posts to conform to this style.

Since the posts were, now, all in good shape, that made it possible to shove them into a DuckDB database (with a little help from R):

library(stringi)
library(purrr)
library(tibble)

# find all the Drops
fils <- list.files(pdir, pattern = "index.md", full.names = TRUE, recursive = TRUE) 

# read them int
fils |> 
  map(stri_read_lines) -> drops

# get the dates of the posts
days <- stri_match_first_regex(fils, "([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2})")[,2]

# get the title of the posts
titles <- map_chr(drops, head, 1)

# get the sections in the posts
sections <- trimws(map_chr(drops, \(.x) .x[2]))

# collapse the lines in each post into one giant string
markdown <- map_chr(drops, paste0, collapse = "\n")

# we're going to extract the URLs that are in the posts
url_regex <- "https?://(?:www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&//=]*)"

stri_match_all_regex(markdown, url_regex, omit_no_match = TRUE, multi_line = TRUE) |> 
  lapply(
    stri_replace_last_regex, r"(\).|\)$)", ""
  ) |> 
  lapply(
    unique
  ) |> 
  lapply(
    stri_subset_regex, "twitter|substack|youtube.com/watch$", negate = TRUE # ignore crufty ones
  ) |> 
  lapply(
    stri_replace_all_fixed, "http://", "https://"
  ) -> urls

# save it out
jsonlite::stream_out(
  tibble(
    day = days,
    title = titles,
    sections = sections,
    md = markdown,
    urls = urls
  ),
  gzfile("/tmp/drops.json.gz")
)

I could have done more in R, but I still favor working in DuckDB’s CLI for SQL ops, so we can turn that JSON into a DuckDB database with full text search capabilities:

$ duckdb drops.db
> INSTALL fts; -- install the full text search extension
> LOAD fts; -- load it
> CREATE TABLE posts AS (FROM read_ndjson('drops.json.gz')); -- create table from the json
> .schema
CREATE TABLE posts("day" DATE, title VARCHAR, sections VARCHAR, md VARCHAR, urls VARCHAR[]);
> PRAGMA create_fts_index('posts', 'title', 'md'); -- setup fts indices
> .schema
CREATE TABLE fts_main_posts.dict(termid BIGINT, term VARCHAR, df BIGINT);
CREATE TABLE fts_main_posts.docs(docid BIGINT, "name" VARCHAR, len BIGINT);
CREATE TABLE fts_main_posts.fields(fieldid BIGINT, field VARCHAR);
CREATE TABLE posts("day" DATE, title VARCHAR, sections VARCHAR, md VARCHAR, urls VARCHAR[]);
CREATE TABLE fts_main_posts.stats(num_docs BIGINT, avgdl DOUBLE);
CREATE TABLE fts_main_posts.stopwords(sw VARCHAR);
CREATE TABLE fts_main_posts.terms(docid BIGINT, fieldid BIGINT, termid BIGINT);

I put that at https://rud.is/data/drops.db. That directory already had a generous CORS policy, but I had to make some further tweaks to the nginx config for it to efficiently handle the remote DuckDB operations:

location ~* /data {
  add_header "Access-Control-Allow-Origin" "*";
  add_header "Access-Control-Expose-Headers" "Content-Range";
  sendfile on;
  tcp_nopush on;
  add_header Accept-Ranges bytes;
  proxy_force_ranges on;
  proxy_http_version 1.1;
  proxy_set_header Range $http_range;
  proxy_set_header If-Range $http_if_range;
}

I also had to touch drops.db.wal in that data/ directory, since DuckDB looks for that Write-Ahead Log file and will error out if it doesn’t find it.

You can go ahead and play with that remote file from the DuckDB CLI right now. The below will find the top 2 posts about atproto:

$ duckdb
> INSTALL fts;
> LOAD fts;
> ATTACH 'https://rud.is/data/drops.db' AS drops_db;
> USE drops_db;
> .schema
CREATE TABLE fts_main_posts.dict(termid BIGINT, term VARCHAR, df BIGINT);
CREATE TABLE fts_main_posts.docs(docid BIGINT, "name" VARCHAR, len BIGINT);
CREATE TABLE fts_main_posts.fields(fieldid BIGINT, field VARCHAR);
CREATE TABLE posts("day" DATE, title VARCHAR, sections VARCHAR, md VARCHAR, urls VARCHAR[]);
CREATE TABLE fts_main_posts.stats(num_docs BIGINT, avgdl DOUBLE);
CREATE TABLE fts_main_posts.stopwords(sw VARCHAR);
CREATE TABLE fts_main_posts.terms(docid BIGINT, fieldid BIGINT, termid BIGINT);
>.mode json
>
FROM (
  FROM posts
  SELECT *, fts_main_posts.match_bm25(
    title,
    'atproto'
  ) AS score
) sq
SELECT sections, score
WHERE score IS NOT NULL
ORDER BY score DESC
LIMIT 2;
[
  {"sections":"atproto/bsky Deep Dive; Ethernet Is Old; Stocky","score":3.1507045410668577},
  {"sections":"goat; sampler; 2024 CSS Survey Results","score":3.124673589581529}
]

I may be odd, but I find that to be super cool.

We could have just gone and used the epic Charm ecosystem to build a TUI remote DuckDB Drop reader app, but I wanted to see if this would work in DuckDB WASM context. So, I tried the above in DuckDB’s Web Shell (that link runs the query in your browser) and it turns out DuckDB WASM does support the full text search plugin.

Since that worked, I got the daft idea to throw together a quick/small Post Explorer app using Vite, Lit, DuckDB, and my lame CSS skills.

If you’re comfortable with those pieces of tech, you can likely skip the blathering, below, and hit up the source on Codeberg.

The Daily Drop In DuckDB Post Explorer

Vite has direct support for Lit, so you can do:

$ npm create vite@latest duckdb-drops -- --template lit-ts

or, if you prefer wizard mode:

$ npm create vite@latest

For those keen-eyed readers who noticed the lit-ts bit and pondered —“I thought you didn’t like TypeScript?” … I really don’t, but using TS in a Vite+Lit context speeds up development quite a bit, especially when it comes to Lit’s decorators.

Once the skeleton project is created, I followed the boilerplate on the DuckDB WASM npm repo to get it working. The salient bits of that are below:

import * as duckdb from "@duckdb/duckdb-wasm";
import duckdb_wasm from "@duckdb/duckdb-wasm/dist/duckdb-mvp.wasm?url";
import mvp_worker from "@duckdb/duckdb-wasm/dist/duckdb-browser-mvp.worker.js?url";
import duckdb_wasm_eh from "@duckdb/duckdb-wasm/dist/duckdb-eh.wasm?url";
import eh_worker from "@duckdb/duckdb-wasm/dist/duckdb-browser-eh.worker.js?url";

const MANUAL_BUNDLES: duckdb.DuckDBBundles = {
  mvp: {
    mainModule: duckdb_wasm,
    mainWorker: mvp_worker,
  },
  eh: {
    mainModule: duckdb_wasm_eh,
    mainWorker: eh_worker,
  },
};

// …

// in Lit, this will fire up duckdb when the element is on the page
async firstUpdated() {
  this.isLoading = true;
  try {
    const bundle = await duckdb.selectBundle(MANUAL_BUNDLES);
    const worker = new Worker(bundle.mainWorker!);
    const logger = new duckdb.ConsoleLogger();
    this.db = new duckdb.AsyncDuckDB(logger, worker);
    await this.db.instantiate(bundle.mainModule, bundle.pthreadWorker);
    const c = await this.db.connect();
    await c.query(`
      INSTALL fts;
      LOAD fts;
      ATTACH 'https://rud.is/data/drops.db' AS drops_db;
      USE drops_db;
    `);
    this.dbConn = c;
    await this.performSearch(""); // init the full table listing
  } finally {
    this.isLoading = false;
  }
}

The searching/listing is handled like this:

// table results are either "everything" or the ones from the full text search
async performSearch(query: string) {
  const searchSQL = query.trim()
    ? `
      FROM posts
      SELECT day, title, sections, md, urls, fts_main_posts.match_bm25(title, '${query}') AS score
      WHERE fts_main_posts.match_bm25(title, '${query}') IS NOT NULL
      ORDER BY score DESC;
    `
    : `
      FROM posts
      SELECT day, title, sections, md, urls
      ORDER BY day DESC;
    `;

  const res = await this.dbConn.query(searchSQL);
  this.posts = await Promise.all(
    res.toArray().map(async (row: any) => {
      const post: PostRecord = {
        day: new Date(row.day).toISOString().split("T")[0],
        title: row.title,
        sections: row.sections,
        md: row.md,
        urls: Array.from(row.urls),
        score: row.score,
        renderedMd: await marked(row.md), // Pre-render the markdown b/c while the async render works in dev mode vite whines abt it in build
      };
      return post;
    }),
  );
}

The rest is really just boring HTML/CSS/Lit boilerplate code, but we now have the viewer side of a bona fide content management system ready to go.

This is not optimized/robust code as it was very hastily thrown together, but it shows that we can now use this powerful feature of DuckDB (full text search) in a remote context.

The app is here, the full code is here, and I’ll work on a system to update the database with new posts later this week.

FIN

We all will need to get much, much better at sensitive comms, and Signal is one of the only ways to do that in modern times. You should absolutely use that if you are doing any kind of community organizing (etc.). Ping me on Mastodon or Bluesky with a “🦇?” request (public or faux-private) and I’ll provide a one-time use link to connect us on Signal.

Remember, you can follow and interact with the full text of The Daily Drop’s free posts on Mastodon via @dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev ☮️

Uncategorized

72 responses to “Drop #565 (2024-12-04): All Strings ATTACHed”

Drop #566 (2024-12-05): Prompts, Two Ways – hrbrmstr's Daily Drop

December 5, 2024

[…] refer to this post to see how to access a database of all the Drops with extracted links, and full text search […]

LikeLike

Reply
Drop #567 (2024-12-06): In Defense Of… – hrbrmstr's Daily Drop

December 6, 2024

[…] refer to this post to see how to access a database of all the Drops with extracted links, and full text search […]

LikeLike

Reply
Bonus Drop #68 (2024-12-08): All Strings ATTACHed (CLI Version) – hrbrmstr's Daily Drop

December 8, 2024

[…] While the previous version of this worked fine, folks rightfully were a tad dismayed at the GIANT WASM DOWNLOAD SIZE just to read my blatherings in that way. […]

LikeLike

Reply
Drop #568 (2024-12-09): Crafting Accessible Data Visualizations – hrbrmstr's Daily Drop

December 9, 2024

[…] refer to this post and this post to see how to access a regularly database of all the Drops with […]

LikeLike

Reply
Drop #569 (2024-12-10): Typography Tuesday – hrbrmstr's Daily Drop

December 10, 2024

[…] refer to this post and this post to see how to access a regularly database of all the Drops with […]

LikeLike

Reply
Drop #570 (2024-12-11): Miles & Miles Of The CSS Styles – hrbrmstr's Daily Drop

December 11, 2024

[…] this post, and […]

LikeLike

Reply
Drop #571 (2024-12-13): Frantic Friday – hrbrmstr's Daily Drop

December 13, 2024

[…] this post, and […]

LikeLike

Reply
Bonus Drop #70 (2024-12-08): Knowledge Drop – hrbrmstr's Daily Drop

December 14, 2024

[…] this post, and […]

LikeLike

Reply
Bonus Drop #70 (2024-12-14): Knowledge Drop – hrbrmstr's Daily Drop

December 14, 2024

[…] this post, and […]

LikeLike

Reply
Drop #572 (2024-12-16): Mixed-up Monday – hrbrmstr's Daily Drop

December 16, 2024

[…] this post, and […]

LikeLike

Reply
Drop #573 (2024-12-17): Typography Tuesday – hrbrmstr's Daily Drop

December 17, 2024

[…] this post, and […]

LikeLike

Reply
Drop #578 (2024-12-18): Wonkish Wednesday – hrbrmstr's Daily Drop

December 18, 2024

[…] this post, and […]

LikeLike

Reply
Drop #579 (2024-12-19): Favicons wRapped – hrbrmstr's Daily Drop

December 19, 2024

[…] this post, and […]

LikeLike

Reply
Drop #580 (2024-12-20): Safety First – hrbrmstr's Daily Drop

December 20, 2024

[…] this post, and […]

LikeLike

Reply
Drop #581 (2024-12-30): Meow-nday – hrbrmstr's Daily Drop

December 30, 2024

[…] this post, and […]

LikeLike

Reply
Drop #582 (2024-12-31): Typography Tuesday – hrbrmstr's Daily Drop

December 31, 2024

[…] this post, and […]

LikeLike

Reply
Drop #583 (2025-01-01): Wrapping Day – hrbrmstr's Daily Drop

January 1, 2025

[…] this post, and […]

LikeLike

Reply
Drop #584 (2025-01-02): Looking Back To Move Forward – hrbrmstr's Daily Drop

January 2, 2025

[…] this post, and […]

LikeLike

Reply
Drop #585 (2025-01-03): Cruftless – hrbrmstr's Daily Drop

January 3, 2025

[…] this post, and […]

LikeLike

Reply
Bonus Drop #71 (2025-01-05): A ‘Flare’ For The Dramatic – hrbrmstr's Daily Drop

January 5, 2025

[…] this post, and […]

LikeLike

Reply
Drop #586 (2025-01-06): Scroll Out The Barrel – hrbrmstr's Daily Drop

January 6, 2025

[…] this post, and […]

LikeLike

Reply
Drop #587 (2025-01-07): (Tardy) Typography Tuesday – hrbrmstr's Daily Drop

January 8, 2025

[…] this post, and […]

LikeLike

Reply
Drop #588 (2025-01-09): ⚔️ Random Acts Of Violence – hrbrmstr's Daily Drop

January 9, 2025

[…] this post, and […]

LikeLike

Reply
Drop #589 (2025-01-10): In The Zone – hrbrmstr's Daily Drop

January 10, 2025

[…] this post, and […]

LikeLike

Reply
Bonus Drop #71 (2025-01-12): Need For Speed – hrbrmstr's Daily Drop

January 12, 2025

[…] this post, and […]

LikeLike

Reply
Drop #590 (2025-01-13): A SQL Trio – hrbrmstr's Daily Drop

January 13, 2025

[…] this post, and […]

LikeLike

Reply
Drop #591 (2025-01-14): Typography Tuesday – hrbrmstr's Daily Drop

January 14, 2025

[…] this post, and […]

LikeLike

Reply
Drop #592 (2025-01-15): In Living Color – hrbrmstr's Daily Drop

January 15, 2025

[…] this post, and […]

LikeLike

Reply
Drop #593 (2025-01-16): WASSSSMM UP?! – hrbrmstr's Daily Drop

January 16, 2025

[…] this post, and […]

LikeLike

Reply
Bonus Drop #72 (2025-01-10): What Comes Next – hrbrmstr's Daily Drop

January 19, 2025

[…] this post, and […]

LikeLike

Reply
Drop #594 (2025-01-20): Tidying Up – hrbrmstr's Daily Drop

January 20, 2025

[…] this post, and […]

LikeLike

Reply
Drop #595 (2025-01-21): Typography Tuesday – hrbrmstr's Daily Drop

January 21, 2025

[…] this post, and […]

LikeLike

Reply
Drop #596 (2025-01-22): Wacky Wensday – hrbrmstr's Daily Drop

January 22, 2025

[…] this post, and […]

LikeLike

Reply
Drop #597 (2025-01-23): Toss-Up Thursday – hrbrmstr's Daily Drop

January 23, 2025

[…] this post, and […]

LikeLike

Reply
Drop #598 (2025-01-24): Development Distractions – hrbrmstr's Daily Drop

January 24, 2025

[…] this post, and […]

LikeLike

Reply
Bonus Drop #73 (2025-01-25): Crawl • Slice • Learn – hrbrmstr's Daily Drop

January 25, 2025

[…] this post, and […]

LikeLike

Reply
Drop #599 (2025-01-24): Contain Your Enthusiasm – hrbrmstr's Daily Drop

January 27, 2025

[…] this post, and […]

LikeLike

Reply
Drop #600 (2025-01-25): – hrbrmstr's Daily Drop

January 28, 2025

[…] this post, and […]

LikeLike

Reply
Drop #600 (2025-01-30): Toss-Up Thursday – hrbrmstr's Daily Drop

January 30, 2025

[…] this post, and […]

LikeLike

Reply
Bonus Drop #74 (2025-02-02): It’s Been A Long Week – hrbrmstr's Daily Drop

February 2, 2025

[…] this post, and […]

LikeLike

Reply
Drop #602 (2025-02-03): Monday Morning Grab Bagb – hrbrmstr's Daily Drop

February 3, 2025

[…] this post, and […]

LikeLike

Reply
Drop #603 (2025-02-12): Suggest • Static • Search – hrbrmstr's Daily Drop

February 12, 2025

[…] this post, and […]

LikeLike

Reply
Drop #604 (2025-02-13): Toss-Up Thursday – hrbrmstr's Daily Drop

February 13, 2025

[…] this post, and […]

LikeLike

Reply
Drop #605 (2025-02-14): Free-For-All Friday – hrbrmstr's Daily Drop

February 14, 2025

[…] this post, and […]

LikeLike

Reply
Bonus Drop #76 (2025-02-16): I’m Sensing A Theme – hrbrmstr's Daily Drop

February 16, 2025

[…] this post, and […]

LikeLike

Reply
Drop #606 (2025-02-17): – hrbrmstr's Daily Drop

February 17, 2025

[…] this post, and […]

LikeLike

Reply
Drop #607 (2025-02-17): Typography Tuesday – hrbrmstr's Daily Drop

February 18, 2025

[…] this post, and […]

LikeLike

Reply
Drop #607 (2025-02-18): Someone To Watch Over You[r Files] – hrbrmstr's Daily Drop

February 19, 2025

[…] this post, and […]

LikeLike

Reply
Drop #609 (2025-02-20): The “Not Everything Is Horrible In 2025” Drop – hrbrmstr's Daily Drop

February 20, 2025

[…] this post, and […]

LikeLike

Reply
Drop #610 (2025-02-21): The Authentic Drop – hrbrmstr's Daily Drop

February 21, 2025

[…] this post, and […]

LikeLike

Reply

Drop #565 (2024-12-04): All Strings ATTACHed

Getting The Daily Drop Into A Database

The Daily Drop In DuckDB Post Explorer

FIN

Share this:

72 responses to “Drop #565 (2024-12-04): All Strings ATTACHed”

Leave a reply to Bonus Drop #68 (2024-12-08): All Strings ATTACHed (CLI Version) – hrbrmstr's Daily Drop Cancel reply