Drop #762 (2026-01-23): If It Talks Like A…

duck_tails; crawler; url_pattern

I was going to make some space between DuckDB-centric Drops, but I literally had to use one of the new Community extensions I discovered yesterday to solve a self-inflicted problem (not my doing, but I am in charge, so it’s technically my problem, now). And, if I’m going to show off one extension, I might as well pick two more!


TL;DR

(This is an LLM/GPT-generated summary of today’s Drop. Ollama and MiniMax M2.1.)


duck_tails

Photo by mauro carlo stella on Pexels.com

In the early days of of $WORK (years before I joined), the folks doing systems engineering, network engineering, and software engineering/app dev were not “data” people. For the most part, that’s still true, and — as such — sometimes causes a bit of pain for we data folk who need to perform research tasks.

One of those is somewhat more of an annoyance than others, and that would be how, and where (and, when) the API and app use tag nameid, and slug for input or output to the caller.

We have naming conventions for tags that we refine over time, and I have one example from this week. This tag — GNU Inetutils Telnetd Authentication Bypass CVE-2026-24061 Attempt — has our canonical form of the product/“thing”, the weakness, the CVE, and the action. Some of that is also encoded in fields, but there needs to be a nice display name for humans. When that weakness was first disclosed, no CVE existed, so the name was just “GNU Inetutils Telnetd Authentication Bypass Attempt”. And, when the data pipeline actually “tags” a given session, that’s permanent (I mean, we technically can go back and modify the record, but that violates all sorts of rules and contracts we have with ourselves and those folks who rely on the data).

Sadly, two APIs that I and others kind of rely on only allow querying by name, and one of them doesn’t store the UUID. (That sound you definitely just heard was me, yet again, smashing my head on the desk.) Truth be told, seeing the name in the query is nice for self-documentation (and typing/copying UUIDs is a pain), but when we’re tracking an emergent threat like this one, it means we have to add both names to the query. And, when I do make a query by id in these and other systems, the different names will return until the data ages out of production (~90d; but we keep all of it around forever). It’s not a show stopper, but it is annoying.

We’re about to move to storing the tags we create in a database, which will make normalizing everything a bit easier. Right now, they live in JSON files in git, and at least once a day there’s a process that scoops up all the tags into one file, which is also in git.

Sure, we could have thought this through a bit more and kept timestamped copies of the combined file, but “live and learn”?

I’ve used various tricks to get files out of git history but they’re all horribly clunky. So when I saw duck_tails yesterday, I might have SQUEEd. Here’s the official description:

Duck Tails brings git-aware data analysis capabilities to DuckDB, enabling sophisticated version-controlled data workflows. The extension provides three core capabilities:

Git Filesystem Access: Use the git:// protocol to access any file in your git repository at any commit, branch, or tag. This allows you to query historical data states, compare versions, and perform temporal analysis directly in SQL.

The functions are very straightforward, and withing about 3 minutes I was able to create a very small Bash+DuckDB script to get what I want:

#!/usr/bin/env bash
cd /Users/hrbrmstr/gn/labs-viz-data
echo "Retrieving git history…"
mapfile -t queries < <(duckdb -c "INSTALL duck_tails FROM community;
LOAD duck_tails;
COPY (FROM git_log('git://docs/tags-no-metadata.json')
SELECT
'FROM read_json(''git://docs/tags-no-metadata.json@' || commit_hash || ''') SELECT name, id' AS query
) TO '/dev/stdout' (FORMAT csv, QUOTE '', HEADER false)")
echo "Building the combined query…"
combined=""
for i in "${!queries[@]}"; do
if [[ $i -gt 0 ]]; then
combined+="
UNION
"
fi
combined+="${queries[$i]}"
done
echo "Creating CSV…"
duckdb -c "INSTALL duck_tails FROM community;
LOAD duck_tails;
COPY (SELECT DISTINCT name, id FROM (${combined})) TO '~/gn/labs-viz-data/tag-name-id-history.csv' (FORMAT csv, QUOTE '\"', FORCE_QUOTE *, HEADER true)"
echo "Done!"

The first query builds a series of FROM statements that we combine in Bash, adding a UNION. And, at the end of the process, I have a super nice translation table I can use to glue things together without being too verbose, and always being able to rely on the id (which is not technically 100% true, but that’s a problem to solve for another day)!

It’s a greaty extension that relies on the widely-used libgit2 for all the underlying git work.


crawler

Thanks to “AI”, it’s getting harder to scrape content (legitimally) from the internet, but it’s not impossible, and there are sites on the open web who still do not mind if you responsibly yank some of their HTLM.

The DuckDB crawler (GH) extension gives our fowl friend scraping and content crunching superpowers. I’m just going to show off a couple, but hee are a few more to whet some appetites:

  • crawl() table function with automatic rate limiting and robots.txt compliance
  • crawl_url() for LATERAL joins
  • sitemap() for XML sitemap parsing
  • jq() and htmlpath() functions for CSS selector-based extraction
  • html.readability for article extraction
  • html.schema for JSON-LD/microdata parsing
  • CRAWLING MERGE INTO syntax for upsert operations

(There’s quite a bit more information and many more examples at the GH link.)

This is one small example of the utility:

🦆>FROM crawl([
'https://duckdb.org/community_extensions/extensions/duck_tails',
'https://duckdb.org/community_extensions/extensions/crawler',
'https://duckdb.org/community_extensions/extensions/html_readability',
'https://duckdb.org/community_extensions/extensions/urlpattern'
])
SELECT
url,
jq(html.document, 'title').text as title,
html.readability.text_content as contents
;
┌──────────────────────┬──────────────────────┬───────────────────────────────────────────────────────────────┐
│ url │ title │ contents │
│ varchar │ varchar │ json │
├──────────────────────┼──────────────────────┼───────────────────────────────────────────────────────────────┤
│ https://duckdb.org… │ duck_tails – DuckD… │ "Smart Development Intelligence for DuckDB - Git-aware data… │
│ https://duckdb.org… │ crawler – DuckDB C… │ "SQL-native web crawler with HTML extraction and MERGE supp… │
│ https://duckdb.org… │ html_readability –… │ "Extract readable content from HTML using Mozilla's Readabi… │
│ https://duckdb.org… │ urlpattern – DuckD… │ "WHATWG URLPattern API for matching and extracting componen… │
└──────────────────────┴──────────────────────┴───────────────────────────────────────────────────────────────┘

(NOTE: there’s also a html_readability extension if you already have a pile of HTML around.)

I don’t have tons of scraping tasks to do anymore, but when I do, I’m reaching for this. I’ll always lean into using two standalone binaries (duckdb & this library) that work everywhere painlessly vs. a gajillion dependencies in R or Python (or whatev).


url_pattern

Photo by Designecologist on Pexels.com

While I may not need to scrape, I do need to parse and compare URLs all the time. Some other extensions have some functions for the parsing, but we now have a dedicated urlpattern extension that only focuses on the matching bit (it can extract as well).

It implements the WHATWG “URL Pattern Standard

This extension also has some thorough examples in GH, so I’ll leave you to peruse them as you prepare for the crushing cold and snow headed our way (at least for ~180m folks in the U.S.).


FIN

Remember, you can follow and interact with the full text of The Daily Drop’s free posts on:

  • Mastodon via @dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev
  • Bluesky via <https://bsky.app/profile/dailydrop.hrbrmstr.dev.web.brid.gy>

☮️

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.