duckdb-mcp (Golang); Yardstick; DataSketches
The pace of change in the DuckDB ecosystem has not slowed down, and there are some new items external to the DuckDB core that we’ll look at today.
TL;DR
(This is an LLM/GPT-generated summary of today’s Drop. Ollama and MiniMax M2.1.)
- A Go-based MCP server for DuckDB that adds tools like
list_tables,describe_table, and exposes schema information as MCP resources to help LLMs better interact with databases (https://git.sr.ht/~hrbrmstr/duckdb-mcp) - The Yardstick extension for DuckDB implements “measure” column types that allow reusable calculations to correctly recalculate at different aggregation levels using an
AToperator for context shifting (https://github.com/sidequery/yardstick/blob/main/README.md) - The DuckDB DataSketches extension provides streaming algorithms for distinct counts, percentiles, and frequent items with bounded memory usage and the ability to store and merge sketch states (https://github.com/query-farm/datasketches)
duckdb-mcp (Golang)

I’ve been using the Python MCP server for DuckDB and MotherDuck for a while now, and it’s solid for what it does. But after enough sessions watching LLMs struggle with some complex tables (nested structures, needing to use DuckDB-flavoured SQL syntax, etc.), I decided to build something with a bit more batteries included.
This Go port adds tools, resources, and prompts that — in real, daily, battle-tested workflows — have helped reduce the volume of busted queries and dramatically improved the decision trees LLMs make during an analysis workflow. list_tables shows you what’s in the database along with column counts. describe_table can show sample rows and null statistics, not just the schema. list_databases handles the case where you’ve attached multiple databases during a given session. The server exposes schema information as MCP resources, which means LLMs can access table structures without making tool calls. And there are built-in prompts that front-load DuckDB-specific knowledge—things like the FROM-first syntax, file querying, and JSON operations—so the LLM starts with better context. The OG epic setup prompt from the folks at MotherDuck is included since it’s really solid and I did not feel like reinventing the wheel.
The other reason I wrote this was practical: it’s a single Go binary. No uvx, no npx, no dependency installation at runtime. You download it, point it at your database (local file, :memory:, MotherDuck, or S3), and it just works. I’m not paranoid about package managers, but I do prefer knowing exactly what’s running on my systems when any component on it has access to my data.
If you’re doing any database work with LLMs or another MCP client, this might save you some time. Just clone the repo (pre-built binaries are “coming soon” as I re-figure out how to work with Sourcehut), build the MCP, follow the examples for getting it into your workflow.
Yardstick

SQL has a (well, more than one) fundamental limitation in that you can’t create a reusable calculation that works correctly at different aggregation levels. Say you define a profitMargin formula in a view that groups by product and date. If someone queries that view and groups only by product, they get garbage (for folks who are reading but aren’t deep into SQL, that result is an average of averages instead of the true margin). This paper’s solution is a new column type called a “measure”. A measure is a formula attached to a table that “knows” how to recalculate itself based on whatever GROUP BY the query uses. When you reference a measure, SQL expands it into a subquery with the right filters for the current context. No self-joins! No repeated WHERE clauses! Stay’s crispy in milk! (Well, mebbe not that last one.) To put it more plainly, the formula lives in one place and just works everywhere.
The practical advantage is eliminating the self-join Hades you hit when doing comparisons. Say you want this year’s revenue vs. last year’s. Without these “measures”, you write the same daft subquery twice with different year filters, then join them together. With measures, you write revenue / revenue AT (SET year = year - 1). The AT operator lets you say “calculate this measure, but pretend the year is something else” or “calculate this ignoring the region grouping entirely” (AT (ALL region)). This is how business intelligence (BI) tools like Looker have worked internally for years, and the concepts in said paper just bring that capability into standard SQL semantics.
The Yardstick extension implements the concepts from the paper in DuckDB. You define measures in a view with SUM(amount) AS MEASURE revenue, then query them with AGGREGATE(revenue). The AT modifiers give you the context-shifting superpowers: AT (ALL) for grand totals, AT (ALL region) for totals across all regions, AT (SET year = 2023) to lock a dimension to a specific value. Percent-of-total calculations that used to require common table expressions (CTEs) or window functions now become simple one-liners, and idioms such as year-over-year growth are just subtraction.
You can install it with INSTALL yardstick FROM community to start playing. I’ve been combing through SQL and R code that did some of this higher-level grunt work and found enough targets to invest time in learning more about this new methodology and also the extended SQL syntax idioms. We’ll take a look at my results in a forthcoming Drop.
DataSketches

The DuckDB DataSketches extension gives us access to Apache DataSketches, a library of streaming algorithms that answer hard questions about massive datasets without breaking your memory budget (which is quite convenient given how our LLM overlords have made purchasing memory a luxury item). With it distinct counts across billions of rows, percentile calculations, or finding the most frequent items in a stream are all computed in a single pass with bounded error guarantees.
DuckDB already ships with approx_count_distinct() and approx_quantile(), which are ace for one-off queries. But those functions are opaque: they compute an answer and throw away the internal state. The DataSketches extension takes a different approach. It returns the sketch itself as a BLOB we can store, query later, or merge with other sketches. This matters when data is partitioned across time periods or machines, since we can just sketch each partition independently, then combine them without ever touching the raw data again.
The extension covers three main problem areas. For quantile estimation, you get KLL (the general-purpose workhorse), TDigest (better accuracy at extreme percentiles like p99), and REQ (for relative error guarantees on skewed distributions). For distinct counting, there’s the venerable HyperLogLog, CPC (which uses about 40% less space), and Theta sketches (which uniquely support set operations like intersection and difference). The Frequent Items sketch identifies heavy hitters in your data.
Each sketch type lets us tune accuracy versus memory through a parameter (usually called K or lg_k). Higher values mean tighter error bounds and larger sketches. The documentation includes tables showing exactly what error to expect at each setting, so no guesswork is required.
This is on the TODO for next week since I’ve been managing a data pipeline where I gather up all the Arkime-enriched sessions across the fleet every hour (now, going back to June 2025) and it’s at the point where I am going to need to spread out the DuckDB files vs keep everything in one. I have a feeling this might speed-up some tasks, but we’ll see how it goes and (again) report back.
FIN
Remember, you can follow and interact with the full text of The Daily Drop’s free posts on:
- Mastodon via
@dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev - Bluesky via
<https://bsky.app/profile/dailydrop.hrbrmstr.dev.web.brid.gy>
☮️
Leave a comment