Drop #658 (2025-06-02): The [Duck] Lake House

One-ish topic today since I wanted to make sure folks who don’t do data ops for a living groked why the DuckDB folks layered in a “lakehouse” feature via their new extension.

If you know what data lakes/lakehouses are, skip to the DuckLake section. I [briefly] repeat the salient bits from the first section there.

Quick links:

TL;DR On Data Lakes

Pretend you’re managing data for a typical company that maintains inventory, has customers, and processes orders. The simplest approach would be organizing your files like this:

data/
├── customers/
│   ├── 2024-01-01.parquet
│   ├── 2024-01-02.parquet
│   └── 2024-01-03.parquet
├── orders/
│   ├── 2024-01-01.parquet
│   ├── 2024-01-02.parquet
│   └── 2024-01-03.parquet
└── products/
    ├── electronics.parquet
    └── clothing.parquet

This works fine for simple scenarios. You can query these files directly with tools like DuckDB, read them with R, or process them with many other types of tooling. The structure is obvious to humans, and basic analytics are straightforward.

But here’s where things start to break down as your system grows.

Think about what happens when your data team wants to perform common operations that any database takes for granted. This presents many challenges in a “simple files” setting:

Schema Evolution Problems: Suppose your customers table originally had columns for id, name, and email. Six months later, you need to add a phone number column. With just files in folders, you now have a mixed situation. Some files have three columns, others have four. How do you know which is which? How do you query across all files when they have different schemas? You’d need to maintain this information somewhere else, essentially creating your own metadata system.
The Update Nightmare: Imagine you discover that customer ID 12345 has an incorrect email address across multiple files spanning several months. In a traditional database, this would be a simple UPDATE statement. With files in folders, you need to read each relevant file, modify the records, and write new files. But what happens if someone else is reading those files while you’re updating them? You have no way to coordinate these operations.
No Transaction Guarantees: Consider what happens when you need to update both customer information and their associated orders atomically. In a database, you’d wrap both operations in a transaction to ensure they either both succeed or both fail. With files in folders, if your process crashes after updating customers but before updating orders, you’re left with inconsistent data and no automatic way to recover.
Time Travel and Versioning: Suppose you need to see what your data looked like last Tuesday for a compliance audit. With files in folders, you’d need to manually track which files existed at that time and what their schemas were. There’s no systematic way to recreate a point-in-time view of your data.

Traditional data lakes tried to solve the scale problem by saying, “just dump everything in object storage and figure it out later.” This worked for some use cases but created new problems. You could store massive amounts of data cheaply, but you lost the ability to perform complex queries efficiently, maintain data quality, or ensure consistency across datasets.

Data lakes became notorious for turning into “data swamps” where data went to die because it was too difficult to maintain and query effectively.

Lakehouse architectures represent the recognition that we need the best of both worlds: the scalability and cost-effectiveness of data lakes with the reliability and query capabilities of databases.

Here’s how lakehouse formats like Iceberg, Delta Lake, and DuckLake solve the problems we just noted:

Metadata Management: Instead of relying on folder naming conventions and hoping for the best, lakehouse formats maintain explicit metadata about table schemas, partitioning schemes, and file locations. This metadata acts like a catalog that tells you exactly what data exists, where it lives, and how it’s structured.
ACID Transactions: Lakehouse formats implement transaction logs that track every change to your data. When you update that customer’s email address, the system can ensure the operation either completes fully or fails cleanly, leaving your data in a consistent state.
Schema Evolution with Compatibility: When you add that phone number column, the lakehouse format tracks this schema change and can handle queries that span both old and new data formats. It knows that older files don’t have the phone column and can return null values appropriately.
Time Travel and Versioning: The transaction log becomes a complete history of your data. You can query any point in time by asking the system to show you which files were active and what the schema looked like at that moment.
Optimizations: Lakehouse formats can maintain statistics about your data, track which files contain which data ranges, and even compact small files automatically. These optimizations make queries much faster than scanning through arbitrary file structures.

Let’s see how the same operation would work in each approach:

Simple Files Approach:

# To add a new customer, you might:
1. Figure out which file to add to (today's date?)
2. Read the existing file
3. Append the new record
4. Write a new file
5. Hope no one else is doing the same thing simultaneously

Lakehouse Approach:

-- Just insert the data - the lakehouse handles everything else
INSERT INTO customers VALUES (12346, 'Charlie', 'charlie@example.com', '555-0199');

-- The system automatically:
-- - Determines optimal file placement
-- - Updates metadata atomically
-- - Maintains transaction log
-- - Handles concurrent operations safely

Now we can understand why DuckLake’s approach (ref. next section) is particularly clever. Traditional lakehouse formats like Iceberg and Delta Lake store their metadata alongside the data files, creating JSON and Avro files that describe the table structure and transaction history. This works, but it has limitations: metadata operations require file I/O, coordination between multiple writers is complex, and the metadata files themselves can become performance bottlenecks.

DuckLake says: “Why not use a real database for the metadata?” This eliminates many coordination problems because databases are already designed to handle concurrent access, ACID transactions, and fast lookups. The data still lives in Parquet files in your object storage, but all the metadata lives in a proper database where it can be managed efficiently.

Managing metadata is a different concern from storing data, and each deserves the right tool for the job. Data belongs in efficient, open formats like Parquet on cheap object storage. Metadata belongs in systems designed for fast lookups, transactions, and concurrent access — which is precisely what databases provide.

DuckLake

As noted in the first section, DuckLake represents a reimagining of how we build data lake architectures. It’s integrated data lake and catalog format solves a central problem in modern data systems: the complexity and limitations of traditional lakehouse formats. Instead of storing metadata in files, it uses an actual SQL database to manage all the catalog information. This might seem obvious in retrospect, but it’s a breakthrough that simplifies everything while making the system more reliable and faster.

DuckLake employs a three-way separation of concerns. Traditional systems typically separate compute and storage, but DuckLake adds a third component: metadata. Here’s how it works:

Storage Layer: Your actual data lives in open formats like Parquet files on object storage such as AWS S3, Google Cloud Storage, or Azure Blob Storage. This gives you all the benefits of cheap, scalable storage with an open format that any tool can read.
Metadata Layer: All catalog information, table schemas, transaction logs, and system metadata live in a separate SQL database. This could be PostgreSQL, MySQL, SQLite, or even DuckDB itself. Because it’s a real database, you get all the ACID properties, concurrency control, and performance optimizations that databases provide.
Compute Layer: Instead of a traditional client-server model, DuckLake uses a “compute-everywhere” approach. Each person querying the data or application runs their own DuckDB instance, which connects to the shared metadata database and reads data from the shared storage. This eliminates traditional bottlenecks and resource scheduling problems.

This design solves several core problems that plague other lakehouse formats. When Iceberg or Delta Lake need to perform metadata operations, they must read and write files to object storage, which is slow and creates consistency challenges. DuckLake performs these operations against a database, which is orders of magnitude faster and provides true ACID guarantees.

Think of it this way: if you want to check what columns a table has, traditional formats might need to scan through multiple metadata files on S3 (taking hundreds of milliseconds), while DuckLake can query its metadata database and get the answer in a couple of milliseconds.

DuckLake provides all the advanced features you’d expect from a modern data lake system, but implements them more elegantly:

Time Travel and Snapshots: You can query your data as it existed at any point in time. Unlike other systems that might struggle with managing many snapshots, DuckLake can handle millions of snapshots efficiently because the metadata is stored in a database designed for such operations.
Schema Evolution: Adding columns, changing data types, or restructuring tables becomes straightforward SQL operations rather than complex file manipulations. The database tracks all these changes transactionally.
Concurrent Access: Multiple DuckDB instances can read and write to the same DuckLake simultaneously with full ACID guarantees. This “multiplayer DuckDB” capability was one of the original motivations for creating DuckLake.
Advanced Security Features: DuckLake introduces innovative encryption capabilities. Every Parquet file gets its own encryption key, and these keys are stored in the metadata database. This means you can store encrypted data on untrusted public cloud storage while keeping the decryption keys in your secure, private metadata server.

This post has a complete example, but let’s take a quick look, here, at how one might actually use DuckLake in practice.

First, you install the DuckLake extension in DuckDB and connect to your metadata database. If you’re using PostgreSQL as your catalog database and S3 for storage, you might run something like this:

INSTALL ducklake;
INSTALL postgres;
ATTACH 'ducklake:postgres:dbname=ducklake_catalog host=your_postgres_host' AS my_daucklake
(DATA_PATH 's3://your-bucket/data/');
USE my_ducklake;

Once connected, you work with DuckLake using standard SQL operations. Creating tables, inserting data, and querying all use familiar SQL syntax:

CREATE TABLE customers (
  id INTEGER, 
  name VARCHAR, 
  email VARCHAR,
  created_at TIMESTAMP
);

INSERT INTO customers VALUES 
  (1, 'Michael Waltz', 'michael@nsa.gov', '2025-01-20'),
  (2, 'Pete Hegseth', 'pete@dod.gov', '2025-01-24'),
  (3, 'Jeffrey Goldberg', 'jeff@atlantic.com', '2025-03-13'),

Behind the scenes, DuckLake stores the actual data as Parquet files in your S3 bucket and maintains all the metadata in your PostgreSQL database. When you query the data, DuckDB efficiently reads the relevant Parquet files while using the metadata to optimize the query.

One of DuckLake’s spiffiest features is how it handles schema changes. Suppose you need to add a phone number column to your customers table:

ALTER TABLE customers ADD COLUMN phone VARCHAR;
UPDATE customers SET phone = '555-0123' WHERE id = 1;

This operation is tracked as a new snapshot in the metadata database. Later, you can use time travel to see your data before this change:

-- Query current version
SELECT * FROM customers;

-- Query a specific snapshot (version 5, before adding phone column)
SELECT * FROM customers AT SNAPSHOT 5;

This time travel capability is incredibly useful for debugging, auditing, or recovering from mistakes. Because the metadata is stored in a database rather than files, these operations are fast and reliable.

DuckLake’s architecture provides several performance advantages over traditional lakehouse formats. Metadata queries are dramatically faster because they hit a database rather than scanning files on object storage. The system can handle many more concurrent operations because database concurrency control is more sophisticated than file-based locking mechanisms.

The encryption capabilities open up new possibilities for data sharing and compliance. You can store sensitive data on public cloud infrastructure while maintaining complete control over access through your private metadata server. This separation of concerns allows for architectures that weren’t practical with traditional lakehouse formats.

DuckLake is designed as an open standard, not just a DuckDB-specific feature. While the current implementation is a DuckDB extension, the format itself is database-agnostic. This means other query engines could potentially implement DuckLake support in the future.

The project also plans to add import and export capabilities for existing table formats like Iceberg, making it easier to migrate from other lakehouse solutions. MotherDuck, the managed DuckDB service, is adding hosted support for DuckLake, which will provide additional scalability and management features.

DuckLake is particularly well-suited for scenarios where you need the flexibility of a data lake but also require strong consistency guarantees, frequent metadata operations, or high concurrency. It’s an excellent choice for “multiplayer” analytics scenarios where multiple teams or applications need to read and write the same datasets simultaneously.

The encryption features make it valuable for organizations with strict data governance requirements, while the performance characteristics make it ideal for applications that require low-latency data access combined with large-scale storage.

By leveraging the power of SQL databases for metadata management while keeping data in open formats, it provides a compelling alternative to more complex file-based approaches.

FIN

Remember, you can follow and interact with the full text of The Daily Drop’s free posts on:

🐘 Mastodon via @dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev
🦋 Bluesky via https://bsky.app/profile/dailydrop.hrbrmstr.dev.web.brid.gy

☮️

hrbrmstr's Daily Drop

Drop #658 (2025-06-02): The [Duck] Lake House

TL;DR On Data Lakes

DuckLake

FIN

Leave a comment Cancel reply

Drop #658 (2025-06-02): The [Duck] Lake House

TL;DR On Data Lakes

DuckLake

FIN

Share this:

Leave a comment Cancel reply