Bonus Drop #70 (2024-12-14): Knowledge Drop

Adding CIDR Handling To DuckDB’s INET Extension

I’m getting “OK” with building personal-use extensions for DuckDB, but today is my first time trying to PR some new functionality into an existing extension.

Photo by Pixabay on Pexels.com

The inet extension defines the INET data type for storing IPv4 and IPv6 Internet addresses. It supports the CIDR (Classless Inter-Domain Routing) notation for subnet masks (e.g., 198.51.100.0/222001:db8:3c4d::/48).

CIDR notation comes up quite a bit in my $DAYJOB, and I had need to add this functionality to the DuckDB INET extension (well, “desire” more than “need”), so let’s walk through how I made it, since DuckDB extensions can — at first glance — be pretty daunting. While most of the extensions are in C++ (like this one), there are numerous efforts to bring extension building to Rust, Zig, and other languages.

When working with existing or a new template extension, it is quite important to ensure you’re collecting the source from the submodules used in the extension:

$ git clone --recurse-submodules git@github.com:hrbrmstr/duckdb_inet.git

We care mostly about the src directory in there. This is the structure with the files we’re modifying highlighted:

├── include
│   ├── html_charref.hpp
│   ├── inet_extension.hpp
│   ├── inet_functions.hpp 👈🏼
│   └── ipaddress.hpp 👈🏼
├── inet_escape_functions.cpp
├── inet_extension.cpp 👈🏼
├── inet_functions.cpp 👈🏼
└── ipaddress.cpp 👈🏼

First up: CIDR notation combines an IP address with a network prefix length (mask) to specify a range of IP addresses. For example:

  • 192.168.1.0/24 represents all IP addresses from 192.168.1.0 to 192.168.1.255
  • 2001:db8::/32 represents a range of IPv6 addresses

We’ll first need a way (in ipaddress.[ch]pp) to test if an IP address value includes a CIDR (i.e., is a “network”):

bool IPAddress::IsCIDR() {
    return (type == IPAddressType::IP_ADDRESS_V4 && mask < IPV4_DEFAULT_MASK) ||
           (type == IPAddressType::IP_ADDRESS_V6 && mask < IPV6_DEFAULT_MASK);
}

This method determines if an IP address represents a CIDR range by checking:

  • For IPv4: if the mask is less than 32 (IPV4_DEFAULT_MASK)
  • For IPv6: if the mask is less than 128 (IPV6_DEFAULT_MASK)

If true, the address represents a network range rather than a single host.

We’re now going to want a function to take in an IP address and expand it into a DuckDB list/array (which may be one element if we have a bare IPv4 or IPv6). Let’s register that function (in inet_extension.cpp):

ExtensionUtil::RegisterFunction(
  *db.instance,
  ScalarFunction("expand_cidr", {inet_type}, LogicalType::LIST(inet_type),
  INetFunctions::ExpandCIDR));

This registers a scalar function that:

  • Takes an INET type as input
  • Returns a LIST of INET values
  • Uses INetFunctions::ExpandCIDR as the implementation

The main implementation in INetFunctions::ExpandCIDR (in inet_functions.[ch]pp) does the heavy lifting.

This part.

auto & ipaddress_vector = args.data[0];
UnifiedVectorFormat ipaddress_data;
ipaddress_vector.ToUnifiedFormat(args.size(), ipaddress_data);

gets the input vector and converts it to a unified format for processing.

We then need to extract the components of each INET value (type, address, mask).

auto & entries = StructVector::GetEntries(ipaddress_vector);
auto ip_type_data = FlatVector::GetData<uint8_t>(*entries[0]);
auto address_data = FlatVector::GetData<hugeint_t>(*entries[1]);
auto mask_data = FlatVector::GetData<uint16_t>(*entries[2]);

DuckDB gives us batches of rows to process (hence the for (idx_t i = 0; i < args.size(); i++) { … } loop).

Next up, we have to figure out the number of hosts based on the address type and mask (2^(bits-mask)):

if (inet.IsCIDR()) {
  IPAddress network = inet.Network();
  IPAddress broadcast = inet.Broadcast();
  hugeint_t hosts;
  if (addr_type == IPAddressType::IP_ADDRESS_V4) {
    hosts = hugeint_t(1) << (32 - mask);
  } else {
    hosts = hugeint_t(1) << (128 - mask);
  }

Then, it’s a boring for loop to do the expansion:

for (hugeint_t j = 0; j < hosts && current.address <= broadcast.address; j = j + one) {
  ip_list.emplace_back(Value::STRUCT(inet_type, {
    Value::UTINYINT(uint8_t(current.type)),
    Value::HUGEINT(ToCompatAddr(current.address, current.type)),
    Value::USMALLINT(current.type == IPAddressType::IP_ADDRESS_V4 ? 32 : 128)
  }));
  current.address = current.address + one;
}

NOTE: We will have to work out what to do when a crazy request is made for expanding a gazillion IPv6 addresses.

If it wasn’t a CIDR we just emplace_back the original value.

The resultant LIST is added to return response:

result.SetValue(i, Value::LIST(inet_type, std::move(ip_list)));

With the above context handy, the full source for the function should be a bit easier to grok.

See It In Action

Here’s a sample of how it works:

🦆>SELECT UNNEST(expand_cidr('192.168.1.0/32'::INET)) ip;
┌─────────────┐
│     ip      │
│    inet     │
├─────────────┤
│ 192.168.1.0 │
└─────────────┘

🦆>SELECT UNNEST(expand_cidr('192.168.1.0/30'::INET)) ip;
┌─────────────┐
│     ip      │
│    inet     │
├─────────────┤
│ 192.168.1.0 │
│ 192.168.1.1 │
│ 192.168.1.2 │
│ 192.168.1.3 │
└─────────────┘

🦆>FROM (SELECT UNNEST(expand_cidr('192.168.1.0/16'::INET)) ip) SELECT COUNT(ip);
┌───────────┐
│ count(ip) │
│   int64   │
├───────────┤
│     65536 │
└───────────┘

🦆>SELECT expand_cidr('2001:db8::/126'::INET);
┌─────────────────────────────────────────────────────┐
│     expand_cidr(CAST('2001:db8::/126' AS INET))     │
│                       inet[]                        │
├─────────────────────────────────────────────────────┤
│ [2001:db8::, 2001:db8::1, 2001:db8::2, 2001:db8::3] │
└─────────────────────────────────────────────────────┘

🦆>SELECT expand_cidr('2603:3005:1507:5900:9c2b:2430:c08e:addf'::INET);
┌──────────────────────────────────────────────────────────────────────┐
│ expand_cidr(CAST('2603:3005:1507:5900:9c2b:2430:c08e:addf' AS INET)) │
│                                inet[]                                │
├──────────────────────────────────────────────────────────────────────┤
│ [2603:3005:1507:5900:9c2b:2430:c08e:addf]                            │
└──────────────────────────────────────────────────────────────────────┘

I’ve got a PR up for this, so it will hopefully make it into the official extension soon-ish.


FIN

We all will need to get much, much better at sensitive comms, and Signal is one of the only ways to do that in modern times. You should absolutely use that if you are doing any kind of community organizing (etc.). Ping me on Mastodon or Bluesky with a “🦇?” request (public or faux-private) and I’ll provide a one-time use link to connect us on Signal.

Remember, you can follow and interact with the full text of The Daily Drop’s free posts on:

  • 🐘 Mastodon via @dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev
  • 🦋 Bluesky via https://bsky.app/profile/dailydrop.hrbrmstr.dev.web.brid.gy

Also, refer to:

to see how to access a regularly updated database of all the Drops with extracted links, and full-text search capability. ☮️

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.