Drop #467 (2024-05-15): Wonkish Wednesday

Homebrew Build Provenance; The Replacement (Parquet) Killers; The Heat Death Of The Internet

Nothing to (technically) download and install in today’s proper edition, just some solid improvements in safety for one ecosystem (in the first section), hope for the data format future (in the middle section), and — since this is me, after all — a nice, depressing closing section.

Homebrew Build Provenance

Trail of Bits is, perhaps, one of the most respected cybersecurity research and consulting firms out there. In a recent post, the did a deep dive into the implementation of build provenance for Homebrew, a popular package manager for macOS (and linux). The initiative, a collaboration with Alpha-Omega and OpenSSF, aims to enhance the security of software supply chains by introducing cryptographically verifiable attestations to Homebrew’s build process.

Build provenance, as explained in the post, serves as a mechanism to ensure that each software build can be traced back to its source, detailing the exact workflow and metadata, including the specific git commit and GitHub Actions run ID. This level of transparency is crucial at a time where the integrity of software supply chains is frequently called into question due to the increasing sophistication attackers. By implementing these measures, Homebrew aims to mitigate the risks posed by compromised or rogue maintainers who might attempt to inject malicious code into the builds. The cryptographic attestation ensures that any tampering with the build process becomes detectable, thus preventing silent compromises that could affect countless users.

One reason I 💙 ToB is that they’re brutally honest. In the post, they fully acknowledge the limitations of build provenance. While it significantly enhances security by making any malicious activity more visible and subject to scrutiny, it does not outright prevent a determined attacker, especially one with maintainer privileges, from tampering with the software. This candid admission reveals something many folks do not grok and many vendors will now own up to: we have — and always will have — and ongoing arms race in cybersecurity. Our defensive measures must continually evolve to keep pace with the tactics employed by attackers. Adding this new provenance control is great, but it will most assuredly cause attackers to figure out other ways to accomplish their goals.

The technical implementation of this feature in Homebrew leverages GitHub’s new artifact attestations feature, which was accessed by the Trail of Bits team during its private beta phase. This integration not only showcases the potential for collaboration between open-source communities and cybersecurity experts but also highlights the growing importance of platform providers like GitHub in securing software development practices across the industry.

It’s a great, accessible read; and, you can likely implement something like this for your org, for free (or cheap).

The Replacement (Parquet) Killers

Chris Riccomini has an accessible and pretty thorough post that introduces two emerging data storage formats, Nimble and Lance V2 (LV2), which are poised to challenge the dominance of Apache Parquet. These new formats, developed by Meta and LanceDB respectively, are working to optimize the handling of data types that are increasingly common in ML and AI, such as vectors, images, and videos.

Apache Parquet is the 800-lb gorilla in terms of storage formats for data analytics. It’s super efficient when it comes to online analytical processing (OLAP) queries across various cloud data warehouses and data lakes. However, its performance begins to falter when dealing with the complex and unstructured data types that are emerging as the typical ones in ML and AI workflows. This is where Nimble and LV2 step in, designed from the ground up to better meet these specialized needs.

Nimble, as described in the post, adopts an incremental improvement strategy over Parquet, focusing on handling wide schemas and enhancing metadata flexibility. It utilizes FlatBuffers for decoding, which allows for selective reading of metadata bytes, potentially reducing the overhead seen in traditional formats. On the other hand, LV2 takes a more radical approach by eliminating Parquet’s row groups altogether, opting instead for a leaner structure that consists of data pages and a footer, with a highly extensible system for types and encodings.

Despite these innovations, both formats are still in their infancy with several features underdeveloped for broader OLAP use cases. As an example, Nimble lacks support for predicate pushdown, a pretty important (crucial?) feature for efficient query processing, and LV2’s encoding capabilities remain pretty basic. These gaps highlight that while the formats excel in specialized ML tasks—demonstrated by Nimble’s performance in Meta’s benchmarks—they are not yet ready to fully replace Parquet in all aspects of data processing.

If you’re not familiar with these formats, the article is most certainly worth your time.

The Heat Death Of The Internet

“The Heat Death of the Internet” is a metaphorical expression used to describe a scenario where the internet, as a vibrant and diverse ecosystem, gradually loses its richness and diversity, becoming more homogenized and less dynamic. This concept draws an analogy to the “heat death of the universe,” a theoretical end state of the universe where all energy is evenly distributed, and no more work can be done due to a lack of energy gradients, leading to a state of maximum entropy and minimal activity.

It’s also a stoic blog post that I think all readers will relate to. (Presented without further comment.)

FIN

Remember, you can follow and interact with the full text of The Daily Drop’s free posts on Mastodon via @dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev ☮️

5 min read

Drop #467 (2024-05-15): Wonkish Wednesday

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.