Drop #484 (2024-06-20): Data All The Way Down

Tekite; DataFusion; postgres-websockets

(Given — CW: language — this threat, we’ll refrain from leaning on the recently more frequent default theme for Thursday. O_O)

We go back to a “data infrastructure” theme for today’s Drop with a look at a new event streaming platform built with analytics in mind, a foundational way to perform at-scale SQL or DataFrame ops in Rust, and a way to wire up the Swiss Army knife that is PostgreSQL to apps via websocekts.

In other news, the author of neofetch (a utility we’ve covered before) hung up their git lasso and headed off into farmerville back in April. Godspeed you, sir. Also, thank goodness, now, for fastfetch.


TL;DR

(This is an AI-generated summary of today’s Drop using Perplexity with the GPT-4o model.)

  • Tekite: Tekite is a new event streaming database written in Go, integrating functionalities of Kafka and RedPanda with advanced processing capabilities like Flink. It supports real-time aggregations, complex data operations, and WASM modules for high-performance computations. Tektite
  • DataFusion: Apache Arrow DataFusion is a high-performance, extensible query engine written in Rust, leveraging Apache Arrow for in-memory data processing. It supports SQL and DataFrame APIs, optimized for parallel processing, and is used by projects like InfluxDB and Arroyo. Apache Arrow DataFusion
  • postgres-websockets: This Haskell-based middleware adds websocket capabilities to PostgreSQL’s LISTEN and NOTIFY commands, enabling real-time notifications and interactions with web or mobile apps. It simplifies managing websocket connections and includes built-in mechanisms for handling database connection failures. postgres-websockets

Tekite

Photo by Eugene Golovesov on Pexels.com

Tektite (GH) is a new event streaming database, written in Go, that integrates the functionalities of traditional event streaming platforms like Kafka and RedPanda with advanced event processing capabilities similar to Flink. Unlike conventional streaming solutions that often layer on top of existing databases, Tektite is designed from the ground up as a standalone database optimized for event streaming and processing.

It maintains full compatibility with Kafka, so we can create and manage topics using any Kafka client. Beyond basic event streaming, Tektite incorporates a powerful expression language and function library for data filtering, transformation, and processing. With it, we can implement custom processing logic using WASM modules, enabling high-performance, server-side computations. I’m kind of super excited about that part, since those modules do not explicitly have to be written in Go, and the code is closer to the native system binary execution layer, which opens up a ton of possibilities.

Tekite can maintain real-time windowed aggregations and materialized views (super helpful if you’re using event streaming for streaming analytics pipelines). It also supports complex data operations, including stream/stream and stream/table joins, which let us create new streams by combining data from multiple sources. We do quite a bit of “enrichment” in streaming pipelines at work, and I’m especially intrigued at how straightforward this feature seems to be.

Given the Kafka API compatibility, Tektite can bridge to and from existing external Kafka-compatible servers, providing flexibility in hybrid deployment scenarios and enabling gradual migration from legacy systems.

Tektite employs a distributed log-structured merge tree (LSM) (arXiv). This data structure has performance characteristics that make it attractive for providing indexed access to files with high insert volume, such as transactional log data. However, Tektite implements a distributed level-based log structured merge tree, where the data is maintained as sorted-string tables (SSTables) in multiple levels of the tree. The SSTables are persisted in the object store and ‘hot’ tables are cached in the Tektite cluster for fast access. SSTables are asynchronously flushed to the object store.”. This both works around the aforementioned inefficiencies and helps Tekite reliably store huge amounts of data.

The project is in active development, with most features nearing completion. The development team is focusing on automated testing and performance optimization to ensure a robust and fast 1.0 release later in the year. It’s licensed under the Apache v2.0 licensed, and the devs welcome contributions from the community.

Once it has a stable release, we’re — very likely — going to kick the tyres at work, so I’ll report back when we do.

DataFusion

Photo by Diana u2728 on Pexels.com

Apache Arrow DataFusion (GH) (PDF) is a high-performance, extensible query engine written in Rust that leverages Apache Arrow as its in-memory format.” It’s designed to be a modular, embeddable query engine that can be integrated into various data-centric systems, and provides a robust framework for query planning, optimization, and execution, making it a pretty spiffy choice when building custom data analytics applications.

It supports SQL and DataFrame APIs, so we get to use the interface we’re most comfortable with (very R {tidyverse}-esque!).

The execution engine is optimized for parallel, vectorized processing, leveraging Rust’s much-lauded performance and safety features. And, it has extensibility baked-in, so we can layer in custom functions, data sources, and even data-wrangling optimizations. Thanks to the “Arrow” part, DataFusion also natively supports various data formats, including Parquet, Avro, CSV, JSON, and Arrow IPC files.

InfluxDB uses it, as does Arroyodask-sql, and many others. For you Python folks out there, DataFusion has a module for you which abstracts away the beautiful Rust operations so you can write terrible Python code instead. It has plenty of Python examples, too.

postgres-websockets

Photo by Yuanpang Wa on Pexels.com

PostgreSQL’s LISTEN and NOTIFY commands give us asynchronous notifications within a given database. This feature is useful for applications that need to react to changes in a database without resorting to constant polling, which can be inefficient and resource-intensive. The LISTEN command registers the current session as a listener on a specified notification channel. When a session executes LISTEN channel_name, it tells PostgreSQL to notify it whenever a NOTIFY command is executed on that channel. The NOTIFY command sends a notification to all sessions that are listening on the specified channel. It can also include an optional payload, which is a string that can carry additional information.

Most (mebbe “many”?) folks end up writing some custom middleware to get these events pushed up to web or iOS/Android apps, which is boring, repetitive, and prone to error.

postgres-websockets (GH) is a Haskell-based middleware that adds websockets capabilites on top of these notifications that lets us:

  • send messages a websocket triggering a NOTIFY command in a PostgreSQL database
  • receive messages sent to any database channel though a websocket
  • authorize the use of channels using a JWT issued by another service
  • authorize read-only, write-only, or read and write websockets

It abstracts the complexities involved in setting up and managing websocket connections so we can focus on an application’s. And, it also includes built-in mechanisms to handle database connection failures, either through self-healing connections or external supervision (which is One More Thing™ you’d have to include in any custom middleware implementation).

Detractors may argue that it’s unusable givent hat it’s written in Haskell, but it ships with pre-built amd64 linux binaries, and building it for other platforms is straightforward (though it may take a while). And, said haters likely use pandoc without even thinking about it, so they’re not one to judge.

The README has some great info and basic examples, and the JS websocket client example is very grokable.

FIN

Remember, you can follow and interact with the full text of The Daily Drop’s free posts on Mastodon via @dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev ☮️

One response to “Drop #484 (2024-06-20): Data All The Way Down”

  1. Roel Avatar

    @dailydrop.hrbrmstr.dev
    Awesome drop again @hrbrmstr
    Tektite looks really interesting !

    For you Python folks out there, DataFusion has a module for you which abstracts away the beautiful Rust operations so you can write terrible Python code instead

    As an R enjoyer, rust novice and Python builder, this made me laugh 👆

    Like

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.