Drop #431 (2024-03-22): hrbrtips & hrbrtricks

Painless Production Postgres To Parquet (Featuring 1Password); Hacking x-callback-url; Robustif-AI-ing Bash Scripts

There’s a hint of Weekend Project Edition in this Friday Drop, in that you may decide to poke at some of these topics in the next couple of days if they’re fit for a purpose or two you might have. The first two come from some work I’ve been doing in the past couple weeks, and the last one is from a discussion on Bluesky 🦋 this morning. There’s some “how to” associated with each, so a TL;DR doesn’t make a ton of sense for this issue.

Painless Production Postgres To Parquet (Featuring 1Password)

Photo by Anna Tarazevich on Pexels.com

We have a diverse array of data sources at work, ranging from Elasticsearch, to Amazon Athena, to Postgres, to MariaDB, and then more than a few from API sources. Whilst debugging a pretty gnarly data pipeline error last week, I discovered a surprise table in one of our core Postgres databases. One of our amazing engineers had setup a data archiving process for some data that I’ve been wanting a history of for quite a while, and I definitely missed the memo (read: I failed to monitor Notion for page updates b/c I hate Notion).

This data being in Postgres is fine, but it’s not yuge data, and getting the history out and into Parquet (or any other format where I can move it around) became a bit of a priority for me.

It’s simple enough to write an R or (ugh) Python script to do this, but — being always curious as to what’s out there — I decided to seek something with fewer dependencies (in the event I ever need to do this again), and ended up being bonkers surprised at how fast and simple doing this exercise with ClickHouse was.

For those readers who just had some whiplash due to me not saying “with DuckDB” at the end of that last paragraph, I’ll note that I tried this with DuckDB but it was painfully slower, and less informative about what it was doing than ClickHouse was.

Now, the ClickHouse docs for how to do this are impeccable, so I’ll focus more on the idiom I chose to work with rather than repeat their documentation verbatim.

This particular Postgres instance is a Very Important One. As such, the connection information is not something we keep in env files or any other plaintext file anywhere. For production systems, there are cloud vaults that are used, but we lowly data scientist folks also hit it from our workstations, so the information is in 1Password (work’s password manager choice, not mine). Along with shared company vaults, an ugly and bloated Electron app, and an annoying browser extension, 1Password also has a fairly decent CLI utility. After unlocking the main vault via the horrible Electron app, you can then get CLI access to it via:

eval $(op signin --account YOUR_ORGANIZATION)

We have the credentials stored as database fields in a database record note type in 1Password, so it’s super straightforward to get those for scripts like the one I made:

JSON=$(op item get --vault shared VAULT_ENTRY_ID --format=json 2>/dev/null)

PG_HOST=$(echo "${JSON}" | jq -r '.fields[] | select (.id=="hostname") | .value')
PG_USER=$(echo "${JSON}" | jq -r '.fields[] | select (.id=="username") | .value')
PG_PASS=$(echo "${JSON}" | jq -r '.fields[] | select (.id=="password") | .value')
PG_BASE=$(echo "${JSON}" | jq -r '.fields[] | select (.id=="database") | .value')

That lets us safely use them with the ClickHouse CLI utility that is easily downloadable from their site:

clickhouse local \
  --progress \
  --output_format_parquet_string_as_string=true \
  --output_format_parquet_compression_method=zstd \
  --query "
SELECT
-- what you want
FROM
  postgresql(
    '${PG_HOST}:5432', '${PG_USER}', '${TABLE}', '${PG_BASE}', '${PG_PASS}'
  )
-- add more stuff here
INTO OUTFILE '${OUTFILE}'

As the query runs, you get very helpful information on the progress:

↘ Progress: 327.05 thousand rows, 52.60 MB (18.93 thousand rows/s., 3.04 MB/s.)

You have a great deal of control over the output format parameters, and I used zstd since I need to work with these Parquet files in DuckDB. In fact, this is part of a Bash “dataloader” in one of the Observable Framework data explorer sites I built for work.

One caveat is that you should likely make sure to do any CASTing required on the export side (we use the Postgress IP and UUID data types, along with Postgress variable text arrays in the official schema and I keep forgetting to do that conversion).

Definitely give this tool a try if you have need to do this type of conversion (or to other more malleable formats).

Hacking x-callback-urls

The long covid 🧠 🌫️ continues to be annoying, and I’m relying more on “notes” than I ever had to. I’ve been switching around the various note-taking systems we’ve looked at in these Drops and it appears I’ve settled into one that is a macOS/iOS/iPadOS-only solution — Bear Notes. It’s a proper macOS app that does not do anything wonky UX-wise, unlike every other one out there that’s just in an Electron shell. The iOS integration, along with integration with Apple’s “share ↑” idioms, and accessible format export options have made it kind of nice to use.

It also supports something called x-callback-urls. This is a macOS (et al.)-specific inter-app communication protocol. You can hit the list of links Perplexity generated for me when I asked it to list out all the apps that support this scheme.

Bear happens to have robust support for these URLs, which is great! Especially since I don’t always want to add notes in a macOS app window, and I have some launchd scripts that regularly collect things I want saved in notes.

Thankfully, ages go, a kind soul made a tiny macOS app that lets work with these x-callback-urls from the command line, returning anything the app sends back in a nice JSON format.

I made a generic shell script wrapper for executing the xcall binary from inside the app’s macOS bundle directory:

#!/usr/env/bin bash
# save this as `xcall` on your path
/Applications/xcall.app/Contents/MacOS/xcall -activateApp NO $@

And then made a Bear-specific CLI for talking to Bear:

#!/usr/env/bin bash
# save this as `bear`
"${HOME}/bin/xcall" -url bear://x-callback-url/$@

I ended up making a CLI note Bash script using Gum, which we’ll close this section out since it’s a good introduction to the final section. (Hit the link, as it’s a bit too long to warrant putting in the newsletter directly.)

If you use macOS apps and have ever thought “I wish I could do X from the CLI with it”, poke around and see what support it might have for x-callback-url.

NOTE: I cloned, read the source, re-compiled, and signed the xcall app if anyone needs a “safe” copy of it.

Robustif-AI-ing Bash Scripts

Photo by cottonbro studio on Pexels.com

Vicki Boykis started a fun thread on Bluesky today on a specific utility use case for LLM/GPTs, and I chimed in with another one; then, ended up demonstrating it to folks who asked.

The final script in the previous section was modified by Perplexity, and I let it take credit for that in the comments. I detest writing the absolutely essential boilerplate around Bash scripts to make them more robust (and, to help remember what that one-off script I’ll “never need again” does and how to use it).

The original script had some common best-practices, but the model managed to round out all the corners.

Hit the thread for a link to the prompt and another example.

I’ll try this out with a local LLM once I un-**** my Python setups on my Macs so it’ll actually use the Metal framework for the GPU ops.

FIN

Remember, you can follow and interact with the full text of The Daily Drop’s free posts on Mastodon via @dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev ☮️