Drop #509 (2024-08-01): SQL Shenanigans

pickaxe ; fselect & fsql; 🐤 shellfs

Who doesn’t 💙 a solid SQL query? Ah, the joy of tossing some boolean algebra at massive data stores can warm the heart on the coldest of days. But, what if I told you you could run SQL on some odd data sources? Yes, these sacred incantations can be used for more than just pulling data from columnar or traditional relational datastores.

So, today, we’ll look at four tools (just three sections though) that let us run SQL in the strangest of places. Don’t worry, this is the good kind of weird.

TL;DR

(This is an AI-generated summary of today’s Drop using Sonnet via Perplexity.)

Here’s a concise three-bullet summary of the blog post:

pickaxe: A tool that allows using SQL operations for web scraping, supporting CSS selectors, JavaScript execution, and complex data manipulation [https://github.com/bitsummation/pickaxe]
fselect: A fast, Rust-based tool that enables running SQL queries over the filesystem, offering extensive support for file metadata and content analysis [https://github.com/jhspetersson/fselect]
shellfs: A DuckDB extension that allows executing shell commands within SQL queries, enabling seamless integration of command-line operations with database operations [https://community-extensions.duckdb.org/extensions/shellfs.html]

pickaxe

Photo by Samer Daboul on Pexels.com

The other day, when I was poking at alternative web scraping idioms/tools for the Drop, I went down a serious rabbit hole that ultimately led to pickaxe that lets us use SQL ops for web scraping!

I’ll be super up-front that this is an old-ish, .NET-based tool. BUT! If you can get past that (hey I did), then read on (it works on Linux, macOS, and Windows).

First, head to the above GH site and download the self-contained version for your OS. That way you don’t need to figure out how to get ancient .NET libraries and frameworks set up on your system. You can use the Dockerized version, but the native ones work fine, too (yes, the macOS one is x86_64, but it runs great in Rosetta 2).

I could just leave you with the README as the author did a bang-up job, but to coax you into tryign it out:

it leverages CSS selectors to pinpoint specific elements on a web page, allowing for precise data extraction
for pages that rely on client-side rendering, it can execute JavaScript to ensure accurate data extraction
the tool supports parallel page downloads using the thread hint
AND groks proxies
it can store results in memory, files, or MSSQL (ugh) databases
plus it supports subqueries, joins, and case statements for complex data manipulation

I know we’re all used to JSON output from tools, these days, but Pickaxe came from the before times, so we’re kind of stuck with somewhat janky *SV export support. I say “janky” as it doesn’t do a great job quoting fields, so I highly recommend using the “group separator” ASCII code as the delimiter to ensure you can break on fields with precision.

The README has extensive examples, but we’ll toss one in here for good measure.

The following gets the title and URL of all the stores up on CNN’s “lite” site and makes a GSSV out of them:

create file results(title string, href string)
with (
    fieldterminator = '␝',
    rowterminator = '\r\n'
)
location '/tmp/results.psv'

insert into results
select 
  pick 'a' as title,
  pick 'a' take attribute 'href' as link
from 
  download page 'https://lite.cnn.com/'
where
  nodes = 'ul li'

Here’s the first few rows of the result:

title␝href
Trump’s comments about Harris’ race kicks off a new – yet familiar – chapter in the 2024 presidential campaign␝/2024/08/01/politics/nabj-trump-harris-analysis/index.html
UK rocked by far-right riots fueled by online disinformation about Southport stabbings␝/2024/08/01/uk/southport-attack-disinformation-far-right-riots-intl-gbr/index.html
Algeria boxer who had gender test issue wins first Olympic fight in Paris when opponent quits␝/2024/08/01/sport/khelif-carini-womens-boxing-paris-olympics/index.html
Bank of England cuts interest rate for first time in four years, but it was the closest call␝/2024/08/01/business/bank-of-england-interest-rates-cut/index.html
Federal judge says New Jersey’s ban on AR-15 rifles is unconstitutional␝/2024/08/01/politics/federal-judge-says-new-jerseys-ban-on-ar-15-rifles-is-unconstitutional/index.html
US and Russia carrying out massive prisoner swap that is expected to include Evan Gershkovich and Paul Whelan␝/2024/08/01/politics/russia-us-prisoner-swap/index.html
Olympic medalist Tom Daley unveils his latest achievement: A sweater, knitted poolside␝/2024/08/01/style/tom-daley-knitting-paris-olympics/index.html
This country’s doctors can prescribe four-week spa breaks to frazzled parents␝/travel/germany-kur-parents-spa-breaks-wellness/index.html
There are cracks forming in the US jobs market␝/2024/08/01/economy/us-jobs-report-preview-july/index.html

It’s a neat way to think about scraping, and I may poke at it a bit more for some of the side projects I have running that cull data from hosted HTML pages.

fselect & fsql

Photo by panumas nikhomkhai on Pexels.com

While fd has been a welcome modern alternative to the ancient find command, it’s is still a tool with somewhat arcane syntax. It sure would be nice to be able to run SQL over the filesystem (without taking on the burden of a full-out osquery install). And, thanks to fselect and fsql we can!

Now, fsql is way slower than fselect, and a bit less functional, so you can head over to the repo and check it out if you’d like (it’s Golang-based, if that’s something that entices you). fselect is Rust-based and lightning fast.

There’s no way I’m going to be able to tell you all of the options for fselect. Just an fselect -h produces 215 lines of help, and you will not believe what it supports (it can even query inside of archives!). It has support for the traditional elements, such as name, extension, path, abspath, directory, absdir, size, fsize, uid, and gid. But it also has extensive support for more detailed filesystem information. On top of that, there are added niceties like line_count, tons of exif fields, MP3/ID3 metadata, and identifying if a file is_shebang, is_empty, is_archive, is_audio, is_book, is_doc, is_font, is_image, is_source, or is_video (and, there’s more).

Did I mention it’s fast? It’s fast!

This:

fselect "SELECT COUNT(*) AS ct FROM ~/ DEPTH 5 INTO json"
[{"as":"AS","count(*)":"1719872","ct":"ct"}]

took:

1.27s user 2.17s system 28% cpu 12.086 total

I may never use fd or find ever again.

🐤 shellfs

Photo by Alan Cabello on Pexels.com

We can’t have a SQL Drop without some mention of the BESTEST DATABASE IN THE WORLD (which, for the uninitiated, is DuckDB).

DuckDB provides extensive support for third-party extensions to help you bend data to your will.

One super-cool extension is shellfs which turns any file-context string (say, to read_csv('…')) into an area where you can run a series of shell commands, terminated by a |. That will cause DuckDB to shell out, run the sequence of commands, and treat the result as data.

Again, the docs are great, but I’ve got a good “why?” example for you here.

I keep tabs on our “known exploited vulnerability” coverage against CISA KEV (yes, I know most of you know that by now but there are new readers this week!). I don’t need R or icky Python to get an instant summary of said performance with shellfs, as I can do it all in SQL!

-- Make the extension available
INSTALL shellfs FROM community;
LOAD shellfs;

-- In-memory tag table from a live `curl`
-- By piping to `jq` I can avoid JSON ops in DuckDB
CREATE TABLE tags AS (
  FROM 
    read_json('curl -s https://rud.is/data/tags.json | jq ".tags" |')
  SELECT 
    name AS tag, 
    created_at,
    UNNEST(cves) AS cve
);

-- In-memory KEV table from a live `curl`
CREATE TABLE kev AS (
  FROM 
    read_json('curl -s https://rud.is/data/kev.json | jq ".vulnerabilities" |')
  SELECT 
    cveID AS cve, 
    dateAdded as added_to_kev
);

-- Compute the delta of when we had a tag vs. when a vuln was added to KEV
CREATE TABLE delta AS (
  FROM 
    tags t, kev k
  SELECT
    k.added_to_kev - t.created_at AS diff
  WHERE 
    k.added_to_kev IS NOT NULL AND
    t.cve = k.cve
);

-- 💥
SUMMARIZE delta;

Then run that from the terminal:

duckdb -json < perf.sql | jq
[
  {
    "column_name": "diff",
    "column_type": "BIGINT",
    "min": "-889",
    "max": "1448",
    "approx_unique": 169,
    "avg": "136.0327868852459",
    "std": "345.3942817722654",
    "q25": "-10",
    "q50": "14",
    "q75": "277",
    "count": 366,
    "null_percentage": 0.00
  }
]

No Bash wrapper script or anything!

shellfs makes DuckDB even handier than it already is!

FIN

Remember, you can follow and interact with the full text of The Daily Drop’s free posts on Mastodon via @dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev ☮️

hrbrmstr's Daily Drop

Drop #509 (2024-08-01): SQL Shenanigans

TL;DR

pickaxe

fselect & fsql

🐤 shellfs

FIN

Leave a comment Cancel reply

Drop #509 (2024-08-01): SQL Shenanigans

TL;DR

pickaxe

fselect & fsql

🐤 shellfs

FIN

Share this:

Leave a comment Cancel reply