url-knife; gather
CORRECTION: It turns out the “can’t click on pages/tabs in SigmaOS” is an artifact of macOS 15 beta (the Sigma folks must be using some seekrit SwiftUI API elements that aren’t officially supported). Clicking works as expected on macOS 14.
This very late Monday Drop was, indeed, started on Monday. But I was delayed as I ran out of time (a) porting the first resource into something I/y’all could use in an ES6 context (without requiring a bundler), and (b) finishing up one function in an R wrapper package for said resource. Sure, I could have picked different resources, but I already had the gather section done.
Anyway…
We’re grabbing content and doing all sorts of unnatural things to it in today’s edition.
url-knife

I came across the url-knife JavaScript library at random, and I’m pretty stoked that I did, as it does a phenom job slicing and dicing plaintext and XML/HTML content then serving up emails, URLs, comments, and entities/elements/tags. It also does a super job when asked to normalize potentially janky URLs, and has a super-fast well-formed URL parser.
Rather than blaher, here, about the JavaScript-side of url-knife, please head on over to this Observable Notebook where you can see the examples in action and play with them.
You can also play with it locally in ES6 module proejcts by downloading url-knife.bundle.js and importing Pattern from it. If you’re in a JS project that’s already doing bundling, just use the official NPM version.
I really wanted to be able to use this from R (after all, it’s the world’s best data-wrangling language). So, tossed it into {V8} for a test run:
ctx <- V8::v8()
# This will run in your R session if you have {V8} and {httr} installed
#
# BUT
#
# NEVER do sourcing from a remote URL you do not have full control over 🙃
ctx$source(file = "https://rud.is/dl/url-knife.bundle-v8.js")
httr::GET("https://text.npr.org/") |>
httr::content(as = "text") -> doc
ctx$call("Pattern.XmlArea.extractAllElements", doc) |>
str()
## 'data.frame': 155 obs. of 5 variables:
## $ value : chr "<html lang=\"en\">" "<head>" "<title>" "</title>" ...
## $ elementName: chr "html" "head" "title" "/title" ...
## $ startIndex : int 16 33 44 78 91 162 218 565 2196 2205 ...
## $ lastIndex : int 31 38 50 85 156 212 559 571 2203 2211 ...
## $ commentArea: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
(Most of the other functions also return a very rich bit of output.)
Since that worked super well, I threw together a small R package that wraps the JS library. There are 1–2 functions I still need to finish (PRs welcome 😉).
I’m going to throw together a Deno/Bun CLI program that uses this library, as I think it’ll be super handy to have these extractors in a self-contained binary.
gather

Markdown has become even more important now that many of us are experimenting with RAG systems. So, having solid tooling to get content into markdown format is super helpful.
Gather CLI (GH) is a Swift-based command-line tool that converts web URLs to Markdown documents. It offers various options for input, output, and formatting, including the ability to extract content from the clipboard, environment variables, or raw HTML. The tool can be used to clip web pages into Markdown text without ads and comments, making it useful for note-taking and other projects. (NB: I had Apple Intelligence write this paragraph on purpose; more on that in Wednesday’s Drop).
It has some special options for yanking content from StackOverflow (this was baked-in before SO started stealing all our content), and for getting content into a note-taking system the author also wrote.
I haven’t tried compiling it at all, but it may actually work on non-macOS systems (I’ll give that a go once I get a free mo’).
Here’s a sample of how it performs on basic content like this: https://sebs.website/blog/know-your-razors-guillotines-and-hammers (I chose this since it’ll fit OK in WP):
# Know your Razors, Guillotines & Hammers
[Source](https://sebs.website/blog/know-your-razors-guillotines-and-hammers "Know your Razors, Guillotines & Hammers")
Navigating modernity can be tough. Luckily philosophers, poets, and great thinkers over the centuries has developed a number of mental shortcuts, in the form of heuristics, razors, guillotines, and other aphorisms. We've collected the most important ones here in a 'cheat sheet' of sorts. Memorize these, and you may never look at the world in the same way again…
**Alder's Razor**
If something cannot be settled by experiment or observation, then it is not worthy of debate.
**Chesterton's Fence**
Reforms should not be made until the reasons behind the existing state of affairs is understood.
**Duck Test**
The duck test is a form of abductive reasoning, usually expressed as "If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck."
**Grice's razor (also known as Giume's razor)**
As a principle of parsimony, conversational implications are to be preferred over semantic context for linguistic explanations. AKA Address what the speaker actually meant, instead of addressing the literal meaning of what they actually said
**Hanlon's Razor**
Never Attribute to malice that which can be adequately explained by stupidity.
**Hitchen's Razor**
What can be asserted without evidence can also be dismissed without evidence.
**Hobson's Choice**
A free choice where only one choice is offered
**Hume's Guillotine**
If the cause, assigned for any effect, be not sufficient to produce it, we must either reject that cause or add to it such qualities as will give it a just proportion to the effect.
**Occam's Razor**
When confronted with competing explanations, often the explanation with the fewest assumptions is the correct explanation.
**Maslow's Hammer**
To treat everything as if it were a nail, If the only tool you have is a hammer.
**Popper's Falsifiability Principle**
For a theory to be considered scientific, it must be falsifiable.
**Sagan Standard**
Extraordinary claims require extraordinary evidence.
**Shirky Principle**
Institutions will try to preserve the problem to which they are the solution.
If you've made it this far I owe you a beer the next time I see you 🍺. Want to get in touch? [Follow me on Twitter(X)][1].
[1]: http://twitter.com/sebs_tweets
(apologies for a twitter ref at the end there)
There are many uses besides RAG, of course. Since it gets rid of so much cruft, gather (and similar tools) are great for yanking recipes for use in note-taking systems (I use Bear). e.g., I can just do “bearit $URL” and it’ll create a Bear note and tag it as something I’ve gathered, so I can find them and augment them as needed. Bear’s browser extension does a fine job as well (it also brings over images), but it’s nice to have my own CLI workflow, too.
I’ll report back on the non-macOS compile in a day or two.
FIN
Remember, you can follow and interact with the full text of The Daily Drop’s free posts on Mastodon via @dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev ☮️
Leave a comment