Drop #519 (2024-08-22): Cache Me If You Can

Building A Lightweight Internet Research Setup With, And Caching Layer For, SearxNG With Jina

I realized, in my usual “doh!” shock and horror, that I neglected to make a note in the last Drop about a newsletter hiatus this week (hey, if John Oliver can take arbitrary weeks off…). There’s just alot going on at the hrbrcompound and $WORK this week.

So, I feel like I owe y’all at least one Drop!

Let’s take today’s mea culpa Drop to talk a bit about the programmatic side of SearXNG and how to wire results from it up to Jina to create a mini-markdown content database which also serves as a nice caching layer to Jina calls, and bundle it all up into a lightweight Internet research utility.


SearXNG

Photo by Skitterphoto on Pexels.com

The mind boggles at the fact that I have not yet covered SearXNG (GH). A link for it was casually Dropped in September of 2022, but never received the full Drop treatment.

SearXNG is a free, self-hostable (in your workstation’s Docker — or other container engine — setup) internet metasearch engine that aggregates results from over 70 search services without tracking or profiling. By aggregating results from numerous search services, it gives us a broader range of information from a single query. It’s flexible enough to let us choose which search engines to include and adjust various settings to tailor the overall search experience/output. The conainter setup instructions are pretty ace, so I’ll leave that to you. After you get it working, this is what an example query result from an interactive session might look like:

NOTE: to enable the “Download results”, you can add the following to the settings.yml:

formats:
    - html
    - json
    - rss
    - csv

(You’ll need to restart the container after that.)

I still use Kagi for my regular interactive searches, but rely on SearXNG’s API for programmatically pulling content. That sounds fancier than it is, since all I have is a Bash script (searx) with a single curl call:

#!/usr/bin/env bash

curl \
  --url "http://your-instance-ip-port/search" \
  --data-urlencode "q='${1}'" \
  --data "format=json" \
  --fail \
  --silent \
  --insecure \
  --location 2>/dev/null || echo '{ "message": "Error fetching Searxng results" }'

An given search incantation with that returns JSON, and each result looks like this:

$ searx 'national public data breach' | jq -sr '.[0].results[]'
{
  "url": "https://krebsonsecurity.com/2024/08/national-public-data-published-its-own-passwords/",
  "title": "National Public Data Published Its Own Passwords – Krebs on Security",
  "content": "2 days ago - New details are emerging about a breach at National Public Data (NPD), a consumer data broker that recently spilled hundreds of millions of Americans' Social Security Numbers, addresses, and phone numbers online. KrebsOnSecurity has learned that another NPD data broker…",
  "publishedDate": null,
  "thumbnail": "",
  "engine": "brave",
  "parsed_url": [
    "https",
    "krebsonsecurity.com",
    "/2024/08/national-public-data-published-its-own-passwords/",
    "",
    "",
    ""
  ],
  "template": "default.html",
  "engines": [
    "qwant",
    "duckduckgo",
    "brave"
  ],
  "positions": [
    1,
    1,
    1
  ],
  "score": 9.0,
  "category": "general"
}

That score value is super handy. It represents a relevance score that indicates how relevant or important the result is deemed to be in relation to the search query.

SearxNG calculates this score based on various factors, including:

  • the position of the result across different search engines
  • the number of search engines that returned this result
  • other relevance metrics provided by the individual search engines

The value can range from 0 to 100, with higher values indicating greater relevance; so, that 9.0 means that result is likely pretty important/good.

SearxNG uses this score to help rank and order the results it presents via the web interface, and it does a bit of work behind the scenes to normalize and compare relevance across different sources.

Rather than leave the cozy CLI, I can get a feel for the top search results right in the termina:

$ searx 'national public data breach' | jq -sr '.[0].results[] | select(.score >= 1) | "\(.score) \(.title)"'
9.0 National Public Data Published Its Own Passwords – Krebs on Security
2.4 Was your Social Security number compromised in a massive data breach?
2.1666666666666665 The Slow-Burn Nightmare of the National Public Data Breach | WIRED
1.5227272727272727 National Public Data breach update: class action lawsuits pile up
1.5119047619047619 National Public Data Cyber Attack: Massive Data Breach Exposed Countless Social Security Numbers and Personal Info - CNET Money
1.2857142857142856 Data broker blunders as millions are exposed with public passwords | Fox News
1.5 2.9 billion records stolen in Social Security data hack, USDoD claims
1.0666666666666667 Hackers may have stolen your Social Security number in a massive breach ...
1.0666666666666667 Mozilla Monitor | National Public Data

Headlines are great, but I want to see the content. While there are many ways to do that, let’s talk about getting the content with Jina.

Jina

Photo by Boris Hamer on Pexels.com

We only mentioned the embeddings side of Jina last October. They have quite a few free API endpoints with extremely generous API call allowances and stupid affordable extended allowances (they’re likely slurping up all URLs/content you send to it, so keep that in mind as to why the free tier is so generous).

The one we’re talking about today is the Reader endpoint. It turns the content at any URL into super clean Markdown. You can see that right now by hitting up https://dailydrop.hrbrmstr.dev/2024/08/18/bonus-drop-58-2024-08-18-one-thing-well/ — which is a Jina AI link to the Markdown version of the last Bonus Drop. It also converts PDF files to Markdown super well.

Despite having a cool free tier, I don’t want to hit it up each time for SearXNG results I’ve already seen, so I threw together a small Go CLI jinac that adds a caching layer to Jina’s Reader API results using SQLite as the cache store.

An API result will have the following structure:

$ jinac 'https://dailydrop.hrbrmstr.dev/2024/08/15/drop-517-2024-08-15-thursdataday/' | jq
{
  "code": 200,
  "status": 20000,
  "data": {
    "title": "Drop #517 (2024-08-15): Thursdataday",
    "url": "https://dailydrop.hrbrmstr.dev/2024/08/15/drop-517-2024-08-15-thursdataday/",
    "content": "_committed; sq; CSVs Are Kinda Bad_…",
    "publishedTime": "2024-08-15T13:11:00+00:00",
    "usage": {
      "tokens": 2483
    }
  }
}

We can see all the URLs we have cached via:

<pre class="wp-block-syntaxhighlighter-code">$ jinac --keys
https://dailydrop.hrbrmstr.dev/2024/08/15/drop-517-2024-08-15-thursdataday/
https://krebsonsecurity.com/2024/08/national-public-data-published-its-own-passwords/
https://krebsonsecurity.com/2024/08/nationalpublicdata-com-hack-exposes-a-nations-data/
https://www.cbsnews.com/news/social-security-number-leak-npd-breach-what-to-know/
<blockquote class="wp-embedded-content" data-secret="EJ5hY4wZPD"><a href="https://www.cnet.com/personal-finance/identity-theft/social-security-numbers-and-personal-data-of-billions-breached-in-national-public-data-cyber-attack-heres-what-you-need-to-know/">National Public Data Cyber Attack: Massive Data Breach Exposed Countless Social Security Numbers and Personal Info</a></blockquote><iframe class="wp-embedded-content" sandbox="allow-scripts" security="restricted" style="position: absolute; visibility: hidden;" title="&#8220;National Public Data Cyber Attack: Massive Data Breach Exposed Countless Social Security Numbers and Personal Info&#8221; &#8212; CNET Money" src="https://www.cnet.com/personal-finance/identity-theft/social-security-numbers-and-personal-data-of-billions-breached-in-national-public-data-cyber-attack-heres-what-you-need-to-know/embed/#?secret=qLp2xQJwWU#?secret=EJ5hY4wZPD" data-secret="EJ5hY4wZPD" width="500" height="282" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
https://www.fastcompany.com/91176676/national-public-data-breach-update-class-action-lawsuits-npd
https://www.foxnews.com/tech/data-broker-blunders-millions-exposed-public-passwords
https://www.kiplinger.com/personal-finance/billions-hacked-in-national-public-data-breach
https://www.latimes.com/business/story/2024-08-13/hacker-claims-theft-of-every-american-social-security-number
<div class="embed-reddit"><blockquote class="reddit-embed-bq" style="height:500px" ><a href="https://www.reddit.com/r/cybersecurity/comments/1ex8mbn/major_national_public_data_leak_worse_than/">Major 'National Public Data' Leak Worse Than Expected With Passwords Stored in Plain Text</a><br> by<a href="https://www.reddit.com/user/Feisty-Solution-6268/">u/Feisty-Solution-6268</a> in<a href="https://www.reddit.com/r/cybersecurity/">cybersecurity</a></blockquote><script async src="https://embed.reddit.com/widgets.js" charset="UTF-8"></script></div>
<blockquote class="wp-embedded-content" data-secret="Vy0KbtMtBB"><a href="https://www.securityintelligence.com/news/national-public-data-breach-publishes-private-data-billions-us-citizens/">National Public Data breach publishes private data of 2.9B US citizens</a></blockquote><iframe class="wp-embedded-content" sandbox="allow-scripts" security="restricted" style="position: absolute; visibility: hidden;" title="&#8220;National Public Data breach publishes private data of 2.9B US citizens&#8221; &#8212; Security Intelligence" src="https://www.securityintelligence.com/news/national-public-data-breach-publishes-private-data-billions-us-citizens/embed/#?secret=9NzjCeXArm#?secret=Vy0KbtMtBB" data-secret="Vy0KbtMtBB" width="500" height="282" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
https://www.securitymagazine.com/articles/100951-security-leaders-discuss-the-national-public-data-breach
https://www.theverge.com/2024/8/14/24220212/national-public-data-breach-social-security-3-billion
https://www.usatoday.com/story/tech/2024/08/15/social-security-hack-national-public-data-breach/74807903007/
https://www.usatoday.com/story/tech/2024/08/17/social-security-hack-national-public-data-confirms/74843810007/
https://www.wgal.com/article/national-public-data-breach-find-out-if-your-social-security-number-was-involved/61928667
https://www.wired.com/story/national-public-data-breach-leak/</pre>

Let’s wire this up with the SearXNG CLI and glow to give me a super-fast, TUI for reading the content at the top search results.

CLI [Re]Search

Photo by Pixabay on Pexels.com

Here’s an expanded view of what might go into another script (researx) to let us preview search results right in the terminal:

#!/usr/bin/env bash

searx "${*}" |
  jq -sr '.[0].results[] | select(.score >= 0.5) | .url' |
  while IFS= read -r url; do
    echo "${url}"
    jinac "${url}" |
      jq -r '.data.content' |
      glow -p
  done

So, if I do:

$ researx national public data breach

I’ll be able to read through the content without much cruft:

And, I now have all that content saved locally. Which means I have programmatic access to it:

$ jinac --show-cache-dir # the CLI uses XDG cross-platform appdir standards
Cache directory: /Users/hrbrmstr/Library/Application Support/jinac
$ duckdb \
  -list -s "FROM cache SELECT url LIMIT 3" \
  "/Users/hrbrmstr/Library/Application Support/jinac"/cache.db 2>/dev/null
url
https://krebsonsecurity.com/2024/08/national-public-data-published-its-own-passwords/
https://www.fastcompany.com/91176676/national-public-data-breach-update-class-action-lawsuits-npd
https://www.wgal.com/article/national-public-data-breach-find-out-if-your-social-security-number-was-involved/61928667

including the JSON column:

$ duckdb \
  -list \
  -s "FROM cache SELECT (response::JSON).data.title AS title LIMIT 3" \
  "/Users/hrbrmstr/Library/Application Support/jinac"/cache.db 2>/dev/null 2>/dev/null
title
"National Public Data Published Its Own Passwords"
"National Public Data breach update: Lawsuits pile up against Florida-based background check company after 'security incident'"
"National Public Data breach: How to find out if your Social Security number was involved"

The content is also in decent shape for piping into your fav AI overlord.

The jinac code is up on Codeberg.


FIN

Remember, you can follow and interact with the full text of The Daily Drop’s free posts on Mastodon via @dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev ☮️