Bonus Drop #73 (2025-01-25): Crawl • Slice • Learn

lightpanda; yek; ManKier

Three distinctly diverse resources await thee in this Weekend Bonus Drop.

I noted this on socmed, today, but the DoJ took down their dedicated January 6th Capitol riot page.

NPR has a very good one, but they’re cowards and collaborators, so it will most certainly go away at some point. Here’s a bit of Bash to get the JSON that backs the table:

curl \
  --silent \
  --url "https://apps.npr.org/dailygraphics/graphics/capitol-riot-table-20210204/table.html" | \
  xmq select "//script[2]" to-text  | \
  sed \
    -e 's/var DATA =//' \
    -e 's/;$//g' \
    -e 's/var L.*//g' | \
  jq

And, the data from my daily scrapes (up to Nov 6, 2024) is available as well, and most assuredly will not be taken down (at least by me).

Aside: My insane “47 Watch” project now has a small, companion web app where you can browse through all of the EOs without having to endure the White House site. I defaulted to lunr.js for client-side, full-text search, but will swap that out for one of the more modern ones, soon.


TL;DR

(This is an AI-generated summary of today’s Drop using Ollama + llama 3.2 and a custom prompt.)

  • Lightpanda is a headless browser offering 11x faster execution and 9x lower memory usage than Chrome, designed specifically for AI agents and web automation (https://lightpanda.io)
  • Yek is a Rust-based tool that processes codebases 230x faster than alternatives, using Git history to intelligently chunk and prioritize files for LLM consumption (https://github.com/bodo-run/yek)
  • ManKier transforms Unix man pages into modern interactive HTML5 documentation with features like dynamic navigation, intelligent search, and command explanation capabilities (https://www.mankier.com)

lightpanda

Photo by Cesar Aguilar on Pexels.com

Traditional browser automation tools like Selenium and headless Chrome have served us pretty well but come with serious overhead, requiring full browser instances that consume substantial resources.

If Lightpanda (GH) continues to evolve, we may just enter a whole new world of javascript-enabled web scraping.

It’s a purpose-built browser automation tool that has remarkable speed and is super resource efficient. Their stats claim it can complete 100-page requests in 2.3 seconds compared to traditional tools’ 25.2 seconds, while using just 24MB of memory versus the typical 200+ MB.

Key features touted by the devs include:

  • Zero-latency startup
  • Complete embedability
  • Direct JavaScript execution
  • Comprehensive web standards support

This efficiency — provided the complete all the compatibility tests — may revolutionize (that’s not hyperbole) web archiving, something we ALL need to get in the habit of doing regularly.

Of course, it will further enable AI Applications (for good or evil) since the tool is lightweight enough to deploy thousands of concurrent instances for autonomous agents and ML training with minimal resource impact.

It has the potential to significantly reduce cloud infra costs through lower resource consumption and higher concurrency.

There are prebuilt binaries, so grab one and follow along with a simple test, if so inclined.

I made a single HTML page with some JS tests on it. We can do a naive check for it working without doing anything fancy:

$ curl --silent https://rud.is/ex/bamboo.html > curl.out
$ lightpanda --verbose --dump https://rud.is/ex/bamboo.html > lp.out
$ diff lp.out curl.out
2c2,3
< <html lang="en"><head>
---
> <html lang="en">
> <head>
38c39
<         <div id="js-check">JavaScript is working!</div>
---
>         <div id="js-check"></div>
70a72,73
> </body>
> </html>
72,74d74
<
<
< </body></html>

There are five tests on the page, but that easy-mode test only caught one.

Unfortunately, there’s an issue with their puppeteer/playwright compatibility (it figures it was on the weekend I had this Drop planned), So, we’ll carve out a section in a future Drop to check back in with the project and run some more advanced tests.


yek

LLMs have fixed context windows (i.e., a maximum amount of tokens they can process at once). We need to use a process called “chunking” to split large inputs into smaller pieces that fit within these limits while preserving meaningful context. This lets LLMs process large codebases or documents that would otherwise exceed their capacity.

The chunking process requires careful consideration of:

  • size: chunks must fit within context limits
  • coherence: chunks should maintain semantic meaning
  • overlap: some content may need to be repeated between chunks to preserve context
  • relevance: important sections should be prioritized within limited context space

Without effective chunking, LLMs would be limited to processing only small documents or risk losing critical context from larger inputs.

Enter Yek, a high-performance Rust-based tool that prepares code repositories for LLM consumption. The tool processes text-based files from repositories or directories, serializing them into a unified format while following .gitignore rules and using Git history to determine file importance. It automagically detects and ignores binary files.

For content management, Yek implements smart chunking through either token count-based segmentation or byte size-based segmentation, defaulting to 10MB chunks. The tool prioritizes files intelligently by placing more important ones later in the output sequence, considering Git history patterns, custom priority rules from yek.toml, and file location and type.

Performance testing shows Yek has exceptional speed. It processed the entire Next.js project in ~5 seconds compared to >20 minutes for alternatives. You can also process multiple directories and stream the output.


ManKier

ManKier is on a mission to level-up Unix manual page documentation by transforming traditional troff-formatted man pages into modern, interactive, semantic HTML5 documentation. The platform’s name cleverly references the industrial process of treating fabric in a circular vat (a kier), drawing a parallel to how it processes and refines raw man page content.

The platform implements a dynamic side-menu system that outlines sections and subsections, enabling precise navigation and direct section linking. This hierarchical organization presents a quick reference along with contextual understanding of complex documentation.

ManKier incorporates an intelligent search system activated by pressing ‘s‘, featuring autocomplete functionality. The search system distinguishes between option searches (prefixed with “-“) and general command searches, streamlining the documentation discovery process.

All command options within the documentation are implemented as interactive links. This lets us quickly access detailed explanations of specific parameters, and is particularly useful when examining command synopses or usage examples.

The platform can be configured as a custom search engine in modern browsers, allowing direct access to man pages through the URL bar using the “man” keyword shortcut. This feature extends to specific option lookups, such as “man grep -e” for targeted reference.

BUT WAIT! THERE’S MOAR!

ManKier includes a sophisticated command explanation feature that breaks down complex command sequences. That means we can paste in complete commands for detailed analysis of each component and option.

Through integration with tldr.sh, ManKier also provides condensed quick-reference sections for commands, making it easier to grasp essential functionality without diving into extensive documentation.

And, impressively, it is fully WCAG 2.1 level AA compliant.

It makes heavy use of the manner CLI tool to generate the base HTML pages.


FIN

Remember, you can follow and interact with the full text of The Daily Drop’s free posts on:

  • 🐘 Mastodon via @dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev
  • 🦋 Bluesky via https://bsky.app/profile/dailydrop.hrbrmstr.dev.web.brid.gy

Also, refer to:

to see how to access a regularly updated database of all the Drops with extracted links, and full-text search capability. ☮️

One response to “Bonus Drop #73 (2025-01-25): Crawl • Slice • Learn”

  1. batpigandme Avatar

    Thanks for putting in the work to archive the insanity!

    Like

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.