Drop #498 (2024-07-101): HTTP Toolbox Companions

Scout; posting; Crawlee

Two scraping and one API-ing resources to help you wrangle HTTP ops more efficiently.

TL;DR

(This is an AI-generated summary of today’s Drop using Sonnet via Perplexity.)

Scout (https://scout-lang.netlify.app/): A domain-specific language for web scraping and crawling, written in Rust. It offers a simple syntax and coder-friendly features for both basic and complex web scraping tasks.
Posting (https://github.com/darrenburns/posting): A modern, terminal-based HTTP client that brings Postman-like functionality to the command line. It allows organizing and storing HTTP request collections in a git-friendly format.
Crawlee (https://crawlee.dev/): A web scraping and browser automation library for Node.js, offering a unified API for HTTP and browser-based crawling. It includes features like proxy rotation, session management, and support for various crawling methods.

Scout

Scout (GH) is a domain-specific language (DSL) designed specifically for web scraping and crawling tasks. Written in Rust, Scout combines a simple, easy-to-learn syntax with coder-friendly web crawling capabilities. It aims to lower the barrier of entry for web scraping while providing advanced features for more complex scenarios. Please note that you need to have icky Firefox installed along with geckodriver before trying to use Scout, as it performs all web-ops through that browser.

The language spec fits on a single web page (which means nothing — since one page could scroll forever — but the main div of that page has ~200 lines and ~6K characters), and the operations are all fairly straightforward. For example, take this sample REPL session (just running scout in the terminal starts a REPL):

>> goto "https://example.com"
Null
>> h1 = $"h1"
Null
>> h1 |> textContent()
"Example Domain"

That’s great, but we likely want to web scrape to get some data back that we can use. For that, Scout has a results JSON output structure which gets populated via scrape operations. Let’s grab all the links and titles from the main Hacker News site (NOTE: use their API if doing IRL “scraping” of their site):

goto "https://news.ycombinator.com/"

items = $$"tr.athing > td.title > span.titleline"

for item in items do
  scrape {
    text: item |> textContent(),
    link: $(item)"a" |> href(),
  }
end

That will spit back something like:

{
  "results": {
    "https://news.ycombinator.com/": [
      {
        "link": "https://www.burn-heart.com/rulers-of-the-ancient-world",
        "text": "Rulers of the Ancient World: period correct measuring tools (burn-heart.com)"
      },
      {
        "link": "https://typesetinthefuture.com/2018/12/04/walle/",
        "text": "The Typeset of Wall·E (2018) (typesetinthefuture.com)"
      },
      {
        "link": "https://matttproud.com/blog/posts/x-window-system-boot-stipple.html",
        "text": "Iconography of the X Window System: The Boot Stipple (matttproud.com)"
      },
      {
        "link": "https://payloadspace.com/habitable-worlds-observatory-and-the-future-of-space-telescopes-in-the-era-of-heavy-lift-launch/",
        "text": "Future of Space Telescopes in the Era of Super Heavy Lift Launch (payloadspace.com)"
      },
      …
    ]
  }
}

In debug mode, you can see the Firefox instrumentation (an example of that is in the section header). It also groks proxies and such, so if you’re behind one, you can still use it.

I really like the language spec, so far, and learning the language has been very straightforward.

Cannot wait to see how this evolves, and I’ll see if I can put together a larger, more real-life example over the coming weeks.

posting

Posting is a modern, terminal-based HTTP client that brings Postman-like functionality to the the terminal/CLI. It enables us to send and test HTTP requests directly from terminal, which means you can also use it over SSH (perhaps, from a different system that has better access to the resources in question).

The tools let us organize and store collections of HTTP requests in a git-friendly format, using (ugh) YAML files with a .posting.yaml extension.

As one might expect, it fully supports the use of environment variables and .env files, so we don’t have to hard-code different values for, say, dev/test/prod.

Since it’s a modern TUI, it fully supports efficient navigation using both keyboard shortcuts and mouse input, including a “jump mode” for quick access to different parts of the interface. It’s also fully theme-able (including the layout), and lets us customize the response formatting as well. Said modernity is also shown via a built-in command palette, so you’re a quick keystroke-or-two away from every command.

I’ve stolen one of their screenshots for the section header, but please head to the GH repo, as the there’s tons more info, and some animated screens that show off some of the features way better than any more blathering I could do.

Crawlee

Photo by Pixabay on Pexels.com

(Given that it sure feels like they’re suggesting stealing content for use in AI training is OK, I almost did not include this, today, but I know Drop readers will use services/tools responsibly.)

Crawlee (GH) is a “web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.”

The tool/library has several key features that make it a pretty compelling tool for web crawling. It offers a unified API that caters to both HTTP and browser-based crawling, which gives us plenty of flexibility, especially when a given website thinks it is too clever by half. The platform automatically scales based on system resources, and also integrates proxy rotation and session management, which helps keep you enhancing anonymous and reduces the likelihood of IP bans. (Again, please consider using this tool responsibly!)

It also has a persistent queue system, which helps manage URLs effectively. Configurable storage options are also available for storing scraped data. And, Crawlee comes with built-in error handling and retry mechanisms, ensuring robustness and reliability during the crawling process.

Crawlee comes with various crawling methods to suit different needs. The “CheerioCrawler” provides fast HTTP-based crawling with Cheerio for HTML parsing, making it ideal for lightweight and speedy tasks. For more complex requirements, the “PuppeteerCrawler” utilizes headless Chrome automation via Puppeteer, offering extensive control over the browsing environment. The “PlaywrightCrawler” supports multi-browser automation with Playwright, compatible with Chrome, Firefox, and WebKit.

By abstracting away common scraping challenges, this tool/library lets us focus on extracting and processing data rather than dealing with low-level crawling logistics. It seems super-flexibile, and the feature set looks fairly comprehensive.

While I do my best to try things out before adding them to the Drop, this one is on the “TODO” list. If you kick the tyres before I do, def lemme know how you liked it!

FIN

Remember, you can follow and interact with the full text of The Daily Drop’s free posts on Mastodon via @dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev ☮️

hrbrmstr's Daily Drop

Drop #498 (2024-07-101): HTTP Toolbox Companions

TL;DR

Scout

posting

Crawlee

FIN

Leave a comment Cancel reply

Drop #498 (2024-07-101): HTTP Toolbox Companions

TL;DR

Scout

posting

Crawlee

FIN

Share this:

Leave a comment Cancel reply