Drop #421 (2024-02-14): Reach Out And Track/Touch/Collect Something

Open Measures; Webhook.Site; Wget WARC-Style

Today’s edition is all about data, whether it be about what’s happening on smaller content platforms – like Bluesky and Mastodon — or, receiving data from apps and APIs, or creating content collections for analysis.

TL;DR

(This is an AI-generated summary of today’s Drop.)

(Perplexity is back to not giving links…o_O)

  • Open Measures: A data-driven platform designed to combat online extremism and disinformation by providing journalists, researchers, and organizations with tools and data to investigate harmful activities. The platform tracks a wide array of sites, especially those on the fringes of the internet, and offers a robust API for integrating its data into in-house dashboards for ongoing investigations.
  • Webhook.Site: An online tool that provides a unique, random URL (and email address) for testing webhooks or arbitrary HTTP requests. It displays requests in real-time for inspection without needing own server infrastructure. The site also features a custom graphical editor and scripting language for processing HTTP requests, making it useful for connecting incompatible APIs or quickly building new ones.
  • Wget WARC-Style: An introduction to using Wget for creating web archive (WARC) files, a standardized format for archiving web content including HTML, CSS, JavaScript, and digital media, along with metadata about the retrieval process. Wget’s built-in support for WARC format allows for straightforward archiving of websites, with options for compression and extensive archiving like capturing an entire website.

Open Measures

Prelude: I feel so daft for not knowing about this site until yesterday.

I know each and every reader of the Drop groks that disinformation and extremism are yuge problems. While there are definitely guardians out there who help track this activity and thwart the plans of those that seek to do harm. There just aren’t enough of them. But, it turns out, we can all get some telemetry on these activities and, perhaps, do our part, even in some small way.

Open Measures is a data-driven site that lets us all battle online extremism and disinformation. The platform is designed to empower journalists, researchers, and organizations dedicated to social good, enabling them to delve into and investigate these harmful activities. With a focus on transparency and cooperation, Open Measures offers a suite of tools that are well-integrated and open source, along with incredibly cool data. The team behind it works super hard to ensure that both the tools and data are accessible to all who wish to use them for public benefit.

The tool tracks a wide array of sites, particularly those on the fringes of the internet, where extremism and disinformation tend to proliferate. By providing access to hundreds of millions of data points across these fringe social networks, Open Measures equips users with the necessary data to uncover and analyze trends related to hate, misinformation, and disinformation.

They have a great search tool and API. The API enables folks to bring Open Measures data into their own in-house dashboards and plug into the platform to augment their ongoing investigations. This feature is especially useful for modern threat intelligence operations that require systems to communicate seamlessly with one another. The API, while robust, is (rightfully so) rate-limited to mitigate risks from malicious actors. However, for users who find the Public API insufficient for their needs, Open Measures offers Pro or Enterprise functionality, which may better suit their requirements.

The need for and utility of Open Measures cannot be overstated. On a daily basis, we all witness the real-world consequences of online disinformation and extremism. Having tools that can sift through scads of data to identify harmful content is invaluable. For researchers and journalists, this means being able to trace and piece together online threats with greater efficiency. For social good organizations, it means having the resources to combat disinformation and protect communities from harm.

Now, while it’s an essential tool for this malicious activity, the site is — fundamentally — indexing the public content on the listed network. That means you can use it for arbitrary queries, such as checking out how often the term “bridge” was seen on, say, Bluesky and Mastodon in the past month (ref: section header images). (For those still unawares, there’s yet-another firestorm on Mastodon related to a new protocol bridge being built between it and Bluesky.)

Webhook.Site

I needed to test out a webhook for something last week and hit up my Raindrop.io bookmarks to see what I used last time, and was glad to see that Webhook.site was still around and kicking!

If you ever have a need to get the hang of a new webhook (or arbitrary HTTP request) or test out what you might want do do in response to an email, you need to check this site out. Putting it simply, Webhook.site (I do dislike product names that are also domain names) gives you a unique, random URL (and, as noted, even an email address) to which you can send HTTP requests and emails. These requests are displayed in real-time, which lets us inspect their content without the need to set up and maintain our own server infrastructure. This feature is incredibly useful for testing webhooks, debugging HTTP requests, and even creating workflows. Said workflows can be wrangled through the site’s custom graphical editor or using their scripting language to transform, validate, and process HTTP requests in various ways.

Webhook.site can also act as an intermediary (i.e., proxying requests) so you can see what was sent in the past, transforming webhooks into other formats, and re-sending them to different systems. This makes it a pretty useful tool for connecting APIs that aren’t natively compatible, or for building APIs quickly without needing to set up infrastructure from scratch.

The external service comes with a great CLI tool, which can be used as a receptor of forwarded requests or execute commands. Their documentation is also exceptional.

However, while this service offers a plethora of benefits and is super easy to use, it’s important to exercise caution when deciding what you hook up to it. Webhooks, by their nature, often involve sending data to a public endpoint on the internet. Without proper security measures, this could potentially expose sensitive data to unauthorized parties. As an aside, I’m really not sure you should be using any service that transmits sensitive information over a webhook. Just make sure you go into the use of a third-party site, such as this, with eyes open and with deliberate consideration.

The section header is a result of this call:

curl -X POST 'https://webhook.site/U-U-I-D' \
  -H 'content-type: application/json' \
  -d $'{"message": "Hey there Daily Drop readers!"}'

Wget WARC-Style

In preparation for one section coming tomorrow, I wanted to spread some scraping 💙 (it is Valentine’s Day, after all) to wget. The curl utility and library gets most of the ahTTpention these days, but there are things it just cannot do (because it was not designed to do them). One of these things is creating web archive (WARC) files.

These files are in a standardized format used for archiving all web content, including the HTML, CSS, JavaScript, and digital media of web pages, along with the metadata about the retrieval process. This makes WARC an ideal format for digital preservation efforts or for anyone looking to capture and store web content for future reference.

Creating WARC files with Wget is straightforward, thanks to its built-in support for the format. To start archiving a website, you simply need to use the --warc-file option followed by a filename for your archive. If you use Wget’s --input-file option, you can save a whole collection of sites into one WARC file. For example:

wget --input-file=urls.txt --warc-file="research-collection"

This command downloads the website and saves the content into a WARC file named research-collection.warc.gz. The .gz extension indicates that the file is compressed with gzip, which is Wget’s default behavior to save space. If you prefer to have an uncompressed WARC file, you can add the --no-warc-compression option to your command.

For more extensive archiving, such as capturing an entire website, you can combine the --warc-file option with Wget’s --mirror option. This tells Wget to recursively download the website, preserving its directory structure and including all pages and resources. This approach ensures that you get a complete snapshot of the site at the time of download, stored in a series of WARC files. Wget automatically splits the archive into multiple files if needed, appending a sequence number to each file’s name.

We’ll be using this technique, tomorrow, to explore a new tool that can operate on WARC files, so I wanted to make sure y’all had time to check this out before we tap into that content.

FIN

Remember, you can follow and interact with the full text of The Daily Drop’s free posts on Mastodon via @dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev ☮️

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.