Drop #402 (2024-01-10): Hoarding Can Be A Good Thing

Monolith; Archivematica

Just two resources, today, as $WORK and .1 have encroached on the cycles usually reserved for personal research this week.

I spend an inordinate amount of time archiving and preserving content from the web and other places. Not just for these Drops, but also for the work I do fighting the good fight in cyber and against those who seek to dismantle liberal democracy and harm others. The two tools in today’s drop make that work a bit easier (though there is a bit of tedium in the second tool as it is a more “official” hoarding platform).

If you need/want to preserve content from the web, or archive other digital assets, read on!

TL;DR

This is an AI-generated summary of today’s Drop.

  • The post discusses two tools for archiving and preserving digital content. The first tool is Monolith, a Rust-based CLI tool that can convert any HTML page into a self-contained HTML file. This tool embeds all necessary assets such as CSS, JavaScript, and images into a single HTML file, allowing for offline access and preservation of the original webpage.
  • The second tool discussed is Archivematica, an open-source digital preservation system. Archivematica processes digital content such as documents, photos, and videos to ensure they meet international preservation standards. It creates Archival Information Packages (AIPs) and Dissemination Information Packages (DIPs) from Submission Information Packages (SIPs), ensuring the digital content remains accessible despite technological changes.

Monolith

One of the more compelling features of both R Markdown and Quarto HTML documents is the ability to create an entirely self-contained HTML file. While I wouldn’t use said file in a production hosting capacity (they can be yuge), they are super handy for shipping interactive reports around.

What if you could do that for any HTML page from the command line?

We’re not talking about generating a WARC archive, or (now deprecated) WebKit .webarchive. This is a fully standalone and functional single HTML file.

Well, we can, with monolith (GH), a simple and efficient Rust-based CLI tool for embedding all the things necessary to reproduce a web page into a single HTML file (i.e., give it a URL as an input, and it will output a single HTML file that faithfully reproduces the original web page, including all its assets like CSS, JavaScript, and images). This means you get a fully interactive page, not just a static screenshot or janky PDF. It’s like having the entire web page in your pocket, available anytime, anywhere, even without an internet connection.

Monolith is not just a simple web scraper. It’s a sophisticated tool that has evolved over time, with features added in response to community input. For instance, it supports a wide range of charsets aside from UTF-8, and it has an option for saving a document using custom encoding. It also can process and embed the contents of <noscript> tags, and it can enforce the saved document’s charset to always be set to UTF-8.

You can even use it with Chromium/Thorium to capture the state and resources of a dynamically loaded page.

The repo has extensive installation and usage examples, so I’ll leave you in their hands for that. And, the section header is a partial capture of the HTML generated by Monolith archiving rud.is/b.

Archivematica

Archivematica (GH) is an open-source digital preservation system that’s a bit like a sophisticated time capsule for all sorts of digital content: documents, photos, videos, etc. This tool takes these files and processes them so that they’re preserved in a way that meets international standards.

Along with faithfully storing the original content, Archivematica transforms them into formats that are less likely to become obsolete, making sure that future generations can still access them. This process involves creating Archival Information Packages (AIPs) and Dissemination Information Packages (DIPs) from Submission Information Packages (SIPs). These packages are like the DNA of digital preservation, ensuring that all the necessary information for future access and understanding is bundled up neatly.

The utility of Archivematica can’t be overstated. Technology changes at breakneck speed, and the risk of digital files becoming unreadable is very, very real. Archivematica mitigates this risk by adhering to the Open Archival Information System (OAIS) reference model, which is the preeminent standard for preserving digital information. By adhering to this model, this tool ensures that the digital content remains accessible, no matter what new technology comes along.

For those who work in libraries, archives, or any institution with a digital collection, Archivematica is a game-changer. I’d argue it’s also a great tool for those of us who are trying to salvage the last vestiges of liberal democracy, as it enables us to precisely and accurately preserve history. Plus, being open-source means that the code is freely available for anyone to study, modify, and improve. This transparency is crucial for institutions that want to show stakeholders exactly how they’re preserving cultural heritage materials.

The target environment is Linux, but it is container-friendly.

FIN

Remember, you can follow and interact with the full text of The Daily Drop’s free posts on Mastodon via @dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev ☮️

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.