Drop #702 (2025-08-29): Archival Edition

Kiwix; Antifa Torrent Server; Talk About The Weather

As political pressures mount against Wikipedia the importance of maintaining independent, archival copies of this (and other) crucial information resource has never been clearer. While Wikipedia has revolutionized how we access knowledge, its centralized nature and susceptibility to external pressure make it vulnerable. Creating and regularly updating archival snapshots serves as both a safeguard against potential censorship and a historical record of how information evolves over time.

So, today, we’ll look at one easy way to ensure you’ve got copies of or at least continued access to a number of important data stores.


TL;DR

(This is an LLM/GPT-generated summary of today’s Drop using SmolLM3-3B-8bit via MLX and a custom prompt.)

  • Kiwix is a free offline reader for Wikipedia and other content, using ZIM files with LZMA2 compression and a Systemd-based server for hosting, with a 2025-08 ZIM file added to a local library. (https://kiwix.org/en/applications)
  • The Antifa Torrent Server hosts a 112 TB archive of U.S. government data, including datasets from NOAA, NIH, and other agencies, preserved through torrents and community seeders. (https://tech.lgbt/@Lydie)
  • OpenWeatherMap Community & Weather Data Preservation focuses on preserving NOAA/NWS data via Climate Reanalyzer, CWOP, and open-source tools like WeeWX, which supports local weather station data preservation. (https://climatereanalyzer.org)

Kiwix

Kiwix is a non-profit organization and a free and open-source software project that is, essentially, an offline reader for online content like Wikipedia, Project Gutenberg, TED Talks, and more. It makes knowledge available to folks with no or limited internet access. The name “Kiwix” is a play on the word “Wiki” as it represented their initial goal of making Wikipedia accessible offline.

It packages up entire websites into .zim files. The file format employs a cleaver architecture built around cluster-based storage with LZMA2 compression, where content is organized into variable-sized clusters (typically ~1 MB) that contain multiple compressed articles or media objects. The format uses a two-level indexing system: a directory entry table that maps article URLs to cluster positions via title hashes, and a cluster pointer table that provides direct byte offsets into the file for O(1) random access.

Each ZIM file begins with an 80-byte header containing magic numbers, article counts, cluster counts, and pointer offsets, followed by the MIME type list, URL pointer list, title pointer list, and cluster pointer list before the actual compressed data clusters. The namespace system uses single-character prefixes (A/ for articles, I/ for images, M/ for metadata) to categorize content types, while the redirect mechanism allows efficient handling of article aliases without data duplication. Full-text search is implemented through a Xapian database embedded within the ZIM file, creating inverted indices that map terms to article positions, enabling complex boolean queries and relevance ranking even in offline environments.

Kiwix is available as a native application for Android, Linux, macOS, iOS and Windows (it’s also available as Chrome, Firefox and Edge extensions). Content files can be downloaded from the apps or from their library.

We’re going to focus on getting the Kiwix Server installed and neatly tucked into a Systemd configuration. The server is provided as a self-contained binary in the Kiwix-Tools (GH) suite. Hit up this part of the Kiwix main site to download the binaries. Under the hood, Kiwix Server is “just” a .zim compatible web server. With it, you can host any ZIM content localy or globally.

My setup for Kiwix is based on their x86_64 binaries, which are:

  • kiwix-manage: this creates, modifies, and manages ZIM file libraries/catalogs (adding/removing ZIM files, setting metadata)
  • kiwix-search: this performs full-text search queries against ZIM files from the command line
  • kiwix-serve: this runs a local HTTP server that serves ZIM file content through a web interface, making offline content accessible via browser

I put them all in /usr/local/bin and made a did a:

mkdir -p /data/kiwix/{content,library}

to hold Kiwix content.

This is the Systemd configuration (which I have in ~/.config/systemd/user/kiwix.service):

[Unit]
Description=Kiwix Server
After=network.target

[Service]
Type=exec
ExecStart=/usr/local/bin/kiwix-serve --port=8080 --library /data/kiwix/library/library.xml
Restart=always
RestartSec=10

# Security settings
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=false
ReadWritePaths=/data/kiwix

# Environment
Environment=KIWIX_DATA_DIR=/data/kiwix

[Install]
WantedBy=default.target

Use whatever port you want (8080 is very likely already consumed on your system if you tinker alot).

Now let’s tell Systemd about it:

# Reload systemd user configuration
$ systemctl --user daemon-reload

# Enable the service to start automatically
$ systemctl --user enable kiwix.service

# Start the service
$ systemctl --user start kiwix.service

# Check status
$ systemctl --user status kiwix.service

We don’t have any .zim content yet, so the logs are gonna (temporarily) whine. Let’s take care of that by grabbing and integrating the top 100 most popular English Wikipedia articles from August 2025. First, the grab:

$ cd /data/kiwix/content
$ wget "https://download.kiwix.org/zim/wikipedia/wikipedia_en_100_2025-08.zim"

You can use any HTTP download method you like (wget will restart from the last saved byte position if interrupted). They also support torrenting the downloads (which is 100% a better way to go if you can run a torrent client).

Now, we integrate it:

# Create empty library
$ touch /data/kiwix/library/library.xml

# Add ZIM files to library
$ kiwix-manage /data/kiwix/library/library.xml add /data/kiwix/content/wikipedia_en_100_2025-08.zim

# Restart Kiwix
$ systemctl --user restart kiwix.service

# Verify
$ kiwix-manage /data/kiwix/library/library.xml show
id:             67008d9b-938d-dc31-db75-fa3f5d4f0fde
path:           /data/kiwix/content/wikipedia_en_100_2025-08.zim
url:
title:          Wikipedia 100
name:           wikipedia_en_100
tags:           wikipedia;_category:wikipedia;_details:yes;_ftindex:yes;_pictures:yes;_videos:yes
description:    Top hundred Wikipedia articles
creator:        Wikipedia
date:           2025-08-10
articleCount:   4951
mediaCount:     3845
size:           325864448 KB

The section header is what you’d see if you navigated to the Kiwix server page for that .zim file.

Kiwix has TONS OF CONTENT spanning all sorts of topics, so definitely linger over their library.

If all of the above seems like “work” (or you don’t have a home server, and want to use the web-based reader vs. load apps on your endpoints), the Kiwix folks have an “easy button” in their Hotspots. These are pre-configured Raspberry Pi 5 devices with 256GB NVMe M.2 SSD Disk (expandable up to 1TB), and pre-loaded content (check out the Hotspots page for the info on the content). The current price-range for these Hotspots for folks in the U.S. is ~$300-400 USD, and they would be ideal donations to local libraries/schools, especially those in rural areas with garbage (or no) internet.


Antifa Torrent Server

Lydie operates a crowdfunded 112 TB torrent seeding server of data preserved from various U.S. government websites (especially those that have already been purged or are about/likely to be purged). It’s quite the list:

  • ExtremeWeather: A large-scale climate dataset for semi-supervised detection, localization, and understanding of extreme weather events (ExtremeWeather A large-scale climate dataset.torrent) 1.5TB
  • English Wikipedia Text Snapshot December 2024 (enwiki-Dec2024.torrent) 23GB
  • NIH Project Data, 2019-2024 (NIH Project Data 2019-2024 – data sets.torrent) 20GB
  • NIH Project Data, 2019-2024 (NIH Project Data 2019-2024 – raw data.torrent) 300MB
  • 01-28-2025 CDC Data sets (20250128-cdc-datasets.torrent) 20GB
  • 10-28-2024 End of Term 2024 Pre Election Crawl (End of Term 2024 Pre Election Crawls 2024-10-28.torrent) 7GB
  • Thru 2024 Federal Election Commission Campaign Finance Data (Federal Election Commission Campaign Finance Data.torrent) 87GB
  • MMWR Morbidity and Mortality Weekly Reports (MMWR Morbidity and Mortality Weekly Reports) 1.3GB
  • catalog.archives.gov: LGBT topics (catalog-archives-gov-lgbt.torrent) 1.3TB
  • Department of Education complete datasets (data.ed.gov.torrent) 107GB
  • Health Resources Services Administration (data_hrsa_gov.torrent) 2GB
  • FBI Crime Data Explorer (fbi-cde.torrent) 51GB
  • NOAA Weather Data (ftp-ncei-noaa-gov.torrent) 120GB
  • January 6th Committee Report (jan6.torrent) 100GB
  • OSHA Publications (osha.torrent) 2.2GB
  • Snapshot of all videos from National Archive Youtube channel (National Archives YouTube Channel 03-16-2025.torrent) 555GB
  • DoE National Solar Radiation Database 2018 (nsrdb_tgy-2018.torrent) 250GB
  • DoE National Solar Radiation Database 2017 (nsrdb_tgy-2017.torrent) 250GB
  • DoE National Solar Radiation Database 2016 (nsrdb_tdy-2016.torrent) 207GB
  • Bureau of Economic Analysis (bea.gov.torrent) 4.6GB
  • Jet Propulsion Laboratory website (www-jpl-nasa.torrent) 4.4GB
  • Office for Civil Rights Data on Equal Access to Education (ocrdata_ed_gov.torrent) 2.9GB
  • CORPSCONNECTION YouTube channel (usace-contentdm-oclc-org12.torrent) 580GB
  • The complete contents of ftp.nlm.nih.gov (ftp.nlm.nih.gov.torrent) 26GB
  • Papers with the keyword “transgender” from academia.edu. (download_transgender.torrent) 2.5GB
  • NLM Publications and Productions (NLMPublicationsandProductions.torrent) 109GB
  • Department of Energy Youtube (UsDeptOfEnergyYoutube.torrent) 246GB
  • USPTO PatentsView Github (PatentsViewGithub.torrent) 5GB
  • Department of Education Web Archive Collection Zipped (Department of Education.wacz.torrent) 271GB
  • Podcasts that are part of the USIP Podcast Network (usip-podcast-network.torrent) 62GB
  • United States Institute of Peace YouTube Rip (usip-youtube.torrent) 273GB
  • Institute of Museum and Library Services website (www.imls.gov.7z.torrent) 2GB
  • Mirror of sites.ed.gov from 2025-05-20 (sites.ed.gov.7z.torrent) 2.4GB

It’s all operated locally and powered via solar arrays. Much of the data comes from SciOp, and the operator hosts torrents supplied by SciOp. SciOp is a much larger and more established data preservation project that is part of “Safeguarding Research & Culture” (SRC) and serves as a distributed archive for scientific, cultural, and public information. (There was a great talk at WHY2025 on that initiative.)

The platform relies on volunteers (“seeders”) who donate storage space and bandwidth to maintain a distributed network of archived data. It focuses on preserving datasets that are at risk of disappearing from the internet. Newcomers can help by installing a BitTorrent client, finding uploads that need more seeders, and keeping their client running to help distribute the data. Folks can subscribe to RSS feeds organized by topics and priorities to automatically download new torrents or get notifications about new uploads.

Anyone can create an account and contribute by scraping data, creating dataset descriptions, making torrents, and uploading them (new accounts are moderated). The platform allows folks to flag datasets at risk of being taken offline, creating “container datasets” to coordinate preservation efforts. They welcome developers to help with bugs, features, and code reviews through their open-source development process.

Both of these efforts are noble and vital to the preservation of knowledge and objective truth. If you can help out in any way it will make a difference. For folks who are wanting to join in the fight, but are not able to perform literal “boots on the ground” work (there are scads of very good reasons this may be the case), this is a great way to join in the fight and make a real difference.


Talk About The Weather

With the U.S. NOAA/NWS facing potential privatization threats, the open weather data community has been mobilizing around several preservation and alternative efforts. OpenWeatherMap started as a commercial service but has evolved into something more significant for data preservation.

One of the most interesting preservation efforts is Climate Reanalyzer combined with community-driven weather station networks like the CWOP (Citizen Weather Observer Program) that result in providing alternative weather data collection that doesn’t rely on “government” infrastructure.

For self-hosting weather data, the WeeWX project is all kinds of spiffy. It’s an open-source weather station software that supports dozens of hardware types, archives weather data locally, and generates rich web-based reports. With its flexible configuration system, it can also publish or back up data to external services and storage locations, ensuring that weather records are both preserved and accessible.

For broader weather data preservation, ECMWF’s ERA5 reanalysis data is freely available and represents one of the most comprehensive weather datasets ever created. It’s 80+ years of global weather data at high resolution, and the Copernicus Climate Data Store provides APIs for bulk downloading.

There are a surprising number of other sources for contributing and accessing weather data. It might be a good idea to familiarize yourself with the ones I’ve mentioned and then Kagi around for others to see which one(s) you can rely on if/when the NOAA/NWS data does fall silent.


FIN

Remember, you can follow and interact with the full text of The Daily Drop’s free posts on:

  • 🐘 Mastodon via @dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev
  • 🦋 Bluesky via https://bsky.app/profile/dailydrop.hrbrmstr.dev.web.brid.gy

☮️

Fediverse Reactions

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.