Drop #429 (2024-02-29): Multi-Threaded Edition v2024.02

The Missing Semester of Your CS Education; WARC-GPT;

We’ve got a wide-ranging smorgasboard for Drop readers, today, ranging from a “back to the basics” resource, all the way to an “easy” way to run LLM/GPT/RAGs over your own data.

Estimated read time: ~5 minutes.

TL;DR

(This is an AI-generated summary of today’s Drop)

  • MIT’s “The Missing Semester of Your CS Education” offers a comprehensive set of coursework aimed at teaching practical, everyday tools essential for computer science students and professionals, covering topics from shell navigation to security and cryptography. Positive student feedback highlights the immediate applicability of these skills in internships and jobs. The Missing Semester of Your CS Education
  • WARC-GPT is an open-source tool designed for exploring web archives with AI, enabling users to navigate vast expanses of web archives. It uses Retrieval Augmented Generation techniques to generate accurate and contextually relevant responses from archived content. The tool’s flexibility allows for customization and use of different models and APIs, making it versatile for researchers and historians. WARC-GPT (GitHub)
  • Web-Check is a tool that provides detailed analysis of potential attack vectors, server architecture, security configurations, and technologies used by websites. It offers a comprehensive dashboard with information on IP, SSL chain, DNS records, and more, making it a valuable resource for optimizing and securing websites. The tool’s on-demand intelligence is beneficial for understanding or improving a website’s structure and security posture. Web-Check (GitHub)

The Missing Semester of Your CS Education

Photo by Ann H on Pexels.com

MIT’s “The Missing Semester of Your CS Education” is a set of coursework designed to fill in the blanks of practical, everyday tools that are essential not only for folks rising through the computer science ranks, but for anyone — scientists, analysts, researchers — who has found themselves at a system prompt with the mission to “get X done”. Sure, you may be a great scientist, analyst, researcher, etc., but that blinking cursor in a terminal window can be pretty daunting if this is your first time there (or have been away from it for a minute).

The course structure is straightforward and intensive, spread over a few weeks, covering a range of topics that are crucial yet often overlooked: It kicks off with a course overview and dives into the shell, teaching folks how to navigate and control systems without a graphical interface. From there, it moves on to shell tools and scripting, editors like Vim, data wrangling (in the sense of using Unix pipes [|] to route and transform data), command-line environments, version control with Git, debugging and profiling, metaprogramming, security and cryptography, and a potpourri of other useful skills.

Student feedback has been overwhelmingly positive. Many have expressed gratitude for the practical skills they’ve gained, which have been immediately applicable in internships and jobs. Some even lament the fact that they didn’t have access to this knowledge earlier in their education, as it would have made their studies and initial work experiences much smoother.

Folks already at some level of mastery may find some more advanced nuggets in the metaprogramming and security and cryptography sections.

WARC-GPT

NOTE: Some parts of this resource could require paying the OpenAI/Perplexity/etc. tax depending on the types of local resources you have.

This section was supposed to come out two Thursdays ago, right after the one about using wget to make WARC files from a list of URLs.

While using more advanced LLM/GPT/RAG services (like Perlexity) turns out to be pretty useful, at times, one is still generally stuck with uploading a finite set of data to have the interactive versions operate on, even over a long-ish query/response session (I refuse to call them “conversations”).

Sure, there’s an increasing number of tutorials for how to build your own, custom GPT setup that works off of your data (and, we’ve covered some tooling in this area), I’ve been hoping for something more geared toward an idiom where I could point it at a bunch of things I’ve captured from the internets, and then ask questions of it. And, I think I’ve found it.

WARC-GPT (GH is an open-source tool designed to navigate potentially vast expanses of web archives one might collect using this fancy new tech provided by our modern AI overlords.

Essentially the process involves reading each WARC file, filtering out irrelevant records, and then using the remaining content to populate a knowledge base. This base is then utilized by WARC-GPT’s REST API and web UI to answer questions about the ingested archives. The tool employs common Retrieval Augmented Generation (the “RAG” from above) techniques, pulling relevant text excerpts based on the question asked, and coalescing various elements into a retrieval prompt to generate accurate and contextually relevant responses.

The design emphasizes customizability and transportability, letting us interchange settings, models, and prompts for experimentation. This flexibility ensures that WARC-GPT can run locally, using open-source models like Ollama by default, or interact with closed-source LLM APIs such as those from OpenAI, Perplexity, or Anthropic, provided API keys are configured. This adaptability makes WARC-GPT a pretty versatile tool for researchers, historians, and anyone interested in extracting valuable insights from web archives.

I used their demo case study and the wget techinque I showed in the other drop to build a WARC file WARC-GPT could operate on, and see if I would be able to replicate their findings.

It installed without a hitch and, as you can see from the above screen captures, it works!

Along with the Q&A ability, they have baked in some other goodies like a way to view the embeddings (see the section header).

I’m going to be testing it out on a bunch of cybersecurity-related resources over the coming weeks and will report back on how effective it has been.

Web-Check

This one is very much so making the rounds, but just in case it slipped through your feeds, Web-Check (GH) is so cool that you should stop reading and just go visit it to see what it’s all about.

If you stuck around (thank you), this bonkers cool tool is designed to uncover potential attack vectors, analyze server architecture, view security configurations, and identify the technologies a site is using. The dashboard presents a wealth of information including IP info, SSL chain, DNS records, cookies, headers, domain info, and much more. This level of detail is invaluable for optimizing and securing websites, as well as for educational purposes. The tool’s ability to provide on-demand intelligence makes it a powerful asset for anyone looking to understand or improve a website’s structure and security posture. Plus, you can download the data after it does its work.

The (large) section header shows how bad Elon Musk is at running a website.

FIN

Remember, you can follow and interact with the full text of The Daily Drop’s free posts on Mastodon via @dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev ☮️