Bonus Drop #49 (2024-05-19): 🦙s All The Way Down

llamafile; llamafile Bash One-Liners; whisperfile; Firecrawl

Heads up! The first three sections are “AI”-centric.

I grok that a sizable percentage of savvy readers aren’t “into” “AI”. If that describes you (no judge!), then you should just check out the fourth section on “Firecrawl”. While that tool’s presumptive use is to get web content into an “AI”-friendly format (markdown), it’s a nice, general purpose free utility (or paid API).


llamafile

Photo by josiah farrow on Pexels.com

If you already know about llamafile, then head on down to sections two-to-four.

One of the main challenges in working with LLM/GPTs has been the complexity of distributing and deploying these models. They usually require specialized hardware, complex software dependencies, and substantial computational resources, making them inaccessible to a fair percentage of curious and wonderful humans.

Simon’s llm solves a fair bit of this complexity problem, but it’s Python, and does introduce some complexity of its own (no dig! It’s a great tool!)ollama also solves some of this as well, but models are still separate from the program itself, and different platforms require different binarires.

llamafile overcomes these limitations and makes local LLM/GPT machinatiosn more accessible and human-friendly. The project combines the llama.cpp inference engine with Cosmopolitan Libc (covered in a previous Drop) into a single-file executable, signiciantly simplifying distribution and deployment.

Like ollama (and certain modes of llm), llamafile emphasizes local inference, meaning all computations and data processing occur on your device. This gets us crunchy privacy and tasty data security by eliminating the need to pay the OpenAI tax. It also allows LLM/GPT applications to function offline or in environments with limited internet connectivity.

It combines llama.cppLibc, the model binary of your choice, and the weights into a single, cross-platform executable. It’s flexible enough to know how to use Nvidia/AMD/Apple Silicon GPUs, or just CPUs if that’s all that’s available. Justine has a great post on the CPU speedups.

It works great! However, it can also present some challenges. The size of the executable files are usually fairly substantial, reaching several gigabytes depending on the LLM used. This could pose distribution and storage challenges, particularly on devices with limited resources or in environments with restricted bandwidth. Special execution considerations are also required on the “why won’t it just die, already” Windows operating systems due to max executable size constraints.

The Quickstart has all you need to kick the tyres, so I’ll leave you follow along there. However, take a peek at the next section to see just why this new LLM/GPT runner is so cool/useful.

llamafile Bash One-liners

Photo by Fabio Eckert on Pexels.com

Justine writes great blog posts, so I won’t spend much of your time here.

Bash One-Liners for LLMs” provides code snippets and command examples to illustrate using llamafile and different LLM models for various — and, practical — applications on the command line.

I rejiggered the URL summarization example on Mastodon the other day using pdftotext to show that Mistral could summarize a paper from arXiv. I’m fortunate enough to run some decent Apple Silicon gear, but even hobbyist hardware should be able to get stuff done without too much of a wait.

Combining these one-liners with a caching system similar to Simon’s llm is a great way to get started with your own RAG experiments.

whisperfile

really like whisper.cpp, and use it weekly on podcasts that do not have human transcripts (NOTE: it does seem like all podcasts are trending towards using Whisper, so I may not need to do this anymore).

CJ Pais shows how to use llamafile with Whisper models in his whisperfile project. Just grab one of the pre-built binaries and either feed the web app view an audio file (if it’s not in the right format, whisperfile will auto-convert it for you, provided ffmpeg is available).

It works great, and I’m hoping we see more cool, focused examples like this.

I wish the tech billionaires/VCs had launched their ginned-up “AI revolution” with retrieval augmented generation — and focused utilitarian use-cases like Whisper — out of the gate. There is so much potential in these discrete tasks, vs. expect to be able to get useful, repeatable output from a static, massive training corpus with a hard-stop temporal content limit. I also wish they hadn’t made a deliberate play to scammers, content spammers, and sloppers.

Firecrawl

Photo by Erik Mclean on Pexels.com

Firecrawl (GH) is both a paid API, or something you can run locally with Docker, to crawl web content, and turn it into Markdown. The use case presented by the developers is that of feeding this data to LLM/GPTs, but it’s a great tool to just get the content from a site along with OpenGraph metadata.

After perusing the API — which works locally or via their paid service — I gave it a go:

$ curl -s -X POST http://localhost:3002/v0/crawl \
    -H 'Content-Type: application/json' \
    -d '{
      "url": "https://thenightly.com.au/world/chinese-government-officials-son-gloats-about-multi-year-hack-on-australian-intelligence-agencies-c-14617200"
    }' | jq -r '.jobId'
$ curl -s -X GET http://localhost:3002/v0/crawl/status/5dea4149-0e71-47fb-bb0f-e824a600be1c \
  -H 'Content-Type: application/json' | \
  jq -r '.data[] | .content' | bat -l md

It works great when it works. The docker compose version (I didn’t test out the paid version) seems to have a hard time with WordPress sites (so it failed to get the content from the Drop or my main blog).

The instructions to run it locally execute flawlessly, and the four containers it launches eat up only around ~800 MB of RAM.


FIN

Remember, you can follow and interact with the full text of The Daily Drop’s free posts on Mastodon via @dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev ☮️

One response to “Bonus Drop #49 (2024-05-19): 🦙s All The Way Down”

  1. contactrmussernet Avatar
    contactrmussernet

    Hey I’m actually working on something that matches what you were mentioning,

    https://github.com/rmusser01/tldw

    the idea is that using several tools together, we can create processing pipelines to extract and store content for search and eventually RAG, but not in the immediate. I had wondered how to approach making a local inference option, but hadn’t thought of llamafile, despite being aware of it. I think that’s what I’ll try for testing, thanks!

    currently it is very much a WIP/alpha, but the goal is to be able to ingest video(transcribe using whisper), ingest website articles (working on figuring this out and had literally last night discovered firecrawl, and am still looking for an effective multi-platform solution, that can be bundled and not require additional installs), pdf, word doc formats(ocr), and ebooks (calibre).

    The idea is to save all your data in a structured text format, and then be able to: feed it to an LLM with custom prompts (built-in prompt library + viewer, another SQLite db) and have the summaries + prompts used for said summary all tracked in the DB. And then since the DB is SQLite, it’s easy to copy / distribute and share , so people can create curated, personalized data sets that they can refer to and search against at a moments notice, or be able to ask parts (ideally everything but RAG is something I’m still learning) and have a discussion with the LLM about it, or ask it to explain certain things etc. and of course open source Apache 2.0

    Like

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.