Local “AI” for Visual Content: Why I’m Done Feeding the Machine; Llamafied Alt Txt; Enhanced OCR

Local “AI” for Visual Content: Why I’m Done Feeding the Machine
If you’re so opposed to LLMs/GPT/“AI” that even local-first inference is off-limits, you can close this email and skip this Bonus Drop entirely.
But if you want to level up your online accessibility game — making images, charts, and video clips readable for folks with screen readers — without feeding the “Big AI” machine, or if you need to extract more than just text from images and PDFs (again, without supporting “Big AI”), then this Bonus Drop is absolutely for you.
I spent my recent $WORK shutdown week doing the usual: fixing the house’s central A/C, tackling outdoor projects around the compound (seriously need to get some goats), cycling, hiking, and coding up two projects. These projects, completely unplanned, both ended up using local inference via Ollama and compatible models for visual content processing.
Feel free to skip ahead to those sections if you want to avoid my “why local AI” rant below.
The Big AI Reckoning is Coming
Without being too hyperbolic, “Big AI” — OpenAI, Perplexity, Anthropic, Microsoft, Google, Meta, Cursor, and their ilk — is in serious trouble. You wouldn’t know this from the breathless tech press coverage, Fortune 1000 exec proclamations, or big consulting firms waxing poetic about AI productivity gains. You also wouldn’t know it from the “✨” sprinkled into every major app, pre-generating slop after every keystroke and begging you to hit TAB.
Back in early Q1, I told a colleague that by Q3, we’d see popular “AI” chat and API services hitting $200/month for non-garbage results and beyond just a handful of daily uses. Virtually all of them now have exactly that premium tier. Even at those prices, I doubt any will turn real profit anytime soon—though immediate profitability isn’t the goal. They need enough folks hooked on their slop services to justify continuous price increases for the next hit.
These companies are now inflating both salaries and egos (though I’m not sure the latter is possible) of “AI experts” by poaching each other’s talent. They’re partnering with hardware, browser, and OS vendors to ensure every keystroke, tweet, email, and audio snippet becomes training data. And speaking of training data — they’re killing news sites with search summaries while forcing the rest of us into erecting algorithmic barriers and endless slop mazes to ward off their scrapers.
The writing is on the wall: a reckoning is coming. Many major players will yank their “AI” services (Microsoft and Google have never discontinued services you rely on, right?), leaving dependent companies in the lurch and giving survivors freedom to jack up prices even more.
Why Local “AI” Makes Sense
Despite all this, I’m not an “AI” detractor. What I am is a “Big AI” detractor. I’m also skeptical of putting probabilistic, non-deterministic systems in charge of critical business functions. We’ll see the year of Linux desktop before we get ubiquitous “intelligent” autonomous agents.
But I’m absolutely in favor of using local “AI” for well-defined, focused use cases with some tolerance for non-critical uncertainty. “AI” should help people, not replace them, non-consensually manipulate/generate images, or substitute for genuine human connection.
Both projects I’m sharing today meet exactly that criteria.
Llamafied Alt Txt

There is a high probability you follow me on either Mastodon or Bluesky. As such, you are likely aware that I try super hard to use media alt text when posting visual content. Vision models have evolved sufficiently to make them more than just adequate at alt text generation. For the past few months I’ve been using different ones from the Ollama catalog to see which worked the best in terms of both speed and content, and have settled on one I like, but more on that in a moment.
I believe it is paramount to perform inference on images using a “local” model. I scarequoted the word “local” since it does not necessarily mean something you have sitting in your house/apartment/van. I have the privilege of being able to run an maxxed out M4 Mac Mini for my local work. Renting GPUs at some provider is still “local”, as long as you’re bringing your own models. When you paste an image into Claude, Perplexity, or ChatGPT and give them a question prompt, you’ve just given them more training data (regardless of the lies in their ToS). Feel free to do that with your own images/content, but do not paste the work of others into them — at least not without permission.
Hollama is great, but I mainly use alt text when posting to Mastodon or Bluesky, and Hollama does not live there. So, I decided to build the Ollama Alt Text Chrome Extension (I’m horribad at names). You need to be running an Ollama server somewhere that the extension can access it and have at least one vision model loaded. After the aforementioned testing, I can highly recommend llama3.2-vision:11b. It does a bang up job with this battle-tested prompt:
Describe this image as alt text for a screen reader. Be concise but comprehensive, focusing on key visual elements, text content, and spatial relationships. Limit to 200 words.
Feel free to customize the prompt, as there is a valid school of though that focuses on brevity. I have, however, had direct feedback from folks who rely on alt text that a number of the longer ones this prompt generates was helpful.
This is what it generates for the section header:
A scene from the Marvel Cinematic Universe, specifically from the movie “Captain America: The Winter Soldier.” The scene shows Captain America, played by Chris Evans, standing over a group of fallen agents, with his iconic shield held high.
Key Elements:
- Captain America: The central figure, standing over the fallen agents, with his shield held high.
- Fallen Agents: A group of agents, likely from the agency S.H.I.E.L.D., lying on the ground, suggesting >they have been defeated or killed.
- Setting: The scene appears to be set in a high-tech facility, possibly S.H.I.E.L.D. headquarters, with a modern and sleek design.
- Atmosphere: The atmosphere is tense and intense, with a sense of urgency and danger.
Overall Impression:
The image conveys a sense of action and drama, with Captain America standing heroically over the fallen agents. The scene suggests a high-stakes battle or mission, with Captain America fighting to protect his allies and defeat his enemies. The image is likely from a pivotal moment in the movie, where Captain America must use his skills and resources to save the day.
At a minimum, it provides a starting point that you can customize.
The original intent was for this to solely be a context menu item, but I’m not an expert at Chrome extensions, so I highly recommend using the pinned “ALT” tool, copying the image you want alt text for to the clipboard, and using the “Process Clipboard Image” for the most consistent results.
You have to load this as an unpacked extension as I’m not sure I’ll get this into the Chrome web store.
Enhanced OCR

The Llama vision model works great on everything I’ve slung at it, but I was floored at how well nanonets/Nanonets-OCR-s performed when it comes generating well-structurd Markdown from all content on a given image. Thus, hrbrmstr/nanxt was born.
It’s a Golang CLI tool for processing PDFs and images with any image model, but defaults to this quantized nanonets one from the Ollama catalog (which you’ll need to install). I’ve been building more CLIs with Deno these days, but Golang has better PDF tooling when it comes to a compiled binary (there are issues with signal handling on macOS as noted in the repo). Given that this CLI has to split PDFs into pages, make an image from the page, and feed it to the model, using Go seemed like the better choice.
The CLI UX is a bit rough (I hoped to get more done by today), but gives you tons of control over the process:
Usage:
nanxt <input.pdf|input.png|input.jpg|input.jpeg|input.gif> [flags]
Flags:
--basic Basic API settings: 4K context, 2K predict (default)
--complex Complex API settings: 12K context, 8K predict
-h, --help help for nanxt
--high High quality: 300 DPI, max 2048px
--max Maximum quality: 300 DPI, max 2048px for complex documents
-m, --model string Ollama model name (default "benhaotang/Nanonets-OCR-s")
--options string JSON file containing Ollama API options
-o, --output string Output file path (default: stdout)
--std Standard quality: 200 DPI, max 1024px (default)
The image in the section header is from a recent report and is included in the repo. Here’s what the model can do:
$ ./nanxt tests/6.png | jq -r .[0].text
Processing page 1 of 1
Survey Demographics Continued
Race/Ethnicity
<table>
<tr>
<td>White</td>
<td>60%</td>
</tr>
<tr>
<td>Hispanic/Latino</td>
<td>20%</td>
</tr>
<tr>
<td>Black</td>
<td>12%</td>
</tr>
<tr>
<td>Asian</td>
<td>6%</td>
</tr>
</table>
Geography
<img>A map of the United States with states colored in different shades of blue, orange, and pink, representing different demographic categories. The states are labeled with their abbreviations. The legend on the right shows the percentages for each category: 25% for Hispanic/Latino, 20% for Black, and 18% for Asian.</img>
Caregiving Responsibility
Fathers/male caregivers
74%
(n=1,080)
Total = 1,456
caregivers in sample
(60% of sample)
Socioeconomic Status
High income
36%
Middle income
32%
Low income
32%
If you did scan through that result, you know it messed up (a bit) and could not read the super tiny text in the treemap. Nonetheless, that result is no less impressive. It identified key components of the page, and understood the majority of the content despite no contextual prose. While I could have shown a perfect output example, I wanted to re-emphasize something I said at the beginning of the Drop. This is a well-defined, focused use case with some tolerance for non-critical uncertainty (unless you intended to put this into automation without human curation).
While the quantized version works well, it does have a tendency to do the “probabilistic repeat loop” on occasion. The original model running under CUDA performs much better.
Both this and the extendion are nowhere near finished products, but I mention them today to encourage folks to both experiment with local inference for similar, focused use cases, and come up with a plan for when — not “if” — the bottom falls out of a provider you or your team/workplace you rely on.
FIN
Oh, and — soon — your work Zoom/Teams meetings are very likely going to look eeriely like the modern version of this:
Isn’t it great living in the future!
☮️
Leave a comment