Bonus Drop #84 (2025-05-31): Shop^w Infer Local With Context

Wiring Up An MCP To Ollama For Local-First Contextual Inference

I am so glad May is nearly over. Travel really messed with the Drop cadence, and I’m hoping June is way more sublime.

No Drop on Friday because I spent time on something worth talking about today: wiring up a stdio MCP (Model Context Protocol) server to a local Ollama workflow. Yes, that means all the LLM/GPT/“AI” malcontents can leave the room now. (FWIW: “AI” is not going away, and you should at least “know your enemy”, plus “The average AI criticism has gotten lazy, and that’s dangerous”)

Model Matters

When working locally, model selection becomes critical. My go-to local-first Ollama models are llama 3.2 and qwen 3. The former is a great, fast general workhorse, while the latter — though slower — is seriously impressive and can handle more complex tasks with ease. Both are capable of calling tools, but each has trade-offs that matter for this project.

Initially, llama 3.2 wasn’t super-great at tool calling, and qwen 3 was just too slow — even on my maxxed out M4 Mini — to be practical. I tried a bunch of alternatives with the project we’ll be covering below. At first, llama3-groq-tool-use looked like a solid choice, but its built-in 8K context window limitation (more on that in the next section) only makes it useful for short conversations with tools that don’t return much data.

After poking around more than a few forums and testing extensively, I settled on qwen2.5-coder. It strikes the right balance: decent context window, super fast performance, and strong community recommendations for tool use scenarios.

To streamline this trial-and-error process, I created a tool-use:latest model using a small Modelfile:

FROM qwen2.5-coder:7b
PARAMETER temperature 0.0
PARAMETER num_ctx 32768
PARAMETER top_p 0.95

This way, I didn’t have to keep switching out model names during testing.

Building the MCP-Ollama Bridge (i.e., “Duct-taping An MCP To Ollama”)

One pet “AI” peeve I have is that we all keep calling “Claude”, “ChatGPT”, “Grok”, and “Perplexity” “models” instead of what they really are: inference platforms. Each has an entire (and likely massive) application framework wrapped around the core model to enable capabilities like internet search, MCP server tool calls, Google Drive access, and more. Each also has robust core system prompts, along with additional subsystem prompts and guidance messages when they grab extra context.

The MCP paradigm was created to establish a common interface for tool calling. Before this, we had to bundle tool code inside custom applications that hardcoded what was available to the LLM/GPT they were designed to work with. That approach is brittle and — if we’re being honest — somewhat limited, since I’m willing to bet a case/when or if/then decision loop with some light parsing and regular expressions could handle most of these jobs (and faster too).

MCPs help separate concerns, which means capabilities are no longer stuck inside the processing application. We can wire up many MCPs to a given application, making the system more modular and flexible. In this context, having something more sophisticated than hardcoded case/if idioms actually makes sense.

We still need an LLM/GPT wrapper to orchestrate everything, which is what I spent time working on yesterday. Here’s the basic architecture:

The MCP tool calling pattern follows a straightforward workflow:

  1. Tool Discovery: When the system starts, it queries the MCP server to discover available tools and their capabilities
  2. Tool Translation: MCP tool definitions are converted to Ollama-compatible formats
  3. Contextual Decision Making: The LLM analyzes user queries and autonomously decides which tools to use
  4. Tool Execution: When a tool is needed, the system makes a JSON-RPC call to the MCP server
  5. Result Integration: Tool results are seamlessly integrated back into the conversation context

The communication with the MCP follows JSON-RPC 2.0 standards. Here’s what a typical tool call looks like:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "tools/call",
  "params": {
    "name": "search_kev",
    "arguments": {
      "vendor": "Microsoft",
      "limit": 10,
      "days_back": 30
    }
  }
}

In my example project, the MCP server processes this request, queries the CISA KEV database, and returns structured results that the LLM can then interpret and present to the user in a meaningful way.

The implementation has three core components working in concert.

The MCP Client (mcpClient.ts) serves as the communication hub, handling:

  • Spawning and managing the MCP server subprocess
  • Implementing JSON-RPC 2.0 communication over stdio
  • Converting MCP tool definitions to Ollama-compatible formats
  • Handling errors and timeouts gracefully

The enhanced Ollama client (ollamaClient.ts) extends the standard Ollama client with MCP awareness:

  • Maintains conversation state across tool calls
  • Automatically executes tool calls when the model requests them
  • Integrates tool results back into the conversation flow

The interactive session manager (mcp.ts) provides the user interface and orchestrates the entire system:

  • Command-line argument parsing for flexible deployment
  • Interactive prompt loop for natural conversations
  • Comprehensive error handling and graceful shutdown

Fixing Context Window Overflows

As noted earlier, I had high hopes for llama3-groq and several other models, but they all started behaving oddly during extended conversations. The culprit turned out to be context window limitations.

The issue stems from how the system maintains conversation state across tool calls. Most MCP servers start life like my kev-mcp — direct translations of existing APIs. If the underlying service has an endpoint for “get vulnerability by ID,” the MCP server exposes a function called get_vulnerability. If there’s a “list all vulnerabilities” endpoint, that becomes list_vulnerabilities. This one-to-one mapping feels logical and maintains familiar patterns for folks who already know the API.

However, problems emerge when you watch language models actually try to use these tools. Consider a security analyst asking: “What vulnerabilities should we prioritize patching this month?” With a basic API wrapper, the LLM faces a multi-step puzzle. It must first retrieve all vulnerabilities, then examine their due dates, filter for current relevance, possibly cross-reference with organizational systems, and finally synthesize recommendations. Each step requires a separate API call, and the LLM must hold increasingly complex state in its working memory while building toward an answer.

My initial implementation of kev-mcp turned out to be excellent at creating information overload. The KEV database is approaching 1,500 entries, and repeated inclusions of the full JSON responses add up fast, token-wise. The MCP server now includes options for trimming what goes into responses, plus new functions for providing concise statistics rather than having the model attempt to compute everything from raw data.

Additionally, I was pretty lazy with the initial KEV MCP implementation and provided virtually no extended descriptions of what each tool does. Fixing this led to major inference improvements, as the model could better understand which tools to use and when.

Lessons Learned

I’m pleased that I now have a straightforward way to do local inference with MCP servers. Here are the key takeaways:

Model Selection Matters: Use a model with well-tuned tool calling capabilities and a large context window. The combination of these two factors determines whether your system will handle complex, multi-step conversations gracefully.

Design for Inference, Not Just APIs: MCP servers should be designed with the language model’s working patterns in mind, not just as direct API translations. This means providing summary functions, trimming verbose responses, and including rich tool descriptions.

Context Management is Critical: Monitor token usage and plan for context trimming strategies. Conversation state that grows without bounds will eventually break even the most capable models.

Items still on my TODO list include enhancing this basic MCP+Ollama code to support multiple MCP servers simultaneously, adding support for remote MCP servers, implementing a running token count with automatic context trimming, and improving the REPL experience with better line editing in the question prompt.

You can see an example session at https://codeberg.org/hrbrmstr/ollama-with-mcp/src/branch/batman/example-session.md.


FIN

Remember, you can follow and interact with the full text of The Daily Drop’s free posts on:

  • 🐘 Mastodon via @dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev
  • 🦋 Bluesky via https://bsky.app/profile/dailydrop.hrbrmstr.dev.web.brid.gy

☮️