local-llm-mcp
Enables delegation of text generation tasks to local Ollama models, allowing the main agent to offload token-heavy work to a local LLM.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@local-llm-mcpgenerate a pytest skeleton for a user model"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
local-llm-mcp

Your coding agent is brilliant at deciding what to build — and overqualified for typing the boilerplate.
local-llm-mcplets Claude Code, Codex, and any MCP client hand the boring, token-heavy work to a model running on your own machine (or a cheap cloud one), while the frontier agent keeps doing the thinking.

TL;DR
Claude Code / Codex → decides, reads the repo, edits files, runs tests, reviews
local-llm-mcp → routes one bounded task out through an MCP tool
LM Studio / Ollama → drafts the boilerplate, tests, docs, summaries — for freeThe smart agent stays in charge. The cheap model just hands back text. You stop paying frontier prices to scaffold a pytest file.
git clone https://github.com/HenryLinyy/local-llm-mcp
cd local-llm-mcp
bash setup.sh # one command: venv, register with Claude Code + Codex, smoke testRelated MCP server: CodeBrain
The problem
Frontier coding agents are metered, and they're worth it — for judgment. Reading a repo, planning a change, reviewing a diff, deciding what's safe to merge.
But a huge slice of what they actually emit is bounded, low-risk generation:
the first draft of a function you're going to rewrite anyway
a
pytestskeletonboilerplate, config, glue code
a 600-line file summarized down to 10 bullets
"give me three alternative implementations"
You're paying premium per-token rates for work a 7B model on your laptop does fine. A local model costs $0 per token. The math only goes one way.
How it works
local-llm-mcp is a tiny MCP server. It exposes your local and cheap-cloud LLMs as tools the main agent can call. The delegated model only returns text — it can't read your repo, edit files, or run commands. That boundary is the whole point.

The main agent decides when to delegate, reviews what comes back, and remains 100% responsible for anything that touches your code.
Quick start
git clone https://github.com/HenryLinyy/local-llm-mcp
cd local-llm-mcp
bash setup.shThen open a new Claude Code or Codex session. That's it.
setup.sh will:
Create a
.venvand install the project.Create
keys.jsonandcustom_backends.jsonfrom examples.Pick a RAM safety threshold for your machine.
Register the MCP server with both Claude Code and Codex (if their CLIs are installed).
Run a smoke test.
No API keys required to start — local backends work out of the box.
✅ Delegate this / 🚫 keep this
✅ Good for the worker model | 🚫 Keep with the main agent |
README / docstring first drafts | Final architecture decisions |
Boilerplate, config, glue code | Security & correctness sign-off |
| Anything that edits your repo |
Long-file summaries | Running shell commands |
Repetitive format conversions | Applying a patch unreviewed |
"Sketch 3 alternative approaches" | Judgment calls of any kind |
Rule of thumb: if a wrong answer is cheap to catch, delegate it. If it's expensive to catch, don't.
Supported backends
Local backends need nothing but a running server. Cloud backends are optional fallbacks and read their key from an env var or keys.json.
Backend | Type | Protocol | Default URL | Default model | Key |
| local | OpenAI |
|
| — |
| local | OpenAI |
|
| — |
| local | OpenAI |
| auto | — |
| local | OpenAI |
| auto | — |
| local | OpenAI |
| auto | — |
| cloud | OpenAI |
|
|
|
| cloud | OpenAI |
|
|
|
| cloud | OpenAI |
|
|
|
| cloud | OpenAI |
|
|
|
| cloud | OpenAI |
|
|
|
| cloud | Anthropic |
|
|
|
Every base URL and default model can be overridden by env vars, e.g. OLLAMA_BASE_URL, DEEPSEEK_DEFAULT_MODEL. Need something else? Add a custom backend — no Python required.
MCP tools
Tool | Purpose |
| Send a prompt to a backend, get back text + usage metadata. |
| Show configured backends, URLs, protocols, key status. |
| Memory, guard state, backend reachability, config paths. |
| List model IDs from backends that expose |
| Add, update, or remove a custom backend live. |
| Reload |
| Change the RAM / exclusivity guards live. |
| Pin a system prefix for prompt-cache-friendly cloud calls. |
Talking to it
Just tell the agent what to delegate:
Use ask_local_model with backend="ollama" to draft a pytest suite for this module.
Don't apply it — review it first, then edit the repo yourself.Call local_status. If DS4 is running, use backend="ds4" for boilerplate.
Otherwise fall back to backend="ollama".More patterns in examples/claude-code-prompts.md.
Safety model
Local models are happy to OOM your machine. Two guards stop that:
RAM valve — local calls are refused when free memory drops below
LOCAL_LLM_MIN_FREE_GB.Exclusive backend — when a heavy local backend (
ds4by default) is up, other local backends are blocked so they don't fight for RAM.
Tune them live, no restart:
set_guard(min_free_gb=8)
set_guard(exclusive_backend="none")
set_guard(enforce=0)Secrets never enter git. keys.json, config.json, and custom_backends.json are gitignored; keys load from env vars or a chmod 600 file. Cloud backends skip the RAM guard — but they send your prompts to a third party and may cost money, so read SECURITY.md before pointing one at proprietary code.
Custom backends
Any OpenAI- or Anthropic-compatible endpoint works. Add it from the tool:
set_backend(name="my_qwen", base_url="http://localhost:9000/v1", default_model="qwen3-coder", local=1, protocol="openai")...or drop it in custom_backends.json and call refresh_backends. See examples/custom_backends.openrouter.json.
Does it actually save money?
It depends entirely on your workload — so this repo refuses to print a fake percentage. Instead it ships a harness so you can measure your numbers:
python scripts/benchmark.py --backend ollama --model qwen2.5-coder:7b --out results.jsonlThen compare premium-only vs. delegated mode using BENCHMARK.md and the runbook. Report what you find; don't take a number on faith — not even ours.
Manual install
python3 -m venv .venv
.venv/bin/python -m pip install -e .
# Claude Code
claude mcp add local-llm -s user -e LOCAL_LLM_MIN_FREE_GB=16 -- "$PWD/.venv/bin/python" "$PWD/server.py"
# Codex
codex mcp add local-llm --env LOCAL_LLM_MIN_FREE_GB=16 -- "$PWD/.venv/bin/python" "$PWD/server.py"Tests
python -m unittest discover -s tests -v
python scripts/smoke_test.pyCI runs both on Python 3.10, 3.11, and 3.12.
FAQ
Does the local model touch my files? No. It returns text only. Every edit goes through the main agent.
Do I need a GPU / a big Mac? No. Small coder models (7B) run on modest hardware, and the RAM valve keeps you from OOMing. No local model handy? Point it at a cheap cloud backend.
Is this a fusion model or an autonomous agent? Neither. It's a delegation layer — a tool your existing agent calls.
Why not just switch Claude Code to a cheaper model entirely? Because you want the frontier model's judgment and the cheap model's typing. This keeps both.
Windows / Linux? The server is cross-platform; the RAM guard reads vm_stat on macOS and /proc/meminfo on Linux. Shell helpers are macOS/zsh-flavored.
Related projects
qwable — a local multi-model gateway and agent runtime for Codex & Claude Code on Apple Silicon.
Conclava — a council of local LLMs with task-aware routing and multi-model deliberation.
Contributing
Issues and PRs welcome — see CONTRIBUTING.md. Launch notes live in docs/.
License
MIT — see LICENSE.
This server cannot be installed
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
- Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)By Om-Shree-0709 on .Agentic AiPrompt InjectionWebAssembly
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/HenryLinyy/local-llm-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server