Skip to main content
Glama

local-llm-mcp

local-llm-mcp hero

Your coding agent is brilliant at deciding what to build — and overqualified for typing the boilerplate. local-llm-mcp lets Claude Code, Codex, and any MCP client hand the boring, token-heavy work to a model running on your own machine (or a cheap cloud one), while the frontier agent keeps doing the thinking.

terminal demo


TL;DR

Claude Code / Codex   →   decides, reads the repo, edits files, runs tests, reviews
local-llm-mcp         →   routes one bounded task out through an MCP tool
LM Studio / Ollama    →   drafts the boilerplate, tests, docs, summaries — for free

The smart agent stays in charge. The cheap model just hands back text. You stop paying frontier prices to scaffold a pytest file.

git clone https://github.com/HenryLinyy/local-llm-mcp
cd local-llm-mcp
bash setup.sh        # one command: venv, register with Claude Code + Codex, smoke test

Related MCP server: CodeBrain

The problem

Frontier coding agents are metered, and they're worth it — for judgment. Reading a repo, planning a change, reviewing a diff, deciding what's safe to merge.

But a huge slice of what they actually emit is bounded, low-risk generation:

  • the first draft of a function you're going to rewrite anyway

  • a pytest skeleton

  • boilerplate, config, glue code

  • a 600-line file summarized down to 10 bullets

  • "give me three alternative implementations"

You're paying premium per-token rates for work a 7B model on your laptop does fine. A local model costs $0 per token. The math only goes one way.

How it works

local-llm-mcp is a tiny MCP server. It exposes your local and cheap-cloud LLMs as tools the main agent can call. The delegated model only returns text — it can't read your repo, edit files, or run commands. That boundary is the whole point.

delegation boundary

The main agent decides when to delegate, reviews what comes back, and remains 100% responsible for anything that touches your code.

Quick start

git clone https://github.com/HenryLinyy/local-llm-mcp
cd local-llm-mcp
bash setup.sh

Then open a new Claude Code or Codex session. That's it.

setup.sh will:

  1. Create a .venv and install the project.

  2. Create keys.json and custom_backends.json from examples.

  3. Pick a RAM safety threshold for your machine.

  4. Register the MCP server with both Claude Code and Codex (if their CLIs are installed).

  5. Run a smoke test.

No API keys required to start — local backends work out of the box.

✅ Delegate this / 🚫 keep this

✅ Good for the worker model

🚫 Keep with the main agent

README / docstring first drafts

Final architecture decisions

Boilerplate, config, glue code

Security & correctness sign-off

pytest / unittest scaffolds

Anything that edits your repo

Long-file summaries

Running shell commands

Repetitive format conversions

Applying a patch unreviewed

"Sketch 3 alternative approaches"

Judgment calls of any kind

Rule of thumb: if a wrong answer is cheap to catch, delegate it. If it's expensive to catch, don't.

Supported backends

Local backends need nothing but a running server. Cloud backends are optional fallbacks and read their key from an env var or keys.json.

Backend

Type

Protocol

Default URL

Default model

Key

lmstudio

local

OpenAI

http://localhost:1234/v1

qwen/qwen3-coder-next

ollama

local

OpenAI

http://localhost:11434/v1

qwen2.5-coder:7b

vllm

local

OpenAI

http://localhost:8001/v1

auto

llamacpp

local

OpenAI

http://localhost:8080/v1

auto

ds4

local

OpenAI

http://127.0.0.1:8000/v1

auto

deepseek

cloud

OpenAI

https://api.deepseek.com/v1

deepseek-v4-flash

DEEPSEEK_API_KEY

openrouter

cloud

OpenAI

https://openrouter.ai/api/v1

anthropic/claude-sonnet-4

OPENROUTER_API_KEY

groq

cloud

OpenAI

https://api.groq.com/openai/v1

openai/gpt-oss-120b

GROQ_API_KEY

cerebras

cloud

OpenAI

https://api.cerebras.ai/v1

gpt-oss-120b

CEREBRAS_API_KEY

agnes

cloud

OpenAI

https://apihub.agnes-ai.com/v1

agnes-2.0-flash

AGNES_API_KEY

minimax

cloud

Anthropic

https://api.minimaxi.com/anthropic

MiniMax-M3

MINIMAX_API_KEY

Every base URL and default model can be overridden by env vars, e.g. OLLAMA_BASE_URL, DEEPSEEK_DEFAULT_MODEL. Need something else? Add a custom backend — no Python required.

MCP tools

Tool

Purpose

ask_local_model

Send a prompt to a backend, get back text + usage metadata.

list_backends

Show configured backends, URLs, protocols, key status.

local_status

Memory, guard state, backend reachability, config paths.

list_local_models / list_models

List model IDs from backends that expose GET /models.

set_backend

Add, update, or remove a custom backend live.

refresh_backends

Reload custom_backends.json without restarting.

set_guard

Change the RAM / exclusivity guards live.

set_system_prefix

Pin a system prefix for prompt-cache-friendly cloud calls.

Talking to it

Just tell the agent what to delegate:

Use ask_local_model with backend="ollama" to draft a pytest suite for this module.
Don't apply it — review it first, then edit the repo yourself.
Call local_status. If DS4 is running, use backend="ds4" for boilerplate.
Otherwise fall back to backend="ollama".

More patterns in examples/claude-code-prompts.md.

Safety model

Local models are happy to OOM your machine. Two guards stop that:

  1. RAM valve — local calls are refused when free memory drops below LOCAL_LLM_MIN_FREE_GB.

  2. Exclusive backend — when a heavy local backend (ds4 by default) is up, other local backends are blocked so they don't fight for RAM.

Tune them live, no restart:

set_guard(min_free_gb=8)
set_guard(exclusive_backend="none")
set_guard(enforce=0)

Secrets never enter git. keys.json, config.json, and custom_backends.json are gitignored; keys load from env vars or a chmod 600 file. Cloud backends skip the RAM guard — but they send your prompts to a third party and may cost money, so read SECURITY.md before pointing one at proprietary code.

Custom backends

Any OpenAI- or Anthropic-compatible endpoint works. Add it from the tool:

set_backend(name="my_qwen", base_url="http://localhost:9000/v1", default_model="qwen3-coder", local=1, protocol="openai")

...or drop it in custom_backends.json and call refresh_backends. See examples/custom_backends.openrouter.json.

Does it actually save money?

It depends entirely on your workload — so this repo refuses to print a fake percentage. Instead it ships a harness so you can measure your numbers:

python scripts/benchmark.py --backend ollama --model qwen2.5-coder:7b --out results.jsonl

Then compare premium-only vs. delegated mode using BENCHMARK.md and the runbook. Report what you find; don't take a number on faith — not even ours.

Manual install

python3 -m venv .venv
.venv/bin/python -m pip install -e .

# Claude Code
claude mcp add local-llm -s user -e LOCAL_LLM_MIN_FREE_GB=16 -- "$PWD/.venv/bin/python" "$PWD/server.py"

# Codex
codex mcp add local-llm --env LOCAL_LLM_MIN_FREE_GB=16 -- "$PWD/.venv/bin/python" "$PWD/server.py"

Tests

python -m unittest discover -s tests -v
python scripts/smoke_test.py

CI runs both on Python 3.10, 3.11, and 3.12.

FAQ

Does the local model touch my files? No. It returns text only. Every edit goes through the main agent.

Do I need a GPU / a big Mac? No. Small coder models (7B) run on modest hardware, and the RAM valve keeps you from OOMing. No local model handy? Point it at a cheap cloud backend.

Is this a fusion model or an autonomous agent? Neither. It's a delegation layer — a tool your existing agent calls.

Why not just switch Claude Code to a cheaper model entirely? Because you want the frontier model's judgment and the cheap model's typing. This keeps both.

Windows / Linux? The server is cross-platform; the RAM guard reads vm_stat on macOS and /proc/meminfo on Linux. Shell helpers are macOS/zsh-flavored.

  • qwable — a local multi-model gateway and agent runtime for Codex & Claude Code on Apple Silicon.

  • Conclava — a council of local LLMs with task-aware routing and multi-model deliberation.

Contributing

Issues and PRs welcome — see CONTRIBUTING.md. Launch notes live in docs/.

License

MIT — see LICENSE.

A
license - permissive license
-
quality - not tested
C
maintenance

Maintenance

Maintainers
Response time
Release cycle
Releases (12mo)
Commit activity

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/HenryLinyy/local-llm-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server