Which integrations are available for this server?

Enables delegation of text generation tasks to local Ollama models, allowing the main agent to offload token-heavy work to a local LLM.

How do I use local-llm-mcp?

1. Click on "Install Server". 2. Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state. 3. In the chat, type @ followed by the MCP server name and your instructions, e.g., "@local-llm-mcp generate a pytest skeleton for a user model" That's it! The server will respond to your query, and you can continue using it as needed. Here is a step-by-step guide with screenshots.

local-llm-mcp

by HenryLinyy

Overview Schema Related Servers Score Discussions

Python

Local

local-llm-mcp

local-llm-mcp hero

terminal demo

Why I built this

One afternoon I watched Claude Code reach for a frontier model — the good one, the one my credit card feels — to generate an __init__.py. Then a pytest skeleton. Then a throwaway first draft of a function I was about to rewrite anyway.

That's when it clicked: I was paying senior-engineer rates for a senior engineer's typing. The thinking — reading my repo, planning the change, reviewing the diff, deciding what's safe to ship — that's what the expensive model is for. The boilerplate is not.

So I did the lazy thing. I already had LM Studio running on my Mac, doing nothing between prompts. Why not let the cheap model on my own machine handle the grunt work, and keep the frontier brain for the calls that actually matter?

One snag: my Mac has 24GB of RAM, and local models cheerfully ate all of it until the whole session froze. So the first thing I built wasn't the clever delegation — it was a dumb RAM valve that refuses a local call when memory runs low. Everything else grew around that.

That's all local-llm-mcp is: a small MCP server that hands your coding agent an intern. The intern only ever returns text — it can't read your repo, edit a file, or run a command — and the senior model reviews everything before a single line changes.

The whole idea in one line: senior model decides, local model types.

Related MCP server: local-executor-mcp

Try it in three commands

git clone https://github.com/HenryLinyy/local-llm-mcp
cd local-llm-mcp
bash setup.sh        # venv + registers with Claude Code & Codex + smoke test

Then open a fresh Claude Code or Codex session. No API keys needed — local backends work out of the box. setup.sh even picks a RAM threshold based on your machine.

The mental model

Claude Code / Codex   the senior — reads the repo, plans, edits, runs tests, reviews
local-llm-mcp         hands one bounded task to the intern over MCP
Ollama / LM Studio    the intern — drafts boilerplate, tests, docs, summaries, for $0

A local 7B model costs $0 per token. My frontier agent does not. The rest is arithmetic — and I'll get to the arithmetic honestly below.

The intern can't touch your repo. That's not a missing feature; it's the point. I never have to trust the cheap model with anything that matters — worst case, it hands back a bad draft and the senior throws it out.

delegation boundary

What I hand the intern / what stays on my desk

✅ Hand to the intern	🚫 Keep with the senior
README & docstring first drafts	Final architecture decisions
Boilerplate, config, glue code	Security & correctness sign-off
`pytest` / `unittest` scaffolds	Anything that edits the repo
Long-file summaries	Running shell commands
Repetitive format conversions	Applying a patch unreviewed
"Sketch 3 alternative approaches"	Any judgment call

My rule of thumb: if a wrong answer is cheap to catch, delegate it. If it's expensive to catch, don't.

How I actually use it

There's no API to learn. I just tell the senior to delegate, in plain English:

Use ask_local_model with backend="ollama" to draft a pytest suite for this module.
Don't apply it — review it first, then edit the repo yourself.

Call local_status. If a local model is up, use it for boilerplate.
Otherwise fall back to backend="deepseek".

More of my go-to prompts are in examples/claude-code-prompts.md.

Bring any intern you like

Local backends need nothing but a running server. Cloud backends are optional — they read their key from an env var or keys.json, never from source.

Backend	Type	Protocol	Default URL	Default model	Key
`lmstudio`	local	OpenAI	`http://localhost:1234/v1`	`qwen/qwen3-coder-next`	—
`ollama`	local	OpenAI	`http://localhost:11434/v1`	`qwen2.5-coder:7b`	—
`vllm`	local	OpenAI	`http://localhost:8001/v1`	auto	—
`llamacpp`	local	OpenAI	`http://localhost:8080/v1`	auto	—
`ds4`	local	OpenAI	`http://127.0.0.1:8000/v1`	auto	—
`deepseek`	cloud	OpenAI	`https://api.deepseek.com/v1`	`deepseek-v4-flash`	`DEEPSEEK_API_KEY`
`openrouter`	cloud	OpenAI	`https://openrouter.ai/api/v1`	`anthropic/claude-sonnet-4`	`OPENROUTER_API_KEY`
`groq`	cloud	OpenAI	`https://api.groq.com/openai/v1`	`openai/gpt-oss-120b`	`GROQ_API_KEY`
`cerebras`	cloud	OpenAI	`https://api.cerebras.ai/v1`	`gpt-oss-120b`	`CEREBRAS_API_KEY`
`agnes`	cloud	OpenAI	`https://apihub.agnes-ai.com/v1`	`agnes-2.0-flash`	`AGNES_API_KEY`
`minimax`	cloud	Anthropic	`https://api.minimaxi.com/anthropic`	`MiniMax-M3`	`MINIMAX_API_KEY`

Every URL and default model is env-overridable. Running something exotic? Add a custom backend — no Python required.

The tools it exposes

Tool	Purpose
`ask_local_model`	Send a prompt to a backend, get back text + usage metadata.
`list_backends`	Show configured backends, URLs, protocols, key status.
`local_status`	Memory, guard state, backend reachability, config paths.
`list_local_models` / `list_models`	List model IDs from backends that expose `GET /models`.
`set_backend`	Add, update, or remove a custom backend live.
`refresh_backends`	Reload `custom_backends.json` without restarting.
`set_guard`	Change the RAM / exclusivity guards live.
`set_system_prefix`	Pin a system prefix for prompt-cache-friendly cloud calls.

The intern is on a short leash

I learned this the hard way, so you don't have to. Two guards:

RAM valve — local calls are refused when free memory drops below LOCAL_LLM_MIN_FREE_GB. This is the feature that exists because I froze my own machine one too many times.
Exclusive backend — when a heavy local server (ds4 by default) is up, other local backends stand down instead of fighting over memory.

Tune them live, no restart: set_guard(min_free_gb=8), set_guard(exclusive_backend="none").

And secrets never enter git — keys.json, config.json, and custom_backends.json are all gitignored; keys load from env vars or a chmod 600 file. Cloud backends skip the RAM guard, but they send your prompts to a third party and may cost money — read SECURITY.md before pointing one at proprietary code.

So how much does it save?

Here's where most READMEs lie to you with a big number. I'm not going to.

The honest answer is it depends entirely on your workload, so instead of inventing a percentage I shipped a harness to measure your own:

python scripts/benchmark.py --backend ollama --model qwen2.5-coder:7b --out results.jsonl

Compare premium-only vs. delegated mode with BENCHMARK.md and the runbook. Trust your numbers — not mine, and definitely not a number a README made up to get you to star it.

Add it straight from the tool:

set_backend(name="my_qwen", base_url="http://localhost:9000/v1", default_model="qwen3-coder", local=1, protocol="openai")

…or drop it in custom_backends.json and call refresh_backends. See examples/custom_backends.openrouter.json.

python3 -m venv .venv
.venv/bin/python -m pip install -e .

# Claude Code
claude mcp add local-llm -s user -e LOCAL_LLM_MIN_FREE_GB=16 -- "$PWD/.venv/bin/python" "$PWD/server.py"

# Codex
codex mcp add local-llm --env LOCAL_LLM_MIN_FREE_GB=16 -- "$PWD/.venv/bin/python" "$PWD/server.py"

python -m unittest discover -s tests -v
python scripts/smoke_test.py

CI runs both on Python 3.10, 3.11, and 3.12.

FAQ

Does the intern touch my files? Never. It returns text; every edit goes through the senior (your main agent).

Do I need a beefy GPU? No. 7B coder models run on modest hardware, and the RAM valve keeps you from OOMing. No local model handy? Point the intern at a cheap cloud backend.

Is this a fusion model or an autonomous agent? Neither. It's a delegation layer — a tool your existing agent calls.

Why not just switch my agent to a cheap model entirely? Because I want the frontier model's judgment and the cheap model's typing. This keeps both.

Windows / Linux? The server is cross-platform (the RAM guard reads vm_stat on macOS, /proc/meminfo on Linux). The shell helpers are macOS/zsh-flavored.

My other experiments in this space

qwable — a local multi-model gateway and agent runtime for Codex & Claude Code on Apple Silicon.
Conclava — a council of local LLMs with task-aware routing and multi-model deliberation.

Contributing

Issues and PRs welcome — see CONTRIBUTING.md. If this saved you some tokens, a ⭐ tells me it was worth open-sourcing.

License

MIT — see LICENSE. Built by someone who got tired of paying frontier prices for import os.

Install Server

license - permissive license

quality

maintenance

How are these scores calculated?

Maintenance

–Maintainers

–Response time

–Release cycle

–Releases (12mo)

Commit activity

Resources

GitHub Repository

Need Help?

Related Servers

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Tools

View all tools

Related MCP Servers

CodeBrain
Code Execution Developer Tools Autonomous Agents
Tschonsen
A
license
A
quality
D
maintenance
An MCP server that offloads bulk coding tasks to local LLMs, allowing Claude Code to delegate repetitive work like boilerplate generation and code polishing while preserving its context for complex reasoning.
Last updated 2026-04-19
10
1
MIT
local-executor-mcp
AI & Machine Learning Developer Tools
rasvan-ghiliciu
F
license
-
quality
C
maintenance
An MCP server that lets a frontier planner (Claude Code) delegate mechanical code-generation subtasks to a local LLM served by llama-swap, to save frontier-model tokens.
Last updated 2026-06-17
ollama-handoff
AI & Machine Learning Developer Tools Autonomous Agents
Michael-WhiteCapData
A
license
A
quality
B
maintenance
Offloads cheap work from cloud LLM agents to a local Ollama model, reducing costs and keeping frontier models focused on complex tasks.
Last updated 2026-06-23
8
2
MIT
local-agent
Developer Tools AI & Machine Learning Code Execution
MrFalach
F
license
-
quality
B
maintenance
Routes code generation tasks between local models and Claude Cloud, optimizing cost by handling simple tasks locally and reserving cloud thinking for complex tasks.
Last updated 2026-07-06

View all related MCP servers

Related MCP Connectors

AgentDrive
Cross-agent artifact workspace with provenance across Claude Code, Codex, Cursor, LangGraph.
Lassare
Human-in-the-loop for AI coding agents — ask questions, get approvals via Slack.
agentmailrooms-mcp
A paid remote MCP for OpenAI Codex agent coordination MCP, built to return verdicts, receipts, usage

View all MCP Connectors

Latest Blog Posts

Who's Calling? MCP Hosts Are an Identity Blind Spot (And the Spec Knows It)
By Om-Shree-0709 on July 25, 2026.
mcp
Agent Identity
OAuth 2.1
Your AI Chatbot Just Exposed Your CEO's Salary to an Intern
By Om-Shree-0709 on July 2, 2026.
Agent Identity
MCP Security
OAuth Delegation
Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)
By Om-Shree-0709 on June 30, 2026.
Agentic Ai
Prompt Injection
WebAssembly

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/HenryLinyy/local-llm-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

local-llm-mcp

Why I built this

Try it in three commands

The mental model

What I hand the intern / what stays on my desk

How I actually use it

Bring any intern you like

The tools it exposes

The intern is on a short leash

So how much does it save?

FAQ

My other experiments in this space

Contributing

License

Maintenance

Resources

Looking for Admin?

Tools

Related MCP Servers

CodeBrain

local-executor-mcp

ollama-handoff

local-agent

Related MCP Connectors

Latest Blog Posts

MCP directory API