LocalLama MCP Server
Integrates with Ollama to use local LLM models for task routing, allowing cost-effective and low-latency inference.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@LocalLama MCP Serverfind bugs in my Python code"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
LocalLama MCP Server
Local-first, provider-neutral Model Context Protocol server for coding-agent workflows. Routes tasks across local models (Ollama, LM Studio, llama.cpp), free OpenRouter models, and paid frontier models using cost, latency, context capacity, and benchmark history.
Node.js: >=22
⚠️ Early / experimental — not yet a stable release. This project is under active, rapid development and has not been fully verified end-to-end. MCP tool signatures, configuration, and behavior may change between releases without notice.
Version numbers follow SemVer mechanically (they're derived from Conventional Commit messages, not hand-picked), so a
1.xnumber signals only "a public surface exists" — it is not a promise of stability or completeness. If you depend on this server, pin to an exact version.
Tagged releases on
mainare the relatively safer builds.The
testingchannel publishes bleeding-edge pre-releases (x.y.z-testing.n) for trying unproven changes early.
Overview
LocalLama MCP reduces token costs without sacrificing quality. Tasks are queued asynchronously — route_task returns a task_id immediately; callers poll get_task_status for results. The decision engine chooses local → free → paid based on measured provider capabilities and configurable thresholds.
Supported MCP clients: Codex, Claude Code, Claw Code, Cursor, GitHub Copilot Agent mode, and any generic MCP stdio client.
Requirements
Node.js 22+
npm
At least one of: Ollama, LM Studio, llama.cpp server, or an OpenRouter API key
Installation
git clone https://github.com/Heratiki/locallama-mcp.git
cd locallama-mcp
npm install
npm run buildConfiguration
Copy .env.example to .env and edit with your values. The server resolves .env from its own root directory (or LOCALLAMA_ROOT_DIR when set), not from the MCP host's CWD.
# Local LLM Endpoints
LM_STUDIO_ENDPOINT=http://localhost:1234/v1
OLLAMA_ENDPOINT=http://localhost:11434/api
# LLAMA_CPP_ENDPOINT=http://localhost:8080 # leave unset to disable
# Routing thresholds
DEFAULT_LOCAL_MODEL=qwen2.5-coder-3b-instruct
TOKEN_THRESHOLD=1500
COST_THRESHOLD=0.02
QUALITY_THRESHOLD=0.7
# Provider concurrency
PROVIDER_HEALTH_PROBE_INTERVAL_MS=60000
PROVIDER_MAX_CONCURRENT_LOCAL=1
PROVIDER_MAX_CONCURRENT_REMOTE=5
PROVIDER_TIMEOUT_MS=120000
OLLAMA_TIMEOUT=120
# Code search (native BM25, no Python required)
CODE_SEARCH_ENABLED=true
CODE_SEARCH_EXCLUDE_PATTERNS=["node_modules/**","dist/**",".git/**"]
CODE_SEARCH_INDEX_ON_START=true
CODE_SEARCH_REINDEX_INTERVAL=3600
# Benchmarks
BENCHMARK_RUNS_PER_TASK=3
BENCHMARK_PARALLEL=false
BENCHMARK_MAX_PARALLEL_TASKS=2
BENCHMARK_TASK_TIMEOUT=60000
BENCHMARK_SAVE_RESULTS=true
BENCHMARK_RESULTS_PATH=./benchmark-results
RELIABLE_BENCHMARK_COUNT=3
MIN_VALIDATOR_SCORE=0.6
VALIDATION_RETRY_BUDGET=1
# Lock file
LOCK_FILE_CHECK_ACTIVE_PROCESS=true
REMOVE_STALE_LOCK_FILES=true
# OpenRouter (optional)
OPENROUTER_API_KEY=your_openrouter_api_key_here
OPENROUTER_FREE_ONLY=false
# Logging
LOG_LEVEL=debug
# Operational testing
# EXPECT_LOCAL_PROVIDER_DOWN=trueKey environment variables
Variable | Default | Description |
| — | LM Studio API base URL |
| — | Ollama API base URL |
| — | llama-server URL; leave unset to disable provider |
| — | Model name used when offloading to local provider |
|
| Token count above which local offload is considered |
|
| USD cost above which local offload is preferred |
|
| Quality score below which paid API is always used |
|
| Benchmark runs required before empirical scores are treated as fully reliable |
|
| Minimum validation score required before a model is eligible for external validation |
|
| Validation retry attempts allowed after an initial failed validation |
|
| Shared local execution slot count |
|
| Per-remote-provider slot count |
| — | Enables OpenRouter provider and related tools |
|
| Restrict OpenRouter to free-tier models only |
| — | Set |
MCP Client Configuration
Build the server, then point your MCP client at node dist/index.js:
{
"mcpServers": {
"locallama": {
"command": "node",
"args": ["/path/to/locallama-mcp/dist/index.js"],
"env": {
"LM_STUDIO_ENDPOINT": "http://localhost:1234/v1",
"OLLAMA_ENDPOINT": "http://localhost:11434/api",
"DEFAULT_LOCAL_MODEL": "qwen2.5-coder-3b-instruct",
"TOKEN_THRESHOLD": "1500",
"COST_THRESHOLD": "0.02",
"QUALITY_THRESHOLD": "0.07",
"OPENROUTER_API_KEY": "your_openrouter_api_key_here"
}
}
}
}Claude Code users can place this in .mcp.json (project-scoped) or ~/.claude/settings.json (global).
Tools
Core tools (always available)
Tool | Inputs | Description |
|
| Queue a task asynchronously. Returns |
|
| Poll a non-blocking |
|
| Cancel all queued or in-progress jobs for a task. |
|
| Cancel a single background job. |
|
| Heuristic routing check with no LLM calls. Returns model/provider recommendation without executing the task. |
|
| Estimate USD cost before calling |
|
| Benchmark one task across local vs paid models. |
|
| Benchmark multiple tasks in one call. |
|
| Run built-in benchmark suites against a specific model. Persists results to |
|
| Index code with the native BM25 engine (no Python required). |
|
| Search indexed code using native BM25. |
| — | Reload |
| — | Check whether the server is up to date with the latest GitHub commit. |
| — | Pull latest changes from GitHub, run |
OpenRouter tools (require OPENROUTER_API_KEY)
Tool | Inputs | Description |
| — | List free models available from OpenRouter. |
| — | Clear cached model list and force a fresh fetch. |
|
| Benchmark free OpenRouter models. Results written to |
|
| Set a custom prompting strategy for an OpenRouter model. |
Async task flow
route_task → { task_id }
↓ poll
get_task_status → { status: "pending" | "in_progress" | "completed" | "failed", result? }When local providers are contended by benchmark workloads, route_task surfaces contention metadata:
{
"task_id": "...",
"status": "queued",
"queue_position": 2,
"benchmark_contention": {
"local_slot_contended": true,
"active_benchmark_runs": 1,
"queued_benchmark_runs": 2,
"message": "Local execution slot currently contended by benchmark workloads."
}
}Resources
Static resources
URI | Description |
| Server status |
| Available local models |
| Currently active jobs |
| Memory bank file list (if directory exists) |
| All OpenRouter models (requires API key) |
| Free OpenRouter models (requires API key) |
| OpenRouter integration status (requires API key) |
Resource templates
URI template | Description |
| Token usage and costs for a specific API (e.g. |
| Progress for a specific job |
| Details for an OpenRouter model (requires API key) |
| Prompting strategy for an OpenRouter model (requires API key) |
Usage
Starting the server
npm startA lock file prevents multiple instances. Stale locks from crashed processes are detected and cleaned up automatically.
Running benchmarks
npm run benchmark
npm run benchmark:comprehensiveResults are stored in benchmark-results/ as JSON and Markdown summaries.
Dashboard
When the server is running, a web dashboard is available at http://localhost:3001 (server-local).
Features:
Real-time job queue with status, provider/model, and queue position
Task monitoring with per-job details and ETA
Manual
route_tasksubmission formTask and job cancellation
Benchmark history
REST API endpoints:
Method | Path | Description |
|
| Queue summary and jobs. Filters: |
|
| Recent tasks. Filters: |
|
| Detailed task status |
|
| Submit a task ( |
|
| Cancel a task |
|
| Cancel a job |
Example submission:
curl -X POST http://localhost:3001/api/tasks \
-H "Content-Type: application/json" \
-d '{"task": "Refactor parser for readability", "context_length": 4096, "complexity": 0.6, "priority": "quality"}'Live monitoring metadata
When the JobTracker WebSocket server is running, task-executing tools include:
{
"task_id": "task-123",
"monitoring": {
"websocketUrl": "ws://127.0.0.1:8081",
"activeJobsUri": "locallama://jobs/active",
"jobProgressUriTemplate": "locallama://jobs/progress/{jobId}",
"note": "Connect to websocketUrl for live updates, or use MCP resources."
}
}websocketUrl is scope: server-local — in SSH/container/Codespaces/WSL setups, forward the port before connecting.
_server_reminder ambient metadata
Tools attach a _server_reminder field at most once every 30 minutes to surface monitoring info:
{
"_server_reminder": {
"schemaVersion": 1,
"kind": "monitoring-reminder",
"status": "reachable",
"scope": "server-local",
"message": "Optional monitoring available from MCP server host.",
"monitoringUrl": "http://127.0.0.1:3001",
"lastCheckedAt": 1747699200000
}
}Remote access
If your MCP client is not on the same machine as the server:
# SSH
ssh -L 8081:127.0.0.1:8081 -L 3001:127.0.0.1:3001 user@hostDev Containers / Codespaces: forward ports 8081 (WebSocket) and 3001 (dashboard) via the VS Code Ports view.
WSL client + WSL server: use the WebSocket URL directly. Windows client + WSL server: forward port 8081 via VS Code or a local tunnel.
Provider integrations
Ollama
Set OLLAMA_ENDPOINT in .env. The server probes for available models on startup.
LM Studio
Set LM_STUDIO_ENDPOINT in .env. Exposes an OpenAI-compatible API.
llama.cpp (llama-server)
# Single model
llama-server -m /path/to/model.gguf --port 8080
# Router mode (multiple models)
llama-server --model /path/model1.gguf --model /path/model2.gguf --port 8080Set LLAMA_CPP_ENDPOINT=http://localhost:8080 in .env. If the endpoint is unset or unreachable, the provider initialises silently — other providers are unaffected. The server does not manage the llama-server process lifecycle.
OpenRouter
Set OPENROUTER_API_KEY. The server fetches ~240 available models on startup (30+ free). Use clear_openrouter_tracking to force a refresh. Set OPENROUTER_FREE_ONLY=true to restrict to free-tier models.
Code search
Code search uses a native TypeScript BM25 engine — no Python or external dependencies required.
# Via MCP tool
retriv_init { "directories": ["/path/to/repo"], "force_reindex": true }
retriv_search { "query": "pagination logic" }Development
npm run build # compile TypeScript + copy assets
npm start # run compiled server
npm run dev # TypeScript watch mode
npm test # build + run Jest (23 suites, 186 tests)
npm run lint # ESLint (note: eslint-plugin-import not installed — lint currently fails)
npm run lint:fix # ESLint with auto-fixAll test files mock server state to prevent multiple real instances during test runs.
Architecture
src/
index.ts entry point, lock file, MCP lifecycle
modules/
api-integration/ tool definitions, resources, routing adapters
decision-engine/ task analysis, model selection, coordination
cost-monitor/ token accounting, cost estimation
benchmark/ execution, scoring, summaries, DB storage
lm-studio/ LM Studio provider
ollama/ Ollama provider
llama-cpp/ llama-server provider
openrouter/ OpenRouter provider
core/provider/ shared provider registry and execution queue
updater/ self-update logic (check_for_updates, update_server)
job-store/ persistent Task/Job store
websocket-server/ live monitoring side channelDecision engine uses two model data stores:
ModelRegistry+CapabilityDetector: benchmark-derived capability scores (authoritative for full routing)modelsDbService: heuristic performance data seeded from ModelRegistry at startup; used bypreemptiveRouting()
Project docs
File | Purpose |
| Shared operating guide for all coding agents |
| Current snapshot of completed and in-progress work |
| Long-form modernization backdrop |
| Active roadmap tasks |
| Branch implementation plan |
| Live test record and verified behavior |
| Real-world MCP test results and known open bugs |
| Core design principles and constraints |
| Historical append-only project memory |
Troubleshooting
Server won't start — lock file detected
Check if another instance is running (
ps aux | grep locallama).Stale locks from crashes are cleaned up automatically (
REMOVE_STALE_LOCK_FILES=true).If needed, manually remove
locallama.lockfrom the project root.
OpenRouter models not appearing
Use clear_openrouter_tracking through the MCP interface to force a fresh fetch.
npm run lint fails
eslint-plugin-import is referenced in the config but not installed. Known issue. Build and tests are unaffected.
Security notes
API keys belong in
.env, which is excluded from version control.All log output goes to
stderr;stdoutis reserved for MCP JSON-RPC. Never write non-JSON to stdout.Treat MCP tools as model-controlled surfaces. Avoid mutations without user approval.
License
ISC
This server cannot be installed
Maintenance
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/Heratiki/locallama-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server