What can you do with this server?

This server is an intelligent LLM routing system that classifies prompts and directs them to the cheapest capable model, saving 35–80% on AI costs while maintaining quality. Key capabilities include: Smart Routing & Classification * llm_route, llm_auto, llm_stream, llm_classify — classify prompt complexity and route to optimal LLMs (budget/balanced/premium tiers) with budget-pressure awareness * llm_reroute, llm_approve_route — override or approve pending high-cost routing decisions * llm_select_agent — pick the best CLI agent (Claude Code, Codex, Gemini CLI) for session-level routing Task-Specific LLM Tools * llm_query, llm_code, llm_analyze, llm_generate, llm_research, llm_edit — route prompts tailored to general queries, coding, analysis, creative generation, web-grounded research, or code editing Media Generation * llm_image, llm_video, llm_audio — generate images (DALL-E, Flux, Stable Diffusion), video (Runway, Kling, Veo), or speech (ElevenLabs, OpenAI TTS) Multi-Step Orchestration * llm_orchestrate, llm_pipeline_templates — decompose complex tasks into multi-LLM pipelines (research reports, competitive analysis, etc.) Cost, Budget & Usage Monitoring * llm_usage, llm_savings, llm_gain, llm_session_spend, llm_budget, llm_quota_status — real-time cost tracking, savings dashboards, quota balances, and anomaly warnings across Claude, Codex, Gemini, and external APIs * llm_check_usage, llm_update_usage, llm_refresh_claude_usage — manage Claude subscription usage Quality & Performance Analytics * llm_quality_report, llm_quality_guard, llm_benchmark, llm_model_eval, llm_model_usage, llm_model_export — routing accuracy metrics, model quality scores, degradation alerts, and exportable tracking data * llm_rate — rate routing decisions to improve future routing Health, Providers & Configuration * llm_health, llm_hook_health, llm_providers, llm_setup — check provider/hook health and configure API keys (Ollama, OpenAI, Anthropic, Google, DeepSeek, Mistral, Groq, Perplexity, and more) * llm_set_profile, llm_policy — switch routing profiles and view active policies Filesystem Operations * llm_fs_find, llm_fs_rename, llm_fs_edit_many, llm_fs_analyze_context — use cheap models to find files, generate rename commands, perform bulk edits, and analyze workspace context Team & Collaboration * llm_team_report, llm_team_push, llm_digest, llm_dashboard — team savings reports, Slack/Discord/Telegram webhooks, spend spike detection, and a local web dashboard * llm_share_profile, llm_import_profile — share or import learned routing profiles with the community Session & Cache Management * llm_save_session — summarize and persist session context for cross-session awareness * llm_cache_stats, llm_cache_clear — manage the prompt classification cache Agoragentic Marketplace * agoragentic_task, agoragentic_browse, agoragentic_wallet, agoragentic_status — execute tasks, browse services, and manage USDC wallet on the Agoragentic capability marketplace

Which integrations are available for this server?

Integration for audio generation and text-to-speech capabilities. Provides access to Gemini 2.5 Pro and 2.5 Flash models with a free tier (1M tokens/day), optimized for generation tasks and long-context processing. Enables routing to locally-hosted models for zero-cost, privacy-preserving, offline inference as the first tier in fallback chains. Provides access to GPT-4o, GPT-4o-mini, and o3 models for code generation, analysis, and reasoning tasks. Integration for research and current events using Sonar and Sonar Pro search-augmented models to get factual, up-to-date information.

ypollak2/llm-router

by ypollak2

Overview Schema Related Servers Score Discussions

Python

Hybrid

This server is an intelligent LLM routing system that classifies prompts and directs them to the cheapest capable model, saving 35–80% on AI costs while maintaining quality. Key capabilities include:

Smart Routing & Classification

llm_route, llm_auto, llm_stream, llm_classify — classify prompt complexity and route to optimal LLMs (budget/balanced/premium tiers) with budget-pressure awareness
llm_reroute, llm_approve_route — override or approve pending high-cost routing decisions
llm_select_agent — pick the best CLI agent (Claude Code, Codex, Gemini CLI) for session-level routing

Task-Specific LLM Tools

llm_query, llm_code, llm_analyze, llm_generate, llm_research, llm_edit — route prompts tailored to general queries, coding, analysis, creative generation, web-grounded research, or code editing

Media Generation

llm_image, llm_video, llm_audio — generate images (DALL-E, Flux, Stable Diffusion), video (Runway, Kling, Veo), or speech (ElevenLabs, OpenAI TTS)

Multi-Step Orchestration

llm_orchestrate, llm_pipeline_templates — decompose complex tasks into multi-LLM pipelines (research reports, competitive analysis, etc.)

Cost, Budget & Usage Monitoring

llm_usage, llm_savings, llm_gain, llm_session_spend, llm_budget, llm_quota_status — real-time cost tracking, savings dashboards, quota balances, and anomaly warnings across Claude, Codex, Gemini, and external APIs
llm_check_usage, llm_update_usage, llm_refresh_claude_usage — manage Claude subscription usage

Quality & Performance Analytics

llm_quality_report, llm_quality_guard, llm_benchmark, llm_model_eval, llm_model_usage, llm_model_export — routing accuracy metrics, model quality scores, degradation alerts, and exportable tracking data
llm_rate — rate routing decisions to improve future routing

Health, Providers & Configuration

llm_health, llm_hook_health, llm_providers, llm_setup — check provider/hook health and configure API keys (Ollama, OpenAI, Anthropic, Google, DeepSeek, Mistral, Groq, Perplexity, and more)
llm_set_profile, llm_policy — switch routing profiles and view active policies

Filesystem Operations

llm_fs_find, llm_fs_rename, llm_fs_edit_many, llm_fs_analyze_context — use cheap models to find files, generate rename commands, perform bulk edits, and analyze workspace context

Team & Collaboration

llm_team_report, llm_team_push, llm_digest, llm_dashboard — team savings reports, Slack/Discord/Telegram webhooks, spend spike detection, and a local web dashboard
llm_share_profile, llm_import_profile — share or import learned routing profiles with the community

Session & Cache Management

llm_save_session — summarize and persist session context for cross-session awareness
llm_cache_stats, llm_cache_clear — manage the prompt classification cache

Agoragentic Marketplace

agoragentic_task, agoragentic_browse, agoragentic_wallet, agoragentic_status — execute tasks, browse services, and manage USDC wallet on the Agoragentic capability marketplace

pip install llm-routing

Why People Install This

AI coding tools send too many prompts to premium models by default.

That means:

You waste paid tokens on simple questions
You burn through Claude, Gemini, or OpenAI quota faster than necessary
You stop working when one provider is rate-limited or down

llm-router sits between your coding tool and your model providers. It classifies each prompt, tries the cheapest capable model first, and falls back automatically when needed.

You keep the same workflow. The router changes the model choice underneath.

Related MCP server: MCP AI Router

What You Get

Route trivial prompts to free or cheap models first
Keep premium models for the prompts that actually need them
Fall back across providers automatically
Track usage and estimated savings locally
Run everything on your own machine
Cost-inverted SUBSCRIPTION_LOCAL routing — free/local first for simple & moderate prompts, your one paid seat first for complex ones, with the seat demoted when its quota is strained (opt-in via LLM_ROUTER_SUBSCRIPTION_PROVIDER)
Keep secrets local — a prompt containing an API key/token/private key routes to local models only, fail-closed so it never reaches an external API
See it working — a surface_status line / terminal title / OS notification showing the last model routed, savings, and health, for hosts without a native statusline
Session-end summary — savings vs baseline, tier mix, per-provider cost, latency p50/p95/p99, and top routes, in Markdown or a rich panel

Ranked #8 on RouterArena

llm-router was independently benchmarked and ranked #8 on RouterArena — a community leaderboard that evaluates model routers on routing accuracy, latency, cost efficiency, and fallback reliability.

Need Enterprise-Grade Routing? Meet Chuzom

Chuzom is the enterprise-ready evolution of the ideas in llm-router. If you're deploying at team or org scale, Chuzom adds the layer of control, governance, and integration that individual-developer tools don't need but enterprises do.

Capability	`llm-router`	Chuzom
Free-first routing chain	✅	✅
Claude / Codex / Gemini CLI hooks	✅	✅
MCP tool interface	✅	✅
Local-only, no proxy	✅	✅
Team-wide policy enforcement	—	✅
Audit log & compliance export	—	✅
SSO / SAML / OIDC integration	—	✅
Role-based provider access controls	—	✅
Multi-workspace / org model budgets	—	✅
SLA-backed support	—	✅

llm-router is the right choice for individual developers and small teams who want local cost savings with zero ops overhead. For organizations that need governance, auditability, and enterprise integrations, Chuzom is built for that.

Quick Start

1. Install

pip install llm-routing
llm-router install

Package name: llm-routing on PyPI. CLI command: llm-router.

2. Add providers (optional)

export OPENAI_API_KEY="sk-..."          # GPT-4o, o3
export GEMINI_API_KEY="AIza..."         # Gemini Flash/Pro (free tier available)
export OLLAMA_BASE_URL="http://localhost:11434"  # Local models (free)
export OPENROUTER_API_KEY="sk-or-v1-…"  # 343 OpenRouter models (qwen, deepseek, grok, …)

Works with zero API keys on Claude Code Pro/Max subscriptions — routing uses MCP tools that call external models only when beneficial. Add OPENROUTER_API_KEY to unlock the open-weight workhorse pool used by the cost_aggressive policy.

3. Verify

llm-router health            # Check provider connectivity

If you already use Claude Code, Codex, or Gemini CLI, keep your existing workflow and let llm-router choose models underneath it.

Example Routing

Prompt	Routed to
"What does this Python error mean?"	Ollama / Gemini Flash / Codex
"Refactor this endpoint"	GPT-4o / Gemini Pro
"Design a distributed tracing strategy"	o3 / Claude Opus

The exact chain depends on your configured providers, budget profile, and routing policy.

Works With

Tool	Mode	Typical Savings
Claude Code	Full auto-routing via hooks	60–80%
Codex CLI	Full auto-routing via hooks	60–80%
Gemini CLI	Full auto-routing via hooks	50–70%
VS Code / Cursor	Manual MCP tools	30–50%
Any MCP client	Manual MCP tools	Varies

Full auto-routing means hooks intercept prompts and route automatically with no workflow change.
Manual MCP tools means routing is available on demand through tools such as llm_query.

llm-router install                    # Claude Code (default)
llm-router install --host codex       # Codex CLI
llm-router install --host gemini-cli  # Gemini CLI
llm-router install --host vscode      # VS Code
llm-router install --host cursor      # Cursor

See guide/HOST_SUPPORT_MATRIX.md for full details on each host.

Protect Claude Code 5-hour quota

For a strict boundary that never automatically falls through to native Claude, configure:

# ~/.llm-router/routing.yaml
enforce: smart
mode: zero_claude

In zero_claude mode, prompts either complete through direct external execution or are blocked before Claude Code invokes its model. Prefix a prompt with claude: when you intentionally want a native Claude turn.

How It Works

User prompt
    │
    ▼
┌──────────────────────┐
│ Complexity Classifier │  ← Heuristic (free, instant) or Ollama/Flash ($0.0001)
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│  Free-First Router   │  ← Tries cheapest model first, walks up the chain
│                      │
│  Ollama (free)       │
│  → Codex (prepaid)   │
│  → Gemini Flash      │
│  → GPT-4o / Claude   │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│  Guards (parallel)   │  ← Circuit breaker, budget pressure, quality check
└──────────┬───────────┘
           │
           ▼
      Response + cost logged to local SQLite

Classification is free for many tasks (regex heuristics catch ~70%) or near-free for ambiguous prompts when using local Ollama or Gemini Flash.

What You Can Do

Use case	How
Route simple questions to free local models	Auto (hooks) or `llm_query`
Protect Claude subscription quota	Budget pressure monitoring + auto-downgrade
Fall back across providers on failure	Automatic chain with circuit breakers
Track token spend and savings	`llm_usage`, `llm_savings`, session-end reports
Enforce routing policy for your team	`LLM_ROUTER_POLICY=aggressive`
Generate images/video/audio	`llm_image`, `llm_video`, `llm_audio`
Run multi-step research pipelines	`llm_orchestrate` with templates
Bulk-edit files with cheap models	`llm_fs_edit_many`
Compare two routing policies	`llm-router policy diff <a> <b>` (v10)
Benchmark + track Arena score	`llm-router benchmark run` / `regress` (v10)

CLI (operational commands)

Beyond the install + auth flow, llm-router ships several operational subcommands:

llm-router benchmark list                              # list registered benchmark runners
llm-router benchmark run routerarena --split sub_10    # route a dataset and score it
llm-router benchmark regress --policy <p> --benchmark <b>  # detect score regressions
llm-router policy diff balanced cost_aggressive        # per-prompt model + cost delta

These power the routing self-improvement loop: routing decisions get persisted to a SQLite outcomes table; benchmark runs against a reference dataset establish baseline scores; regress flags drops > 0.005 in release-over-release comparisons.

Providers

Routing chains are built from your configured providers. You only need one.

Text LLM Providers

Provider	Models	Cost	Setup
Ollama	gemma4, qwen3.5, llama3, etc.	Free (local)	`OLLAMA_BASE_URL`
OpenAI	GPT-4o, o3, GPT-4o-mini	Paid API	`OPENAI_API_KEY`
Google	Gemini Flash, Pro	Free tier + paid	`GEMINI_API_KEY`
Anthropic	Claude Sonnet, Opus, Haiku	Paid API or subscription	`ANTHROPIC_API_KEY` or subscription
xAI	Grok-3	Paid API	`XAI_API_KEY`
DeepSeek	DeepSeek Chat, Reasoner	Paid API (ultra-cheap)	`DEEPSEEK_API_KEY`
Mistral	Mistral Large, Small	Paid API	`MISTRAL_API_KEY`
Cohere	Command R+	Paid API	`COHERE_API_KEY`
Perplexity	Sonar Pro (web-grounded)	Paid API	`PERPLEXITY_API_KEY`
Groq	Fast inference (Llama, Mixtral)	Free tier	`GROQ_API_KEY`
Together	Open-source models	Paid API	`TOGETHER_API_KEY`
HuggingFace	Open-source models	Free tier + paid	`HF_TOKEN`
OpenRouter	343 models (qwen3-235b, deepseek-v4-flash, grok-4.3, gemini-flash-lite, claude, gpt, …)	Paid API (one key, all providers)	`OPENROUTER_API_KEY`
Codex	GPT-5.4, o3 (prepaid desktop)	Included with Codex CLI	Auto-detected

Media Providers

Provider	Type	Setup
fal	Image (Flux), Video (Kling)	`FAL_KEY`
Stability	Image (Stable Diffusion 3)	`STABILITY_API_KEY`
ElevenLabs	Audio / TTS	`ELEVENLABS_API_KEY`
Runway	Video (Gen-3)	`RUNWAY_API_KEY`
Replicate	Various open-source models	`REPLICATE_API_TOKEN`

See guide/PROVIDERS.md for setup instructions and model recommendations.

Routing Policies

Control how aggressively the router offloads to cheap models. Policies ship as YAML files in src/llm_router/policies/ — write your own to override workhorses, subject specialists, and per-task chains.

Policy	Confidence Threshold	Typical Savings	Best For
Aggressive	2	60–75%	Maximum cost reduction
Balanced (default)	4	35–45%	Cost/quality tradeoff
Conservative	6	10–15%	Quality over cost
`cost_aggressive`	3	70–85%	OpenRouter open-weight workhorses + subject specialists. Activate with `OPENROUTER_API_KEY`. New in v10.

export LLM_ROUTER_POLICY=aggressive     # Or: balanced, conservative, cost_aggressive
export LLM_ROUTER_ENFORCE=smart          # smart | hard | soft | off
export LLM_ROUTER_PROFILE=balanced       # budget | balanced | premium | subscription_local
export LLM_ROUTER_BANDIT=on              # on (default) | off — opt out of telemetry-driven chain reorder

The cost_aggressive policy routes via OpenRouter:

export OPENROUTER_API_KEY=sk-or-v1-...
export LLM_ROUTER_POLICY=cost_aggressive
# Now: code → qwen3-coder-next, medical → gemini-flash-lite, reasoning → grok-4.3, …

See guide/POLICIES.md for the YAML schema and how to author your own policy.

LLM_ROUTER_ENFORCE controls how strictly the auto-route hook blocks direct model use:

smart — route when confident, pass through when uncertain
hard — always route, block unrouted tool calls
soft — suggest routing, never block
off — disable hook enforcement

MCP Tools (60)

llm-router exposes 60 MCP tools organized by function:

Category	Tools	Examples
Routing & classification	7	`llm_route`, `llm_classify`, `llm_auto`, `llm_stream`
Text generation	6	`llm_query`, `llm_code`, `llm_analyze`, `llm_research`
Media generation	3	`llm_image`, `llm_video`, `llm_audio`
Pipeline orchestration	2	`llm_orchestrate`, `llm_pipeline_templates`
Admin & monitoring	20+	`llm_usage`, `llm_budget`, `llm_health`, `llm_savings`
Filesystem operations	4	`llm_fs_find`, `llm_fs_edit_many`
Subscription tracking	3	`llm_check_usage`, `llm_refresh_claude_usage`

Slim mode (LLM_ROUTER_SLIM=routing or core) reduces registered tools to save context tokens in constrained environments.

Full Tool Reference

Savings: How It Works

Savings are calculated by comparing actual spend against a baseline of routing every task to Claude Sonnet/Opus.

Methodology:

Each routed task logs: model used, tokens consumed, estimated cost
A baseline cost is computed as if the same tokens were processed by the most expensive model in the chain
Savings = (baseline - actual) / baseline

Assumptions and limitations:

Baseline assumes you would have used Opus/Sonnet for everything (worst case)
Token estimates use len(text) / 4 approximation, not exact tokenizer counts
Cost data comes from LiteLLM's pricing tables (may lag provider price changes)
Savings vary significantly by workload — code-heavy sessions route more to cheap models
The router itself adds small overhead (classification costs ~$0.0001 per ambiguous task)

Observed range: 35–80% savings depending on policy and task mix. The "87%" figure in some docs represents a single-user peak over a specific development period, not a guaranteed outcome.

Trust, Privacy, and Local-First Design

llm-router runs entirely on your machine. There is no hosted proxy, no telemetry, no account required.

What	Where	Details
Your prompts	Sent to configured providers	Exactly like using those providers directly
API keys	`.env` or `~/.llm-router/config.yaml`	Local files, never transmitted
Usage logs	`~/.llm-router/usage.db`	Unencrypted SQLite (filesystem permissions)
Classification cache	In-memory	Cleared on process restart
Hook scripts	`~/.claude/hooks/`	Local shell scripts, inspectable

What we do:

Scrub API keys from structured logs
Detect hook deadlocks before installation
Store all data locally in ~/.llm-router/
Respect provider rate limits and TOS

What you should know:

Prompts are sent to whichever provider the router selects — review your provider's privacy policy
Usage logs (SQLite) are not encrypted at rest — use full-disk encryption if needed
The router cannot prevent model jailbreaks or prompt injection at the provider level

See SECURITY.md for responsible disclosure policy.

Configuration

Minimal setup — only configure what you have:

# Provider keys (set any combination)
export OPENAI_API_KEY="sk-proj-..."
export GEMINI_API_KEY="AIza..."
export OLLAMA_BASE_URL="http://localhost:11434"
export OLLAMA_BUDGET_MODELS="gemma4:latest,qwen3.5:latest"

# Routing behavior
export LLM_ROUTER_PROFILE="balanced"       # budget | balanced | premium | subscription_local
export LLM_ROUTER_POLICY="balanced"        # aggressive | balanced | conservative
export LLM_ROUTER_ENFORCE="smart"          # smart | hard | soft | off

For teams or environments where .env is restricted:

# User-level config (no project .env needed)
mkdir -p ~/.llm-router && chmod 700 ~/.llm-router
cat > ~/.llm-router/config.yaml << 'EOF'
openai_api_key: "sk-proj-..."
gemini_api_key: "AIza..."
ollama_base_url: "http://localhost:11434"
llm_router_profile: "balanced"
EOF
chmod 600 ~/.llm-router/config.yaml

Documentation

Document	Purpose
Quick Start (2 min)	Fastest path to working routing
Getting Started	Full setup walkthrough
Host Support Matrix	Per-host feature comparison
Providers	Provider setup and model recommendations
Tool Reference	All 60 MCP tools with examples
Architecture	Internal design and module structure
Troubleshooting	Common issues and fixes

Contributing

Contributions welcome. See CONTRIBUTING.md for full guidelines.

git clone https://github.com/ypollak2/llm-router.git
cd llm-router
uv sync --extra dev
uv run pytest tests/ -q         # Run tests (1900+)
uv run ruff check src/ tests/   # Lint

Package Names

Name	What it is
`llm-routing`	Current PyPI package (`pip install llm-routing`)
`llm-router`	CLI command and GitHub repo name
`claude-code-llm-router`	Deprecated legacy package (redirects to `llm-routing`)

Star History

Activity

Install Server

license - permissive license

quality

maintenance

How are these scores calculated?

Maintenance

–Maintainers

3dResponse time

0dRelease cycle

120Releases (12mo)

Commit activity

Issues opened vs closed

Resources

GitHub Repository

Need Help?

Related Servers

Tools

View all tools

Latest Blog Posts

Your AI Chatbot Just Exposed Your CEO's Salary to an Intern
By Om-Shree-0709 on July 2, 2026.
Agent Identity
MCP Security
OAuth Delegation
Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)
By Om-Shree-0709 on June 30, 2026.
Agentic Ai
Prompt Injection
WebAssembly
Lightport: Open-Sourcing Glama's AI Gateway
By punkpeye on April 27, 2026.
OpenAI
open source

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ypollak2/llm-router'

If you have feedback or need assistance with the MCP directory API, please join our Discord server