ypollak2/llm-router
This server is an intelligent LLM routing system that classifies prompts and directs them to the cheapest capable model, saving 35–80% on AI costs while maintaining quality. Key capabilities include:
Smart Routing & Classification
llm_route,llm_auto,llm_stream,llm_classify— classify prompt complexity and route to optimal LLMs (budget/balanced/premium tiers) with budget-pressure awarenessllm_reroute,llm_approve_route— override or approve pending high-cost routing decisionsllm_select_agent— pick the best CLI agent (Claude Code, Codex, Gemini CLI) for session-level routing
Task-Specific LLM Tools
llm_query,llm_code,llm_analyze,llm_generate,llm_research,llm_edit— route prompts tailored to general queries, coding, analysis, creative generation, web-grounded research, or code editing
Media Generation
llm_image,llm_video,llm_audio— generate images (DALL-E, Flux, Stable Diffusion), video (Runway, Kling, Veo), or speech (ElevenLabs, OpenAI TTS)
Multi-Step Orchestration
llm_orchestrate,llm_pipeline_templates— decompose complex tasks into multi-LLM pipelines (research reports, competitive analysis, etc.)
Cost, Budget & Usage Monitoring
llm_usage,llm_savings,llm_gain,llm_session_spend,llm_budget,llm_quota_status— real-time cost tracking, savings dashboards, quota balances, and anomaly warnings across Claude, Codex, Gemini, and external APIsllm_check_usage,llm_update_usage,llm_refresh_claude_usage— manage Claude subscription usage
Quality & Performance Analytics
llm_quality_report,llm_quality_guard,llm_benchmark,llm_model_eval,llm_model_usage,llm_model_export— routing accuracy metrics, model quality scores, degradation alerts, and exportable tracking datallm_rate— rate routing decisions to improve future routing
Health, Providers & Configuration
llm_health,llm_hook_health,llm_providers,llm_setup— check provider/hook health and configure API keys (Ollama, OpenAI, Anthropic, Google, DeepSeek, Mistral, Groq, Perplexity, and more)llm_set_profile,llm_policy— switch routing profiles and view active policies
Filesystem Operations
llm_fs_find,llm_fs_rename,llm_fs_edit_many,llm_fs_analyze_context— use cheap models to find files, generate rename commands, perform bulk edits, and analyze workspace context
Team & Collaboration
llm_team_report,llm_team_push,llm_digest,llm_dashboard— team savings reports, Slack/Discord/Telegram webhooks, spend spike detection, and a local web dashboardllm_share_profile,llm_import_profile— share or import learned routing profiles with the community
Session & Cache Management
llm_save_session— summarize and persist session context for cross-session awarenessllm_cache_stats,llm_cache_clear— manage the prompt classification cache
Agoragentic Marketplace
agoragentic_task,agoragentic_browse,agoragentic_wallet,agoragentic_status— execute tasks, browse services, and manage USDC wallet on the Agoragentic capability marketplace
Integration for audio generation and text-to-speech capabilities.
Provides access to Gemini 2.5 Pro and 2.5 Flash models with a free tier (1M tokens/day), optimized for generation tasks and long-context processing.
Enables routing to locally-hosted models for zero-cost, privacy-preserving, offline inference as the first tier in fallback chains.
Provides access to GPT-4o, GPT-4o-mini, and o3 models for code generation, analysis, and reasoning tasks.
Integration for research and current events using Sonar and Sonar Pro search-augmented models to get factual, up-to-date information.
Why People Install This
AI coding tools send too many prompts to premium models by default.
That means:
You waste paid tokens on simple questions
You burn through Claude, Gemini, or OpenAI quota faster than necessary
You stop working when one provider is rate-limited or down
llm-router sits between your coding tool and your model providers. It classifies each prompt, tries the cheapest capable model first, and falls back automatically when needed.
You keep the same workflow. The router changes the model choice underneath.
What You Get
Route trivial prompts to free or cheap models first
Keep premium models for the prompts that actually need them
Fall back across providers automatically
Track usage and estimated savings locally
Run everything on your own machine
Quick Start
1. Install
pip install llm-routing
llm-router installPackage name:
llm-routingon PyPI. CLI command:llm-router.
2. Add providers (optional)
export OPENAI_API_KEY="sk-..." # GPT-4o, o3
export GEMINI_API_KEY="AIza..." # Gemini Flash/Pro (free tier available)
export OLLAMA_BASE_URL="http://localhost:11434" # Local models (free)Works with zero API keys on Claude Code Pro/Max subscriptions — routing uses MCP tools that call external models only when beneficial.
3. Verify
llm-router health # Check provider connectivityIf you already use Claude Code, Codex, or Gemini CLI, keep your existing workflow and let llm-router choose models underneath it.
Example Routing
Prompt | Routed to |
"What does this Python error mean?" | Ollama / Gemini Flash / Codex |
"Refactor this endpoint" | GPT-4o / Gemini Pro |
"Design a distributed tracing strategy" | o3 / Claude Opus |
The exact chain depends on your configured providers, budget profile, and routing policy.
Works With
Tool | Mode | Typical Savings |
Claude Code | Full auto-routing via hooks | 60–80% |
Codex CLI | Full auto-routing via hooks | 60–80% |
Gemini CLI | Full auto-routing via hooks | 50–70% |
VS Code / Cursor | Manual MCP tools | 30–50% |
Any MCP client | Manual MCP tools | Varies |
Full auto-routing means hooks intercept prompts and route automatically with no workflow change.
Manual MCP tools means routing is available on demand through tools such as
llm_query.
llm-router install # Claude Code (default)
llm-router install --host codex # Codex CLI
llm-router install --host gemini-cli # Gemini CLI
llm-router install --host vscode # VS Code
llm-router install --host cursor # CursorSee docs/HOST_SUPPORT_MATRIX.md for full details on each host.
Protect Claude Code 5-hour quota
For a strict boundary that never automatically falls through to native Claude, configure:
# ~/.llm-router/routing.yaml
enforce: smart
mode: zero_claudeIn zero_claude mode, prompts either complete through direct external execution or are blocked before Claude Code invokes its model. Prefix a prompt with claude: when you intentionally want a native Claude turn.
How It Works
User prompt
│
▼
┌──────────────────────┐
│ Complexity Classifier │ ← Heuristic (free, instant) or Ollama/Flash ($0.0001)
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ Free-First Router │ ← Tries cheapest model first, walks up the chain
│ │
│ Ollama (free) │
│ → Codex (prepaid) │
│ → Gemini Flash │
│ → GPT-4o / Claude │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ Guards (parallel) │ ← Circuit breaker, budget pressure, quality check
└──────────┬───────────┘
│
▼
Response + cost logged to local SQLiteClassification is free for many tasks (regex heuristics catch ~70%) or near-free for ambiguous prompts when using local Ollama or Gemini Flash.
What You Can Do
Use case | How |
Route simple questions to free local models | Auto (hooks) or |
Protect Claude subscription quota | Budget pressure monitoring + auto-downgrade |
Fall back across providers on failure | Automatic chain with circuit breakers |
Track token spend and savings |
|
Enforce routing policy for your team |
|
Generate images/video/audio |
|
Run multi-step research pipelines |
|
Bulk-edit files with cheap models |
|
Providers
Routing chains are built from your configured providers. You only need one.
Text LLM Providers
Provider | Models | Cost | Setup |
Ollama | gemma4, qwen3.5, llama3, etc. | Free (local) |
|
OpenAI | GPT-4o, o3, GPT-4o-mini | Paid API |
|
Gemini Flash, Pro | Free tier + paid |
| |
Anthropic | Claude Sonnet, Opus, Haiku | Paid API or subscription |
|
xAI | Grok-3 | Paid API |
|
DeepSeek | DeepSeek Chat, Reasoner | Paid API (ultra-cheap) |
|
Mistral | Mistral Large, Small | Paid API |
|
Cohere | Command R+ | Paid API |
|
Perplexity | Sonar Pro (web-grounded) | Paid API |
|
Groq | Fast inference (Llama, Mixtral) | Free tier |
|
Together | Open-source models | Paid API |
|
HuggingFace | Open-source models | Free tier + paid |
|
Codex | GPT-5.4, o3 (prepaid desktop) | Included with Codex CLI | Auto-detected |
Media Providers
Provider | Type | Setup |
fal | Image (Flux), Video (Kling) |
|
Stability | Image (Stable Diffusion 3) |
|
ElevenLabs | Audio / TTS |
|
Runway | Video (Gen-3) |
|
Replicate | Various open-source models |
|
See docs/PROVIDERS.md for setup instructions and model recommendations.
Routing Policies
Control how aggressively the router offloads to cheap models.
Policy | Confidence Threshold | Typical Savings | Best For |
Aggressive | 2 | 60–75% | Maximum cost reduction |
Balanced (default) | 4 | 35–45% | Cost/quality tradeoff |
Conservative | 6 | 10–15% | Quality over cost |
export LLM_ROUTER_POLICY=aggressive # Or: balanced, conservative
export LLM_ROUTER_ENFORCE=smart # smart | hard | soft | off
export LLM_ROUTER_PROFILE=balanced # budget | balanced | premiumLLM_ROUTER_ENFORCE controls how strictly the auto-route hook blocks direct model use:
smart— route when confident, pass through when uncertainhard— always route, block unrouted tool callssoft— suggest routing, never blockoff— disable hook enforcement
MCP Tools (60)
llm-router exposes 60 MCP tools organized by function:
Category | Tools | Examples |
Routing & classification | 7 |
|
Text generation | 6 |
|
Media generation | 3 |
|
Pipeline orchestration | 2 |
|
Admin & monitoring | 20+ |
|
Filesystem operations | 4 |
|
Subscription tracking | 3 |
|
Slim mode (LLM_ROUTER_SLIM=routing or core) reduces registered tools to save context tokens in constrained environments.
Savings: How It Works
Savings are calculated by comparing actual spend against a baseline of routing every task to Claude Sonnet/Opus.
Methodology:
Each routed task logs: model used, tokens consumed, estimated cost
A baseline cost is computed as if the same tokens were processed by the most expensive model in the chain
Savings =
(baseline - actual) / baseline
Assumptions and limitations:
Baseline assumes you would have used Opus/Sonnet for everything (worst case)
Token estimates use
len(text) / 4approximation, not exact tokenizer countsCost data comes from LiteLLM's pricing tables (may lag provider price changes)
Savings vary significantly by workload — code-heavy sessions route more to cheap models
The router itself adds small overhead (classification costs ~$0.0001 per ambiguous task)
Observed range: 35–80% savings depending on policy and task mix. The "87%" figure in some docs represents a single-user peak over a specific development period, not a guaranteed outcome.
Trust, Privacy, and Local-First Design
llm-router runs entirely on your machine. There is no hosted proxy, no telemetry, no account required.
What | Where | Details |
Your prompts | Sent to configured providers | Exactly like using those providers directly |
API keys |
| Local files, never transmitted |
Usage logs |
| Unencrypted SQLite (filesystem permissions) |
Classification cache | In-memory | Cleared on process restart |
Hook scripts |
| Local shell scripts, inspectable |
What we do:
Scrub API keys from structured logs
Detect hook deadlocks before installation
Store all data locally in
~/.llm-router/Respect provider rate limits and TOS
What you should know:
Prompts are sent to whichever provider the router selects — review your provider's privacy policy
Usage logs (SQLite) are not encrypted at rest — use full-disk encryption if needed
The router cannot prevent model jailbreaks or prompt injection at the provider level
See SECURITY.md for responsible disclosure policy and docs/SECURITY_DESIGN.md for the full threat model.
Configuration
Minimal setup — only configure what you have:
# Provider keys (set any combination)
export OPENAI_API_KEY="sk-proj-..."
export GEMINI_API_KEY="AIza..."
export OLLAMA_BASE_URL="http://localhost:11434"
export OLLAMA_BUDGET_MODELS="gemma4:latest,qwen3.5:latest"
# Routing behavior
export LLM_ROUTER_PROFILE="balanced" # budget | balanced | premium
export LLM_ROUTER_POLICY="balanced" # aggressive | balanced | conservative
export LLM_ROUTER_ENFORCE="smart" # smart | hard | soft | offFor teams or environments where .env is restricted:
# User-level config (no project .env needed)
mkdir -p ~/.llm-router && chmod 700 ~/.llm-router
cat > ~/.llm-router/config.yaml << 'EOF'
openai_api_key: "sk-proj-..."
gemini_api_key: "AIza..."
ollama_base_url: "http://localhost:11434"
llm_router_profile: "balanced"
EOF
chmod 600 ~/.llm-router/config.yamlDocumentation
Document | Purpose |
Fastest path to working routing | |
Full setup walkthrough | |
Per-host feature comparison | |
Provider setup and model recommendations | |
All 60 MCP tools with examples | |
Internal design and module structure | |
Common issues and fixes | |
Threat model and data handling |
Contributing
Contributions welcome. See CONTRIBUTING.md for full guidelines.
git clone https://github.com/ypollak2/llm-router.git
cd llm-router
uv sync --extra dev
uv run pytest tests/ -q # Run tests (1900+)
uv run ruff check src/ tests/ # LintPackage Names
Name | What it is |
| Current PyPI package ( |
| CLI command and GitHub repo name |
| Deprecated legacy package (redirects to |
Maintenance
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/ypollak2/llm-router'
If you have feedback or need assistance with the MCP directory API, please join our Discord server