ypollak2/llm-router
Server Configuration
Describes the environment variables required to run the server.
| Name | Required | Description | Default |
|---|---|---|---|
No arguments | |||
Capabilities
Features and capabilities supported by this server
| Capability | Details |
|---|---|
| tools | {
"listChanged": false
} |
| prompts | {
"listChanged": false
} |
| resources | {
"subscribe": false,
"listChanged": false
} |
| experimental | {} |
Tools
Functions exposed to the LLM to take actions
| Name | Description |
|---|---|
| llm_classify | Classify a prompt's complexity and recommend which model to use. Returns a smart recommendation considering complexity, daily token budget, quality preference, and minimum model floor. Includes budget usage bar. Complexity drives model selection at all times:
Args: prompt: The task or question to classify. quality: Override quality mode — "best", "balanced", or "conserve". min_model: Override minimum model floor — "haiku", "sonnet", or "opus". |
| llm_track_usage | Report Claude Code model token usage for budget tracking. Call this after using an Agent with haiku/sonnet to track token consumption against the daily budget. This enables progressive model downshifting. Shows per-call savings vs opus and cumulative session savings. Args: model: The Claude model used — "haiku", "sonnet", or "opus". tokens_used: Approximate tokens consumed by the Agent call. complexity: The task complexity that was routed — "simple", "moderate", "complex". |
| llm_route | Smart router — classifies task complexity, then routes to the optimal external LLM. Uses a cheap classifier to assess complexity, then picks the right model tier:
For routing to Claude Code's own models (haiku/sonnet) without API keys, use llm_classify instead and follow its recommendation. Args: prompt: The task or question to route. task_type: Optional hint — "query", "research", "generate", "analyze", "code". Auto-detected if omitted. complexity_override: Skip classification — force "simple", "moderate", or "complex". system_prompt: Optional system instructions. temperature: Sampling temperature (0.0-2.0). max_tokens: Maximum output tokens. context: Optional conversation context to help the model understand the broader task. |
| llm_auto | Auto-routing wrapper with persistent savings tracking — works from any host. Equivalent to llm_route but additionally:
Use llm_auto instead of llm_route when you are in a host that lacks a UserPromptSubmit hook (Codex CLI, Claude Desktop, GitHub Copilot) — the savings are tracked server-side, so they accumulate correctly regardless of which client triggered the call. Args: prompt: The task or question to route. task_type: Optional hint — "query", "research", "generate", "analyze", "code". profile_override: Force a routing profile — "budget", "balanced", or "premium". system_prompt: Optional system instructions. context: Optional conversation context. |
| llm_stream | Stream an LLM response for long-running tasks — shows output as it arrives. Uses the same routing logic as llm_route but streams chunks instead of waiting for the full response. Ideal for long-form generation, research summaries, or any task where seeing partial output early is valuable. Args: prompt: The task or question to stream. task_type: Task type hint — "query", "research", "generate", "analyze", "code". model: Optional model override (e.g. "openai/gpt-4o", "gemini/gemini-2.5-flash"). system_prompt: Optional system instructions. temperature: Sampling temperature (0.0-2.0). max_tokens: Maximum output tokens. |
| llm_select_agent | Classify a task prompt and return the recommended agent CLI + model for session-level routing. Use this BEFORE starting a Claude Code / Codex / Gemini CLI session to pick the right agent runtime for the task. This is session-level routing — it selects which agent to invoke, not which model to call mid-session. Decision tree (profile × complexity): budget + simple/moderate → codex + gpt-4o-mini budget + complex → codex + gpt-4o (Codex handles most coding; escalate if needed) balanced + simple → codex + gpt-4o-mini balanced + moderate → claude_code + sonnet balanced + complex → claude_code + opus premium + any → claude_code + opus Returns JSON with: primary — agent binary name: "claude_code" | "codex" | "gemini_cli" primary_model — model flag value (pass via -m or --model) fallback — fallback agent if primary unavailable fallback_model — model for fallback task_type — classified task type (code / analyze / generate / research / query) complexity — simple | moderate | complex confidence — classifier confidence 0–1 reason — one-line classification rationale env_check — dict of required env vars and whether they're set Args: prompt: The task description to classify (same text you'd pass to the agent). profile: Routing profile — "budget", "balanced", or "premium" (default: "balanced"). |
| llm_reroute | Override the last routing decision and record it for feedback learning. Logs the correction to the database so future routing decisions for this task type have lowered confidence. Use this when llm_route, llm_query, llm_code, or any other tool chose the wrong model for your task. Args: to_tool: Which tool to use instead (e.g. "llm_analyze", "llm_code"). reason: Optional explanation — stored for routing quality improvement. original_tool: The tool that made the wrong decision (auto-detected if omitted). original_model: The model that was selected (for logging purposes). |
| llm_query | Send a general query to the best available LLM. Routes by complexity: simple→Haiku/Flash, moderate→Sonnet/GPT-4o, complex→Opus/o3. Args: prompt: The question or prompt to send. complexity: Task complexity — "simple", "moderate", or "complex". Drives model selection: simple→cheap (Haiku/Flash), moderate→balanced (Sonnet/GPT-4o), complex→premium (Opus/o3). Auto-detected from prompt length when omitted. model: Explicit model override, bypasses complexity routing entirely. system_prompt: Optional system instructions. temperature: Sampling temperature (0.0-2.0). max_tokens: Maximum output tokens. context: Optional conversation context to help the model understand the broader task. |
| llm_research | Search-augmented research query — routes to Perplexity for web-grounded answers. Best for: fact-checking, current events, finding sources, market research. Args: prompt: The research question. system_prompt: Optional system instructions. max_tokens: Maximum output tokens. context: Optional conversation context to help the model understand the broader task. |
| llm_generate | Generate creative or long-form content — routes to the best generation model. Best for: writing, summarization, brainstorming, content creation. Args: prompt: What to generate. complexity: Task complexity — "simple", "moderate", or "complex". Drives model selection. Simple tasks (short summaries) use cheap models; complex tasks (long-form, nuanced writing) use premium models. system_prompt: Optional system instructions (tone, format, audience). temperature: Sampling temperature (higher = more creative). max_tokens: Maximum output tokens. context: Optional conversation context to help the model understand the broader task. |
| llm_analyze | Deep analysis task — routes to the strongest reasoning model. Best for: data analysis, code review, problem decomposition, debugging. Args: prompt: What to analyze. complexity: Task complexity — "simple", "moderate", or "complex". Analysis tasks default to at least moderate. Pass "complex" for multi-file reviews or architecture decisions that warrant Opus/o3. system_prompt: Optional system instructions. max_tokens: Maximum output tokens. context: Optional conversation context to help the model understand the broader task. |
| llm_code | Coding task — routes to the best coding model. Best for: code generation, refactoring suggestions, algorithm design. Args: prompt: The coding task or question. complexity: Task complexity — "simple", "moderate", or "complex". Drives model selection: simple questions use Haiku/Flash, actual implementation tasks use Sonnet/GPT-4o, large refactors or architecture work use Opus/o3. system_prompt: Optional system instructions (language, framework, style). max_tokens: Maximum output tokens. context: Optional conversation context to help the model understand the broader task. |
| llm_edit | Route code-edit reasoning to a cheap model and return exact edit instructions. Instead of Opus reasoning about what to change (expensive), a cheap model
reads the files, figures out the edits, and returns JSON How to use the result: After calling this tool, apply each edit instruction using the Edit tool with the exact old_string → new_string pairs provided. Best for: refactoring, bug fixes, adding small features to existing files. Args: task: Natural-language description of what to change (e.g. "Add type hints to all public functions in router.py"). files: List of file paths to read and include in the prompt. Relative paths are resolved from the current working directory. Files larger than 32 KB are truncated with a note. context: Optional conversation context to help the model understand the task. |
| llm_image | Generate an image — auto-routes to Gemini Imagen, DALL-E, Flux, or Stable Diffusion. Args: prompt: Description of the image to generate. model: Optional model override (e.g. "gemini/imagen-3", "openai/dall-e-3", "fal/flux-pro", "stability/stable-diffusion-3"). size: Image size (e.g. "1024x1024", "1792x1024"). quality: Image quality — "standard" or "hd" (DALL-E only). |
| llm_video | Generate a video — routes to Gemini Veo, Runway, Kling, or other video models. Args: prompt: Description of the video to generate. model: Optional model override (e.g. "gemini/veo-2", "runway/gen3a_turbo", "fal/kling-video"). duration: Video duration in seconds (default: 5). |
| llm_audio | Generate speech/audio — routes to ElevenLabs or OpenAI TTS. Args: text: Text to convert to speech. model: Optional model override (e.g. "openai/tts-1-hd", "elevenlabs/eleven_multilingual_v2"). voice: Voice selection (OpenAI: alloy/echo/fable/onyx/nova/shimmer. ElevenLabs: voice ID). |
| llm_orchestrate | Multi-step orchestration — automatically decomposes complex tasks across multiple LLMs. Chains research, analysis, generation, and coding steps together, routing each to the optimal model. Use templates for common patterns or let the AI decompose. Free tier: up to 2-step pipelines. Pro tier: unlimited steps + auto-decomposition. Args: task: Description of the complex task to accomplish. template: Optional pipeline template: "research_report", "competitive_analysis", "content_pipeline", "code_review_fix". Omit for auto-decomposition. |
| llm_pipeline_templates | List available pipeline templates for multi-step orchestration. |
| llm_save_session | Summarize and save the current session for cross-session context. Uses a cheap model to generate a compact summary of the session's exchanges, then persists it to SQLite. Future routed calls will include this summary as context, giving external models awareness of prior work. Call this before ending a session or when switching to a different task. Sessions with fewer than 3 exchanges are skipped. |
| llm_set_profile | Switch the active routing profile. Args: profile: One of "budget", "balanced", or "premium". |
| llm_usage | Unified usage dashboard — Claude subscription, Codex, external APIs, and savings. Shows a complete picture of all LLM usage across all providers in one view. Args: period: Time period — "today", "week", "month", or "all". |
| llm_savings | Show time-bucketed savings dashboard: today / this week / this month / all-time. Displays actual spend vs Sonnet baseline and the efficiency multiplier (Nx) for each period. Use this to understand the real dollar value routing provides. Returns: Formatted savings table with efficiency multiplier. |
| llm_cache_stats | Show prompt classification cache statistics — hit rate, entries, memory usage. The cache stores ClassificationResult objects keyed by SHA-256(prompt + quality_mode + min_model). Budget pressure is always applied fresh, so cached classifications stay valid. |
| llm_cache_clear | Clear the prompt classification cache. |
| llm_quality_report | Show routing quality metrics — classification accuracy, savings, model distribution. Analyzes routing decisions over the specified period to show how the classifier is performing, which models are being selected, downshift rates, and cost efficiency. Args: days: Number of days to include in the report (default 7). |
| llm_quality_guard | Show quality scores per model with degradation alerts (v6.2). Displays rolling average judge scores for all routed models over the past N days. Alerts if any model's score < 0.7 with sufficient samples (quality degradation). Args: days: Number of days of history to analyze (default 7). Returns: Formatted table with model scores, trend arrows, and alerts. |
| llm_health | Check the health status of all configured LLM providers. |
| llm_hook_health | Check the health status of all routing hooks. Shows:
|
| llm_providers | List all supported providers and which ones are configured. |
| llm_dashboard | Open the LLM Router web dashboard in the background. Starts a local HTTP server at localhost: showing routing stats, cost trends, model distribution, and recent decisions. Refreshes every 30s. The dashboard reads from the same SQLite DB the router writes — no extra configuration needed. Args: port: TCP port for the dashboard server (default 7337). Returns: URL and instructions for opening the dashboard. |
| llm_team_report | Show a team savings report for the current user and project. Displays call counts, cost savings, free-tier usage, and top models, broken down for the auto-detected user (git email) and project (git remote). Args:
period: |
| llm_team_push | Push the team savings report to the configured notification channel. Sends a formatted message to the endpoint set by
Args:
period: |
| llm_policy | Show the active routing policy and recent policy audit events. Displays the merged policy from all three layers:
Also shows the last 10 policy enforcement events from the audit log. |
| llm_digest | Generate a savings digest and optionally send it to a webhook. Formats a savings summary for the given period. Also detects spend spikes and shows a "what if router was off?" simulation. Args:
period: |
| llm_benchmark | Show routing accuracy benchmarks by task type. Accuracy is computed from llm_rate feedback (thumbs up/down). The more you rate responses with llm_rate, the more accurate this becomes. Also shows an optional community export status if LLM_ROUTER_COMMUNITY=true. |
| llm_model_eval | Evaluate and benchmark all available local and remote models. Runs a suite of benchmark tasks (reasoning, code) against each available model (Ollama, Codex, APIs) to determine quality, speed, and accuracy. Results are cached for 7 days and used to optimize routing priorities. Can be called manually to force a re-evaluation, or automatically runs once per week during session-end. Returns: Formatted evaluation results with quality scores and latency metrics. |
| llm_model_usage | Analyze which models are being selected in routing. Shows usage statistics for the last N hours:
Args: hours: Look back this many hours (default: 24) Returns: Formatted usage statistics and analysis. |
| llm_model_export | Export model tracking data for external analysis. Exports complete routing history to a file for analysis in spreadsheets or data tools (Excel, Python, R, etc.). Args: format: Export format (csv, json). Default: csv Returns: Path to exported file and record count. |
| llm_session_spend | Show real-time session cost breakdown. Reports spend accumulated since this session started, broken down by model and tool. Fires an anomaly warning if spend exceeds the configured threshold (default $0.50) in under 10 minutes. Returns a formatted summary with per-model costs and anomaly status. |
| llm_approve_route | Approve or reject a pending high-cost routing decision. Use this when llm_route (or any routing tool) blocked a call because the estimated cost exceeded LLM_ROUTER_ESCALATE_ABOVE. The pending call is stored server-side until you approve or cancel it. Args: approve: True to proceed with the call, False to cancel it. downgrade_to: Optional cheaper model to use instead of the blocked one (e.g. "gemini/gemini-2.5-flash" instead of "openai/o3"). |
| llm_quota_status | Show quota balance across Claude, Gemini CLI, and Codex subscriptions. Monitors three subscription providers to help you understand which quota is being exhausted, and make routing decisions accordingly. The QUOTA_BALANCED profile uses this data to dynamically reorder the routing chain — keeping usage balanced across all three subscriptions. Returns: Formatted quota status with usage percentages and route recommendations. |
| llm_budget | Show real-time budget pressure for all configured providers (v5.0+). Reads live budget state from the Budget Oracle, which normalises provider quota into a single pressure value (0.0 = fully available, 1.0 = exhausted). Pressure sources by provider type: Local (Ollama, vLLM) — always 0.0 (free, no quota) Claude subscription — max(session_pct, weekly_pct, sonnet_pct) / 100 API-key providers — monthly spend / configured cap (0.0 if no cap) Returns: A formatted budget summary with pressure bars per provider. |
| llm_gain | Show token savings dashboard (RTK-style). Displays comprehensive token savings metrics across all routing decisions, showing actual costs vs. Opus baseline and efficiency multiplier. Features:
Args: period: Time period to analyze: "today", "week" (default), "month", or "all" Returns: Formatted savings dashboard |
| llm_share_profile | Share your learned routing profile with the community. Exports ~/.llm-router/learned_routes.json and prepares it for upload to a shared community repository. Useful for publishing routing patterns you've learned that may benefit other llm-router users. Returns: Path to exported profile and upload instructions |
| llm_import_profile | Import a learned routing profile from community or URL. Imports a shared profile and merges it with your existing learned routes. Community profiles must have confidence >= 2 to be imported (strict validation). Args: url: URL to a profile JSON file (optional; defaults to community latest) Returns: Merge summary and new routes imported |
| llm_check_usage | Check real-time Claude subscription usage (session limits, weekly limits, extra spend). Shows cached data if available. If no data cached, returns the JS snippet to run via Playwright's browser_evaluate (one call, no page navigation needed). The budget pressure from this data feeds directly into model routing — higher usage = more aggressive downshifting to cheaper models. |
| llm_update_usage | Update cached Claude usage from the JSON API response. Call this with the result from browser_evaluate(FETCH_USAGE_JS). Accepts the full JSON object from the claude.ai internal API. The cached data is used by llm_classify for real budget pressure instead of token-based estimates. Args: data: JSON response from the claude.ai usage API (via browser_evaluate). |
| llm_refresh_claude_usage | Refresh Claude subscription usage via the OAuth API — no browser required. Reads the Claude Code OAuth token from the macOS Keychain, calls the Anthropic OAuth usage endpoint, and updates the local usage cache. Requires: Claude Code installed and authenticated on macOS. |
| llm_codex | Route a task to the local Codex desktop agent (OpenAI). Uses the Codex CLI to run tasks non-interactively. This uses the user's OpenAI subscription (not Claude quota) — ideal as a fallback when Claude limits are tight, or for tasks that benefit from OpenAI's models. Available models: gpt-5.4, o3, o4-mini, gpt-4o, gpt-4o-mini Args: prompt: The task or question to send to Codex. model: OpenAI model to use (default: gpt-5.4). |
| llm_gemini | Route a task to the local Gemini CLI agent (Google). Uses the Gemini CLI to run tasks non-interactively. This uses the user's Google One AI Pro subscription (not Claude quota) — ideal as a fallback when Claude limits are tight, or for tasks that benefit from Google's Gemini models. Available models: gemini-2.5-flash, gemini-2.0-flash, gemini-3-flash-preview Args: prompt: The task or question to send to Gemini. model: Google model to use (default: gemini-2.5-flash). |
| llm_setup | Set up and manage API providers, hooks, and routing enforcement. Actions:
Args: action: What to do — "status", "guide", "discover", "add", "test", "provider", "install_hooks", or "uninstall_hooks". provider: Provider name (for "add", "test", and "provider" actions). api_key: API key value (for "add" action only). Key is validated before saving. |
| llm_rate | Rate the last (or a specific) routing decision as good or bad. Stores thumbs-up / thumbs-down feedback in the Args: good: True = routing was a good choice; False = bad choice. decision_id: Row ID to rate. Omit (or pass None) to rate the most recent routing decision. Returns: Confirmation string with the rated decision ID, or an error message. |
| llm_fs_find | Generate glob/grep commands to find files matching a natural-language description. Routes to Haiku/Ollama so the cheap model does pattern thinking. Claude executes the returned commands with Glob/Grep/Bash. Args: description: What you're looking for, e.g. "all Python files that import sqlite3" or "TypeScript files with TODO comments added in the last week". root: Optional root directory to search in. Defaults to current working directory. |
| llm_fs_rename | Generate shell commands for a file rename/reorganisation operation. Describe what you want to rename and the cheap model produces the mv/git mv
commands. Use Args:
description: What to rename and how, e.g. "rename all _old.py files in
src/ to remove the old suffix" or "move all test*.py files from
tests/unit/ into tests/".
dry_run: When True, commands are prefixed with |
| llm_fs_edit_many | Generate bulk edit instructions across multiple files. Extends the Use this for cross-file refactors, bulk renames within files, or updating repeated patterns across a module. Args:
task: Natural-language description of what to change, e.g.
"replace all |
| llm_fs_analyze_context | Analyze workspace files to build a routing context summary. Scans key files (package.json, pyproject.toml, go.mod, Cargo.toml, README, open TODOs) and produces a compact semantic summary stored in ~/.llm-router/context_summary.json. Subsequent routing decisions inject this summary into the system prompt so cheap models have workspace context. Call this once at the start of a project session or after major refactors. The summary is automatically used by llm_route and llm_auto — no further action required. Args: path: Workspace root to analyze (default: current directory). max_files: Maximum files to read (default: 20). |
| agoragentic_task | Execute a task on the Agoragentic capability marketplace. Routes automatically to the best-matching trusted provider. Handles USDC settlement on Base L2 blockchain. Args: task: Task type (e.g., "code_review", "summarization") input_json: Task input as JSON string max_budget_usdc: Maximum spend limit (optional) Returns: Execution result as JSON string |
| agoragentic_browse | Browse available services on the Agoragentic marketplace. Shows trust-verified providers and their capabilities. Returns: JSON list of available capabilities |
| agoragentic_wallet | Check Agoragentic wallet balance and status. Returns: Wallet info including balance, chain, and currency |
| agoragentic_status | Get llm-router agent status on Agoragentic. Shows registration status, available seller slots, listings, etc. Returns: Agent status as JSON |
Prompts
Interactive templates invoked by user choice
| Name | Description |
|---|---|
No prompts | |
Resources
Contextual data attached and managed by the client
| Name | Description |
|---|---|
| router_status | MCP resource returning a plain-text snapshot of the router's current state. Includes the active profile, subscription tier, configured provider counts (text and media), optional monthly budget, and per-provider circuit-breaker health status. Returns: A newline-delimited plain-text summary (not markdown). |
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/ypollak2/llm-router'
If you have feedback or need assistance with the MCP directory API, please join our Discord server