| llm_classifyA | Classify a prompt's complexity and recommend which model to use. Returns a smart recommendation considering complexity, daily token budget,
quality preference, and minimum model floor. Includes budget usage bar.
Complexity drives model selection at all times:
- simple → haiku, moderate → sonnet, complex → opus
Budget pressure is a late safety net only:
- 0-85%: no downshift — complexity routing handles efficiency
- 85-95%: downshift by 1 tier (opus→sonnet, sonnet→haiku)
- 95%+: downshift by 2 tiers, warns user
Args:
prompt: The task or question to classify.
quality: Override quality mode — "best", "balanced", or "conserve".
min_model: Override minimum model floor — "haiku", "sonnet", or "opus".
|
| llm_track_usageA | Report Claude Code model token usage for budget tracking. Call this after using an Agent with haiku/sonnet to track token consumption
against the daily budget. This enables progressive model downshifting.
Shows per-call savings vs opus and cumulative session savings.
Args:
model: The Claude model used — "haiku", "sonnet", or "opus".
tokens_used: Approximate tokens consumed by the Agent call.
complexity: The task complexity that was routed — "simple", "moderate", "complex".
|
| llm_routeA | Smart router — classifies task complexity, then routes to the optimal external LLM. Uses a cheap classifier to assess complexity, then picks the right model tier:
- simple → budget models (Gemini Flash, GPT-4o-mini)
- moderate → balanced models (GPT-4o, Sonnet, Gemini Pro)
- complex → premium models (o3, Opus)
For routing to Claude Code's own models (haiku/sonnet) without API keys,
use llm_classify instead and follow its recommendation.
Args:
prompt: The task or question to route.
task_type: Optional hint — "query", "research", "generate", "analyze", "code". Auto-detected if omitted.
complexity_override: Skip classification — force "simple", "moderate", or "complex".
system_prompt: Optional system instructions.
temperature: Sampling temperature (0.0-2.0).
max_tokens: Maximum output tokens.
context: Optional conversation context to help the model understand the broader task.
|
| llm_streamA | Stream an LLM response for long-running tasks — shows output as it arrives. Uses the same routing logic as llm_route but streams chunks instead of
waiting for the full response. Ideal for long-form generation, research
summaries, or any task where seeing partial output early is valuable.
Args:
prompt: The task or question to stream.
task_type: Task type hint — "query", "research", "generate", "analyze", "code".
model: Optional model override (e.g. "openai/gpt-4o", "gemini/gemini-2.5-flash").
system_prompt: Optional system instructions.
temperature: Sampling temperature (0.0-2.0).
max_tokens: Maximum output tokens.
|
| llm_queryA | Send a general query to the best available LLM. Routes by complexity: simple→Haiku/Flash, moderate→Sonnet/GPT-4o, complex→Opus/o3.
Args:
prompt: The question or prompt to send.
complexity: Task complexity — "simple", "moderate", or "complex". Drives model
selection: simple→cheap (Haiku/Flash), moderate→balanced (Sonnet/GPT-4o),
complex→premium (Opus/o3). Auto-detected from prompt length when omitted.
model: Explicit model override, bypasses complexity routing entirely.
system_prompt: Optional system instructions.
temperature: Sampling temperature (0.0-2.0).
max_tokens: Maximum output tokens.
context: Optional conversation context to help the model understand the broader task.
|
| llm_researchA | Search-augmented research query — routes to Perplexity for web-grounded answers. Best for: fact-checking, current events, finding sources, market research.
Args:
prompt: The research question.
system_prompt: Optional system instructions.
max_tokens: Maximum output tokens.
context: Optional conversation context to help the model understand the broader task.
|
| llm_generateA | Generate creative or long-form content — routes to the best generation model. Best for: writing, summarization, brainstorming, content creation.
Args:
prompt: What to generate.
complexity: Task complexity — "simple", "moderate", or "complex". Drives model
selection. Simple tasks (short summaries) use cheap models; complex tasks
(long-form, nuanced writing) use premium models.
system_prompt: Optional system instructions (tone, format, audience).
temperature: Sampling temperature (higher = more creative).
max_tokens: Maximum output tokens.
context: Optional conversation context to help the model understand the broader task.
|
| llm_analyzeA | Deep analysis task — routes to the strongest reasoning model. Best for: data analysis, code review, problem decomposition, debugging.
Args:
prompt: What to analyze.
complexity: Task complexity — "simple", "moderate", or "complex". Analysis tasks
default to at least moderate. Pass "complex" for multi-file reviews or
architecture decisions that warrant Opus/o3.
system_prompt: Optional system instructions.
max_tokens: Maximum output tokens.
context: Optional conversation context to help the model understand the broader task.
|
| llm_codeA | Coding task — routes to the best coding model. Best for: code generation, refactoring suggestions, algorithm design.
Args:
prompt: The coding task or question.
complexity: Task complexity — "simple", "moderate", or "complex". Drives model
selection: simple questions use Haiku/Flash, actual implementation tasks use
Sonnet/GPT-4o, large refactors or architecture work use Opus/o3.
system_prompt: Optional system instructions (language, framework, style).
max_tokens: Maximum output tokens.
context: Optional conversation context to help the model understand the broader task.
|
| llm_editA | Route code-edit reasoning to a cheap model and return exact edit instructions. Instead of Opus reasoning about what to change (expensive), a cheap model
reads the files, figures out the edits, and returns JSON ``{file, old_string,
new_string}`` pairs that Claude can apply mechanically via the Edit tool.
**How to use the result**: After calling this tool, apply each edit instruction
using the Edit tool with the exact old_string → new_string pairs provided.
Best for: refactoring, bug fixes, adding small features to existing files.
Args:
task: Natural-language description of what to change (e.g.
"Add type hints to all public functions in router.py").
files: List of file paths to read and include in the prompt.
Relative paths are resolved from the current working directory.
Files larger than 32 KB are truncated with a note.
context: Optional conversation context to help the model understand the task.
|
| llm_imageA | Generate an image — auto-routes to Gemini Imagen, DALL-E, Flux, or Stable Diffusion. Args:
prompt: Description of the image to generate.
model: Optional model override (e.g. "gemini/imagen-3", "openai/dall-e-3", "fal/flux-pro", "stability/stable-diffusion-3").
size: Image size (e.g. "1024x1024", "1792x1024").
quality: Image quality — "standard" or "hd" (DALL-E only).
|
| llm_videoA | Generate a video — routes to Gemini Veo, Runway, Kling, or other video models. Args:
prompt: Description of the video to generate.
model: Optional model override (e.g. "gemini/veo-2", "runway/gen3a_turbo", "fal/kling-video").
duration: Video duration in seconds (default: 5).
|
| llm_audioA | Generate speech/audio — routes to ElevenLabs or OpenAI TTS. Args:
text: Text to convert to speech.
model: Optional model override (e.g. "openai/tts-1-hd", "elevenlabs/eleven_multilingual_v2").
voice: Voice selection (OpenAI: alloy/echo/fable/onyx/nova/shimmer. ElevenLabs: voice ID).
|
| llm_orchestrateA | Multi-step orchestration — automatically decomposes complex tasks across multiple LLMs. Chains research, analysis, generation, and coding steps together, routing each
to the optimal model. Use templates for common patterns or let the AI decompose.
Free tier: up to 2-step pipelines. Pro tier: unlimited steps + auto-decomposition.
Args:
task: Description of the complex task to accomplish.
template: Optional pipeline template: "research_report", "competitive_analysis", "content_pipeline", "code_review_fix". Omit for auto-decomposition.
|
| llm_pipeline_templatesB | List available pipeline templates for multi-step orchestration. |
| llm_save_sessionA | Summarize and save the current session for cross-session context. Uses a cheap model to generate a compact summary of the session's exchanges,
then persists it to SQLite. Future routed calls will include this summary
as context, giving external models awareness of prior work. Call this before ending a session or when switching to a different task.
Sessions with fewer than 3 exchanges are skipped. |
| llm_set_profileB | Switch the active routing profile. Args:
profile: One of "budget", "balanced", or "premium".
|
| llm_usageB | Unified usage dashboard — Claude subscription, Codex, external APIs, and savings. Shows a complete picture of all LLM usage across all providers in one view.
Args:
period: Time period — "today", "week", "month", or "all".
|
| llm_cache_statsA | Show prompt classification cache statistics — hit rate, entries, memory usage. The cache stores ClassificationResult objects keyed by SHA-256(prompt + quality_mode + min_model).
Budget pressure is always applied fresh, so cached classifications stay valid.
|
| llm_cache_clearA | Clear the prompt classification cache. |
| llm_quality_reportA | Show routing quality metrics — classification accuracy, savings, model distribution. Analyzes routing decisions over the specified period to show how the
classifier is performing, which models are being selected, downshift
rates, and cost efficiency.
Args:
days: Number of days to include in the report (default 7).
|
| llm_healthA | Check the health status of all configured LLM providers. |
| llm_providersA | List all supported providers and which ones are configured. |
| llm_dashboardA | Open the LLM Router web dashboard in the background. Starts a local HTTP server at localhost:<port> showing routing stats,
cost trends, model distribution, and recent decisions. Refreshes every 30s.
The dashboard reads from the same SQLite DB the router writes — no extra
configuration needed.
Args:
port: TCP port for the dashboard server (default 7337).
Returns:
URL and instructions for opening the dashboard.
|
| llm_check_usageA | Check real-time Claude subscription usage (session limits, weekly limits, extra spend). Shows cached data if available. If no data cached, returns the JS snippet
to run via Playwright's browser_evaluate (one call, no page navigation needed).
The budget pressure from this data feeds directly into model routing —
higher usage = more aggressive downshifting to cheaper models.
|
| llm_update_usageA | Update cached Claude usage from the JSON API response. Call this with the result from browser_evaluate(FETCH_USAGE_JS).
Accepts the full JSON object from the claude.ai internal API.
The cached data is used by llm_classify for real budget pressure
instead of token-based estimates.
Args:
data: JSON response from the claude.ai usage API (via browser_evaluate).
|
| llm_refresh_claude_usageA | Refresh Claude subscription usage via the OAuth API — no browser required. Reads the Claude Code OAuth token from the macOS Keychain, calls the
Anthropic OAuth usage endpoint, and updates the local usage cache.
Requires: Claude Code installed and authenticated on macOS.
|
| llm_codexA | Route a task to the local Codex desktop agent (OpenAI). Uses the Codex CLI to run tasks non-interactively. This uses the user's
OpenAI subscription (not Claude quota) — ideal as a fallback when Claude
limits are tight, or for tasks that benefit from OpenAI's models.
Available models: gpt-5.4, o3, o4-mini, gpt-4o, gpt-4o-mini
Args:
prompt: The task or question to send to Codex.
model: OpenAI model to use (default: gpt-5.4).
|
| llm_setupA | Set up and manage API providers, hooks, and routing enforcement. Actions:
- "status": Show which providers are configured and which are missing
- "guide": Step-by-step guide to add recommended free/cheap providers
- "discover": Scan for existing API keys in environment (safe, read-only)
- "add": Add an API key for a provider (writes to .env file securely)
- "test": Validate API keys with a minimal call (tests configured or specific provider)
- "provider": Show details about a specific provider
- "install_hooks": Install auto-routing hooks globally (every Claude Code session)
- "uninstall_hooks": Remove auto-routing hooks
Args:
action: What to do — "status", "guide", "discover", "add", "test", "provider", "install_hooks", or "uninstall_hooks".
provider: Provider name (for "add", "test", and "provider" actions).
api_key: API key value (for "add" action only). Key is validated before saving.
|
| llm_rateA | Rate the last (or a specific) routing decision as good or bad. Stores thumbs-up / thumbs-down feedback in the ``routing_decisions`` table.
Over time this signal can be used to retrain the local classifier so routing
choices improve based on your preferences.
Args:
good: True = routing was a good choice; False = bad choice.
decision_id: Row ID to rate. Omit (or pass None) to rate the most recent
routing decision.
Returns:
Confirmation string with the rated decision ID, or an error message.
|