Skip to main content
Glama

Server Configuration

Describes the environment variables required to run the server.

NameRequiredDescriptionDefault

No arguments

Capabilities

Features and capabilities supported by this server

CapabilityDetails
tools
{
  "listChanged": false
}
prompts
{
  "listChanged": false
}
resources
{
  "subscribe": false,
  "listChanged": false
}
experimental
{}

Tools

Functions exposed to the LLM to take actions

NameDescription
llm_classifyA

Classify a prompt's complexity and recommend which model to use.

Returns a smart recommendation considering complexity, daily token budget,
quality preference, and minimum model floor. Includes budget usage bar.

Complexity drives model selection at all times:
- simple → haiku, moderate → sonnet, complex → opus
Budget pressure is a late safety net only:
- 0-85%: no downshift — complexity routing handles efficiency
- 85-95%: downshift by 1 tier (opus→sonnet, sonnet→haiku)
- 95%+: downshift by 2 tiers, warns user

Args:
    prompt: The task or question to classify.
    quality: Override quality mode — "best", "balanced", or "conserve".
    min_model: Override minimum model floor — "haiku", "sonnet", or "opus".
llm_track_usageA

Report Claude Code model token usage for budget tracking.

Call this after using an Agent with haiku/sonnet to track token consumption
against the daily budget. This enables progressive model downshifting.
Shows per-call savings vs opus and cumulative session savings.

Args:
    model: The Claude model used — "haiku", "sonnet", or "opus".
    tokens_used: Approximate tokens consumed by the Agent call.
    complexity: The task complexity that was routed — "simple", "moderate", "complex".
llm_routeA

Smart router — classifies task complexity, then routes to the optimal external LLM.

Uses a cheap classifier to assess complexity, then picks the right model tier:
- simple → budget models (Gemini Flash, GPT-4o-mini)
- moderate → balanced models (GPT-4o, Sonnet, Gemini Pro)
- complex → premium models (o3, Opus)

For routing to Claude Code's own models (haiku/sonnet) without API keys,
use llm_classify instead and follow its recommendation.

Args:
    prompt: The task or question to route.
    task_type: Optional hint — "query", "research", "generate", "analyze", "code". Auto-detected if omitted.
    complexity_override: Skip classification — force "simple", "moderate", or "complex".
    system_prompt: Optional system instructions.
    temperature: Sampling temperature (0.0-2.0).
    max_tokens: Maximum output tokens.
    context: Optional conversation context to help the model understand the broader task.
llm_streamA

Stream an LLM response for long-running tasks — shows output as it arrives.

Uses the same routing logic as llm_route but streams chunks instead of
waiting for the full response. Ideal for long-form generation, research
summaries, or any task where seeing partial output early is valuable.

Args:
    prompt: The task or question to stream.
    task_type: Task type hint — "query", "research", "generate", "analyze", "code".
    model: Optional model override (e.g. "openai/gpt-4o", "gemini/gemini-2.5-flash").
    system_prompt: Optional system instructions.
    temperature: Sampling temperature (0.0-2.0).
    max_tokens: Maximum output tokens.
llm_queryA

Send a general query to the best available LLM.

Routes by complexity: simple→Haiku/Flash, moderate→Sonnet/GPT-4o, complex→Opus/o3.

Args:
    prompt: The question or prompt to send.
    complexity: Task complexity — "simple", "moderate", or "complex". Drives model
        selection: simple→cheap (Haiku/Flash), moderate→balanced (Sonnet/GPT-4o),
        complex→premium (Opus/o3). Auto-detected from prompt length when omitted.
    model: Explicit model override, bypasses complexity routing entirely.
    system_prompt: Optional system instructions.
    temperature: Sampling temperature (0.0-2.0).
    max_tokens: Maximum output tokens.
    context: Optional conversation context to help the model understand the broader task.
llm_researchA

Search-augmented research query — routes to Perplexity for web-grounded answers.

Best for: fact-checking, current events, finding sources, market research.

Args:
    prompt: The research question.
    system_prompt: Optional system instructions.
    max_tokens: Maximum output tokens.
    context: Optional conversation context to help the model understand the broader task.
llm_generateA

Generate creative or long-form content — routes to the best generation model.

Best for: writing, summarization, brainstorming, content creation.

Args:
    prompt: What to generate.
    complexity: Task complexity — "simple", "moderate", or "complex". Drives model
        selection. Simple tasks (short summaries) use cheap models; complex tasks
        (long-form, nuanced writing) use premium models.
    system_prompt: Optional system instructions (tone, format, audience).
    temperature: Sampling temperature (higher = more creative).
    max_tokens: Maximum output tokens.
    context: Optional conversation context to help the model understand the broader task.
llm_analyzeA

Deep analysis task — routes to the strongest reasoning model.

Best for: data analysis, code review, problem decomposition, debugging.

Args:
    prompt: What to analyze.
    complexity: Task complexity — "simple", "moderate", or "complex". Analysis tasks
        default to at least moderate. Pass "complex" for multi-file reviews or
        architecture decisions that warrant Opus/o3.
    system_prompt: Optional system instructions.
    max_tokens: Maximum output tokens.
    context: Optional conversation context to help the model understand the broader task.
llm_codeA

Coding task — routes to the best coding model.

Best for: code generation, refactoring suggestions, algorithm design.

Args:
    prompt: The coding task or question.
    complexity: Task complexity — "simple", "moderate", or "complex". Drives model
        selection: simple questions use Haiku/Flash, actual implementation tasks use
        Sonnet/GPT-4o, large refactors or architecture work use Opus/o3.
    system_prompt: Optional system instructions (language, framework, style).
    max_tokens: Maximum output tokens.
    context: Optional conversation context to help the model understand the broader task.
llm_editA

Route code-edit reasoning to a cheap model and return exact edit instructions.

Instead of Opus reasoning about what to change (expensive), a cheap model
reads the files, figures out the edits, and returns JSON ``{file, old_string,
new_string}`` pairs that Claude can apply mechanically via the Edit tool.

**How to use the result**: After calling this tool, apply each edit instruction
using the Edit tool with the exact old_string → new_string pairs provided.

Best for: refactoring, bug fixes, adding small features to existing files.

Args:
    task: Natural-language description of what to change (e.g.
        "Add type hints to all public functions in router.py").
    files: List of file paths to read and include in the prompt.
        Relative paths are resolved from the current working directory.
        Files larger than 32 KB are truncated with a note.
    context: Optional conversation context to help the model understand the task.
llm_imageA

Generate an image — auto-routes to Gemini Imagen, DALL-E, Flux, or Stable Diffusion.

Args:
    prompt: Description of the image to generate.
    model: Optional model override (e.g. "gemini/imagen-3", "openai/dall-e-3", "fal/flux-pro", "stability/stable-diffusion-3").
    size: Image size (e.g. "1024x1024", "1792x1024").
    quality: Image quality — "standard" or "hd" (DALL-E only).
llm_videoA

Generate a video — routes to Gemini Veo, Runway, Kling, or other video models.

Args:
    prompt: Description of the video to generate.
    model: Optional model override (e.g. "gemini/veo-2", "runway/gen3a_turbo", "fal/kling-video").
    duration: Video duration in seconds (default: 5).
llm_audioA

Generate speech/audio — routes to ElevenLabs or OpenAI TTS.

Args:
    text: Text to convert to speech.
    model: Optional model override (e.g. "openai/tts-1-hd", "elevenlabs/eleven_multilingual_v2").
    voice: Voice selection (OpenAI: alloy/echo/fable/onyx/nova/shimmer. ElevenLabs: voice ID).
llm_orchestrateA

Multi-step orchestration — automatically decomposes complex tasks across multiple LLMs.

Chains research, analysis, generation, and coding steps together, routing each
to the optimal model. Use templates for common patterns or let the AI decompose.

Free tier: up to 2-step pipelines. Pro tier: unlimited steps + auto-decomposition.

Args:
    task: Description of the complex task to accomplish.
    template: Optional pipeline template: "research_report", "competitive_analysis", "content_pipeline", "code_review_fix". Omit for auto-decomposition.
llm_pipeline_templatesB

List available pipeline templates for multi-step orchestration.

llm_save_sessionA

Summarize and save the current session for cross-session context.

Uses a cheap model to generate a compact summary of the session's exchanges, then persists it to SQLite. Future routed calls will include this summary as context, giving external models awareness of prior work.

Call this before ending a session or when switching to a different task. Sessions with fewer than 3 exchanges are skipped.

llm_set_profileB

Switch the active routing profile.

Args:
    profile: One of "budget", "balanced", or "premium".
llm_usageB

Unified usage dashboard — Claude subscription, Codex, external APIs, and savings.

Shows a complete picture of all LLM usage across all providers in one view.

Args:
    period: Time period — "today", "week", "month", or "all".
llm_cache_statsA

Show prompt classification cache statistics — hit rate, entries, memory usage.

The cache stores ClassificationResult objects keyed by SHA-256(prompt + quality_mode + min_model).
Budget pressure is always applied fresh, so cached classifications stay valid.
llm_cache_clearA

Clear the prompt classification cache.

llm_quality_reportA

Show routing quality metrics — classification accuracy, savings, model distribution.

Analyzes routing decisions over the specified period to show how the
classifier is performing, which models are being selected, downshift
rates, and cost efficiency.

Args:
    days: Number of days to include in the report (default 7).
llm_healthA

Check the health status of all configured LLM providers.

llm_providersA

List all supported providers and which ones are configured.

llm_dashboardA

Open the LLM Router web dashboard in the background.

Starts a local HTTP server at localhost:<port> showing routing stats,
cost trends, model distribution, and recent decisions. Refreshes every 30s.

The dashboard reads from the same SQLite DB the router writes — no extra
configuration needed.

Args:
    port: TCP port for the dashboard server (default 7337).

Returns:
    URL and instructions for opening the dashboard.
llm_check_usageA

Check real-time Claude subscription usage (session limits, weekly limits, extra spend).

Shows cached data if available. If no data cached, returns the JS snippet
to run via Playwright's browser_evaluate (one call, no page navigation needed).

The budget pressure from this data feeds directly into model routing —
higher usage = more aggressive downshifting to cheaper models.
llm_update_usageA

Update cached Claude usage from the JSON API response.

Call this with the result from browser_evaluate(FETCH_USAGE_JS).
Accepts the full JSON object from the claude.ai internal API.

The cached data is used by llm_classify for real budget pressure
instead of token-based estimates.

Args:
    data: JSON response from the claude.ai usage API (via browser_evaluate).
llm_refresh_claude_usageA

Refresh Claude subscription usage via the OAuth API — no browser required.

Reads the Claude Code OAuth token from the macOS Keychain, calls the
Anthropic OAuth usage endpoint, and updates the local usage cache.

Requires: Claude Code installed and authenticated on macOS.
llm_codexA

Route a task to the local Codex desktop agent (OpenAI).

Uses the Codex CLI to run tasks non-interactively. This uses the user's
OpenAI subscription (not Claude quota) — ideal as a fallback when Claude
limits are tight, or for tasks that benefit from OpenAI's models.

Available models: gpt-5.4, o3, o4-mini, gpt-4o, gpt-4o-mini

Args:
    prompt: The task or question to send to Codex.
    model: OpenAI model to use (default: gpt-5.4).
llm_setupA

Set up and manage API providers, hooks, and routing enforcement.

Actions:
- "status": Show which providers are configured and which are missing
- "guide": Step-by-step guide to add recommended free/cheap providers
- "discover": Scan for existing API keys in environment (safe, read-only)
- "add": Add an API key for a provider (writes to .env file securely)
- "test": Validate API keys with a minimal call (tests configured or specific provider)
- "provider": Show details about a specific provider
- "install_hooks": Install auto-routing hooks globally (every Claude Code session)
- "uninstall_hooks": Remove auto-routing hooks

Args:
    action: What to do — "status", "guide", "discover", "add", "test", "provider", "install_hooks", or "uninstall_hooks".
    provider: Provider name (for "add", "test", and "provider" actions).
    api_key: API key value (for "add" action only). Key is validated before saving.
llm_rateA

Rate the last (or a specific) routing decision as good or bad.

Stores thumbs-up / thumbs-down feedback in the ``routing_decisions`` table.
Over time this signal can be used to retrain the local classifier so routing
choices improve based on your preferences.

Args:
    good: True = routing was a good choice; False = bad choice.
    decision_id: Row ID to rate. Omit (or pass None) to rate the most recent
        routing decision.

Returns:
    Confirmation string with the rated decision ID, or an error message.

Prompts

Interactive templates invoked by user choice

NameDescription

No prompts

Resources

Contextual data attached and managed by the client

NameDescription
router_statusMCP resource returning a plain-text snapshot of the router's current state. Includes the active profile, subscription tier, configured provider counts (text and media), optional monthly budget, and per-provider circuit-breaker health status. Returns: A newline-delimited plain-text summary (not markdown).

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ypollak2/llm-router'

If you have feedback or need assistance with the MCP directory API, please join our Discord server