long-context-mcp

Overview Schema Related Servers Score Discussions

long-context-mcp
docs

MODEL_SELECTION_AND_COST.md•3.53 KiB

# Model Selection and Cost ## How Cost Accrues RLM uses **two models**: `model_name` (root) and `other_model_name` (recursive subcalls). Every `llm_query` subcall is an actual model invocation, so you pay for: * Root call tokens + completions * **Sum of recursive subcall tokens + completions** * Plus any extra "reasoning tokens" a provider meters (OpenRouter exposes this concept in its stats UI) The paper's whole trick is *not* "free reasoning"; it's shifting from "read everything" to "probe + recurse", often pairing a stronger root model with a cheaper recursive model. ## Cost-Efficient Strategy: Strong Root + Cheap Recursion RLM gives you a natural place to be stingy: * **Root model**: Choose for *planning/tool-use/code reasoning quality* * **Other model**: Choose for *cheap repeated analysis* (summaries, regex scanning, localized reasoning, small code transforms) This is exactly the pairing the paper leans on. ## OpenRouter Pricing and Accounting OpenRouter prices are in USD per token. Important nuance: OpenRouter charges based on **native token counts**, and recommends `/api/v1/generation` for precise accounting. **Total cost per MCP `rlm.solve`** (roughly): ``` cost ≈ ∑_{k ∈ {root} ∪ subcalls} (inTok_k · p^{(in)}_{m_k} + outTok_k · p^{(out)}_{m_k}) ``` Where m_k is the model used for that call (root vs other). ## Cost-Efficient OpenRouter Model Options (Examples) Prices below come from OpenRouter model/provider pages (they can change; treat as examples, not constants): ### Ultra-cheap "other model" for recursion * `qwen/qwen-2.5-coder-7b-instruct`: **$0.03/M input, $0.09/M output** * `qwen/qwen-2.5-coder-32b-instruct`: (still very cheap; good coding quality per $) ### Cheap + fast + huge context * `google/gemini-2.0-flash-001`: **$0.10/M input, $0.40/M output**, ~1M context This is a strong default when you want low latency and can tolerate occasional "flash" behavior. ### Budget "root model" that's usually good enough * `openai/gpt-4o-mini`: **$0.15/M input, $0.60/M output** ### Higher-quality roots (still relatively economical) * DeepSeek variants have multiple price/perf points (including some free variants). If you let users choose DeepSeek "free" routes, document reliability expectations. ## Hard Knobs That Dominate Spend These are already in your request schema and server: * `request.rlm.max_iterations` (trajectory length cap) * `request.rlm.timeout_sec` (server wraps completion with `asyncio.wait_for`) * `backend_kwargs` / `other_backend_kwargs` (where users can pass `max_tokens`, `temperature`, etc.) Recommendation for a sane default: * `max_iterations`: 8-15 * `other_backend_kwargs.max_tokens`: small (e.g., 256-512) unless you explicitly need verbose sub-analyses ## Local-Only Option (No Token Spend) You already expose presets for Ollama and vLLM via OpenAI-compatible base URLs. * Ollama documents its OpenAI compatibility (Responses API; earlier Chat Completions compatibility is also supported, but compatibility scope varies by endpoint). * vLLM provides an OpenAI-compatible server (`vllm serve … --api-key …`). This is the only path where "RLM probes" don't trigger marginal API costs (you pay in GPUs, obviously). ## Benchmarking Your Costs Use `bench/bench_tokens.py` to compare baseline vs RLM token usage and verify savings on your own tasks. ## References * [OpenRouter API Reference](https://openrouter.ai/docs/api/reference/overview) * [OpenRouter Authentication](https://openrouter.ai/docs/api/reference/authentication) * [OpenRouter Models](https://openrouter.ai/docs/guides/overview/models)

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/wx-b/long-context-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

MODEL_SELECTION_AND_COST.md•3.53 KiB