# Model Selection and Cost
## How Cost Accrues
RLM uses **two models**: `model_name` (root) and `other_model_name` (recursive subcalls). Every `llm_query` subcall is an actual model invocation, so you pay for:
* Root call tokens + completions
* **Sum of recursive subcall tokens + completions**
* Plus any extra "reasoning tokens" a provider meters (OpenRouter exposes this concept in its stats UI)
The paper's whole trick is *not* "free reasoning"; it's shifting from "read everything" to "probe + recurse", often pairing a stronger root model with a cheaper recursive model.
## Cost-Efficient Strategy: Strong Root + Cheap Recursion
RLM gives you a natural place to be stingy:
* **Root model**: Choose for *planning/tool-use/code reasoning quality*
* **Other model**: Choose for *cheap repeated analysis* (summaries, regex scanning, localized reasoning, small code transforms)
This is exactly the pairing the paper leans on.
## OpenRouter Pricing and Accounting
OpenRouter prices are in USD per token. Important nuance: OpenRouter charges based on **native token counts**, and recommends `/api/v1/generation` for precise accounting.
**Total cost per MCP `rlm.solve`** (roughly):
```
cost ≈ ∑_{k ∈ {root} ∪ subcalls} (inTok_k · p^{(in)}_{m_k} + outTok_k · p^{(out)}_{m_k})
```
Where m_k is the model used for that call (root vs other).
## Cost-Efficient OpenRouter Model Options (Examples)
Prices below come from OpenRouter model/provider pages (they can change; treat as examples, not constants):
### Ultra-cheap "other model" for recursion
* `qwen/qwen-2.5-coder-7b-instruct`: **$0.03/M input, $0.09/M output**
* `qwen/qwen-2.5-coder-32b-instruct`: (still very cheap; good coding quality per $)
### Cheap + fast + huge context
* `google/gemini-2.0-flash-001`: **$0.10/M input, $0.40/M output**, ~1M context
This is a strong default when you want low latency and can tolerate occasional "flash" behavior.
### Budget "root model" that's usually good enough
* `openai/gpt-4o-mini`: **$0.15/M input, $0.60/M output**
### Higher-quality roots (still relatively economical)
* DeepSeek variants have multiple price/perf points (including some free variants).
If you let users choose DeepSeek "free" routes, document reliability expectations.
## Hard Knobs That Dominate Spend
These are already in your request schema and server:
* `request.rlm.max_iterations` (trajectory length cap)
* `request.rlm.timeout_sec` (server wraps completion with `asyncio.wait_for`)
* `backend_kwargs` / `other_backend_kwargs` (where users can pass `max_tokens`, `temperature`, etc.)
Recommendation for a sane default:
* `max_iterations`: 8-15
* `other_backend_kwargs.max_tokens`: small (e.g., 256-512) unless you explicitly need verbose sub-analyses
## Local-Only Option (No Token Spend)
You already expose presets for Ollama and vLLM via OpenAI-compatible base URLs.
* Ollama documents its OpenAI compatibility (Responses API; earlier Chat Completions compatibility is also supported, but compatibility scope varies by endpoint).
* vLLM provides an OpenAI-compatible server (`vllm serve … --api-key …`).
This is the only path where "RLM probes" don't trigger marginal API costs (you pay in GPUs, obviously).
## Benchmarking Your Costs
Use `bench/bench_tokens.py` to compare baseline vs RLM token usage and verify savings on your own tasks.
## References
* [OpenRouter API Reference](https://openrouter.ai/docs/api/reference/overview)
* [OpenRouter Authentication](https://openrouter.ai/docs/api/reference/authentication)
* [OpenRouter Models](https://openrouter.ai/docs/guides/overview/models)