# RLM Benchmarks
## One-shot vs. Recursive Probing
Standard MCP tool calls often involve stuffing as much context as possible into a single prompt. This leads to:
1. Higher costs (quadratic attention).
2. "Lost in the middle" phenomena.
3. Context window limits being hit.
RLM uses recursive probing to find *exactly* what is needed.
### Benchmark Harness
You can run the included benchmark script to see the difference for your own codebase:
```bash
uv run python bench/bench_tokens.py \
--query "Detailed explanation of security boundaries" \
--globs "**/*.py" \
--provider_preset openrouter \
--model anthropic/claude-3-sonnet
```
### Typical Results
- **Vanilla/One-shot**: 120k input tokens, $0.36, 15s latency.
- **RLM**: 12k input tokens, $0.04, 30s latency (recursive steps).
**Win**: ~90% reduction in token pressure and cost for large-context tasks.
**Trade-off**: Slightly higher latency due to sequential LLM steps.
## Quick Tests
### Test 1 `qwen/qwen-2.5-coder-32b-instruct`
Example of a quick test for less than $0.01 USD:
```bash
uv run python bench/bench_tokens.py \
--query "What are the core components of the RLM MCP server? Return a JSON list." \
--globs "rlm_mcp_server/*.py" \
--provider_preset openrouter \
--model qwen/qwen-2.5-coder-32b-instruct \
--dump-dir ./examples/benchmark_tests/qwen-2.5-coder-32b-instruct
```
More tokens consumed by RLM for this benchmark test and 10x the cost, but also a much more comprehensive and useful [answer](examples/benchmark_test/rlm_answer.txt) vs the baseline [answer](examples/benchmark_test/baseline_answer.txt).
```bash
Benchmark Results
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Metric ┃ Baseline (One-Shot) ┃ RLM (Recursive) ┃ Delta ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ Answer Score │ 1.0 │ 1.0 │ +0.0 │
│ Cost ($) │ $0.0004 │ $0.0042 │ $+0.0038 │
│ Time (sec) │ 1.93 │ 132.64 │ 130.71 │
│ Peak Prompt Tokens │ 5801 │ 0 (No sub-calls) │ N/A │
│ Total Input Tokens │ 5801 │ 44452 │ +38651 │
│ Total Output Tokens │ 30 │ 6883 │ +6853 │
│ Total Tokens │ 5831 │ 51335 │ +45504 │
└─────────────────────┴─────────────────────┴──────────────────┴──────────┘
Context Analysis
Raw Ingested Context Size: ~24151 bytes
Approx. Context Tokens: ~6037
RLM Recursive Steps: 9
```
### Test 2 `google/gemini-2.0-flash-001`
Example of a quick test for less than $0.01 USD:
```bash
uv run python bench/bench_tokens.py \
--query "What are the core components of the RLM MCP server? Return a JSON list." \
--globs "rlm_mcp_server/*.py" \
--provider_preset openrouter \
--model google/gemini-2.0-flash-001 \
--dump-dir ./examples/benchmark_tests/gemini-2.0-flash-001
```
```bash
Benchmark Results
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Metric ┃ Baseline (One-Shot) ┃ RLM (Recursive) ┃ Delta ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ Answer Score │ 1.0 │ 1.0 │ +0.0 │
│ Cost ($) │ $0.0007 │ $0.0055 │ $+0.0048 │
│ Time (sec) │ 1.18 │ 45.07 │ 43.89 │
│ Peak Prompt Tokens │ 6710 │ 0 (No sub-calls) │ N/A │
│ Total Input Tokens │ 6710 │ 41722 │ +35012 │
│ Total Output Tokens │ 45 │ 3195 │ +3150 │
│ Total Tokens │ 6755 │ 44917 │ +38162 │
└─────────────────────┴─────────────────────┴──────────────────┴──────────┘
Context Analysis
Raw Ingested Context Size: ~24151 bytes
Approx. Context Tokens: ~6037
RLM Recursive Steps: 12
```
### Test 3 `google/gemini-2.5-flash-lite`
```bash
uv run python bench/bench_tokens.py \
--query "What are the core components of the RLM MCP server? Return a JSON list." \
--globs "rlm_mcp_server/*.py" \
--provider_preset openrouter \
--model google/gemini-2.5-flash-lite \
--dump-dir ./examples/benchmark_tests/gemini-2.5-flash-lite
```
```bash
Benchmark Results
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Metric ┃ Baseline (One-Shot) ┃ RLM (Recursive) ┃ Delta ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ Answer Score │ 1.0 │ 1.0 │ +0.0 │
│ Cost ($) │ $0.0000 │ $0.0000 │ $+0.0000 │
│ Time (sec) │ 0.95 │ 18.13 │ 17.18 │
│ Peak Prompt Tokens │ 6712 │ 0 (No sub-calls) │ N/A │
│ Total Input Tokens │ 6712 │ 9871 │ +3159 │
│ Total Output Tokens │ 83 │ 917 │ +834 │
│ Total Tokens │ 6795 │ 10788 │ +3993 │
└─────────────────────┴─────────────────────┴──────────────────┴──────────┘
Context Analysis
Raw Ingested Context Size: ~24151 bytes
Approx. Context Tokens: ~6037
RLM Recursive Steps: 3
```
### Test 4 `openai/gpt-oss-120b`
```bash
uv run python bench/bench_tokens.py \
--query "What are the core components of the RLM MCP server? Return a JSON list." \
--globs "rlm_mcp_server/*.py" \
--provider_preset openrouter \
--model openai/gpt-oss-120b \
--dump-dir ./examples/benchmark_tests/gpt-oss-120b
```
```bash
Benchmark Results
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Metric ┃ Baseline (One-Shot) ┃ RLM (Recursive) ┃ Delta ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ Answer Score │ 1.0 │ 1.0 │ +0.0 │
│ Cost ($) │ $0.0000 │ $0.0000 │ $+0.0000 │
│ Time (sec) │ 1.33 │ 40.06 │ 38.73 │
│ Peak Prompt Tokens │ 5857 │ 0 (No sub-calls) │ N/A │
│ Total Input Tokens │ 5857 │ 33801 │ +27944 │
│ Total Output Tokens │ 279 │ 2456 │ +2177 │
│ Total Tokens │ 6136 │ 36257 │ +30121 │
└─────────────────────┴─────────────────────┴──────────────────┴──────────┘
Context Analysis
Raw Ingested Context Size: ~24151 bytes
Approx. Context Tokens: ~6037
RLM Recursive Steps: 7
```
### Test 5 `openai/gpt-oss-20b`
```bash
uv run python bench/bench_tokens.py \
--query "What are the core components of the RLM MCP server? Return a JSON list." \
--globs "rlm_mcp_server/*.py" \
--provider_preset openrouter \
--model openai/gpt-oss-20b \
--dump-dir ./examples/benchmark_tests/gpt-oss-20b
```
```bash
Benchmark Results
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Metric ┃ Baseline (One-Shot) ┃ RLM (Recursive) ┃ Delta ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ Answer Score │ 1.0 │ 0.0 │ -1.0 │
│ Cost ($) │ $0.0000 │ $0.0000 │ $+0.0000 │
│ Time (sec) │ 2.99 │ 320.76 │ 317.77 │
│ Peak Prompt Tokens │ 5857 │ 0 (No sub-calls) │ N/A │
│ Total Input Tokens │ 5857 │ 135520 │ +129663 │
│ Total Output Tokens │ 200 │ 19651 │ +19451 │
│ Total Tokens │ 6057 │ 155171 │ +149114 │
└─────────────────────┴─────────────────────┴──────────────────┴──────────┘
Context Analysis
Raw Ingested Context Size: ~24151 bytes
Approx. Context Tokens: ~6037
RLM Recursive Steps: 29
```
### Test 6 `openai/gpt-5-nano`
```bash
uv run python bench/bench_tokens.py \
--query "What are the core components of the RLM MCP server? Return a JSON list." \
--globs "rlm_mcp_server/*.py" \
--provider_preset openrouter \
--model openai/gpt-5-nano \
--dump-dir ./examples/benchmark_tests/gpt-5-nano
```
```bash
Benchmark Results
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Metric ┃ Baseline (One-Shot) ┃ RLM (Recursive) ┃ Delta ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ Answer Score │ 1.0 │ 1.0 │ +0.0 │
│ Cost ($) │ $0.0000 │ $0.0000 │ $+0.0000 │
│ Time (sec) │ 9.04 │ 93.85 │ 84.81 │
│ Peak Prompt Tokens │ 5793 │ 0 (No sub-calls) │ N/A │
│ Total Input Tokens │ 5793 │ 11824 │ +6031 │
│ Total Output Tokens │ 823 │ 9532 │ +8709 │
│ Total Tokens │ 6616 │ 21356 │ +14740 │
└─────────────────────┴─────────────────────┴──────────────────┴──────────┘
Context Analysis
Raw Ingested Context Size: ~24151 bytes
Approx. Context Tokens: ~6037
RLM Recursive Steps: 7
```