We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/wx-b/long-context-mcp'
If you have feedback or need assistance with the MCP directory API, please join our Discord server
README.md•1.48 KiB
# Deterministic fixtures for RLM benchmarks
These fixtures are a pragmatic fix for a recurring problem: open-ended repo questions are
hard to score because "correct" answers can vary wildly in formatting and verbosity.
The generator creates a small synthetic repo with **known ground truth** (provider presets
+ env vars). The scorer then grades any model/tool output mechanically.
## Generate a fixture
```bash
uv run python bench/fixtures/fixture_gen.py --out-dir /tmp/rlm_fixture --seed 1337
```
Outputs:
- `/tmp/rlm_fixture/repo/` — synthetic repo content
- `/tmp/rlm_fixture/gold.json` — ground truth
- `/tmp/rlm_fixture/query.txt` — deterministic query requiring strict JSON
- `/tmp/rlm_fixture/globs.txt` — suggested glob patterns
## Run your benchmark scripts on the fixture
Use the globs file for ingestion, and the query file for the task.
## Score any output
If your benchmark scripts write JSON artifacts (recommended), point the scorer at the
artifact file; it will prefer `answer_json` if present.
```bash
uv run python bench/fixtures/fixture_score.py \
--gold /tmp/rlm_fixture/gold.json \
--output /tmp/bench_output/whatever/baseline.json
```
The scorer prints a JSON report including precision/recall/F1 and exact-match.
## Design notes
- Decoy env var strings exist in comments (lines contain `DO_NOT_INCLUDE`).
The query explicitly says to ignore these; models that "grep blindly" get penalized.
- Sorting requirements make outputs canonical and reduce scoring noise.