Relace MCP Server

benchmark.md•6.26 KiB

# Benchmark Module > **Note**: This module is under active development. APIs and metrics may change. Evaluates `agentic_search` performance using the Loc-Bench dataset (derived from [LocAgent](https://github.com/IvanaXu/LocAgent)). ## 1. Setup **Prerequisites**: Python 3.11+, `git`, network access **Environment** (create `.env` in project root): ```bash SEARCH_PROVIDER=relace # or: openai, openrouter RELACE_API_KEY=your-key-here # or: OPENAI_API_KEY, OPENROUTER_API_KEY ``` **Dataset**: Build the Loc-Bench dataset from Hugging Face via datasets-server (no LocAgent required): ```bash uv run python -m benchmark.cli.build_locbench \ --output artifacts/data/raw/locbench_v1.jsonl ``` ## 2. Single Run ```bash # Basic run uv run python -m benchmark.cli.run --dataset artifacts/data/processed/elite_50.jsonl --limit 20 # Run on Loc-Bench (after build_locbench) uv run python -m benchmark.cli.run --dataset artifacts/data/raw/locbench_v1.jsonl --limit 20 # With parameter overrides uv run python -m benchmark.cli.run \ --dataset artifacts/data/processed/elite_50.jsonl \ --limit 64 --seed 0 --shuffle \ --max-turns 8 --temperature 0.2 -q # Resume from checkpoint after interruption uv run python -m benchmark.cli.run \ -o my_run --resume --timeout 300 --fail-fast 5 ``` **Outputs**: - Results: `benchmark/artifacts/results/<name>.jsonl` - Report: `benchmark/artifacts/reports/<name>.report.json` **Key options**: | Option | Default | Description | |--------|---------|-------------| | `--dataset` | locbench_v1.jsonl | Dataset path | | `-o, --output` | auto | Output file prefix | | `--limit` | all | Number of cases | | `--seed` | `0` | Random seed | | `--shuffle` | off | Randomize selection | | `--max-turns` | env | Override `SEARCH_MAX_TURNS` | | `--temperature` | env | Override `SEARCH_TEMPERATURE` | | `--prompt-file` | env | Override `SEARCH_PROMPT_FILE` (YAML) | | `--timeout` | none | Per-case timeout in seconds | | `--fail-fast` | none | Stop after N consecutive failures | | `--resume` | off | Resume from checkpoint | | `-v, --verbose` | off | Detailed logging | | `-q, --quiet` | off | Disable progress bar | | `--dry-run` | off | Preview only | ## 3. Grid Search (Hyperparameter Tuning) Run Cartesian product of `turns × temperature` combinations: ```bash uv run python -m benchmark.cli.grid \ --dataset artifacts/data/processed/elite_50.jsonl \ --limit 64 --seed 0 --shuffle \ --turns 4 --turns 6 --turns 8 \ --temperatures 0 --temperatures 0.2 --temperatures 0.4 --temperatures 0.6 ``` **Grid options**: | Option | Required | Description | |--------|----------|-------------| | `--turns` | ✓ | Grid values for `SEARCH_MAX_TURNS` (repeatable) | | `--temperatures` | ✓ | Grid values for `SEARCH_TEMPERATURE` (repeatable) | | `--search-prompt-file` | | Override prompt file for all runs | | `--output` | | Output directory prefix | | `--dry-run` | | Print planned runs without executing | **Output**: Grid summary saved to `artifacts/reports/<grid_name>.grid.json` ## 4. Dataset Validation Validate dataset correctness before running benchmarks: ```bash # Validate default dataset uv run python -m benchmark.cli.validate # Validate specific dataset uv run python -m benchmark.cli.validate --input artifacts/data/raw/locbench_v1.jsonl # Output report to file uv run python -m benchmark.cli.validate --output validation.json --verbose ``` **Validation checks**: - `hard_gt` / `soft_context` files exist - Line ranges are valid - Function names match AST parsing - `target_ranges` fall within context range - Solvability evidence appears in query **Options**: | Option | Default | Description | |--------|---------|-------------| | `--input` | `locbench_v1.jsonl` | Dataset path | | `--output` | stdout | Report output path | | `--limit` | all | Number to validate | | `-v/--verbose` | off | Detailed output | ## 5. Analyze and Report Results ```bash # Analyze single run (detailed stdout) uv run python -m benchmark.cli.analyze path/to/run.report.json # Compare multiple runs (Markdown output) uv run python -m benchmark.cli.report run1.report.json run2.report.json # Find best config from grid search uv run python -m benchmark.cli.report --best grid_curated_30.grid.json # Output comparison to file uv run python -m benchmark.cli.report -o comparison.md *.report.json ``` ## 6. Interpret Metrics | Metric | Formula | |--------|---------| | File Recall | GT files found / Total GT files | | File Precision | Correct files / Returned files | | Line Coverage | GT lines covered / Total GT lines | | Line Precision | Correct lines / Returned lines | | Line Prec (Matched) | Correct lines / Returned lines (matched files only) | | Function Hit Rate | Functions with overlap / Total functions | Each `*.report.json` includes metadata tracking: `temperature`, `max_turns`, `prompt_file` for reproducibility. ## 7. Troubleshooting | Problem | Solution | |---------|----------| | Missing API key | Set `RELACE_API_KEY` or provider-specific key | | Clone fails | Check network, ensure `git` installed | | Dataset not found | Place dataset in `benchmark/artifacts/data/` | | Slow first run | Normal—repos cached after first download | ## 8. Running Unit Tests ```bash uv run pytest benchmark/tests -v ``` ## Directory Structure ``` benchmark/ ├── cli/ │ ├── run.py # Single run CLI │ ├── grid.py # Grid search CLI │ ├── report.py # Report generation │ ├── analyze.py # Detailed analysis │ ├── curate.py # Dataset curation │ ├── validate.py # Dataset validation │ └── build_locbench.py # Loc-Bench builder ├── analysis/ # Analysis tools (function scope, etc.) ├── datasets/ # Dataset loaders ├── metrics/ # Metrics implementation ├── runner/ # Execution pipeline ├── tests/ # Unit tests ├── config.py # Configuration constants ├── schemas.py # Data structure definitions └── artifacts/ # (runtime generated, not in version control) ├── data/ # Dataset files ├── repos/ # Cached repositories ├── results/ # Run outputs (.jsonl) └── reports/ # Summary reports (.report.json, .grid.json) ```

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/possible055/relace-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

benchmark.md•6.26 KiB