agentlens_benchmark
Create and manage A/B benchmarks for comparing agent configurations, track metrics like cost and latency, and obtain statistical results.
Instructions
Manage A/B benchmarks: create, list, check status, get results, and control lifecycle.
When to use: To set up controlled experiments comparing different agent configurations (models, prompts, parameters), track which variant performs better, and get statistical results.
Workflow:
create— Define a benchmark with 2+ variants and metricsTag sessions with variant tags during data collection
start— Transition benchmark to runningstatus— Check progress (session counts per variant)results— Get statistical comparison with p-valuescomplete— Finalize the benchmark
Actions:
create: Set up a new benchmark (name, variants[], metrics[])list: List benchmarks, optionally filter by statusstatus: Get benchmark detail with per-variant session countsresults: Get formatted comparison table with statistical analysisstart: Transition benchmark to running statecomplete: Transition benchmark to completed state
Example: agentlens_benchmark({ action: "create", name: "GPT-4o vs Claude", variants: [{name: "gpt4o", tag: "v-gpt4o"}, {name: "claude", tag: "v-claude"}], metrics: ["cost", "latency", "success_rate"] })
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| action | Yes | Action to perform | |
| name | No | Benchmark name (required for create) | |
| description | No | Benchmark description | |
| variants | No | Variants to compare (required for create, min 2) | |
| metrics | No | Metrics to track (e.g., ["cost", "latency", "success_rate"]) | |
| minSessions | No | Minimum sessions per variant before results are meaningful | |
| agentId | No | Agent ID to scope the benchmark to | |
| status | No | Filter by status (for list action) | |
| benchmarkId | No | Benchmark ID (required for status/results/start/complete) |