evaluate_retrieval
Score retrieval methods against labeled questions by averaging four ranking metrics (recall@k, precision@k, MRR@k, nDCG@k) to quantify retrieval quality.
Instructions
Score one retrieval method against the labeled question set.
Runs every labeled question through the chosen retriever and averages four ranking
metrics, so you can put a number on retrieval quality instead of eyeballing it.
Args:
method: Which retrieval strategy to score (default 'hybrid').
k: The cutoff for recall@k, precision@k, and nDCG@k (1 to 20, default 3).
Returns:
MetricsOut with fields: method, k, n_questions, recall_at_k, precision_at_k,
mrr_at_k, ndcg_at_k. Each metric is in the range 0.0 to 1.0.
Raises:
A tool error if method='dense' and the optional dense extra is not installed.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| method | No | Retrieval strategy to score: 'bm25', 'tfidf', 'dense', or 'hybrid'. | hybrid |
| k | No | Cutoff k for recall@k, precision@k, and nDCG@k. |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| method | Yes | ||
| k | Yes | ||
| n_questions | Yes | ||
| recall_at_k | Yes | Fraction of relevant docs found in the top k. | |
| precision_at_k | Yes | Fraction of the top k that are relevant. | |
| mrr_at_k | Yes | Mean reciprocal rank of the first relevant doc within k. | |
| ndcg_at_k | Yes | Normalized discounted cumulative gain at k. |