mcp-llm-eval
Server Configuration
Describes the environment variables required to run the server.
| Name | Required | Description | Default |
|---|---|---|---|
| GOOGLE_API_KEY | No | API key for Google Gemini models | |
| OPENAI_API_KEY | No | API key for OpenAI models (includes judge model) | |
| ANTHROPIC_API_KEY | No | API key for Anthropic Claude models | |
| MCP_LLM_EVAL_JUDGE_MODEL | No | Model to use as judge for scoring | gpt-4o-mini |
Capabilities
Features and capabilities supported by this server
| Capability | Details |
|---|---|
| tools | {
"listChanged": false
} |
| experimental | {} |
Tools
Functions exposed to the LLM to take actions
| Name | Description |
|---|---|
| run_evaluationB | Run an LLM evaluation: load a dataset, query models via streaming, score responses with an LLM-as-judge, and return per-question scores, aggregate summary, and pass/fail status. |
| check_thresholdsB | Check evaluation results against quality gate thresholds. Returns pass/fail per metric and overall gate status. |
| list_evaluationsA | List past evaluation runs in a directory. Returns metadata for each run: timestamp, dataset, models, pass/fail, and cost. |
| get_evaluationA | Retrieve the full details of a specific evaluation run: per-question per-model scores, responses, and judge reasoning. |
| compare_runsB | Compare two evaluation runs and detect regressions. Flags metrics that worsened beyond configurable tolerance. |
| evaluate_retrievalA | Run retrieval metrics (recall@k, precision@k, MRR, nDCG@k) against a labelled dataset with a configurable retrieval adapter. Returns per-query metrics, dataset-level aggregate, and p50/p95 retrieval latency. |
| evaluate_rag_end_to_endB | Run the full RAG pipeline: retrieve chunks, generate answers using the retrieved chunks as context, and score with context_relevance and citation_faithfulness judges. Returns retrieval metrics, generation metrics, and judge scores per query, plus an aggregate. |
| check_retrieval_driftA | Compare two retrieval evaluation result files and detect drift. Flags metrics that have regressed beyond configurable tolerance. Takes two result-set paths; does not persist history itself. |
| simulate_poisoned_corpusB | [STUB - not implemented in v0.5.0] Inject poisoned chunks into a corpus and re-run retrieval evaluation. Returns a clear not-implemented response. |
| format_pr_commentA | Generate a markdown PR comment from evaluation results. Includes results table, regression details, and threshold status. |
Prompts
Interactive templates invoked by user choice
| Name | Description |
|---|---|
No prompts | |
Resources
Contextual data attached and managed by the client
| Name | Description |
|---|---|
No resources | |
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/berkayildi/mcp-llm-eval'
If you have feedback or need assistance with the MCP directory API, please join our Discord server