Schema | mcp-llm-eval

mcp-llm-eval

Overview Schema Related Servers Score Discussions

Server Configuration

Describes the environment variables required to run the server.

Name	Required	Description	Default
`GOOGLE_API_KEY`	No	API key for Google Gemini models
`OPENAI_API_KEY`	No	API key for OpenAI models (includes judge model)
`ANTHROPIC_API_KEY`	No	API key for Anthropic Claude models
`MCP_LLM_EVAL_JUDGE_MODEL`	No	Model to use as judge for scoring	gpt-4o-mini

Capabilities

Features and capabilities supported by this server

Capability	Details
`tools`	{ "listChanged": false }
`experimental`	{}

Tools

Functions exposed to the LLM to take actions

Name	Description
run_evaluationB	Run an LLM evaluation: load a dataset, query models via streaming, score responses with an LLM-as-judge, and return per-question scores, aggregate summary, and pass/fail status.
check_thresholdsB	Check evaluation results against quality gate thresholds. Returns pass/fail per metric and overall gate status.
list_evaluationsA	List past evaluation runs in a directory. Returns metadata for each run: timestamp, dataset, models, pass/fail, and cost.
get_evaluationA	Retrieve the full details of a specific evaluation run: per-question per-model scores, responses, and judge reasoning.
compare_runsB	Compare two evaluation runs and detect regressions. Flags metrics that worsened beyond configurable tolerance.
evaluate_retrievalA	Run retrieval metrics (recall@k, precision@k, MRR, nDCG@k) against a labelled dataset with a configurable retrieval adapter. Returns per-query metrics, dataset-level aggregate, and p50/p95 retrieval latency.
evaluate_rag_end_to_endB	Run the full RAG pipeline: retrieve chunks, generate answers using the retrieved chunks as context, and score with context_relevance and citation_faithfulness judges. Returns retrieval metrics, generation metrics, and judge scores per query, plus an aggregate.
check_retrieval_driftA	Compare two retrieval evaluation result files and detect drift. Flags metrics that have regressed beyond configurable tolerance. Takes two result-set paths; does not persist history itself.
simulate_poisoned_corpusB	[STUB - not implemented in v0.5.0] Inject poisoned chunks into a corpus and re-run retrieval evaluation. Returns a clear not-implemented response.
format_pr_commentA	Generate a markdown PR comment from evaluation results. Includes results table, regression details, and threshold status.

Prompts

Interactive templates invoked by user choice

Name	Description
No prompts

Resources

Contextual data attached and managed by the client

Name	Description
No resources

Server Configuration
Capabilities
Tools
Prompts
Resources

Latest Blog Posts

Your AI Chatbot Just Exposed Your CEO's Salary to an Intern
By Om-Shree-0709 on July 2, 2026.
Agent Identity
MCP Security
OAuth Delegation
Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)
By Om-Shree-0709 on June 30, 2026.
Agentic Ai
Prompt Injection
WebAssembly
Lightport: Open-Sourcing Glama's AI Gateway
By punkpeye on April 27, 2026.
OpenAI
open source

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/berkayildi/mcp-llm-eval'

If you have feedback or need assistance with the MCP directory API, please join our Discord server