Skip to main content
Glama

Server Configuration

Describes the environment variables required to run the server.

NameRequiredDescriptionDefault
GOOGLE_API_KEYNoAPI key for Google Gemini models
OPENAI_API_KEYNoAPI key for OpenAI models (includes judge model)
ANTHROPIC_API_KEYNoAPI key for Anthropic Claude models
MCP_LLM_EVAL_JUDGE_MODELNoModel to use as judge for scoringgpt-4o-mini

Capabilities

Features and capabilities supported by this server

CapabilityDetails
tools
{
  "listChanged": false
}
experimental
{}

Tools

Functions exposed to the LLM to take actions

NameDescription
run_evaluationB

Run an LLM evaluation: load a dataset, query models via streaming, score responses with an LLM-as-judge, and return per-question scores, aggregate summary, and pass/fail status.

check_thresholdsB

Check evaluation results against quality gate thresholds. Returns pass/fail per metric and overall gate status.

list_evaluationsA

List past evaluation runs in a directory. Returns metadata for each run: timestamp, dataset, models, pass/fail, and cost.

get_evaluationA

Retrieve the full details of a specific evaluation run: per-question per-model scores, responses, and judge reasoning.

compare_runsB

Compare two evaluation runs and detect regressions. Flags metrics that worsened beyond configurable tolerance.

evaluate_retrievalA

Run retrieval metrics (recall@k, precision@k, MRR, nDCG@k) against a labelled dataset with a configurable retrieval adapter. Returns per-query metrics, dataset-level aggregate, and p50/p95 retrieval latency.

evaluate_rag_end_to_endB

Run the full RAG pipeline: retrieve chunks, generate answers using the retrieved chunks as context, and score with context_relevance and citation_faithfulness judges. Returns retrieval metrics, generation metrics, and judge scores per query, plus an aggregate.

check_retrieval_driftA

Compare two retrieval evaluation result files and detect drift. Flags metrics that have regressed beyond configurable tolerance. Takes two result-set paths; does not persist history itself.

simulate_poisoned_corpusB

[STUB - not implemented in v0.5.0] Inject poisoned chunks into a corpus and re-run retrieval evaluation. Returns a clear not-implemented response.

format_pr_commentA

Generate a markdown PR comment from evaluation results. Includes results table, regression details, and threshold status.

Prompts

Interactive templates invoked by user choice

NameDescription

No prompts

Resources

Contextual data attached and managed by the client

NameDescription

No resources

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/berkayildi/mcp-llm-eval'

If you have feedback or need assistance with the MCP directory API, please join our Discord server