Which integrations are available for this server?

Provides CI-friendly quality gates for agent evaluation, enabling integration with GitHub Actions workflows for automated testing and deployment.

How do I use agent-eval-mcp?

1. Click on "Install Server". 2. Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state. 3. In the chat, type @ followed by the MCP server name and your instructions, e.g., "@agent-eval-mcp run evaluation suite on my candidate agent outputs" That's it! The server will respond to your query, and you can continue using it as needed. Here is a step-by-step guide with screenshots.

agent-eval-mcp

by ad-github1

Overview Schema Related Servers Score Discussions

Python

Local

Enterprise AI Agent Evaluation & Deployment Platform

A dependency-light evaluation platform for RAG/wiki-quality AI agents. It scores agent outputs for faithfulness, retrieval relevance, hallucination risk, latency, and cost; produces CI-friendly quality gates; emits regression/canary reports; and exposes the workflow through a lightweight MCP-style stdio tool server.

What Is Included

JSONL evaluation case format for RAG/wiki workflows.
Deterministic checks for:
- faithfulness to retrieved context and reference answer,
- retrieval relevance against question and expected keywords,
- hallucination risk from unsupported answer content,
- latency and cost thresholds.
100+ synthetic case generator.
CI/CD-style suite-level and case-level quality gates.
Markdown and JSON evaluation reports.
Regression report comparing baseline and candidate runs.
Canary promotion policy with traffic ramp decisions.
OpenTelemetry-compatible JSONL traces/metrics.
MCP-style stdio server exposing evaluation tools.

Related MCP server: iris-eval/mcp-server

Quick Start

git clone https://github.com/ad-github1/ENTERPRISE-AI-AGENT-EVALUATION-PLATFORM.git
cd ENTERPRISE-AI-AGENT-EVALUATION-PLATFORM
PYTHONPATH=src python3 -m agent_eval_platform generate-cases --count 120 --out examples/wiki_eval_cases.jsonl
PYTHONPATH=src python3 -m agent_eval_platform evaluate \
  --cases examples/wiki_eval_cases.jsonl \
  --gate examples/quality_gate.json \
  --variant candidate \
  --json-out reports/eval_result.json \
  --markdown-out reports/eval_report.md \
  --traces-out reports/traces.jsonl
PYTHONPATH=src python3 -m unittest discover -s tests

Testing

Run the test suite:

PYTHONPATH=src python3 -m unittest discover -s tests

Current local result:

Ran 5 tests in 0.015s

OK

After installation:

pip install -e .
agent-eval evaluate --cases examples/wiki_eval_cases.jsonl --gate examples/quality_gate.json
agent-eval-mcp

Case Format

Each JSONL row contains one evaluated agent run:

{
  "case_id": "case-0001",
  "question": "What contribution is Ada Lovelace known for in mathematics?",
  "reference_answer": "Ada Lovelace is known for Analytical Engine notes.",
  "expected_keywords": ["Ada Lovelace", "Analytical Engine"],
  "retrieved_docs": [
    {"doc_id": "wiki-1", "title": "Ada Lovelace", "text": "...", "score": 0.94}
  ],
  "agent_answer": "Ada Lovelace is known for Analytical Engine notes.",
  "latency_ms": 240.5,
  "cost_usd": 0.0031,
  "tags": ["wiki", "rag"]
}

CI Quality Gate

The evaluator exits non-zero when --fail-on-gate is used and thresholds fail:

PYTHONPATH=src python3 -m agent_eval_platform evaluate \
  --cases examples/wiki_eval_cases.jsonl \
  --gate examples/quality_gate.json \
  --fail-on-gate

See .github/workflows/agent-eval.yml for a GitHub Actions example.

MCP-Style Tool Server

Run:

PYTHONPATH=src python3 -m agent_eval_platform.mcp_server

Supported JSON-RPC methods:

initialize
tools/list
tools/call with:
- run_evaluation_suite
- compare_regression
- decide_canary

This is intentionally stdio and dependency-free. It follows the MCP tool shape closely enough for local agent integration demos without requiring the MCP Python SDK.

Canary Workflow

PYTHONPATH=src python3 -m agent_eval_platform canary \
  --result reports/eval_result.json \
  --config examples/canary_config.json \
  --json-out reports/canary_decision.json

The decision is hold, increase_traffic, or promote based on suite quality and minimum case coverage.

Evaluation Results

Evaluation Run

PYTHONPATH=src python3 -m agent_eval_platform evaluate \
  --cases examples/wiki_eval_cases.jsonl \
  --gate examples/quality_gate.json \
  --variant candidate \
  --json-out reports/eval_result.json \
  --markdown-out reports/eval_report.md \
  --traces-out reports/traces.jsonl

Aggregate Metrics

Metric	Value
Evaluation cases	120
Pass rate	82.5%
Average faithfulness	0.825
Average retrieval relevance	0.838
Average hallucination risk	0.153
p50 latency	392.58 ms
p95 latency	663.40 ms
p99 latency	872.87 ms
Average cost	$0.00393
Total cost	$0.47158

Canary Decision

PYTHONPATH=src python3 -m agent_eval_platform canary \
  --result reports/eval_result.json \
  --config examples/canary_config.json \
  --json-out reports/canary_decision.json

Result:

{
  "action": "hold",
  "next_traffic_percent": 10.0,
  "reasons": [
    "pass_rate 0.825 < 0.900"
  ]
}

The canary policy correctly blocked promotion because the candidate run did not meet the configured 90% pass-rate threshold. This demonstrates how the platform can prevent low-quality agent versions from being promoted automatically.

Observability

The evaluation emits OpenTelemetry-style JSONL traces to:

reports/traces.jsonl

Each case generates spans for faithfulness, retrieval relevance, hallucination risk, and final case-level pass/fail status, enabling debugging of failed agent responses.

Install Server

license - permissive license

quality

maintenance

How are these scores calculated?

Maintenance

–Maintainers

–Response time

–Release cycle

–Releases (12mo)

Commit activity

Resources

GitHub Repository

Need Help?

Related Servers

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Tools

Latest Blog Posts

Your AI Chatbot Just Exposed Your CEO's Salary to an Intern
By Om-Shree-0709 on July 2, 2026.
Agent Identity
MCP Security
OAuth Delegation
Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)
By Om-Shree-0709 on June 30, 2026.
Agentic Ai
Prompt Injection
WebAssembly
Lightport: Open-Sourcing Glama's AI Gateway
By punkpeye on April 27, 2026.
OpenAI
open source

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ad-github1/ENTERPRISE-AI-AGENT-EVALUATION-PLATFORM'

If you have feedback or need assistance with the MCP directory API, please join our Discord server