Which integrations are available for this server?

Provides LLM-based quality scoring and hallucination detection for code evaluations using OpenAI models.

How do I use mcp-eval-harness?

1. Click on "Install Server". 2. Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state. 3. In the chat, type @ followed by the MCP server name and your instructions, e.g., "@mcp-eval-harness evaluate my solution for the two sum problem" That's it! The server will respond to your query, and you can continue using it as needed. Here is a step-by-step guide with screenshots.

mcp-eval-harness

by salmankhan-prs

Overview Schema Related Servers Score Discussions

TypeScript

Local

mcp-eval-harness

MCP-based code evaluation harness — sandboxed execution + LLM quality scoring.

An MCP server that exposes 4 evaluation tools (no AI on the server itself — it's just a tool provider). Any MCP-compatible client can connect and use these tools: the included AI agent, Cursor, Claude Desktop, or anything that speaks MCP.

Tools

Tool	Description
`run_code`	Execute code in a sandboxed Docker container (or local fallback). Returns stdout, stderr, exit code.
`run_tests`	Run code against test cases. Returns pass/fail for each test with actual vs expected output.
`score_answer`	LLM-based quality scoring (0-10) against a rubric with breakdown by correctness, efficiency, readability, edge cases.
`detect_hallucination`	Check if text contains claims not supported by provided context. Returns confidence and unsupported claims.

Related MCP server: multivon-mcp

Architecture

┌─────────────────────┐       MCP (stdio)       ┌─────────────────────┐
│                     │  ◄───────────────────── │                     │
│   MCP Client        │       tool calls         │   MCP Server        │
│   (AI agent)        │  ────────────────────► │   (tool provider)   │
│                     │       results            │                     │
│   Has the brain —   │                          │   No AI here —      │
│   decides what to   │                          │   just runs code,   │
│   evaluate & when   │                          │   executes tests,   │
│                     │                          │   calls scoring APIs │
└─────────────────────┘                          └─────────────────────┘

Server = pure tool provider. Exposes run_code, run_tests, score_answer, detect_hallucination over MCP stdio. No decision-making, no orchestration.

Client = the brain. An AI agent (GPT-4o-mini) that decides which tools to call, in what order, and synthesizes a final evaluation report.

Transport = stdio. The client spawns the server as a subprocess and communicates via stdin/stdout. This is how MCP stdio works — same as how Cursor or Claude Desktop talk to MCP servers.

Use with Any MCP Client

The server isn't locked to our AI agent. You can use it with:

Cursor — Add to your MCP config:

{
  "mcpServers": {
    "eval-harness": {
      "command": "npx",
      "args": ["tsx", "server/index.ts"],
      "cwd": "/path/to/mcp-eval-harness"
    }
  }
}

Claude Desktop — Add to claude_desktop_config.json:

{
  "mcpServers": {
    "eval-harness": {
      "command": "npx",
      "args": ["tsx", "server/index.ts"],
      "cwd": "/path/to/mcp-eval-harness"
    }
  }
}

MCP Inspector (for debugging):

npx @modelcontextprotocol/inspector npx tsx server/index.ts

Stack

Package	Purpose
`@modelcontextprotocol/sdk`	MCP server + client protocol
`@ai-sdk/mcp`	Vercel AI SDK MCP client adapter
`ai`	Vercel AI SDK core — `generateText`, `generateObject`
`@ai-sdk/openai`	OpenAI provider (direct API key, NOT Vercel Gateway)
`zod`	Schema validation for tool inputs and structured outputs
`dotenv`	Environment variables

Setup

git clone https://github.com/salmankhan-prs/mcp-eval-harness.git
cd mcp-eval-harness
pnpm install
cp .env.example .env
# Add your OpenAI API key to .env

Pull Docker images for sandboxed execution (optional — falls back to local if Docker isn't running):

docker pull node:22-alpine
docker pull python:3.12-alpine

Usage

pnpm eval examples/problems.json

Example Output

Evaluating: Two Sum
Language: javascript
============================================================

Test Results: 3/3 passed

Quality Score: 8.5/10
  Correctness:  9/10
  Efficiency:   9/10
  Readability:  8/10
  Edge Cases:   8/10

Hallucination Check: No hallucinations detected

Verdict: PASS

Project Structure

mcp-eval-harness/
├── README.md
├── package.json
├── tsconfig.json
├── .env.example
├── .gitignore
├── server/
│   ├── index.ts                     # MCP server — registers tools, starts stdio transport
│   └── tools/
│       ├── run-code.ts              # Docker-sandboxed code execution (hacky but works)
│       ├── run-tests.ts             # Test runner — executes code against test cases
│       ├── score-answer.ts          # LLM-based quality scoring against rubric
│       └── detect-hallucination.ts  # LLM-based hallucination detection
├── client/
│   └── index.ts                     # AI agent that orchestrates evaluation via MCP
└── examples/
    └── problems.json                # Sample coding problems with test cases

How Each Tool Works

run_code — Detects if Docker is available. If yes, runs code in a container with --network=none, memory/CPU/PID limits. If Docker is unavailable, falls back to local child_process with a 10s timeout. In production you'd swap this for a Firecracker microVM pool.

run_tests — Wraps the candidate solution + each test input into a runnable script. Calls run_code for each test case and compares stdout to expected output. Convention: solution must define a solution() function.

score_answer — Sends the problem + solution to GPT-4o-mini with a structured output schema (Zod). Returns an overall score and per-criterion breakdown. This is what ReasonCore's expert evaluators do, but automated.

detect_hallucination — Sends a claim + context to GPT-4o-mini. Returns whether unsupported claims exist, a confidence score, and specific unsupported claims.

Prerequisites

Node.js 22+
Docker Desktop running (optional — falls back to local execution)
OpenAI API key in .env

This server cannot be installed

license - not found

quality - not tested

maintenance

How are these scores calculated?

Maintenance

–Maintainers

–Response time

–Release cycle

–Releases (12mo)

Commit activity

Resources

GitHub Repository

Need Help?

Related Servers

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Related MCP Servers

Sandbox MCP
Code Execution Developer Tools Autonomous Agents
scooter-lacroix
A
license
-
quality
C
maintenance
Production-ready MCP server for secure Python code execution with artifact capture, virtual environment support, and LM Studio integration.
Last updated 2026-03-14
10
Apache 2.0
multivon-mcpofficial
AI & Machine Learning Testing & QA Tools
multivon-ai
A
license
A
quality
A
maintenance
MCP server that gives AI coding agents direct access to evaluation tools.
Last updated 2026-06-12
22
Apache 2.0
repo-seatbelt
Security Shell Access File Systems
berkcangumusisik
A
license
-
quality
C
maintenance
Runtime safety guardrails for AI coding agents. Checks file access, validates shell commands, and scores your repo's AI safety — all via MCP.
Last updated 2026-05-31
8
8
MIT
LLM Python Code Sandbox
Code Execution Autonomous Agents
DSXiangLi
F
license
-
quality
D
maintenance
Enables LLMs to execute Python code in isolated sandboxes with file operations and MCP integration, supporting multi-round execution and plot capture.
Last updated 2025-09-24
1

View all related MCP servers

Related MCP Connectors

mcp
MCP server providing access to the Scorecard API to evaluate and optimize LLM systems.
agentmailrooms-mcp
A paid remote MCP for OpenAI Codex agent coordination MCP, built to return verdicts, receipts, usage
geminiupgradeqa-mcp
Remote MCP for Gemini upgrade evals, prompt regressions, output diffs, and eval receipts.

View all MCP Connectors

Latest Blog Posts

Who's Calling? MCP Hosts Are an Identity Blind Spot (And the Spec Knows It)
By Om-Shree-0709 on July 25, 2026.
mcp
Agent Identity
OAuth 2.1
Your AI Chatbot Just Exposed Your CEO's Salary to an Intern
By Om-Shree-0709 on July 2, 2026.
Agent Identity
MCP Security
OAuth Delegation
Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)
By Om-Shree-0709 on June 30, 2026.
Agentic Ai
Prompt Injection
WebAssembly

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/salmankhan-prs/mcp-eval-harness'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

mcp-eval-harness

Tools

Architecture

Use with Any MCP Client

Stack

Setup

Usage

Example Output

Project Structure

How Each Tool Works

Prerequisites

Maintenance

Resources

Looking for Admin?

Related MCP Servers

Sandbox MCP

multivon-mcpofficial

repo-seatbelt

LLM Python Code Sandbox

Related MCP Connectors

Latest Blog Posts

MCP directory API