mcp-prompt-lab
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@mcp-prompt-labEvaluate my prompt 'summarizer' with dataset 'news_articles'"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
mcp-prompt-lab
A local MCP server that brings prompt evaluation directly into your AI coding environment. Define test cases, run prompts against multiple LLM providers, score outputs with deterministic and LLM-graded assertions, and track quality over time — through 4 consolidated tools, 3 resources, and 2 prompt templates.
The only MCP server that exposes general-purpose prompt evaluation as MCP tools. Every other eval tool in the ecosystem tests MCP servers from the outside. This one brings eval capabilities into the host — so you iterate on prompts without leaving your editor.
Table of Contents
Design Philosophy
Most MCP servers on GitHub are thin API wrappers: one endpoint becomes one tool, names are generic, there is no error recovery, and the README says "install and run." This project takes the opposite approach, applying production-grade patterns to a domain that matters — evaluation is the scarcest skill in AI engineering right now, and this server makes it accessible from any MCP-compatible host.
Tool Consolidation (4 tools, not 15+)
The official MCP filesystem server exposes 13 tools. Agents work best with 10-15 tools maximum — after that, tool selection accuracy degrades. This server consolidates all evaluation operations into 4 tools grouped by user intent, not by CRUD operation:
Tool | Intent | Actions |
| "Check this output right now" | Single-purpose, no actions |
| "Set up an evaluation" | 9 actions (CRUD for prompts + datasets + generation) |
| "Run an evaluation" | Single-purpose, matrix execution |
| "Look at results" | 5 actions (get, compare, list, delete, trends) |
Each tool uses action dispatch internally. The agent sees 4 clean entry points instead of 15+ individual tools competing for selection.
Dynamic Hints
Every response includes hints — not just on errors, but on success too. Five rules:
Errors say what happened AND what to do next.
"Prompt 'x' exists (v2). To update, include checksum: 'abc123...'"— the fix is in the error message.Resource status requiring special settings gets communicated. LLM-graded assertions without
grader_provider→ hint tells you exactly which param to add.Success responses suggest the logical follow-up. Save a prompt →
"Use eval_run with prompt_name 'x' to run evaluations."Wrong values suggest available options. Invalid provider string → lists all supported providers.
Auto-corrections get reported. Variables auto-extracted from
{{var}}patterns → response confirms what was detected.
Token-Aware Responses
An eval with 50 cases x 3 providers = 150 results. Dumping all into context would blow the token budget. Response strategy:
eval_runreturns a summary (pass rates, avg scores, latency, token usage) + top 3 failures with reasonsFull per-case breakdowns live behind
eval_analyze get_runwithlimit/offsetpaginationLong outputs are truncated to a token budget before embedding in failure reports
This makes eval results usable even in contexts with aggressive token limits.
Checksum Pattern for Safe Mutations
Updating a saved prompt requires the current checksum (SHA-256 of the content). This prevents overwriting a prompt that was changed in another session or by another tool call — a real concern when multiple agents share the same MCP server. The error response always includes the current checksum, so recovery is one copy-paste away.
Architecture Overview
src/
index.ts STDIO transport entry point
server.ts MCP server setup: tools, resources, prompts
tools/
eval-assert.ts Standalone assertion runner
eval-suite.ts Prompt & dataset CRUD + synthetic generation
eval-run.ts Evaluation execution orchestrator
eval-analyze.ts Run analysis, comparison, trends
engine/
assertions.ts 8 deterministic + 4 LLM-graded assertion types
providers.ts Unified provider resolution (6 providers + local)
runner.ts Concurrency-controlled matrix execution
db/
database.ts SQLite (bun:sqlite) with WAL, migrations, lazy init
utils/
checksum.ts SHA-256 via Bun.CryptoHasher
hints.ts Consistent response envelope formatting
tokens.ts Token estimation + truncation for context budgetKey architectural choices:
Bun-native SQLite (
bun:sqlite) — zero dependencies for persistence, WAL mode for safe concurrent readsVercel AI SDK — thin provider abstraction with unified
generateText/generateObjectacross all providers. No orchestration frameworks (no LangChain, no CrewAI)Zod v4 — shared schema validation between MCP tool inputs and LLM grader structured outputs
STDIO-only transport — API keys stay in your local environment, eval history stays in local SQLite. Zero infrastructure to deploy.
Quick Start
Prerequisites: Bun installed.
git clone https://github.com/nicholasbarwicki/mcp-prompt-lab
cd mcp-prompt-lab
bun installVerify the server starts:
bun run src/index.tsTo inspect with MCP Inspector: use STDIO transport, command bun, args ["run", "/absolute/path/to/src/index.ts"].
Configuration
Claude Code
Create .mcp.json in the project root (or add to ~/.claude.json globally):
{
"mcpServers": {
"prompt-lab": {
"command": "bun",
"args": ["run", "/absolute/path/to/mcp-prompt-lab/src/index.ts"],
"env": {
"OPENAI_API_KEY": "${OPENAI_API_KEY}",
"GOOGLE_GENERATIVE_AI_API_KEY": "${GOOGLE_GENERATIVE_AI_API_KEY}"
}
}
}
}Claude Desktop
Add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS):
{
"mcpServers": {
"prompt-lab": {
"command": "bun",
"args": ["run", "/absolute/path/to/mcp-prompt-lab/src/index.ts"],
"env": {
"OPENAI_API_KEY": "sk-...",
"GOOGLE_GENERATIVE_AI_API_KEY": "AIza..."
}
}
}
}Cursor / Windsurf / Any MCP Host
Same pattern — STDIO transport, bun run command, env vars for the providers you use.
Only include API keys for providers you intend to use. Deterministic assertions require no keys at all.
Bun auto-loads .env from the project root, so export OPENAI_API_KEY=sk-... in your shell works too.
Tools Reference
All tools return a consistent JSON envelope:
{
"status": "success",
"data": { "..." },
"hints": ["What to do next..."]
}Errors use "status": "error" with MCP's isError: true flag.
eval_assert — Standalone Assertion Runner
Run assertions against any text output. No saved prompts, no datasets, no API keys for deterministic checks. This is the tool you'll use daily — the zero-setup entry point.
Input schema:
Field | Type | Required | Description |
| string | yes | The text to evaluate |
| Assertion[] | yes | List of assertion objects |
| string | no | Reference text for factuality/similarity |
| string | no | Original query for relevance assertions |
| string | no | Provider for LLM-graded assertions, e.g. |
Example: Validate a Classifier Output (Deterministic Only)
No API key needed. Instant results.
{
"output": "{\"category\": \"billing\", \"confidence\": 0.87, \"priority\": \"high\"}",
"assertions": [
{ "type": "is-json" },
{ "type": "contains", "value": "category" },
{ "type": "regex", "value": "\"confidence\":\\s*0\\.\\d+" },
{ "type": "length-max", "value": "200" }
]
}Response:
{
"status": "success",
"data": {
"pass": true,
"score": 1.0,
"results": [
{ "type": "is-json", "pass": true, "score": 1, "reason": "Valid JSON", "weight": 1 },
{
"type": "contains",
"value": "category",
"pass": true,
"score": 1,
"reason": "Output contains \"category\"",
"weight": 1
},
{
"type": "regex",
"value": "\"confidence\":\\s*0\\.\\d+",
"pass": true,
"score": 1,
"reason": "Output matches regex",
"weight": 1
},
{
"type": "length-max",
"value": "200",
"pass": true,
"score": 1,
"reason": "Output length 58 is within max 200",
"weight": 1
}
]
},
"hints": ["All assertions passed. Score: 1.00."]
}Example: LLM-Graded Quality Check
Uses a grader model to evaluate subjective criteria.
{
"output": "Hey! So like, your account is kinda messed up. Gonna fix it tho, no worries lol",
"assertions": [
{ "type": "llm-rubric", "value": "Response is professional and concise", "weight": 2 },
{ "type": "not-contains", "value": "lol" },
{ "type": "length-max", "value": "500" }
],
"grader_provider": "openai:gpt-5-mini"
}The grader uses generateObject with a Zod schema — guaranteed structured { pass, score, reason } response, no fragile regex parsing.
Example: Factuality Check Against Reference
{
"output": "The Eiffel Tower is 330 meters tall and was completed in 1889.",
"expected": "The Eiffel Tower is 330 meters tall, completed in 1889 for the World's Fair.",
"assertions": [{ "type": "factuality" }, { "type": "contains", "value": "1889" }],
"grader_provider": "openai:gpt-5-mini"
}eval_suite — Prompt & Dataset Manager
CRUD for prompts and test datasets, plus LLM-powered synthetic test generation. All actions dispatched via the action field.
Input schema:
Field | Type | Required for | Description |
| enum | always | One of the 9 actions below |
| string | most actions | Prompt or dataset name |
| string |
| Prompt template content |
| string[] | no | Variable names (auto-extracted from |
| string[] | no | Tags for categorization and filtering |
| string | updating prompt | Current checksum (required to update existing prompt) |
| Case[] |
| Array of |
| string |
| What the prompt does |
| number | no | Cases to generate (default: 5) |
| string |
| Provider for generation (required explicitly, no default) |
| number | no | Pagination for list actions |
Actions:
Action | Required fields | Description |
|
| Upsert prompt. Auto-extracts |
|
| Retrieve full prompt content and metadata |
| -- | Paginated summary list |
|
| Remove prompt |
|
| Upsert dataset |
|
| Retrieve dataset with all cases |
| -- | Paginated summary list |
|
| Remove dataset |
|
| Generate synthetic test cases via LLM |
Example: Save a Prompt (Variables Auto-Extracted)
{
"action": "save_prompt",
"name": "ticket-classifier",
"content": "You are a support ticket classifier.\n\nGiven a customer message, output JSON:\n- category: billing, technical, general\n- priority: low, medium, high\n- confidence: 0.0-1.0\n\nCustomer message: {{message}}",
"tags": ["classification", "support"]
}The {{message}} variable is auto-extracted. Response includes checksum for future updates:
{
"status": "success",
"data": {
"action": "created",
"name": "ticket-classifier",
"variables": ["message"],
"version": 1,
"checksum": "a1b2c3..."
},
"hints": ["Use eval_run with prompt_name 'ticket-classifier' to run evaluations."]
}Example: Update a Prompt (Checksum Required)
{
"action": "save_prompt",
"name": "ticket-classifier",
"content": "You are a support ticket classifier. Be strict about confidence — only output >0.8 when you're sure.\n\n{{message}}",
"checksum": "a1b2c3..."
}If you omit the checksum, the error tells you exactly what to provide:
"Prompt 'ticket-classifier' exists (v1). To update, include checksum: 'a1b2c3...'"Example: Generate a Synthetic Test Dataset
Have an LLM create diverse test cases from a description. Includes happy path, edge cases, and adversarial inputs automatically.
{
"action": "generate_dataset",
"prompt_description": "Classifies customer support tickets into category (billing/technical/general), priority (low/medium/high), and confidence (0-1)",
"count": 5,
"provider": "openai:gpt-5-mini",
"name": "ticket-tests-v1",
"tags": ["synthetic", "classification"]
}Generated cases are saved to the database. Review with get_dataset before running an eval.
Example: Save a Manual Dataset
{
"action": "save_dataset",
"name": "ticket-tests-curated",
"cases": [
{
"vars": { "message": "I was charged twice for my subscription this month" },
"expected": "{\"category\": \"billing\", \"priority\": \"high\"}",
"description": "Clear billing issue"
},
{
"vars": { "message": "hey" },
"description": "Adversarial: vague single-word input"
},
{
"vars": { "message": "The app crashes when I open settings on Android 14" },
"expected": "{\"category\": \"technical\", \"priority\": \"medium\"}",
"description": "Technical issue with platform detail"
}
],
"tags": ["curated", "classification"]
}eval_run — Execute Evaluations
The core evaluation engine. Runs a prompt against one or more providers across test cases, scores each output with assertions, and stores everything in SQLite.
Execution model: prompt x providers x cases x assertions = scored result matrix, run with configurable concurrency.
Input schema:
Field | Type | Required | Description |
| string | one of | Name of a saved prompt |
| string | one of | Inline prompt content |
| string | one of | Name of a saved dataset |
| Case[] | one of | Inline test cases |
| string[] | yes | Provider strings (default: |
| Assertion[] | no | If omitted, outputs are collected with |
| string[] | no | Tags for filtering runs later |
| number | no | Default: |
| number | no | Default: |
| number | no | Parallel requests (default: |
| string | no | Required for LLM-graded assertions |
Providing both prompt_name and prompt is an error (ambiguous input). Same for dataset_name + cases.
Example: Compare Two Providers
{
"prompt_name": "ticket-classifier",
"dataset_name": "ticket-tests-v1",
"providers": ["openai:gpt-5-mini", "google:gemini-3.1-flash-lite-preview"],
"assertions": [
{ "type": "is-json" },
{ "type": "contains", "value": "category" },
{ "type": "contains", "value": "priority" },
{
"type": "llm-rubric",
"value": "Classification is reasonable for the given customer message",
"weight": 2
}
],
"grader_provider": "openai:gpt-5-mini",
"tags": ["v1", "model-comparison"]
}Response (token-aware summary):
{
"status": "success",
"data": {
"runId": 1,
"providers": ["openai:gpt-5-mini", "google:gemini-3.1-flash-lite-preview"],
"totalCases": 5,
"passRate": { "openai:gpt-5-mini": 0.8, "google:gemini-3.1-flash-lite-preview": 0.6 },
"avgScore": { "openai:gpt-5-mini": 0.85, "google:gemini-3.1-flash-lite-preview": 0.72 },
"avgLatencyMs": { "openai:gpt-5-mini": 650, "google:gemini-3.1-flash-lite-preview": 420 },
"totalTokens": { "openai:gpt-5-mini": 3200, "google:gemini-3.1-flash-lite-preview": 2800 },
"topFailures": [
{
"caseIndex": 1,
"description": "Adversarial: vague single-word input",
"provider": "google:gemini-3.1-flash-lite-preview",
"output": "I'd be happy to help! Could you...",
"failedAssertions": ["is-json: Invalid JSON"]
}
]
},
"hints": [
"Run complete. Use eval_analyze with action: 'get_run', run_id: 1 for full per-case breakdown.",
"Top providers by pass rate: openai:gpt-5-mini: 80%, google:gemini-3.1-flash-lite-preview: 60%"
]
}Notice: only the summary and top 3 failures — not all 10 individual results. Use eval_analyze for the full breakdown.
Example: Quick Inline Eval (No Saved Data)
No need to save anything — pass prompt and cases directly:
{
"prompt": "Translate the following English text to French:\n\n{{text}}",
"cases": [
{ "vars": { "text": "Hello, how are you?" }, "expected": "Bonjour, comment allez-vous ?" },
{
"vars": { "text": "The weather is nice today" },
"expected": "Le temps est beau aujourd'hui"
},
{ "vars": { "text": "" }, "description": "Edge case: empty input" }
],
"providers": ["openai:gpt-5-mini"],
"assertions": [
{ "type": "length-min", "value": "1" },
{ "type": "similarity", "threshold": 0.8 }
],
"grader_provider": "openai:gpt-5-mini"
}eval_analyze — Inspect & Compare Results
Browse, compare, and manage stored evaluation runs. This is where you analyze failures, detect regressions, and track quality trends.
Input schema:
Field | Type | Required for | Description |
| enum | always | One of the 5 actions below |
| number |
| Run ID |
| number[] |
| At least 2 run IDs |
| string | no | Filter by prompt name |
| string | no | Filter by tag |
| number | no | Pagination (default: 20) |
| string | no | Filter results by provider |
| boolean | no | Return only failed cases |
Actions:
Action | Description |
| Full run details with paginated per-case results. Supports |
| Side-by-side score comparison between 2+ runs. Computes deltas, detects regressions, counts improvements. |
| Recent runs, filterable by |
| Remove run and all its results (CASCADE). |
| Raw |
Example: Deep Dive Into Failures
{
"action": "get_run",
"run_id": 1,
"only_failed": true,
"provider": "google:gemini-3.1-flash-lite-preview"
}Returns only the cases that failed for Gemini, with full assertion results and the actual model output.
Example: Compare Runs After a Prompt Change
You fixed the prompt and re-ran. Now compare:
{
"action": "compare_runs",
"run_ids": [1, 2]
}Response:
{
"status": "success",
"data": {
"runs": ["run_1", "run_2"],
"scoreChanges": {
"openai:gpt-5-mini": "+0.10 (0.85 -> 0.95)",
"google:gemini-3.1-flash-lite-preview": "+0.23 (0.72 -> 0.95)"
},
"regressions": [],
"improvements": 2
},
"hints": ["2 provider(s) improved. No regressions detected."]
}Example: Quality Trends Over Time
{
"action": "trends",
"prompt_name": "ticket-classifier"
}Returns chronological data for charting score progression across runs.
Resources
Three MCP resources are exposed for supporting clients to browse server state.
URI | Type | Contents |
| Static | JSON array of prompt summaries: name, version, variable count, tags |
| Static | JSON array of dataset summaries: name, case count, tags |
| Dynamic template | Full run metadata and summary for a specific run ID |
The runs resource supports URI completion — clients can tab-complete run IDs from the 50 most recent runs.
Resources return summaries only. Full content lives behind the tools (eval_suite get_prompt, eval_suite get_dataset, eval_analyze get_run). This separation keeps resource reads lightweight while full data is available on demand.
Prompt Templates
Two MCP prompt templates are registered for quick invocation from supporting clients.
quick-eval
Run a quick evaluation of a prompt against a single test input.
Arguments: prompt, test_input, provider
Generates a message that instructs the host to call eval_run and interpret results — a one-shot workflow.
generate-tests
Generate a synthetic test dataset for a prompt.
Arguments: prompt_description, variable_names, count
Generates a message that calls eval_suite generate_dataset with guidance to include happy-path, edge, and adversarial cases.
Assertions Reference
Deterministic (free, instant, no API key required)
Type | Checks |
|
| Substring is present | Substring to find |
| Substring is absent | Substring to reject |
| Exact match (trimmed) | Expected string |
| Regex pattern matches | Regex pattern |
| Output begins with prefix | Prefix string |
| Output is valid JSON | Not used |
| Output length <= N chars | Max length as string, e.g. |
| Output length >= N chars | Min length as string, e.g. |
LLM-Graded (requires grader_provider)
Type | Checks | Uses |
| Free-form criteria |
|
| Output matches expected facts |
|
| Output answers the query |
|
| Semantic similarity to expected |
|
All assertions support a weight field (default: 1). The aggregate score is sum(weight * score) / sum(weight).
LLM-graded assertions use generateObject with a Zod schema for guaranteed structured { pass, score, reason } responses — no fragile regex or JSON.parse on raw LLM text. Each assertion type has a distinct grading prompt optimized for that evaluation style.
Provider Configuration
Provider strings follow the format "provider:model":
Provider string | Required env var | Example |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The local provider uses an OpenAI-compatible endpoint. LOCAL_LLM_URL defaults to http://localhost:1234/v1 (LM Studio default).
Missing API keys produce clear, actionable errors:
Provider 'anthropic:claude-sonnet-4-20250514' requires ANTHROPIC_API_KEY.
Set it in your MCP server config env block or shell environment.No provider has a default for LLM-graded assertions or dataset generation. You must specify grader_provider and provider explicitly — no silent API spending.
Provider resolution is powered by Vercel AI SDK — a thin abstraction layer over each provider's SDK. It handles retries, types, and multi-provider support in ~5 lines per provider. No orchestration frameworks, no hidden magic. If needed, each provider can be swapped to a raw SDK call without changing the tool interface.
Walkthrough: End-to-End Prompt Engineering
This walkthrough demonstrates the full iterative loop — from writing a prompt to shipping a validated version. Everything happens inside your MCP host (Claude Code, Cursor, Claude Desktop).
Step 1: Save Your Prompt
"Save this as 'meeting-action-items':
Extract action items from this meeting transcript. Return a JSON array
where each item has: owner, task, due_date (or null if not mentioned).
{{transcript}}"Claude calls eval_suite with save_prompt. Variables auto-extracted: ["transcript"].
Step 2: Generate Test Cases
"Generate 5 test cases for meeting-action-items using openai:gpt-5-mini"Claude calls eval_suite with generate_dataset. The LLM creates diverse cases:
Simple 1-on-1 meeting with clear action items
Multi-person standup with overlapping responsibilities
Vague meeting with no concrete actions
Conflicting assignments (same task, two owners)
Long rambling transcript with buried action items
Step 3: Run the First Eval
"Run meeting-action-items against the generated dataset using openai:gpt-5-mini and google:gemini-3.1-flash-lite-preview.
Assert: valid JSON, contains 'owner', and use llm-rubric 'All action items are accurately extracted with correct owners'.
Use openai:gpt-5-mini as the grader."Result: GPT-5-mini: 80% pass rate. Gemini: 60%. Top failures:
Case 3 (vague meeting): Gemini hallucinated action items that weren't in the transcript
Case 4 (conflicting assignments): Both models assigned the task to only one person
Step 4: Fix the Prompt and Re-Run
"Update meeting-action-items to handle conflicts by noting both owners, and add an instruction
to output an empty array when no clear actions exist. Then re-run the same eval."Claude calls eval_suite save_prompt with the checksum from v1, then eval_run with identical parameters.
Step 5: Compare Runs
"Compare the two runs"Claude calls eval_analyze compare_runs:
GPT-5-mini: +0.15 (0.80 -> 0.95)
Gemini: +0.35 (0.60 -> 0.95)
Regressions: noneBoth providers now pass at 95%. The prompt fix for conflict handling helped Gemini the most. No regressions. Ship it.
Step 6: Regression Check Later
A week later, you tweak the prompt for a new edge case.
"List recent runs for meeting-action-items, then run the same tests and compare against the last run"Claude calls eval_analyze list_runs, then eval_run, then eval_analyze compare_runs. Instant confidence check before deploying the change.
Use Cases
Composability Matrix
What you want | Tools used | Setup needed |
"Check this output" |
| None |
"Test my prompt" |
| 30 seconds |
"Compare v1 vs v2" |
| Already have v1 |
"Did I break anything?" |
| Already have history |
"Which model is best?" |
| Just a prompt + cases |
"Generate test data" |
| Just a description |
"Track quality over time" |
| Already have runs |
Specific Scenarios
Prompt iteration for classification tasks. Save a classifier prompt, generate synthetic edge cases, run against 2-3 providers, read failures, fix the prompt, re-run, compare. The full loop in one conversation.
Model selection for production. Same prompt, same test suite, 3 providers. One eval_run call gives pass rates, scores, latency, and token cost per provider. Data-driven model choice instead of vibes.
Pre-deploy regression testing. Changed a prompt? Run the existing test suite, compare against the last known-good run. Zero regressions = safe to deploy.
Output validation in agent pipelines. Use eval_assert as a quality gate — check that an agent's output is valid JSON, contains required fields, passes a rubric. No eval infrastructure needed.
Synthetic test generation. Describe what your prompt does, get diverse test cases including adversarial inputs. Review and curate before running evals.
Database
Created lazily on first tool call (STDIO servers must start instantly — no slow init).
Default location: ~/.mcp-prompt-lab/prompt-lab.db
Override with PROMPT_LAB_DB:
"env": { "PROMPT_LAB_DB": "/custom/path/prompt-lab.db" }Reset: rm -rf ~/.mcp-prompt-lab
Backup: cp ~/.mcp-prompt-lab/prompt-lab.db ~/backups/prompt-lab-$(date +%Y%m%d).db
Schema: 4 tables — prompts, datasets, runs, results. Deleting a run cascades to its results.
Pragmas:
journal_mode=WAL— safe concurrent reads from multiple processesforeign_keys=ON— enforced referential integrityuser_version— schema migration tracking (version 0 = run full schema, set to 1)
Engineering Decisions
Choices made deliberately, not by default.
Decision | Resolution | Why |
4 tools, not 15 | Group by user intent, not CRUD operation | Agents select tools better with fewer, well-scoped options. 4 is well under the 10-15 tool limit. |
STDIO-only, no HTTP | Local-first: keys in env, data in SQLite | Eval data is sensitive (prompts, test cases, model outputs). Remote HTTP would need auth, sessions, key management. |
Vercel AI SDK | Thin provider abstraction, not an orchestration framework | Unified |
Explicit grader_provider | No default grader, no silent API calls | LLM-graded assertions cost tokens. The user must opt in per-call. |
Checksum on mutations | SHA-256 of prompt content | Prevents overwriting prompts changed in another session. Error includes current checksum for easy recovery. |
Token-aware summaries | Summary + top 3 failures, details via pagination | 150 results in a single response would blow context. Summary-first keeps results usable in any host. |
Weighted scoring |
| A |
Global DB, not per-project |
| Eval data is cross-project by nature. You're comparing prompts, not building per-repo config. |
Lazy DB init | Created on first tool call, not server start | STDIO servers must respond to |
Zod v4 for grader schemas |
| Guaranteed structured grader responses. No regex parsing of raw LLM text. |
Static hints, not generated | Hardcoded per action and outcome | Predictable, fast, zero tokens. Every response guides the next logical action. |
Bun-native SQLite |
| One less dependency. Bun's SQLite is fast, synchronous, and supports WAL natively. |
License
MIT
This server cannot be installed
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/nikolasbarwicki/mcp-prompt-lab'
If you have feedback or need assistance with the MCP directory API, please join our Discord server