Skip to main content
Glama

mcp-prompt-lab

A local MCP server that brings prompt evaluation directly into your AI coding environment. Define test cases, run prompts against multiple LLM providers, score outputs with deterministic and LLM-graded assertions, and track quality over time — through 4 consolidated tools, 3 resources, and 2 prompt templates.

The only MCP server that exposes general-purpose prompt evaluation as MCP tools. Every other eval tool in the ecosystem tests MCP servers from the outside. This one brings eval capabilities into the host — so you iterate on prompts without leaving your editor.


Table of Contents


Design Philosophy

Most MCP servers on GitHub are thin API wrappers: one endpoint becomes one tool, names are generic, there is no error recovery, and the README says "install and run." This project takes the opposite approach, applying production-grade patterns to a domain that matters — evaluation is the scarcest skill in AI engineering right now, and this server makes it accessible from any MCP-compatible host.

Tool Consolidation (4 tools, not 15+)

The official MCP filesystem server exposes 13 tools. Agents work best with 10-15 tools maximum — after that, tool selection accuracy degrades. This server consolidates all evaluation operations into 4 tools grouped by user intent, not by CRUD operation:

Tool

Intent

Actions

eval_assert

"Check this output right now"

Single-purpose, no actions

eval_suite

"Set up an evaluation"

9 actions (CRUD for prompts + datasets + generation)

eval_run

"Run an evaluation"

Single-purpose, matrix execution

eval_analyze

"Look at results"

5 actions (get, compare, list, delete, trends)

Each tool uses action dispatch internally. The agent sees 4 clean entry points instead of 15+ individual tools competing for selection.

Dynamic Hints

Every response includes hints — not just on errors, but on success too. Five rules:

  1. Errors say what happened AND what to do next. "Prompt 'x' exists (v2). To update, include checksum: 'abc123...'" — the fix is in the error message.

  2. Resource status requiring special settings gets communicated. LLM-graded assertions without grader_provider → hint tells you exactly which param to add.

  3. Success responses suggest the logical follow-up. Save a prompt → "Use eval_run with prompt_name 'x' to run evaluations."

  4. Wrong values suggest available options. Invalid provider string → lists all supported providers.

  5. Auto-corrections get reported. Variables auto-extracted from {{var}} patterns → response confirms what was detected.

Token-Aware Responses

An eval with 50 cases x 3 providers = 150 results. Dumping all into context would blow the token budget. Response strategy:

  • eval_run returns a summary (pass rates, avg scores, latency, token usage) + top 3 failures with reasons

  • Full per-case breakdowns live behind eval_analyze get_run with limit/offset pagination

  • Long outputs are truncated to a token budget before embedding in failure reports

This makes eval results usable even in contexts with aggressive token limits.

Checksum Pattern for Safe Mutations

Updating a saved prompt requires the current checksum (SHA-256 of the content). This prevents overwriting a prompt that was changed in another session or by another tool call — a real concern when multiple agents share the same MCP server. The error response always includes the current checksum, so recovery is one copy-paste away.


Architecture Overview

src/
  index.ts              STDIO transport entry point
  server.ts             MCP server setup: tools, resources, prompts
  tools/
    eval-assert.ts      Standalone assertion runner
    eval-suite.ts       Prompt & dataset CRUD + synthetic generation
    eval-run.ts         Evaluation execution orchestrator
    eval-analyze.ts     Run analysis, comparison, trends
  engine/
    assertions.ts       8 deterministic + 4 LLM-graded assertion types
    providers.ts        Unified provider resolution (6 providers + local)
    runner.ts           Concurrency-controlled matrix execution
  db/
    database.ts         SQLite (bun:sqlite) with WAL, migrations, lazy init
  utils/
    checksum.ts         SHA-256 via Bun.CryptoHasher
    hints.ts            Consistent response envelope formatting
    tokens.ts           Token estimation + truncation for context budget

Key architectural choices:

  • Bun-native SQLite (bun:sqlite) — zero dependencies for persistence, WAL mode for safe concurrent reads

  • Vercel AI SDK — thin provider abstraction with unified generateText/generateObject across all providers. No orchestration frameworks (no LangChain, no CrewAI)

  • Zod v4 — shared schema validation between MCP tool inputs and LLM grader structured outputs

  • STDIO-only transport — API keys stay in your local environment, eval history stays in local SQLite. Zero infrastructure to deploy.


Quick Start

Prerequisites: Bun installed.

git clone https://github.com/nicholasbarwicki/mcp-prompt-lab
cd mcp-prompt-lab
bun install

Verify the server starts:

bun run src/index.ts

To inspect with MCP Inspector: use STDIO transport, command bun, args ["run", "/absolute/path/to/src/index.ts"].


Configuration

Claude Code

Create .mcp.json in the project root (or add to ~/.claude.json globally):

{
  "mcpServers": {
    "prompt-lab": {
      "command": "bun",
      "args": ["run", "/absolute/path/to/mcp-prompt-lab/src/index.ts"],
      "env": {
        "OPENAI_API_KEY": "${OPENAI_API_KEY}",
        "GOOGLE_GENERATIVE_AI_API_KEY": "${GOOGLE_GENERATIVE_AI_API_KEY}"
      }
    }
  }
}

Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS):

{
  "mcpServers": {
    "prompt-lab": {
      "command": "bun",
      "args": ["run", "/absolute/path/to/mcp-prompt-lab/src/index.ts"],
      "env": {
        "OPENAI_API_KEY": "sk-...",
        "GOOGLE_GENERATIVE_AI_API_KEY": "AIza..."
      }
    }
  }
}

Cursor / Windsurf / Any MCP Host

Same pattern — STDIO transport, bun run command, env vars for the providers you use.

Only include API keys for providers you intend to use. Deterministic assertions require no keys at all.

Bun auto-loads .env from the project root, so export OPENAI_API_KEY=sk-... in your shell works too.


Tools Reference

All tools return a consistent JSON envelope:

{
  "status": "success",
  "data": { "..." },
  "hints": ["What to do next..."]
}

Errors use "status": "error" with MCP's isError: true flag.


eval_assert — Standalone Assertion Runner

Run assertions against any text output. No saved prompts, no datasets, no API keys for deterministic checks. This is the tool you'll use daily — the zero-setup entry point.

Input schema:

Field

Type

Required

Description

output

string

yes

The text to evaluate

assertions

Assertion[]

yes

List of assertion objects

expected

string

no

Reference text for factuality/similarity

input

string

no

Original query for relevance assertions

grader_provider

string

no

Provider for LLM-graded assertions, e.g. "openai:gpt-5-mini"

Example: Validate a Classifier Output (Deterministic Only)

No API key needed. Instant results.

{
  "output": "{\"category\": \"billing\", \"confidence\": 0.87, \"priority\": \"high\"}",
  "assertions": [
    { "type": "is-json" },
    { "type": "contains", "value": "category" },
    { "type": "regex", "value": "\"confidence\":\\s*0\\.\\d+" },
    { "type": "length-max", "value": "200" }
  ]
}

Response:

{
  "status": "success",
  "data": {
    "pass": true,
    "score": 1.0,
    "results": [
      { "type": "is-json", "pass": true, "score": 1, "reason": "Valid JSON", "weight": 1 },
      {
        "type": "contains",
        "value": "category",
        "pass": true,
        "score": 1,
        "reason": "Output contains \"category\"",
        "weight": 1
      },
      {
        "type": "regex",
        "value": "\"confidence\":\\s*0\\.\\d+",
        "pass": true,
        "score": 1,
        "reason": "Output matches regex",
        "weight": 1
      },
      {
        "type": "length-max",
        "value": "200",
        "pass": true,
        "score": 1,
        "reason": "Output length 58 is within max 200",
        "weight": 1
      }
    ]
  },
  "hints": ["All assertions passed. Score: 1.00."]
}

Example: LLM-Graded Quality Check

Uses a grader model to evaluate subjective criteria.

{
  "output": "Hey! So like, your account is kinda messed up. Gonna fix it tho, no worries lol",
  "assertions": [
    { "type": "llm-rubric", "value": "Response is professional and concise", "weight": 2 },
    { "type": "not-contains", "value": "lol" },
    { "type": "length-max", "value": "500" }
  ],
  "grader_provider": "openai:gpt-5-mini"
}

The grader uses generateObject with a Zod schema — guaranteed structured { pass, score, reason } response, no fragile regex parsing.

Example: Factuality Check Against Reference

{
  "output": "The Eiffel Tower is 330 meters tall and was completed in 1889.",
  "expected": "The Eiffel Tower is 330 meters tall, completed in 1889 for the World's Fair.",
  "assertions": [{ "type": "factuality" }, { "type": "contains", "value": "1889" }],
  "grader_provider": "openai:gpt-5-mini"
}

eval_suite — Prompt & Dataset Manager

CRUD for prompts and test datasets, plus LLM-powered synthetic test generation. All actions dispatched via the action field.

Input schema:

Field

Type

Required for

Description

action

enum

always

One of the 9 actions below

name

string

most actions

Prompt or dataset name

content

string

save_prompt

Prompt template content

variables

string[]

no

Variable names (auto-extracted from {{var}} if omitted)

tags

string[]

no

Tags for categorization and filtering

checksum

string

updating prompt

Current checksum (required to update existing prompt)

cases

Case[]

save_dataset

Array of { vars, expected?, description? }

prompt_description

string

generate_dataset

What the prompt does

count

number

no

Cases to generate (default: 5)

provider

string

generate_dataset

Provider for generation (required explicitly, no default)

limit / offset

number

no

Pagination for list actions

Actions:

Action

Required fields

Description

save_prompt

name, content

Upsert prompt. Auto-extracts {{var}} variables. Update requires matching checksum.

get_prompt

name

Retrieve full prompt content and metadata

list_prompts

--

Paginated summary list

delete_prompt

name

Remove prompt

save_dataset

name, cases

Upsert dataset

get_dataset

name

Retrieve dataset with all cases

list_datasets

--

Paginated summary list

delete_dataset

name

Remove dataset

generate_dataset

prompt_description, provider

Generate synthetic test cases via LLM

Example: Save a Prompt (Variables Auto-Extracted)

{
  "action": "save_prompt",
  "name": "ticket-classifier",
  "content": "You are a support ticket classifier.\n\nGiven a customer message, output JSON:\n- category: billing, technical, general\n- priority: low, medium, high\n- confidence: 0.0-1.0\n\nCustomer message: {{message}}",
  "tags": ["classification", "support"]
}

The {{message}} variable is auto-extracted. Response includes checksum for future updates:

{
  "status": "success",
  "data": {
    "action": "created",
    "name": "ticket-classifier",
    "variables": ["message"],
    "version": 1,
    "checksum": "a1b2c3..."
  },
  "hints": ["Use eval_run with prompt_name 'ticket-classifier' to run evaluations."]
}

Example: Update a Prompt (Checksum Required)

{
  "action": "save_prompt",
  "name": "ticket-classifier",
  "content": "You are a support ticket classifier. Be strict about confidence — only output >0.8 when you're sure.\n\n{{message}}",
  "checksum": "a1b2c3..."
}

If you omit the checksum, the error tells you exactly what to provide:

"Prompt 'ticket-classifier' exists (v1). To update, include checksum: 'a1b2c3...'"

Example: Generate a Synthetic Test Dataset

Have an LLM create diverse test cases from a description. Includes happy path, edge cases, and adversarial inputs automatically.

{
  "action": "generate_dataset",
  "prompt_description": "Classifies customer support tickets into category (billing/technical/general), priority (low/medium/high), and confidence (0-1)",
  "count": 5,
  "provider": "openai:gpt-5-mini",
  "name": "ticket-tests-v1",
  "tags": ["synthetic", "classification"]
}

Generated cases are saved to the database. Review with get_dataset before running an eval.

Example: Save a Manual Dataset

{
  "action": "save_dataset",
  "name": "ticket-tests-curated",
  "cases": [
    {
      "vars": { "message": "I was charged twice for my subscription this month" },
      "expected": "{\"category\": \"billing\", \"priority\": \"high\"}",
      "description": "Clear billing issue"
    },
    {
      "vars": { "message": "hey" },
      "description": "Adversarial: vague single-word input"
    },
    {
      "vars": { "message": "The app crashes when I open settings on Android 14" },
      "expected": "{\"category\": \"technical\", \"priority\": \"medium\"}",
      "description": "Technical issue with platform detail"
    }
  ],
  "tags": ["curated", "classification"]
}

eval_run — Execute Evaluations

The core evaluation engine. Runs a prompt against one or more providers across test cases, scores each output with assertions, and stores everything in SQLite.

Execution model: prompt x providers x cases x assertions = scored result matrix, run with configurable concurrency.

Input schema:

Field

Type

Required

Description

prompt_name

string

one of

Name of a saved prompt

prompt

string

one of

Inline prompt content

dataset_name

string

one of

Name of a saved dataset

cases

Case[]

one of

Inline test cases

providers

string[]

yes

Provider strings (default: ["openai:gpt-4o"])

assertions

Assertion[]

no

If omitted, outputs are collected with score: 1.0

tags

string[]

no

Tags for filtering runs later

temperature

number

no

Default: 0 (deterministic)

max_tokens

number

no

Default: 1024

concurrency

number

no

Parallel requests (default: 3)

grader_provider

string

no

Required for LLM-graded assertions

Providing both prompt_name and prompt is an error (ambiguous input). Same for dataset_name + cases.

Example: Compare Two Providers

{
  "prompt_name": "ticket-classifier",
  "dataset_name": "ticket-tests-v1",
  "providers": ["openai:gpt-5-mini", "google:gemini-3.1-flash-lite-preview"],
  "assertions": [
    { "type": "is-json" },
    { "type": "contains", "value": "category" },
    { "type": "contains", "value": "priority" },
    {
      "type": "llm-rubric",
      "value": "Classification is reasonable for the given customer message",
      "weight": 2
    }
  ],
  "grader_provider": "openai:gpt-5-mini",
  "tags": ["v1", "model-comparison"]
}

Response (token-aware summary):

{
  "status": "success",
  "data": {
    "runId": 1,
    "providers": ["openai:gpt-5-mini", "google:gemini-3.1-flash-lite-preview"],
    "totalCases": 5,
    "passRate": { "openai:gpt-5-mini": 0.8, "google:gemini-3.1-flash-lite-preview": 0.6 },
    "avgScore": { "openai:gpt-5-mini": 0.85, "google:gemini-3.1-flash-lite-preview": 0.72 },
    "avgLatencyMs": { "openai:gpt-5-mini": 650, "google:gemini-3.1-flash-lite-preview": 420 },
    "totalTokens": { "openai:gpt-5-mini": 3200, "google:gemini-3.1-flash-lite-preview": 2800 },
    "topFailures": [
      {
        "caseIndex": 1,
        "description": "Adversarial: vague single-word input",
        "provider": "google:gemini-3.1-flash-lite-preview",
        "output": "I'd be happy to help! Could you...",
        "failedAssertions": ["is-json: Invalid JSON"]
      }
    ]
  },
  "hints": [
    "Run complete. Use eval_analyze with action: 'get_run', run_id: 1 for full per-case breakdown.",
    "Top providers by pass rate: openai:gpt-5-mini: 80%, google:gemini-3.1-flash-lite-preview: 60%"
  ]
}

Notice: only the summary and top 3 failures — not all 10 individual results. Use eval_analyze for the full breakdown.

Example: Quick Inline Eval (No Saved Data)

No need to save anything — pass prompt and cases directly:

{
  "prompt": "Translate the following English text to French:\n\n{{text}}",
  "cases": [
    { "vars": { "text": "Hello, how are you?" }, "expected": "Bonjour, comment allez-vous ?" },
    {
      "vars": { "text": "The weather is nice today" },
      "expected": "Le temps est beau aujourd'hui"
    },
    { "vars": { "text": "" }, "description": "Edge case: empty input" }
  ],
  "providers": ["openai:gpt-5-mini"],
  "assertions": [
    { "type": "length-min", "value": "1" },
    { "type": "similarity", "threshold": 0.8 }
  ],
  "grader_provider": "openai:gpt-5-mini"
}

eval_analyze — Inspect & Compare Results

Browse, compare, and manage stored evaluation runs. This is where you analyze failures, detect regressions, and track quality trends.

Input schema:

Field

Type

Required for

Description

action

enum

always

One of the 5 actions below

run_id

number

get_run, delete_run

Run ID

run_ids

number[]

compare_runs

At least 2 run IDs

prompt_name

string

no

Filter by prompt name

tag

string

no

Filter by tag

limit / offset

number

no

Pagination (default: 20)

provider

string

no

Filter results by provider

only_failed

boolean

no

Return only failed cases

Actions:

Action

Description

get_run

Full run details with paginated per-case results. Supports provider and only_failed filters.

compare_runs

Side-by-side score comparison between 2+ runs. Computes deltas, detects regressions, counts improvements.

list_runs

Recent runs, filterable by prompt_name and tag.

delete_run

Remove run and all its results (CASCADE).

trends

Raw {run_id, date, providers, avg_score, pass_rate} list ordered by date.

Example: Deep Dive Into Failures

{
  "action": "get_run",
  "run_id": 1,
  "only_failed": true,
  "provider": "google:gemini-3.1-flash-lite-preview"
}

Returns only the cases that failed for Gemini, with full assertion results and the actual model output.

Example: Compare Runs After a Prompt Change

You fixed the prompt and re-ran. Now compare:

{
  "action": "compare_runs",
  "run_ids": [1, 2]
}

Response:

{
  "status": "success",
  "data": {
    "runs": ["run_1", "run_2"],
    "scoreChanges": {
      "openai:gpt-5-mini": "+0.10 (0.85 -> 0.95)",
      "google:gemini-3.1-flash-lite-preview": "+0.23 (0.72 -> 0.95)"
    },
    "regressions": [],
    "improvements": 2
  },
  "hints": ["2 provider(s) improved. No regressions detected."]
}
{
  "action": "trends",
  "prompt_name": "ticket-classifier"
}

Returns chronological data for charting score progression across runs.


Resources

Three MCP resources are exposed for supporting clients to browse server state.

URI

Type

Contents

prompt-lab://prompts

Static

JSON array of prompt summaries: name, version, variable count, tags

prompt-lab://datasets

Static

JSON array of dataset summaries: name, case count, tags

prompt-lab://runs/{runId}

Dynamic template

Full run metadata and summary for a specific run ID

The runs resource supports URI completion — clients can tab-complete run IDs from the 50 most recent runs.

Resources return summaries only. Full content lives behind the tools (eval_suite get_prompt, eval_suite get_dataset, eval_analyze get_run). This separation keeps resource reads lightweight while full data is available on demand.


Prompt Templates

Two MCP prompt templates are registered for quick invocation from supporting clients.

quick-eval

Run a quick evaluation of a prompt against a single test input.

Arguments: prompt, test_input, provider

Generates a message that instructs the host to call eval_run and interpret results — a one-shot workflow.

generate-tests

Generate a synthetic test dataset for a prompt.

Arguments: prompt_description, variable_names, count

Generates a message that calls eval_suite generate_dataset with guidance to include happy-path, edge, and adversarial cases.


Assertions Reference

Deterministic (free, instant, no API key required)

Type

Checks

value field

contains

Substring is present

Substring to find

not-contains

Substring is absent

Substring to reject

equals

Exact match (trimmed)

Expected string

regex

Regex pattern matches

Regex pattern

starts-with

Output begins with prefix

Prefix string

is-json

Output is valid JSON

Not used

length-max

Output length <= N chars

Max length as string, e.g. "500"

length-min

Output length >= N chars

Min length as string, e.g. "10"

LLM-Graded (requires grader_provider)

Type

Checks

Uses

llm-rubric

Free-form criteria

value = rubric text

factuality

Output matches expected facts

expected field as reference

relevance

Output answers the query

input field as the query

similarity

Semantic similarity to expected

expected field + threshold

All assertions support a weight field (default: 1). The aggregate score is sum(weight * score) / sum(weight).

LLM-graded assertions use generateObject with a Zod schema for guaranteed structured { pass, score, reason } responses — no fragile regex or JSON.parse on raw LLM text. Each assertion type has a distinct grading prompt optimized for that evaluation style.


Provider Configuration

Provider strings follow the format "provider:model":

Provider string

Required env var

Example

openai:*

OPENAI_API_KEY

openai:gpt-5-mini

anthropic:*

ANTHROPIC_API_KEY

anthropic:claude-sonnet-4-20250514

google:*

GOOGLE_GENERATIVE_AI_API_KEY

google:gemini-3.1-flash-lite-preview

xai:*

XAI_API_KEY

xai:grok-3

openrouter:*

OPENROUTER_API_KEY

openrouter:meta-llama/llama-4-scout

local:*

LOCAL_LLM_URL (optional)

local:my-model

The local provider uses an OpenAI-compatible endpoint. LOCAL_LLM_URL defaults to http://localhost:1234/v1 (LM Studio default).

Missing API keys produce clear, actionable errors:

Provider 'anthropic:claude-sonnet-4-20250514' requires ANTHROPIC_API_KEY.
Set it in your MCP server config env block or shell environment.

No provider has a default for LLM-graded assertions or dataset generation. You must specify grader_provider and provider explicitly — no silent API spending.

Provider resolution is powered by Vercel AI SDK — a thin abstraction layer over each provider's SDK. It handles retries, types, and multi-provider support in ~5 lines per provider. No orchestration frameworks, no hidden magic. If needed, each provider can be swapped to a raw SDK call without changing the tool interface.


Walkthrough: End-to-End Prompt Engineering

This walkthrough demonstrates the full iterative loop — from writing a prompt to shipping a validated version. Everything happens inside your MCP host (Claude Code, Cursor, Claude Desktop).

Step 1: Save Your Prompt

"Save this as 'meeting-action-items':

Extract action items from this meeting transcript. Return a JSON array
where each item has: owner, task, due_date (or null if not mentioned).

{{transcript}}"

Claude calls eval_suite with save_prompt. Variables auto-extracted: ["transcript"].

Step 2: Generate Test Cases

"Generate 5 test cases for meeting-action-items using openai:gpt-5-mini"

Claude calls eval_suite with generate_dataset. The LLM creates diverse cases:

  • Simple 1-on-1 meeting with clear action items

  • Multi-person standup with overlapping responsibilities

  • Vague meeting with no concrete actions

  • Conflicting assignments (same task, two owners)

  • Long rambling transcript with buried action items

Step 3: Run the First Eval

"Run meeting-action-items against the generated dataset using openai:gpt-5-mini and google:gemini-3.1-flash-lite-preview.
Assert: valid JSON, contains 'owner', and use llm-rubric 'All action items are accurately extracted with correct owners'.
Use openai:gpt-5-mini as the grader."

Result: GPT-5-mini: 80% pass rate. Gemini: 60%. Top failures:

  • Case 3 (vague meeting): Gemini hallucinated action items that weren't in the transcript

  • Case 4 (conflicting assignments): Both models assigned the task to only one person

Step 4: Fix the Prompt and Re-Run

"Update meeting-action-items to handle conflicts by noting both owners, and add an instruction
to output an empty array when no clear actions exist. Then re-run the same eval."

Claude calls eval_suite save_prompt with the checksum from v1, then eval_run with identical parameters.

Step 5: Compare Runs

"Compare the two runs"

Claude calls eval_analyze compare_runs:

GPT-5-mini: +0.15 (0.80 -> 0.95)
Gemini:     +0.35 (0.60 -> 0.95)
Regressions: none

Both providers now pass at 95%. The prompt fix for conflict handling helped Gemini the most. No regressions. Ship it.

Step 6: Regression Check Later

A week later, you tweak the prompt for a new edge case.

"List recent runs for meeting-action-items, then run the same tests and compare against the last run"

Claude calls eval_analyze list_runs, then eval_run, then eval_analyze compare_runs. Instant confidence check before deploying the change.


Use Cases

Composability Matrix

What you want

Tools used

Setup needed

"Check this output"

eval_assert only

None

"Test my prompt"

eval_suite + eval_run

30 seconds

"Compare v1 vs v2"

eval_run x2 + eval_analyze

Already have v1

"Did I break anything?"

eval_analyze + eval_run + eval_analyze

Already have history

"Which model is best?"

eval_run (multiple providers)

Just a prompt + cases

"Generate test data"

eval_suite generate_dataset

Just a description

"Track quality over time"

eval_analyze trends

Already have runs

Specific Scenarios

Prompt iteration for classification tasks. Save a classifier prompt, generate synthetic edge cases, run against 2-3 providers, read failures, fix the prompt, re-run, compare. The full loop in one conversation.

Model selection for production. Same prompt, same test suite, 3 providers. One eval_run call gives pass rates, scores, latency, and token cost per provider. Data-driven model choice instead of vibes.

Pre-deploy regression testing. Changed a prompt? Run the existing test suite, compare against the last known-good run. Zero regressions = safe to deploy.

Output validation in agent pipelines. Use eval_assert as a quality gate — check that an agent's output is valid JSON, contains required fields, passes a rubric. No eval infrastructure needed.

Synthetic test generation. Describe what your prompt does, get diverse test cases including adversarial inputs. Review and curate before running evals.


Database

Created lazily on first tool call (STDIO servers must start instantly — no slow init).

Default location: ~/.mcp-prompt-lab/prompt-lab.db

Override with PROMPT_LAB_DB:

"env": { "PROMPT_LAB_DB": "/custom/path/prompt-lab.db" }

Reset: rm -rf ~/.mcp-prompt-lab

Backup: cp ~/.mcp-prompt-lab/prompt-lab.db ~/backups/prompt-lab-$(date +%Y%m%d).db

Schema: 4 tables — prompts, datasets, runs, results. Deleting a run cascades to its results.

Pragmas:

  • journal_mode=WAL — safe concurrent reads from multiple processes

  • foreign_keys=ON — enforced referential integrity

  • user_version — schema migration tracking (version 0 = run full schema, set to 1)


Engineering Decisions

Choices made deliberately, not by default.

Decision

Resolution

Why

4 tools, not 15

Group by user intent, not CRUD operation

Agents select tools better with fewer, well-scoped options. 4 is well under the 10-15 tool limit.

STDIO-only, no HTTP

Local-first: keys in env, data in SQLite

Eval data is sensitive (prompts, test cases, model outputs). Remote HTTP would need auth, sessions, key management.

Vercel AI SDK

Thin provider abstraction, not an orchestration framework

Unified generateText/generateObject across 6 providers. Can migrate to raw SDKs later — the tool interface doesn't change.

Explicit grader_provider

No default grader, no silent API calls

LLM-graded assertions cost tokens. The user must opt in per-call.

Checksum on mutations

SHA-256 of prompt content

Prevents overwriting prompts changed in another session. Error includes current checksum for easy recovery.

Token-aware summaries

Summary + top 3 failures, details via pagination

150 results in a single response would blow context. Summary-first keeps results usable in any host.

Weighted scoring

weight on every assertion

A length-max check and an llm-rubric on accuracy aren't equally important. Weights let the score reflect actual quality criteria.

Global DB, not per-project

~/.mcp-prompt-lab/ default, overridable via env

Eval data is cross-project by nature. You're comparing prompts, not building per-repo config.

Lazy DB init

Created on first tool call, not server start

STDIO servers must respond to initialize instantly. SQLite setup happens when actually needed.

Zod v4 for grader schemas

generateObject with typed schemas

Guaranteed structured grader responses. No regex parsing of raw LLM text.

Static hints, not generated

Hardcoded per action and outcome

Predictable, fast, zero tokens. Every response guides the next logical action.

Bun-native SQLite

bun:sqlite, zero npm dependencies for DB

One less dependency. Bun's SQLite is fast, synchronous, and supports WAL natively.


License

MIT

A
license - permissive license
-
quality - not tested
C
maintenance

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/nikolasbarwicki/mcp-prompt-lab'

If you have feedback or need assistance with the MCP directory API, please join our Discord server