Design System MCP Server

aurora-mcp
evals
question-and-answer

READ_ME.md•3.4 KiB

# Design System MCP Evaluation This repository contains tools for evaluating design system MCP answers against expected results using LLM-as-judge methodology. ## Files - `EVALUATION_SET_1.md` - Questions to test - `EVALUATION_SET_1_OUTPUT.md` - MCP-generated answers - `EVALUATION_SET_1_ANSWER_KEY.md` - Expected answers - `simple_evaluate.py` - Evaluation script ## Quick Start **Fully automated evaluation:** ```bash # Default (uses EVALUATION_SET_1_OUTPUT.md) python3 simple_evaluate.py # Or with custom answer file python3 simple_evaluate.py NEW_ANSWERS.md ``` This automatically: - Creates evaluation prompt - Runs Q CLI evaluation - Saves results with pass/fail analysis to `evaluation_results_with_pass_fail_[filename].txt` ## Evaluation Process ### Step 1: Prepare Files Ensure you have three files: - Questions (markdown list format: `- Question text`) - MCP answers (sections starting with `## Question N` and containing `**Answer:**`) - Expected answers (sections with `Expected Answer:` format) **MCP Answers Format Requirements:** ```markdown ## Question 1 **Question:** What should I use as a separator in breadcrumbs? **Answer:** Your answer text here... ## Question 2 **Question:** Next question text... **Answer:** Next answer text... ``` Create the MCP Answers file by running: ```bash python3 run_questions.py ``` This script automatically: - Reads all questions from `EVALUATION_SET_1.md` - Asks each question using the design system MCP via Q CLI (with `--trust-all-tools` flag) - Saves formatted answers to `EVALUATION_SET_1_OUTPUT.md` - Saves complete Q CLI output to `EVALUATION_SET_1_FULL_OUTPUT.md` for debugging ### Step 2: Run Automated Evaluation ```bash python3 simple_evaluate.py [ANSWER_FILE.md] ``` This creates: - `evaluation_prompt_[filename].txt` - LLM prompt - `answer_comparison_[filename].json` - Structured comparison data - `evaluation_results_with_pass_fail_[filename].txt` - **Complete results with scores and pass/fail analysis** ### Generated Results File Contains: - Summary statistics (average score, pass rate) shown first - Score breakdown by performance level - List of critical failures - Individual question scores with explanations - Pass/fail status for each question (≥6 = Pass) ## Scoring Scale - **10:** Perfect match in meaning and content - **8-9:** Very good match, captures main points with minor differences - **6-7:** Good match, captures key concepts but missing some details - **4-5:** Partial match, some correct information but significant gaps - **2-3:** Poor match, minimal correct information - **0-1:** No match or completely incorrect ## Pass/Fail Criteria - **Pass:** Score ≥ 6/10 - **Fail:** Score < 6/10 ## Example Output ``` Average Score: 8.33/10 (100.0% pass rate) Pass Rate: 100.0% (15 out of 15 questions passed) Failed Questions: 0 ``` ## Manual Evaluation (if needed) If automatic evaluation fails, you can run manually: ```bash cat evaluation_prompt_[filename].txt | q chat --no-interactive > results.txt ``` ## Troubleshooting - Ensure all three input files have matching question counts - Check markdown formatting matches expected patterns - Verify Q CLI is installed and accessible - Script automatically cleans ANSI escape codes from Q CLI output ## Customization To modify evaluation criteria: 1. Edit the scoring scale in `simple_evaluate.py` 2. Adjust pass/fail threshold (currently 6/10) 3. Modify prompt template for different evaluation focus

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/pglevy/aurora-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

READ_ME.md•3.4 KiB