# ๐ MCP Evaluation Server
### *The Ultimate AI Evaluation Platform*
> **๐ Tools**: 63 Specialized Evaluation Tools
>
> **๐จโ๐ป Author**: Mihai Criveti
A **MCP server** providing the most comprehensive AI evaluation platform in the ecosystem. Features **63 specialized tools** across **14 categories** for complete AI system assessment using **LLM-as-a-judge techniques** combined with rule-based metrics.
## ๐ฏ **Tool Categories Overview**
### **๐ Core Evaluation (15 tools)**
๐ค **4 Judge Tools** - LLM-as-a-judge evaluation with bias mitigation
๐ **4 Prompt Tools** - Clarity, consistency, completeness analysis
๐ ๏ธ **4 Agent Tools** - Tool usage, reasoning, task completion assessment
๐ **3 Quality Tools** - Factuality, coherence, toxicity detection
### **๐ฌ Advanced Assessment (39 tools)**
๐ **8 RAG Tools** - Retrieval relevance, context utilization, grounding verification
โ๏ธ **6 Bias & Fairness** - Demographic bias, representation equity, intersectional analysis
๐ก๏ธ **5 Robustness Tools** - Adversarial testing, injection resistance, stability analysis
๐ **4 Safety & Alignment** - Harmful content detection, instruction adherence, value alignment
๐ **4 Multilingual Tools** - Translation quality, cross-lingual consistency, cultural adaptation
โก **4 Performance Tools** - Latency tracking, efficiency metrics, throughput scaling
๐ **8 Privacy Tools** - PII detection, data minimization, compliance, anonymization
### **๐ง System Management (9 tools)**
๐ **3 Workflow Tools** - Evaluation suites, parallel execution, results comparison
๐ **2 Calibration Tools** - Judge agreement testing, rubric optimization
๐ฅ **4 Server Tools** - Health monitoring, cache statistics, system management
### **โก Technology**
- **๐ค LLM-as-a-Judge** - GPT-4, Azure OpenAI, with position bias mitigation
- **๐ Statistical Rigor** - Confidence intervals, significance testing, correlation analysis
- **๐ช Multi-Modal Assessment** - Pattern matching + LLM evaluation + rule-based metrics
- **๐๏ธ Extensible Architecture** - Configurable rubrics, custom criteria, plugin system
## ๐ **Quick Start**
### **๐ก Multiple Server Modes**
#### **๐ MCP Server Mode (stdio)**
```bash
# ๐ฏ One-command setup
pip install -e ".[dev]"
# ๐ฅ Launch MCP server for Claude Desktop, MCP clients
python -m mcp_eval_server.server
# or
make dev
# ๐ฅ Health check (automatic on port 8080)
curl http://localhost:8080/health # โ
Liveness probe
curl http://localhost:8080/ready # ๐ฏ Readiness probe
curl http://localhost:8080/metrics # ๐ Performance metrics
```
#### **๐ REST API Server Mode (HTTP)**
```bash
# ๐ Launch REST API server with FastAPI
python -m mcp_eval_server.rest_server --port 8080 --host 0.0.0.0
# or
make serve-rest
# ๐ Interactive API documentation
open http://localhost:8080/docs
# ๐งช Quick API test
curl http://localhost:8080/health
curl http://localhost:8080/tools/categories
```
#### **๐ HTTP Bridge Mode (MCP over HTTP)**
```bash
# ๐ MCP protocol over HTTP with Server-Sent Events
make serve-http
# ๐ก Access via JSON-RPC over HTTP on port 9000
curl -X POST -H 'Content-Type: application/json' \
-d '{"jsonrpc": "2.0", "id": 1, "method": "tools/list", "params": {}}' \
http://localhost:9000/
```
## โจ **Complete Tool Arsenal**
### ๐ค **LLM-as-a-Judge Tools** (4 Tools)
- **๐ฏ Single Response Evaluation**: Customizable criteria with weighted scoring and confidence metrics
- **โ๏ธ Pairwise Comparison**: Head-to-head analysis with automatic position bias mitigation
- **๐ Multi-Response Ranking**: Tournament, round-robin, and scoring-based ranking algorithms
- **๐ Reference-Based Evaluation**: Gold standard comparison for factuality, completeness, and style
- **๐ค Multi-Judge Consensus**: Ensemble evaluation with agreement analysis and confidence weighting
### ๐ **Prompt Evaluation Tools** (4 Tools)
- **๐ Clarity Analysis**: Rule-based ambiguity detection + LLM semantic analysis with improvement recommendations
- **๐ Consistency Testing**: Multi-run variance analysis across temperature settings with outlier detection
- **โ
Completeness Measurement**: Component coverage analysis with visual heatmap generation
- **๐ฏ Relevance Assessment**: Semantic alignment using TF-IDF vectorization with drift analysis
### ๐ ๏ธ **Agent Evaluation Tools** (4 Tools)
- **โ๏ธ Tool Usage Evaluation**: Selection accuracy, sequence optimization, parameter validation with efficiency scoring
- **โ
Task Completion Analysis**: Multi-criteria success evaluation with partial credit and failure analysis
- **๐ง Reasoning Assessment**: Decision-making quality, logical coherence, and hallucination detection
- **๐ Performance Benchmarking**: Comprehensive capability testing across skill levels with baseline comparison
### ๐ **Quality Assessment Tools** (3 Tools)
- **โ
Factuality Checking**: Claims verification against knowledge bases with confidence scoring and evidence tracking
- **๐งฉ Coherence Analysis**: Logical flow assessment, contradiction detection, and structural analysis
- **๐ก๏ธ Toxicity Detection**: Multi-category harmful content identification with bias pattern analysis
### ๐ **RAG Evaluation Tools** (8 Tools)
- **๐ Retrieval Relevance**: Semantic similarity assessment with LLM judge validation and configurable thresholds
- **๐ฏ Context Utilization**: Analysis of how well retrieved context is integrated into generated responses
- **โ Answer Groundedness**: Claim verification against supporting context with strictness controls
- **๐จ Hallucination Detection**: Contradiction identification between responses and source context
- **๐ฏ Retrieval Coverage**: Topic completeness assessment and information gap analysis
- **๐ Citation Accuracy**: Reference validation and citation quality scoring across multiple formats
- **๐งฉ Chunk Relevance**: Individual document segment evaluation with ranking and scoring
- **๐ Retrieval Benchmarking**: Comparative analysis using standard IR metrics (precision, recall, MRR, NDCG)
### โ๏ธ **Bias & Fairness Tools** (6 Tools)
- **๐ฏ Demographic Bias Detection**: Pattern matching and LLM assessment for protected group bias
- **๐ Representation Fairness**: Balanced representation analysis across contexts and groups
- **โ๏ธ Outcome Equity**: Disparate impact analysis across protected attributes
- **๐ Cultural Sensitivity**: Cross-cultural appropriateness and awareness evaluation
- **๐ฃ๏ธ Linguistic Bias Detection**: Language-based discrimination and dialect bias identification
- **๐ Intersectional Fairness**: Compound bias effects across multiple identity dimensions
### ๐ก๏ธ **Robustness Tools** (5 Tools)
- **โ๏ธ Adversarial Testing**: Malicious prompt resistance and attack vector evaluation
- **๐ Input Sensitivity**: Response stability testing under input variations and perturbations
- **๐ก๏ธ Prompt Injection Resistance**: Security defense evaluation against injection attacks
- **๐ Distribution Shift**: Performance degradation analysis on out-of-domain data
- **๐ฏ Consistency Under Perturbation**: Output stability measurement across input modifications
### ๐ **Safety & Alignment Tools** (4 Tools)
- **โ ๏ธ Harmful Content Detection**: Multi-category risk assessment across safety dimensions
- **๐ Instruction Following**: Constraint adherence and safety instruction compliance
- **๐ซ Refusal Appropriateness**: Evaluation of appropriate system refusal behavior
- **๐ Value Alignment**: Human values and ethical principles alignment assessment
### ๐ **Multilingual Tools** (4 Tools)
- **๐ Translation Quality**: Accuracy, fluency, and completeness assessment across languages
- **๐ Cross-Lingual Consistency**: Consistency evaluation across multiple language versions
- **๐ญ Cultural Adaptation**: Localization quality and cultural appropriateness evaluation
- **๐ Language Mixing Detection**: Inappropriate code-switching and language mixing identification
### โก **Performance Tools** (4 Tools)
- **โฑ๏ธ Response Latency**: Generation speed tracking with statistical analysis and percentiles
- **๐ป Computational Efficiency**: Resource usage monitoring and efficiency metrics
- **๐ Throughput Scaling**: Concurrent request handling and scaling behavior analysis
- **๐พ Memory Monitoring**: Memory consumption pattern tracking and leak detection
### ๐ **Privacy Tools** (8 Tools)
- **๐ PII Detection**: Personally identifiable information detection with configurable sensitivity
- **๐ Data Minimization**: Evaluation of data collection necessity and purpose alignment
- **๐ Consent Compliance**: Privacy regulation compliance assessment (GDPR, CCPA, COPPA, HIPAA)
- **๐ญ Anonymization Effectiveness**: Re-identification risk analysis and utility preservation
- **๐จ Data Leakage Detection**: Unintended data exposure and inference leakage identification
- **๐ Consent Clarity**: Readability and comprehensibility assessment of privacy notices
- **๐๏ธ Data Retention Compliance**: Retention policy alignment and regulatory adherence
- **๐๏ธ Privacy-by-Design**: System-level privacy implementation and design principle evaluation
### ๐ **Workflow Management Tools** (3 Tools)
- **๐๏ธ Evaluation Suites**: Customizable multi-step pipelines with weighted criteria and success thresholds
- **โก Parallel/Sequential Execution**: Optimized processing with configurable concurrency and resource management
- **๐ Results Comparison**: Statistical analysis with trend detection, significance testing, and regression analysis
### ๐ **Judge Calibration Tools** (2 Tools)
- **๐ค Agreement Testing**: Inter-judge correlation analysis with human baseline comparison
- **๐ฏ Rubric Optimization**: Automatic tuning using machine learning for improved human alignment
### ๐ง **Server Management Tools** (9 Tools)
- **๐ Judge Management**: Available model listing, capability assessment, configuration validation
- **๐พ Results Storage**: Comprehensive evaluation history with metadata and statistical reporting
- **โก Cache Management**: Multi-level caching statistics and performance optimization
- **๐ Health Monitoring**: System status checks and performance metrics
## ๐ **Advanced Features**
### **๐ฏ LLM-as-a-Judge Best Practices**
- **Position Bias Mitigation**: Automatic response position randomization for fair comparisons
- **Chain-of-Thought Integration**: Step-by-step reasoning for enhanced evaluation quality
- **Confidence Calibration**: Self-assessment metrics for evaluation reliability
- **Multiple Judge Consensus**: Ensemble methods with disagreement analysis
- **Human Alignment**: Regular calibration against ground truth evaluations
### **โก Performance & Scalability**
- **Lightweight Dependencies**: Uses standard libraries (scikit-learn, numpy) instead of heavy ML frameworks
- **Smart Caching**: Multi-level caching (memory + disk) with TTL and invalidation
- **Async Processing**: Non-blocking evaluation execution with configurable concurrency
- **Batch Operations**: Efficient multi-item processing with progress tracking
- **Resource Management**: Memory and CPU optimization with automatic scaling
- **Fast Startup**: Quick initialization without loading large pre-trained models
### **๐ Enterprise Security**
- **Cryptographic Random**: Secure random number generation for bias mitigation
- **API Key Management**: Secure credential handling with environment variable integration
- **Input Validation**: Comprehensive parameter validation and sanitization
- **Error Isolation**: Graceful failure handling with detailed error reporting
- **Audit Trail**: Complete evaluation history with compliance reporting
### **๐ Analytics & Insights**
- **Statistical Analysis**: Correlation analysis, significance testing, trend detection
- **Performance Metrics**: Latency tracking, throughput monitoring, success rate analysis
- **Quality Dashboards**: Real-time evaluation quality monitoring with alerting
- **Comparative Analysis**: A/B testing capabilities with regression detection
- **Predictive Analytics**: Performance trend forecasting and anomaly detection
## ๐ ๏ธ **Installation & Setup**
### **Quick Installation**
```bash
# Clone and install (lightweight dependencies only)
cd mcp-servers/python/mcp_eval_server
pip install -e ".[dev]"
# Set up API keys (optional - rule-based judge works without them)
export OPENAI_API_KEY="sk-your-key-here"
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
export AZURE_OPENAI_API_KEY="your-azure-api-key"
# Configure health check endpoints (optional)
export HEALTH_CHECK_PORT=8080 # Default: 8080
export HEALTH_CHECK_HOST=0.0.0.0 # Default: 0.0.0.0
# Note: No heavy ML dependencies required!
# Uses efficient TF-IDF + scikit-learn instead of transformers
```
### **MCP Client Connection**
```json
{
"command": "python",
"args": ["-m", "mcp_eval_server.server"],
"cwd": "/path/to/mcp-servers/python/mcp_eval_server"
}
```
**Protocol**: stdio (Model Context Protocol)
**Transport**: Standard input/output (no HTTP port needed)
**Tools Available**: 63 specialized evaluation tools
### **Health Check Endpoints**
The server automatically starts health check HTTP endpoints for monitoring:
```bash
# Health endpoints (started automatically with the MCP server)
curl http://localhost:8080/health # Liveness probe
curl http://localhost:8080/ready # Readiness probe
curl http://localhost:8080/metrics # Basic metrics
curl http://localhost:8080/ # Service info
# Kubernetes-style endpoints
curl http://localhost:8080/healthz # Alternative health
curl http://localhost:8080/readyz # Alternative readiness
```
**Health Check Response Example:**
```json
{
"status": "healthy",
"timestamp": 1698765432.123,
"uptime_seconds": 45.67,
"service": "mcp-eval-server",
"version": "0.1.0",
"checks": {
"server_running": true,
"uptime_ok": true
}
}
```
**Readiness Check Response Example:**
```json
{
"status": "ready",
"timestamp": 1698765432.123,
"service": "mcp-eval-server",
"version": "0.1.0",
"checks": {
"server_initialized": true,
"judge_tools_loaded": true,
"storage_initialized": true
}
}
```
### **Docker Deployment**
```bash
# Build container
make build
# Run with environment
make run
# Or use docker-compose
make compose-up
```
### **Development Setup**
```bash
# Install development dependencies
make dev-install
# Run development server
make dev
# Run tests
make test
# Check code quality
make lint
```
## ๐ฎ **Usage Examples**
### **๐ฏ MCP Client Integration**
```python
# Multi-criteria evaluation with MCP client
result = await mcp_client.call_tool("judge.evaluate_response", {
"response": "Detailed technical explanation...",
"criteria": [
{"name": "technical_accuracy", "description": "Correctness of technical details", "scale": "1-5", "weight": 0.4},
{"name": "clarity", "description": "Explanation clarity", "scale": "1-5", "weight": 0.3},
{"name": "completeness", "description": "Coverage of key points", "scale": "1-5", "weight": 0.3}
],
"rubric": {
"criteria": [],
"scale_description": {
"1": "Severely lacking",
"2": "Below expectations",
"3": "Meets basic requirements",
"4": "Exceeds expectations",
"5": "Outstanding quality"
}
},
"judge_model": "gpt-4",
"use_cot": True
})
```
### **๐ REST API Integration**
```bash
# Evaluate response via REST API
curl -X POST http://localhost:8080/judge/evaluate \
-H "Content-Type: application/json" \
-d '{
"response": "Paris is the capital of France",
"criteria": [
{
"name": "accuracy",
"description": "Factual accuracy",
"scale": "1-5",
"weight": 1.0
}
],
"rubric": {
"criteria": [],
"scale_description": {
"1": "Wrong",
"5": "Correct"
}
},
"judge_model": "gpt-4o-mini"
}'
```
```python
# Python REST API client
import httpx
import asyncio
async def evaluate_via_rest():
async with httpx.AsyncClient() as client:
response = await client.post("http://localhost:8080/judge/evaluate", json={
"response": "Technical explanation...",
"criteria": [
{"name": "quality", "description": "Overall quality", "scale": "1-5", "weight": 1.0}
],
"rubric": {
"criteria": [],
"scale_description": {"1": "Poor", "5": "Excellent"}
},
"judge_model": "gpt-4o-mini"
})
result = response.json()
return result
# Run evaluation
result = asyncio.run(evaluate_via_rest())
print(f"Overall score: {result['overall_score']}")
```
### **โ๏ธ Advanced Pairwise Comparison**
```python
# Head-to-head comparison with bias mitigation
comparison = await mcp_client.call_tool("judge.pairwise_comparison", {
"response_a": "Technical solution A with implementation details...",
"response_b": "Alternative solution B with different approach...",
"criteria": [
{"name": "innovation", "description": "Novelty and creativity", "scale": "1-5", "weight": 0.4},
{"name": "feasibility", "description": "Implementation practicality", "scale": "1-5", "weight": 0.3},
{"name": "efficiency", "description": "Resource optimization", "scale": "1-5", "weight": 0.3}
],
"context": "Solutions for enterprise-scale data processing challenge",
"position_bias_mitigation": True,
"judge_model": "gpt-4-turbo"
})
```
### **๐ Comprehensive Agent Benchmarking**
```python
# Full agent performance assessment
benchmark_result = await mcp_client.call_tool("agent.benchmark_performance", {
"benchmark_suite": "advanced_skills",
"agent_config": {
"model": "gpt-4",
"temperature": 0.7,
"tools_enabled": ["search", "calculator", "code_executor"]
},
"baseline_comparison": {
"name": "GPT-3.5 Baseline",
"scores": {"accuracy": 0.75, "efficiency": 0.68, "reliability": 0.72}
},
"metrics_focus": ["accuracy", "efficiency", "reliability", "creativity"]
})
```
### **๐ Advanced Evaluation Suite**
```python
# Create sophisticated evaluation pipeline
suite = await mcp_client.call_tool("workflow.create_evaluation_suite", {
"suite_name": "comprehensive_ai_assessment",
"description": "Full-spectrum AI capability evaluation",
"evaluation_steps": [
{
"tool": "prompt.evaluate_clarity",
"weight": 0.15,
"parameters": {"target_model": "gpt-4", "domain_context": "technical"}
},
{
"tool": "judge.evaluate_response",
"weight": 0.25,
"parameters": {
"criteria": [
{"name": "technical_depth", "description": "Technical sophistication", "scale": "1-5", "weight": 0.4},
{"name": "practical_utility", "description": "Real-world applicability", "scale": "1-5", "weight": 0.6}
],
"judge_model": "gpt-4"
}
},
{
"tool": "quality.evaluate_factuality",
"weight": 0.20
},
{
"tool": "quality.measure_coherence",
"weight": 0.15
},
{
"tool": "quality.assess_toxicity",
"weight": 0.10
},
{
"tool": "agent.analyze_reasoning",
"weight": 0.15,
"parameters": {"judge_model": "gpt-4-turbo"}
}
],
"success_thresholds": {
"overall": 0.85,
"quality.evaluate_factuality": 0.90,
"quality.assess_toxicity": 0.95
},
"weights": {
"accuracy": 0.4,
"safety": 0.3,
"utility": 0.3
}
})
# Execute comprehensive evaluation
results = await mcp_client.call_tool("workflow.run_evaluation", {
"suite_id": suite["suite_id"],
"test_data": {
"response": "Complex AI system response...",
"context": "Enterprise deployment scenario...",
"reasoning_trace": [...],
"agent_trace": {...}
},
"parallel_execution": True,
"max_concurrent": 5
})
```
## ๐๏ธ **Advanced Configuration**
### **Custom Model Configuration**
The MCP Eval Server supports complete customization of judge models, allowing you to:
- Configure custom API endpoints and deployments
- Set provider-specific parameters and capabilities
- Create domain-specific model configurations
- Use custom environment variable names
```bash
# Use custom model configuration
export MCP_EVAL_MODELS_CONFIG="./my-custom-models.yaml"
export DEFAULT_JUDGE_MODEL="my-custom-judge"
# Copy default config for customization
make copy-config # Copies to ./custom-config/
make show-config # Show current configuration status
make validate-config # Validate configuration syntax
```
### **Model Configuration with Capabilities**
```yaml
models:
azure:
my-enterprise-gpt4:
provider: "azure"
deployment_name: "my-gpt4-deployment"
model_name: "gpt-4"
api_base_env: "AZURE_OPENAI_ENDPOINT"
api_key_env: "AZURE_OPENAI_API_KEY"
api_version_env: "AZURE_OPENAI_API_VERSION"
deployment_name_env: "AZURE_DEPLOYMENT_NAME"
default_temperature: 0.1 # Custom temperature
max_tokens: 3000 # Custom token limit
capabilities:
supports_cot: true
supports_pairwise: true
supports_ranking: true
supports_reference: true
max_context_length: 8192
optimal_temperature: 0.1
consistency_level: "very_high"
metadata:
purpose: "production_evaluation"
cost_tier: "premium"
ollama:
my-local-llama:
provider: "ollama"
model_name: "llama3:70b"
base_url_env: "OLLAMA_BASE_URL"
default_temperature: 0.3
max_tokens: 2000
request_timeout: 120 # Longer timeout for large models
# Custom defaults
defaults:
primary_judge: "my-enterprise-gpt4"
fallback_judge: "my-local-llama"
# Custom recommendations
recommendations:
production: ["my-enterprise-gpt4"]
development: ["my-local-llama"]
```
### **Advanced Evaluation Rubrics**
```yaml
rubrics:
technical_excellence:
name: "Technical Excellence Assessment"
criteria:
- name: "code_quality"
description: "Code structure, efficiency, and best practices"
scale: "1-10"
weight: 0.3
- name: "innovation"
description: "Novel approaches and creative solutions"
scale: "1-10"
weight: 0.25
- name: "scalability"
description: "System scalability and performance considerations"
scale: "1-10"
weight: 0.25
- name: "maintainability"
description: "Code maintainability and documentation quality"
scale: "1-10"
weight: 0.2
scale_description:
"1-2": "Severely deficient, requires major rework"
"3-4": "Below standards, significant improvements needed"
"5-6": "Meets basic requirements, minor improvements possible"
"7-8": "Exceeds expectations, high quality work"
"9-10": "Exceptional excellence, industry-leading quality"
```
### **Multi-Domain Benchmarks**
```yaml
benchmarks:
enterprise_readiness:
name: "Enterprise Readiness Assessment"
category: "production"
tasks:
- name: "security_analysis"
description: "Security vulnerability assessment and mitigation"
difficulty: "advanced"
expected_tools: ["security_scanner", "vulnerability_analyzer", "mitigation_planner"]
evaluation_metrics: ["threat_identification", "risk_assessment", "solution_quality"]
- name: "performance_optimization"
description: "System performance analysis and optimization"
difficulty: "advanced"
expected_tools: ["profiler", "optimizer", "benchmarker"]
evaluation_metrics: ["performance_gain", "resource_efficiency", "scalability_impact"]
```
## ๐ฌ **Research-Grade Features**
### **๐ Statistical Analysis**
- **Correlation Analysis**: Pearson, Spearman, Cohen's Kappa for agreement measurement
- **Significance Testing**: Statistical validation of evaluation differences
- **Trend Analysis**: Performance trajectory analysis with volatility assessment
- **Outlier Detection**: Anomaly identification in evaluation results
- **Confidence Intervals**: Uncertainty quantification for evaluation scores
### **๐งช Experimental Capabilities**
- **Judge Calibration**: Systematic bias detection and correction algorithms
- **Rubric Evolution**: Machine learning-powered rubric optimization
- **Meta-Evaluation**: Evaluation of evaluation quality itself
- **Human Alignment**: Continuous calibration against expert human judgments
- **Cross-Validation**: K-fold validation for evaluation reliability
### **๐ฏ Domain-Specific Evaluations**
- **Technical Content**: Code quality, architecture assessment, security analysis
- **Creative Writing**: Originality, engagement, style consistency evaluation
- **Academic Work**: Research quality, citation analysis, argument strength
- **Customer Service**: Helpfulness, politeness, problem resolution effectiveness
- **Educational Content**: Learning objective achievement, instructional clarity
## ๐๏ธ **Production Architecture**
### **๐ง Infrastructure Components**
- **Multi-Judge Runtime**: Supports OpenAI, Azure OpenAI, and rule-based evaluation engines
- **Caching Layer**: Redis-compatible distributed caching with automatic invalidation
- **Results Database**: SQLite/PostgreSQL storage with comprehensive indexing
- **API Gateway**: RESTful endpoints with authentication and rate limiting
- **Monitoring System**: Prometheus metrics with Grafana dashboards
### **๐ฆ Deployment Options**
- **Container Deployment**: Production-ready Docker/Podman containers with security hardening
- **Kubernetes Support**: Helm charts with auto-scaling and service mesh integration
- **Cloud Integration**: AWS ECS, Azure Container Instances, Google Cloud Run compatibility
- **Edge Deployment**: Lightweight containers for edge computing scenarios
- **Development Mode**: Hot-reload development server with debugging capabilities
### **๐ Security & Compliance**
- **Enterprise Security**: OAuth 2.0, JWT tokens, API key rotation
- **Data Privacy**: Encryption at rest and in transit, PII detection and filtering
- **Audit Logging**: Comprehensive audit trails with tamper detection
- **Compliance Ready**: SOC 2, GDPR, HIPAA compliance frameworks supported
- **Vulnerability Management**: Continuous security scanning and automated patching
## ๐บ๏ธ **Tool Ecosystem Map**
```
๐ MCP EVALUATION SERVER - 63 SPECIALIZED TOOLS ๐
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ CORE EVALUATION SUITE (15 tools)
โโโ ๐ค Judge Tools (4) โโโโโโ LLM-as-a-judge evaluation
โโโ ๐ Prompt Tools (4) โโโโโ Clarity, consistency, optimization
โโโ ๐ ๏ธ Agent Tools (4) โโโโโโ Performance, reasoning, benchmarking
โโโ ๐ Quality Tools (3) โโโโ Factuality, coherence, toxicity
๐ฌ ADVANCED ASSESSMENT SUITE (39 tools)
โโโ ๐ RAG Tools (8) โโโโโโโโ Retrieval relevance, grounding, citations
โโโ โ๏ธ Bias & Fairness (6) โโ Demographic bias, intersectional analysis
โโโ ๐ก๏ธ Robustness (5) โโโโโโโโ Adversarial testing, injection resistance
โโโ ๐ Safety & Alignment (4) Harmful content, value alignment
โโโ ๐ Multilingual (4) โโโโโโ Translation, cultural adaptation
โโโ โก Performance (4) โโโโโโโโ Latency, efficiency, scaling
โโโ ๐ Privacy (8) โโโโโโโโโ PII detection, compliance, anonymization
๐ง SYSTEM MANAGEMENT (9 tools)
โโโ ๐ Workflow Tools (3) โโโ Evaluation suites, parallel execution
โโโ ๐ Calibration (2) โโโโโโ Judge agreement, rubric optimization
โโโ ๐ฅ Server Tools (4) โโโโโ Health monitoring, system management
๐ฏ TOTAL: 63 TOOLS ACROSS 14 CATEGORIES ๐ฏ
```
## ๐ **Complete Tool Reference**
### **Judge Tools (4/63)**
| Tool | Description | Key Features |
|------|-------------|--------------|
| `judge.evaluate_response` | Single response evaluation | Customizable criteria, weighted scoring, confidence metrics |
| `judge.pairwise_comparison` | Two-response comparison | Position bias mitigation, criterion-level analysis |
| `judge.rank_responses` | Multi-response ranking | Tournament/scoring algorithms, consistency measurement |
| `judge.evaluate_with_reference` | Reference-based evaluation | Gold standard comparison, similarity scoring |
### **Prompt Tools (4/63)**
| Tool | Description | Key Features |
|------|-------------|--------------|
| `prompt.evaluate_clarity` | Clarity assessment | Rule-based + LLM analysis, ambiguity detection |
| `prompt.test_consistency` | Consistency testing | Multi-run analysis, temperature variance |
| `prompt.measure_completeness` | Completeness analysis | Component coverage, heatmap visualization |
| `prompt.assess_relevance` | Relevance measurement | TF-IDF semantic alignment, drift analysis |
### **Agent Tools (4/63)**
| Tool | Description | Key Features |
|------|-------------|--------------|
| `agent.evaluate_tool_use` | Tool usage analysis | Selection accuracy, sequence optimization |
| `agent.measure_task_completion` | Task success evaluation | Multi-criteria assessment, partial credit |
| `agent.analyze_reasoning` | Reasoning quality assessment | Logic analysis, hallucination detection |
| `agent.benchmark_performance` | Performance benchmarking | Multi-domain testing, baseline comparison |
### **Quality Tools (3/63)**
| Tool | Description | Key Features |
|------|-------------|--------------|
| `quality.evaluate_factuality` | Factual accuracy checking | Claims verification, confidence scoring |
| `quality.measure_coherence` | Logical flow analysis | Coherence scoring, contradiction detection |
| `quality.assess_toxicity` | Harmful content detection | Multi-category analysis, bias detection |
### **RAG Tools (8/63)**
| Tool | Description | Key Features |
|------|-------------|--------------|
| `rag.evaluate_retrieval_relevance` | Document relevance assessment | Semantic similarity, LLM validation |
| `rag.measure_context_utilization` | Context usage analysis | Word overlap, sentence integration |
| `rag.assess_answer_groundedness` | Claim verification | Context support, strictness control |
| `rag.detect_hallucination_vs_context` | Contradiction detection | Statement verification, confidence scoring |
| `rag.evaluate_retrieval_coverage` | Topic completeness check | Information gap analysis, coverage scoring |
| `rag.assess_citation_accuracy` | Reference validation | Citation quality, format support |
| `rag.measure_chunk_relevance` | Document segment scoring | Individual chunk analysis, ranking |
| `rag.benchmark_retrieval_systems` | System comparison | IR metrics, performance analysis |
### **Bias & Fairness Tools (6/63)**
| Tool | Description | Key Features |
|------|-------------|--------------|
| `bias.detect_demographic_bias` | Protected group bias detection | Pattern matching, LLM assessment, sensitivity control |
| `bias.measure_representation_fairness` | Balanced representation analysis | Context evaluation, fairness metrics |
| `bias.evaluate_outcome_equity` | Disparate impact assessment | Outcome analysis, equity scoring |
| `bias.assess_cultural_sensitivity` | Cultural appropriateness evaluation | Cross-cultural awareness, sensitivity dimensions |
| `bias.detect_linguistic_bias` | Language-based discrimination | Dialect bias, formality assessment |
| `bias.measure_intersectional_fairness` | Multi-dimensional bias analysis | Compound effects, intersectional metrics |
### **Robustness Tools (5/63)**
| Tool | Description | Key Features |
|------|-------------|--------------|
| `robustness.test_adversarial_inputs` | Malicious prompt testing | Attack vectors, injection resistance |
| `robustness.measure_input_sensitivity` | Perturbation stability testing | Input variations, sensitivity thresholds |
| `robustness.evaluate_prompt_injection_resistance` | Security defense evaluation | Injection strategies, resistance scoring |
| `robustness.assess_distribution_shift` | Out-of-domain performance | Domain adaptation, degradation analysis |
| `robustness.measure_consistency_under_perturbation` | Output stability measurement | Perturbation consistency, variance analysis |
### **Safety & Alignment Tools (4/63)**
| Tool | Description | Key Features |
|------|-------------|--------------|
| `safety.detect_harmful_content` | Harmful content identification | Multi-category risk assessment, severity classification |
| `safety.assess_instruction_following` | Constraint adherence evaluation | Instruction parsing, compliance scoring |
| `safety.evaluate_refusal_appropriateness` | Refusal behavior assessment | Decision accuracy, precision/recall metrics |
| `safety.measure_value_alignment` | Human values alignment | Ethical principles, weighted assessment |
### **Multilingual Tools (4/63)**
| Tool | Description | Key Features |
|------|-------------|--------------|
| `multilingual.evaluate_translation_quality` | Translation assessment | Accuracy, fluency, cultural adaptation |
| `multilingual.measure_cross_lingual_consistency` | Multi-language consistency | Semantic preservation, factual alignment |
| `multilingual.assess_cultural_adaptation` | Localization evaluation | Cultural dimensions, adaptation scoring |
| `multilingual.detect_language_mixing` | Code-switching detection | Language purity, mixing appropriateness |
### **Performance Tools (4/63)**
| Tool | Description | Key Features |
|------|-------------|--------------|
| `performance.measure_response_latency` | Latency measurement | Statistical analysis, percentiles, timeout tracking |
| `performance.assess_computational_efficiency` | Resource usage monitoring | CPU/memory efficiency, per-token metrics |
| `performance.evaluate_throughput_scaling` | Scaling behavior analysis | Concurrency testing, bottleneck detection |
| `performance.monitor_memory_usage` | Memory consumption tracking | Usage patterns, leak detection, threshold monitoring |
### **Privacy Tools (8/63)**
| Tool | Description | Key Features |
|------|-------------|--------------|
| `privacy.detect_pii_exposure` | PII detection and analysis | Pattern matching, sensitivity levels, context analysis |
| `privacy.assess_data_minimization` | Data collection necessity | Purpose alignment, minimization scoring |
| `privacy.evaluate_consent_compliance` | Regulatory compliance assessment | GDPR/CCPA/COPPA/HIPAA standards, gap analysis |
| `privacy.measure_anonymization_effectiveness` | Anonymization quality evaluation | Re-identification risk, utility preservation |
| `privacy.detect_data_leakage` | Data exposure identification | Direct/inference leakage, unexpected data flow |
| `privacy.assess_consent_clarity` | Consent readability analysis | Grade level, accessibility, comprehension |
| `privacy.evaluate_data_retention_compliance` | Retention policy adherence | Policy-practice alignment, regulatory requirements |
| `privacy.assess_privacy_by_design` | System privacy implementation | Design principles, control effectiveness |
### **Workflow Tools (3/63)**
| Tool | Description | Key Features |
|------|-------------|--------------|
| `workflow.create_evaluation_suite` | Evaluation pipeline creation | Multi-step workflows, weighted criteria |
| `workflow.run_evaluation` | Suite execution | Parallel processing, progress tracking |
| `workflow.compare_evaluations` | Results comparison | Statistical analysis, trend detection |
### **Calibration Tools (2/63)**
| Tool | Description | Key Features |
|------|-------------|--------------|
| `calibration.test_judge_agreement` | Judge agreement testing | Correlation analysis, bias detection |
| `calibration.optimize_rubrics` | Rubric optimization | ML-powered tuning, human alignment |
### **Server Tools (4/63)**
| Tool | Description | Key Features |
|------|-------------|--------------|
| `server.get_available_judges` | List available judges | Model capabilities, status checking |
| `server.get_evaluation_suites` | List evaluation suites | Suite management, configuration viewing |
| `server.get_evaluation_results` | Retrieve results | History browsing, filtering, pagination |
| `server.get_cache_stats` | Cache statistics | Performance monitoring, optimization |
## ๐ก **Innovation & Research Integration**
### **๐ง AI Research Applications**
- **Model Comparison Studies**: Systematic evaluation of different LLM architectures
- **Prompt Engineering Research**: Large-scale prompt effectiveness analysis
- **Agent Behavior Studies**: Comprehensive agent decision-making research
- **Bias Detection Research**: Systematic bias pattern analysis across models
- **Evaluation Methodology**: Meta-research on evaluation techniques themselves
### **๐ข Enterprise Applications**
- **Quality Assurance**: Automated content quality control in production systems
- **A/B Testing**: Systematic comparison of different AI configurations
- **Performance Monitoring**: Continuous evaluation of deployed AI systems
- **Compliance Reporting**: Automated generation of evaluation compliance reports
- **Cost Optimization**: Evaluation-driven optimization of AI system costs
### **๐ Educational Applications**
- **Student Assessment**: Automated evaluation of student AI projects
- **Curriculum Development**: Assessment-driven AI curriculum optimization
- **Research Training**: Tools for training researchers in evaluation methodologies
- **Benchmark Creation**: Development of new evaluation benchmarks
- **Peer Review**: AI-assisted peer review systems for academic work
## ๐ **Getting Started**
### **๐ฏ Deployment Options Quick Reference**
| Mode | Command | Protocol | Port | Auth | Use Case |
|------|---------|----------|------|------|----------|
| **MCP Server** | `make dev` | stdio | none | none | Claude Desktop, MCP clients |
| **REST API** | `make serve-rest` | HTTP REST | 8080 | none | Direct HTTP API integration |
| **REST Public** | `make serve-rest-public` | HTTP REST | 8080 | none | Public REST API access |
| **HTTP Bridge** | `make serve-http` | JSON-RPC/HTTP | 9000 | none | MCP over HTTP, local testing |
| **HTTP Public** | `make serve-http-public` | JSON-RPC/HTTP | 9000 | none | MCP over HTTP, remote access |
| **Container** | `make run` | HTTP | 8080 | none | Docker deployment |
### **Immediate Quick Start**
#### **Option 1: MCP Server (stdio)**
```bash
# 1. Run MCP server (for Claude Desktop, etc.)
make dev # Shows connection info + starts server
# 2. Test basic functionality
make example # Run evaluation example
make test-mcp # Test MCP protocol
```
#### **Option 2: REST API Server (FastAPI)**
```bash
# 1. Run native REST API server
make serve-rest # Starts on http://localhost:8080
# 2. Test REST API endpoints
make test-rest # Test all REST endpoints
# 3. View interactive documentation
open http://localhost:8080/docs # Swagger UI
open http://localhost:8080/redoc # ReDoc
# 4. Get connection info
make rest-info # Show complete REST API guide
```
#### **Option 3: HTTP Bridge (MCP over HTTP)**
```bash
# 1. Run MCP protocol over HTTP
make serve-http # Starts on http://localhost:9000
# 2. Test HTTP endpoints
make test-http # Test MCP JSON-RPC endpoints
# 3. Get connection info
make http-info # Show complete HTTP bridge guide
```
#### **Option 4: Docker Deployment**
```bash
# Build and deploy
make build && make run
```
### **Integration Examples**
#### **MCP Client Integration**
```python
# Basic MCP integration
from mcp import Client
client = Client("mcp-eval-server")
# Evaluate any AI output
result = await client.call_tool("judge.evaluate_response", {
"response": "Your AI output here",
"criteria": [{"name": "quality", "description": "Overall quality", "scale": "1-5", "weight": 1.0}],
"rubric": {"criteria": [], "scale_description": {"1": "Poor", "5": "Excellent"}}
})
```
#### **REST API Integration**
```bash
# Start REST API server
make serve-rest
# Check server health
curl http://localhost:8080/health
# List tool categories
curl http://localhost:8080/tools/categories
# Evaluate response directly via REST
curl -X POST http://localhost:8080/judge/evaluate \
-H "Content-Type: application/json" \
-d '{
"response": "Paris is the capital of France.",
"criteria": [
{
"name": "accuracy",
"description": "Factual accuracy",
"scale": "1-5",
"weight": 1.0
}
],
"rubric": {
"criteria": [],
"scale_description": {"1": "Wrong", "5": "Correct"}
},
"judge_model": "rule-based"
}'
```
#### **HTTP Bridge Integration (MCP over HTTP)**
```bash
# Start HTTP bridge server
make serve-http
# List available tools (JSON-RPC)
curl -X POST \
-H "Content-Type: application/json" \
-d '{"jsonrpc": "2.0", "id": 1, "method": "tools/list", "params": {}}' \
http://localhost:9000/
# Evaluate response via HTTP bridge (JSON-RPC)
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"id": 2,
"method": "tools/call",
"params": {
"name": "judge.evaluate_response",
"arguments": {
"response": "Paris is the capital of France.",
"criteria": [{"name": "accuracy", "description": "Factual accuracy", "scale": "1-5", "weight": 1.0}],
"rubric": {"criteria": [], "scale_description": {"1": "Wrong", "5": "Correct"}},
"judge_model": "rule-based"
}
}
}' \
http://localhost:9000/
```
#### **Python REST API Client Integration**
```python
import httpx
import asyncio
async def evaluate_via_rest_api():
"""Example using native REST API endpoints."""
async with httpx.AsyncClient() as client:
base_url = "http://localhost:8080"
# Check health
health = await client.get(f"{base_url}/health")
print(f"Server status: {health.json()['status']}")
# List tool categories
categories = await client.get(f"{base_url}/tools/categories")
print(f"Available categories: {len(categories.json()['categories'])}")
# Evaluate response using REST endpoint
evaluation = await client.post(f"{base_url}/judge/evaluate", json={
"response": "Your AI response here",
"criteria": [
{"name": "quality", "description": "Overall quality", "scale": "1-5", "weight": 1.0}
],
"rubric": {
"criteria": [],
"scale_description": {"1": "Poor", "5": "Excellent"}
},
"judge_model": "rule-based"
})
result = evaluation.json()
print(f"Evaluation score: {result['overall_score']}")
# Check content toxicity
toxicity = await client.post(f"{base_url}/quality/toxicity", json={
"content": "This is a test message",
"toxicity_categories": ["profanity", "hate_speech"],
"sensitivity_level": "moderate",
"judge_model": "rule-based"
})
result = toxicity.json()
print(f"Toxicity detected: {result['toxicity_detected']}")
# Run evaluation
asyncio.run(evaluate_via_rest_api())
```
#### **Python HTTP Bridge Client Integration**
```python
import httpx
import asyncio
async def evaluate_via_http_bridge():
"""Example using MCP over HTTP bridge."""
async with httpx.AsyncClient() as client:
base_url = "http://localhost:9000"
# List tools via JSON-RPC
tools_request = {
"jsonrpc": "2.0",
"id": 1,
"method": "tools/list",
"params": {}
}
response = await client.post(base_url, json=tools_request)
result = response.json()
tools = result.get("result", [])
print(f"Available tools: {len(tools)}")
# Evaluate response via JSON-RPC
eval_request = {
"jsonrpc": "2.0",
"id": 2,
"method": "tools/call",
"params": {
"name": "judge.evaluate_response",
"arguments": {
"response": "Your AI response here",
"criteria": [{"name": "quality", "description": "Overall quality", "scale": "1-5", "weight": 1.0}],
"rubric": {"criteria": [], "scale_description": {"1": "Poor", "5": "Excellent"}},
"judge_model": "rule-based"
}
}
}
response = await client.post(base_url, json=eval_request)
result = response.json()
print(f"Evaluation result: {result}")
# Run evaluation
asyncio.run(evaluate_via_http_bridge())
```