Dingo MCP Server

Overview Schema Related Servers Score Discussions

dingo
docs

rag_evaluation_metrics.md•13.6 KiB

# RAG Evaluation Metrics - Complete Guide ## 🎯 Overview Dingo's RAG evaluation metrics system is based on best practices from the [RAGAS paper](https://arxiv.org/abs/2309.15217), DeepEval, and TruLens, providing comprehensive RAG system evaluation capabilities. ### ✅ Supported Metrics (5/5) | Metric | Evaluation Dimension | Required Fields | Source | |--------|---------------------|-----------------|--------| | **Faithfulness** | Answer Faithfulness | user_input, response, retrieved_contexts | RAGAS | | **Answer Relevancy** | Answer Relevance | user_input, response | RAGAS | | **Context Relevancy** | Context Relevance | user_input, retrieved_contexts | RAGAS + DeepEval + TruLens | | **Context Recall** | Context Recall | user_input, retrieved_contexts, reference | RAGAS | | **Context Precision** | Context Precision | user_input, retrieved_contexts, reference | RAGAS | ## 🚀 Quick Start ### 1. Run Examples ```bash # Dataset mode - batch evaluation (recommended) python examples/rag/dataset_rag_eval_baseline.py # SDK mode - single evaluation python examples/rag/sdk_rag_eval.py # Simulate RAG system and evaluate python examples/rag/e2e_RAG_eval_with_mockRAG_fiqa.py ``` ### 2. SDK Mode - Single Evaluation ```python import os from dingo.config.input_args import EvaluatorLLMArgs, EmbeddingConfigArgs from dingo.io.input import Data from dingo.model.llm.rag.llm_rag_faithfulness import LLMRAGFaithfulness # Configure LLM LLMRAGFaithfulness.dynamic_config = EvaluatorLLMArgs( key=os.getenv("OPENAI_API_KEY"), api_url=os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"), model=os.getenv("OPENAI_MODEL", "deepseek-chat"), ) # Prepare data data = Data( data_id="example_1", prompt="What is machine learning?", content="Machine learning is a branch of AI that enables computers to learn from data.", context=[ "Machine learning is a subfield of AI.", "ML systems learn from data without explicit programming." ] ) # Evaluate result = LLMRAGFaithfulness.eval(data) # View results print(f"Score: {result.score}/10") print(f"Passed: {not result.status}") # status=False means passed print(f"Reason: {result.reason[0]}") ``` ### 3. Dataset Mode - Batch Evaluation ```python from dingo.config import InputArgs from dingo.exec import Executor # Configuration llm_config = { "model": "gpt-4o-mini", "key": "YOUR_API_KEY", "api_url": "https://api.openai.com/v1", } llm_config_embedding = { "model": "gpt-4o-mini", "key": "YOUR_API_KEY", "api_url": "https://api.openai.com/v1", "embedding_config": { # ⭐ Required for Answer Relevancy "model": "text-embedding-3-large", "api_url": "https://api.openai.com/v1", "key": "YOUR_API_KEY" }, "parameters": { "strictness": 3, "threshold": 5 } } input_data = { "task_name": "rag_evaluation", "input_path": "test/data/fiqa.jsonl", "output_path": "outputs/", "dataset": {"source": "local", "format": "jsonl"}, "executor": { "max_workers": 10, "result_save": {"good": True, "bad": True, "all_labels": True} }, "evaluator": [ { "fields": { "prompt": "user_input", "content": "response", "context": "retrieved_contexts", "reference": "reference" }, "evals": [ {"name": "LLMRAGFaithfulness", "config": llm_config}, {"name": "LLMRAGAnswerRelevancy", "config": llm_config_embedding}, {"name": "LLMRAGContextRelevancy", "config": llm_config}, {"name": "LLMRAGContextRecall", "config": llm_config}, {"name": "LLMRAGContextPrecision", "config": llm_config} ] } ] } input_args = InputArgs(**input_data) executor = Executor.exec_map["local"](input_args) summary = executor.execute() ``` ## 📋 Data Format ### Required Fields | Metric | user_input | response | retrieved_contexts | reference | Notes | |--------|-----------|----------|-------------------|-----------|-------| | **Faithfulness** | ✅ | ✅ | ✅ | - | Measures if answer is based on context | | **Answer Relevancy** | ✅ | ✅ | - | - | Measures if answer addresses the question | | **Context Relevancy** | ✅ | - | ✅ | - | Measures if retrieved contexts are relevant | | **Context Recall** | ✅ | - | ✅ | ✅ | Measures if all needed info is retrieved | | **Context Precision** | ✅ | - | ✅ | ✅ | Measures ranking quality of retrieved contexts | ### Data Example (JSONL) ```jsonl {"user_input": "What is deep learning?", "response": "Deep learning uses neural networks...", "retrieved_contexts": ["Deep learning is a subset of ML...", "Deep learning is used for image recognition..."]} {"user_input": "Python features?", "response": "Python is concise and has rich libraries.", "retrieved_contexts": ["Python has clean syntax.", "Python has NumPy and other libraries."], "reference": "Python has clean syntax and a rich ecosystem."} ``` ## ⚙️ Configuration ### Configurable Parameters | Parameter | Applicable Metrics | Default | Description | |-----------|-------------------|---------|-------------| | `threshold` | All metrics | 5.0 | Pass threshold (0-10) | | `strictness` | Answer Relevancy | 3 | Number of questions to generate (1-5) | | `embedding_config` | Answer Relevancy | - | **Required**: includes `model`, `api_url`, `key` | ### Embedding Configuration (Answer Relevancy) `LLMRAGAnswerRelevancy` **requires `embedding_config`**: **Option 1: Cloud LLM + Cloud Embedding** ```python "config": { "model": "deepseek-chat", "key": "YOUR_API_KEY", "api_url": "https://api.deepseek.com", "embedding_config": { # ⭐ Required "model": "text-embedding-3-large", "api_url": "https://api.deepseek.com", "key": "YOUR_API_KEY" }, "parameters": {"strictness": 3, "threshold": 5} } ``` **Option 2: Cloud LLM + Local Embedding (Recommended: Cost-effective)** ```python "config": { "model": "deepseek-chat", "key": "YOUR_API_KEY", "api_url": "https://api.deepseek.com", "embedding_config": { # ⭐ Independent embedding service "model": "BAAI/bge-m3", "api_url": "http://localhost:8000/v1", # Local vLLM/Xinference "key": "dummy-key" }, "parameters": {"strictness": 3, "threshold": 5} } ``` **Deploy Local Embedding (vLLM)**: ```bash pip install vllm python -m vllm.entrypoints.openai.api_server \ --model BAAI/bge-m3 \ --port 8000 \ --host 0.0.0.0 ``` **What happens if not configured?** Runtime exception: ``` ValueError: Embedding model not initialized. Please configure 'embedding_config' in your LLM config with: - model: embedding model name (e.g., 'BAAI/bge-m3') - api_url: embedding service URL - key: API key (optional for local services) ``` ## 📊 Metric Details ### 1️⃣ Faithfulness (Answer Faithfulness) **Evaluation Goal**: Measure if the answer is entirely based on retrieved context, avoiding hallucinations **Calculation**: 1. Break down answer into independent statements (claims) 2. Judge if each statement is supported by context 3. Faithfulness score = (Supported statements / Total statements) × 10 **Formula**: ``` Faithfulness = (Context-supported claims / Total claims) × 10 ``` **Recommended Threshold**: 7 (out of 10) --- ### 2️⃣ Answer Relevancy (Answer Relevance) **Evaluation Goal**: Measure if the answer directly addresses the user question **Calculation**: 1. Generate N reverse questions from the answer (questions inferred by LLM from the answer) 2. Calculate cosine similarity between embeddings of generated questions and original question 3. Answer Relevancy = Average of all similarities **Formula**: ``` Answer Relevancy = (1/N) × Σ cosine_sim(E_gi, E_o) Where: - N: Number of generated questions, default 3 (adjustable via strictness parameter) - E_gi: Embedding of the i-th generated question - E_o: Embedding of the original question ``` **⚠️ Important**: This metric **requires `embedding_config`**: - `model`: Embedding model name (e.g., `text-embedding-3-large`, `BAAI/bge-m3`) - `api_url`: Embedding service address - `key`: API key (optional for local services) **Recommended Threshold**: 5 (out of 10) --- ### 3️⃣ Context Relevancy (Context Relevance) **Evaluation Goal**: Measure if retrieved contexts are relevant to the question **Calculation**: Uses a **Dual-Judge System** from NVIDIA research: **Judge 1 Scoring**: - **0** = Context completely irrelevant - **1** = Context partially relevant - **2** = Context fully relevant **Judge 2 Scoring**: - Uses different prompt wording for another perspective - Same 0-2 scoring standard - Purpose: Reduce single-prompt bias **Final Score**: ``` Context Relevancy = (Relevant contexts / Total contexts) × 10 Where: - Relevant context: Average score from both judges ≥ threshold (default 1.0) - Irrelevant context: Average score < threshold ``` **Recommended Threshold**: 5 (out of 10) --- ### 4️⃣ Context Recall (Context Recall) **Evaluation Goal**: Measure if all needed information is retrieved (requires reference answer) **Calculation**: 1. Extract independent statements from reference answer 2. Judge if each statement can be attributed from retrieved contexts 3. Recall = (Context-supported reference statements / Total reference statements) × 10 **Formula**: ``` Context Recall = (Context-supported reference claims / Total reference claims) × 10 ``` **Note**: **Requires reference answer (reference)**, typically used in evaluation phase **Recommended Threshold**: 5 (out of 10) --- ### 5️⃣ Context Precision (Context Precision) **Evaluation Goal**: Measure ranking quality of retrieval results, whether relevant docs are at the top (requires reference answer) **Calculation**: 1. For each position k, judge if the context is relevant (supports reference answer) 2. Calculate Precision@k for each position 3. Use relevance indicator (v_k) for weighted sum **Formula**: ``` Context Precision = Σ(Precision@k × v_k) / Total relevant items in top K Where: - K: Total retrieved documents, e.g., 5 documents - k: Current position (1st, 2nd, 3rd, ..., K-th) - v_k: Relevance indicator, 0 (irrelevant) or 1 (relevant) - Precision@k: Precision in first k documents, 0.0 to 1.0 - Precision@k = Relevant count in first k / k ``` **Note**: **Requires reference answer (reference)** to judge which contexts are relevant **Recommended Threshold**: 5 (out of 10) ## 🌟 Best Practices ### 1. Metric Combinations **Complete Evaluation** (5 metrics): ```python "evals": [ {"name": "LLMRAGFaithfulness"}, # Detect hallucinations {"name": "LLMRAGAnswerRelevancy"}, # Check answer relevance {"name": "LLMRAGContextRelevancy"}, # Check context noise {"name": "LLMRAGContextRecall"}, # Evaluate retrieval completeness {"name": "LLMRAGContextPrecision"} # Evaluate retrieval ranking ] ``` **Production Environment** (no reference needed): ```python "evals": [ {"name": "LLMRAGFaithfulness"}, # ⭐ Most important: prevent hallucinations {"name": "LLMRAGAnswerRelevancy"}, # Ensure direct answers {"name": "LLMRAGContextRelevancy"} # Check retrieval noise ] ``` **Evaluation Phase** (requires reference): ```python "evals": [ {"name": "LLMRAGContextRecall"}, # Evaluate retrieval completeness {"name": "LLMRAGContextPrecision"} # Evaluate retrieval ranking ] ``` ### 2. Threshold Adjustment Adjust thresholds (default 5) based on scenario: - **Strict scenarios** (finance, medical): threshold 7-8 - **General scenarios** (Q&A systems): threshold 5-6 - **Loose scenarios** (exploratory search): threshold 3-4 ### 3. Iterative Optimization 1. **Initial Evaluation**: Evaluate current system with all 5 metrics 2. **Identify Issues**: - **Low Faithfulness** → Generation model produces hallucinations - Optimize: Adjust generation prompts, use stronger models, enhance fact-checking - **Low Answer Relevancy** → Answer off-topic or contains irrelevant info - Optimize: Improve generation prompts, limit answer length, enhance question understanding - **Low Context Relevancy** → Retrieval introduces noise - Optimize: Improve retrieval algorithm, adjust similarity threshold, improve embedding model - **Low Context Recall** → Retrieval misses important info - Optimize: Increase Top-K, improve query rewriting, expand knowledge base - **Low Context Precision** → Relevant docs ranked lower - Optimize: Improve ranking algorithm, adjust reranker, improve relevance calculation 3. **Targeted Optimization**: Adjust components based on issues 4. **Re-evaluate**: Verify optimization effects 5. **Continuous Monitoring**: Monitor key metrics in production ### 4. Important Notes - **LLM Dependency**: All metrics depend on LLM API, requiring correct API key and endpoint - **Embedding Dependency**: - Answer Relevancy **requires `embedding_config`**: `model`, `api_url`, `key` - Can use cloud services (OpenAI, DeepSeek) or local deployment (vLLM, Xinference) - Not configuring will throw exception: `ValueError: Embedding model not initialized...` - **Cost Considerations**: Evaluation generates API costs, recommendations: - Development: Sample evaluation (50-100 samples) - Production: Use key metrics only (Faithfulness, Answer Relevancy, Context Relevancy) - Evaluation: Full evaluation of all metrics - **Reference Requirements**: - Context Recall and Context Precision **require** reference - Other three metrics don't need reference - Reference mainly used in evaluation phase, production usually doesn't need it ## 📖 For More Details See the [Chinese version](rag_evaluation_metrics_zh.md) for comprehensive examples and detailed explanations.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/MigoXLab/dingo'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

rag_evaluation_metrics.md•13.6 KiB