Dingo MCP Server

Overview Schema Related Servers Score Discussions

dingo
docs

factuality_assessment_guide.md•11.1 KiB

# Dingo Factuality Assessment - Complete Guide This guide introduces how to use integrated factuality assessment features in Dingo to evaluate factual accuracy of LLM-generated content. ## 🎯 Feature Overview Factuality assessment evaluates whether LLM-generated responses contain factual errors or unverifiable claims. Particularly useful for: - **Content Quality Control**: Verify accuracy of generated content - **Knowledge Base Validation**: Ensure knowledge base information is accurate - **Training Data Filtering**: Filter out factually incorrect training samples - **Real-time Output Verification**: Check factual accuracy of model outputs ## 🔧 Core Principles ### Evaluation Process 1. **Claim Extraction**: Break down response into independent factual claims 2. **Fact Verification**: Verify each claim against reference materials or knowledge base 3. **Score Calculation**: Calculate overall factuality score 4. **Issue Identification**: Identify specific factual errors ### Scoring Mechanism - **Score Range**: 0.0 - 10.0 - **Score Meaning**: - 8.0-10.0 = High factual accuracy - 5.0-7.9 = Moderate accuracy, some errors - 0.0-4.9 = Low accuracy, significant errors - **Default Threshold**: 5.0 (configurable) ## 📋 Usage Requirements ### Data Format Requirements ```python from dingo.io.input import Data data = Data( data_id="test_1", prompt="User's question", # Original question (optional) content="LLM's response", # Response to assess context=["Reference material 1", "Reference material 2"] # Reference materials (optional but recommended) ) ``` ## 🚀 Quick Start ### SDK Mode - Single Assessment ```python import os from dingo.config.input_args import EvaluatorLLMArgs from dingo.io.input import Data from dingo.model.llm.llm_factcheck import LLMFactCheck # Configure LLM LLMFactCheck.dynamic_config = EvaluatorLLMArgs( key=os.getenv("OPENAI_API_KEY"), api_url=os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"), model=os.getenv("OPENAI_MODEL", "gpt-4o-mini"), parameters={"threshold": 5.0} ) # Prepare data data = Data( data_id="test_1", prompt="When was Python released?", content="Python was released in 1991 by Guido van Rossum.", context=["Python was created by Guido van Rossum.", "Python was first released in 1991."] ) # Execute assessment result = LLMFactCheck.eval(data) # View results print(f"Score: {result.score}/10") print(f"Has issues: {result.status}") # True = below threshold, False = passed print(f"Reason: {result.reason[0]}") ``` ### Dataset Mode - Batch Assessment ```python from dingo.config import InputArgs from dingo.exec import Executor input_data = { "task_name": "factuality_assessment", "input_path": "test/data/responses.jsonl", "output_path": "outputs/", "dataset": {"source": "local", "format": "jsonl"}, "executor": { "max_workers": 10, "result_save": {"good": True, "bad": True, "all_labels": True} }, "evaluator": [ { "fields": { "prompt": "question", "content": "response", "context": "references" }, "evals": [ { "name": "LLMFactCheck", "config": { "model": "gpt-4o-mini", "key": "YOUR_API_KEY", "api_url": "https://api.openai.com/v1", "parameters": {"threshold": 5.0} } } ] } ] } input_args = InputArgs(**input_data) executor = Executor.exec_map["local"](input_args) summary = executor.execute() print(f"Total: {summary.total}") print(f"Passed: {summary.num_good}") print(f"Issues: {summary.num_bad}") print(f"Pass rate: {summary.score}%") ``` ### Data File Format (JSONL) ```jsonl {"question": "When was Python released?", "response": "Python was released in 1991 by Guido van Rossum.", "references": ["Python was created by Guido van Rossum.", "Python first appeared in 1991."]} {"question": "What is the capital of France?", "response": "The capital of France is Paris.", "references": ["Paris is the capital and largest city of France."]} ``` ## ⚙️ Configuration Options ### Threshold Adjustment ```python LLMFactCheck.dynamic_config = EvaluatorLLMArgs( key="YOUR_API_KEY", api_url="https://api.openai.com/v1", model="gpt-4o-mini", parameters={"threshold": 5.0} # Range: 0.0-10.0 ) ``` **Threshold Recommendations**: - **Strict scenarios** (medical, legal): threshold 7.0-8.0 - **General scenarios** (Q&A, documentation): threshold 5.0-6.0 - **Loose scenarios** (creative content, brainstorming): threshold 3.0-4.0 ### Model Selection ```python # Option 1: GPT-4 (highest accuracy, higher cost) LLMFactCheck.dynamic_config = EvaluatorLLMArgs( model="gpt-4o", key="YOUR_API_KEY", api_url="https://api.openai.com/v1" ) # Option 2: GPT-4o-mini (balanced, recommended) LLMFactCheck.dynamic_config = EvaluatorLLMArgs( model="gpt-4o-mini", key="YOUR_API_KEY", api_url="https://api.openai.com/v1" ) # Option 3: Alternative LLM (DeepSeek, etc.) LLMFactCheck.dynamic_config = EvaluatorLLMArgs( model="deepseek-chat", key="YOUR_API_KEY", api_url="https://api.deepseek.com" ) ``` ## 📊 Output Format ### SDK Mode Output ```python result = LLMFactCheck.eval(data) # Basic information result.score # Score: 0.0-10.0 result.status # Has issues: True (below threshold) / False (passed) result.label # Labels: ["QUALITY_GOOD.FACTCHECK_PASS"] or ["QUALITY_BAD.FACTCHECK_FAIL"] result.reason # Detailed reasons result.metric # Metric name: "LLMFactCheck" ``` **Output Example (Passed)**: ```python result.score = 8.5 result.status = False # False = passed result.label = ["QUALITY_GOOD.FACTCHECK_PASS"] result.reason = ["Factual accuracy assessment passed (score: 8.5/10). All claims verified: Python was released in 1991, Creator is Guido van Rossum."] ``` **Output Example (Failed)**: ```python result.score = 3.2 result.status = True # True = failed result.label = ["QUALITY_BAD.FACTCHECK_FAIL"] result.reason = ["Factual accuracy assessment failed (score: 3.2/10). Errors detected: Python was not released in 1995 (correct: 1991)"] ``` ## 🌟 Best Practices ### 1. Provide High-quality Reference Materials **Good References**: ```python context = [ "Python was created by Guido van Rossum and first released in February 1991.", "Python is an interpreted, high-level programming language.", "Python 2.0 was released in 2000, Python 3.0 was released in 2008." ] ``` **Poor References**: ```python context = [ "Python", # Too brief "Python is a programming language" # Lacks details ] ``` ### 2. Suitable Use Cases **✅ Suitable for**: - Verifiable factual claims (dates, names, numbers, events) - Historical facts - Technical specifications - Statistical data **❌ Not suitable for**: - Subjective opinions - Future predictions - Creative content - Open-ended questions ### 3. Combined Use with Other Metrics ```python "evaluator": [ { "fields": { "prompt": "user_input", "content": "response", "context": "retrieved_contexts" }, "evals": [ {"name": "LLMRAGFaithfulness"}, # Answer faithfulness {"name": "LLMFactCheck"}, # Factual accuracy {"name": "RuleHallucinationHHEM"} # Hallucination detection ] } ] ``` ### 4. Iterative Optimization 1. **Initial Testing**: Use default threshold (5.0) 2. **Analyze Results**: Review false positives and false negatives 3. **Adjust Threshold**: Fine-tune based on business requirements 4. **Re-validate**: Test with new threshold ## 📈 Metric Comparison | Metric | Purpose | Score Range | Requires Reference | Best For | |--------|---------|-------------|-------------------|----------| | **Factuality** | Verify factual accuracy | 0-10 | Optional (recommended) | Fact verification, knowledge base validation | | **Faithfulness** | Check if based on context | 0-10 | Yes | RAG systems, prevent hallucinations | | **Hallucination** | Detect contradictions with context | 0-1 | Yes | Fast hallucination detection | **Recommendations**: - **RAG evaluation**: Combine Faithfulness + Hallucination + Factuality - **Content generation**: Use Factuality alone - **Real-time verification**: Prioritize Hallucination (fast) or Faithfulness ## ❓ FAQ ### Q1: Difference between Factuality and Faithfulness? - **Factuality**: Verifies if content is factually correct (can use external knowledge) - **Faithfulness**: Checks if response is based on provided context (only looks at context-response relationship) ### Q2: What if no reference materials provided? LLM will use its internal knowledge for verification, but accuracy may be lower. **Recommendation**: Always provide reference materials for best results. ### Q3: How to handle domain-specific facts? 1. Provide domain-specific reference materials in `context` 2. Use domain-specific LLM models 3. Lower threshold to reduce false positives ### Q4: How to interpret scores? - **8.0-10.0**: High accuracy, all or most facts verified - **5.0-7.9**: Moderate accuracy, some errors or unverifiable claims - **3.0-4.9**: Low accuracy, multiple errors - **0.0-2.9**: Very low accuracy, serious factual errors ## 📖 Related Documents - [RAG Evaluation Metrics Guide](rag_evaluation_metrics.md) - [Hallucination Detection Guide](hallucination_detection_guide.md) - [Response Quality Evaluation](../README.md#evaluation-metrics) ## 📝 Example Scenarios ### Scenario 1: Verify Historical Facts ```python data = Data( content="Python was released in 1991 by Guido van Rossum at CWI in the Netherlands.", context=[ "Python was created by Guido van Rossum.", "Python was first released in February 1991.", "Guido van Rossum began working on Python at CWI." ] ) result = LLMFactCheck.eval(data) # Expected: High score (>8.0), all facts verified ``` ### Scenario 2: Detect Factual Errors ```python data = Data( content="Python was released in 1995 by James Gosling.", # Wrong year and author context=[ "Python was created by Guido van Rossum.", "Python was first released in 1991." ] ) result = LLMFactCheck.eval(data) # Expected: Low score (<4.0), multiple errors detected ``` ### Scenario 3: Assess Partially Correct Content ```python data = Data( content="Python 3.0 was released in 2008. It introduced many breaking changes and removed backward compatibility with Python 2.x.", context=[ "Python 3.0 was released on December 3, 2008.", "Python 3.0 was not backward compatible with Python 2.x series." ] ) result = LLMFactCheck.eval(data) # Expected: High score (7-9), facts mostly correct with minor imprecisions ``` ### Scenario 4: Handle Unverifiable Claims ```python data = Data( content="Python will become the most popular programming language in 2030.", # Future prediction context=["Python is currently one of the most popular programming languages."] ) result = LLMFactCheck.eval(data) # Expected: Moderate score (4-6), future prediction cannot be verified ```

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/MigoXLab/dingo'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

factuality_assessment_guide.md•11.1 KiB