Hybrid RAG Project MCP Server

Overview Schema Related Servers Score Discussions

hybrid-rag-project
docs
getting-started

testing-quick-start.md•8.38 KiB

# Testing Quick Start Guide ## TL;DR - Run This First ```bash # Get your system's reliability score (takes ~5 minutes) python tests/test_reliability_score.py ``` This will output: - **Overall reliability percentage** (target: ≥90%) - Specific failures with recommendations - Detailed report: `RELIABILITY_TEST_REPORT.md` ## What Am I Testing? Your RAG system can fail in subtle ways: ❌ **Hallucination**: Making up information ❌ **Wrong answers**: Retrieving incorrect context ❌ **Out-of-scope**: Using general knowledge instead of your documents ❌ **Number errors**: Approximating instead of being exact The tests catch these failures **before** they happen in production. ## Quick Testing Workflow ### Step 1: Baseline Score (5 minutes) ```bash python tests/test_reliability_score.py ``` **Example Output:** ``` Overall Reliability: 78.5% Tests Passed: 11/14 Category Breakdown: Hallucination Resistance: 75.0% (3/4) Numerical Accuracy: 100.0% (2/2) Source Accuracy: 100.0% (3/3) Context Adherence: 66.7% (2/3) Basic Functionality: 100.0% (3/3) ``` ### Step 2: Understand Failures Open the generated report: ```bash cat RELIABILITY_TEST_REPORT.md ``` Look for ❌ FAIL entries - these show exactly what went wrong. ### Step 3: Fix and Re-test Common fixes: **If Hallucination Resistance fails:** ```python # Edit your demo script's prompt # Add more explicit instructions: prompt = ChatPromptTemplate.from_template(""" CRITICAL: If the context doesn't contain the answer, respond EXACTLY with: "I don't have this information in the documents." NEVER make up or infer information. <context> {context} </context> Question: {input} """) ``` **If Context Adherence fails:** ```yaml # Edit config/config.yaml # Increase retrieval to get more context retrieval: vector_search_k: 8 # Up from 5 keyword_search_k: 8 # Up from 5 ``` **Then re-run:** ```bash python tests/test_reliability_score.py ``` ### Step 4: Deep Dive Testing (Optional) Once you're above 85%, run comprehensive tests: ```bash # This takes ~15-20 minutes, runs 21 tests python tests/test_rag_boundaries.py ``` This tests edge cases like: - Empty queries - Very long queries - Negation ("products WITHOUT warranty claims") - Multi-document reasoning - Special characters ## Understanding Test Results ### Reliability Score Meanings | Score | Meaning | Action | |-------|---------|--------| | **≥95%** | 🟢 Production Ready | Deploy with confidence | | **85-94%** | 🟡 Nearly There | Fix remaining issues | | **75-84%** | 🟠 Needs Work | Address failures before production | | **<75%** | 🔴 Not Ready | Significant improvements needed | ### What Each Category Tests **1. Hallucination Resistance** (Critical) - Asks about information NOT in documents - Should respond: "I don't have this information" - Failure = System is making up answers ⚠️ **2. Numerical Accuracy** (Critical) - Tests if exact numbers are preserved - Should say "600 entries" not "~600 entries" - Failure = Data integrity issues ⚠️ **3. Source Accuracy** (Important) - Tests if correct documents are retrieved - Should find product catalog when asked about products - Failure = Retrieval issues **4. Context Adherence** (Critical) - Tests if system uses ONLY your documents - Should NOT answer "What is machine learning?" from general knowledge - Failure = Using training data instead of your docs ⚠️ **5. Basic Functionality** (Critical) - Tests if system can answer in-domain questions - Should successfully answer questions about your data - Failure = Core system broken ⚠️ ## Common Issues & Quick Fixes ### Issue 1: "Low Hallucination Resistance Score" **Problem:** System invents information **Quick Fix:** ```python # In your demo script, strengthen the prompt: prompt = ChatPromptTemplate.from_template(""" You MUST answer ONLY from the context below. If the answer is not in the context, say: "I don't have this information in the documents." DO NOT use any other knowledge. <context> {context} </context> Question: {input} """) ``` ### Issue 2: "Low Numerical Accuracy" **Problem:** Numbers are rounded or lost **Quick Fix:** ```yaml # config/config.yaml document_processing: text_chunk_size: 1500 # Increase from 1000 text_chunk_overlap: 300 # Increase from 200 ``` Larger chunks preserve more numerical context. ### Issue 3: "Low Source Accuracy" **Problem:** Not finding the right documents **Quick Fix:** ```yaml # config/config.yaml retrieval: vector_search_k: 10 # Increase from 5 keyword_search_k: 10 # Increase from 5 ``` More retrieved documents = better chance of finding the right one. ### Issue 4: "Low Context Adherence" **Problem:** Using general knowledge instead of documents **Quick Fix:** ```python # Make prompt more restrictive: prompt = ChatPromptTemplate.from_template(""" You are a document search assistant. Your ONLY knowledge is in the context below. You have NO other knowledge. If the context is empty or doesn't answer the question, say: "I don't have this information in the documents." <context> {context} </context> Question: {input} """) ``` ## Advanced: Custom Tests Add your own critical tests: ```python # tests/test_custom.py import sys from pathlib import Path sys.path.insert(0, str(Path(__file__).parent.parent)) from tests.test_reliability_score import ReliabilityTester class MyCustomTests(ReliabilityTester): def test_my_critical_query(self): """Test the most important query for my use case.""" answer = self.query("What OLED TVs cost under $1000?") # Check answer contains expected information assert "TV-OLED" in answer, "Should mention OLED TV products" assert "$" in answer, "Should include price information" print(f"✅ Critical query test passed") print(f" Answer: {answer[:100]}...") if __name__ == '__main__': tester = MyCustomTests() tester.test_my_critical_query() ``` Run: ```bash python tests/test_custom.py ``` ## Interpreting the Reports ### RELIABILITY_TEST_REPORT.md This file contains: - Overall score and breakdown - Every test that was run - For failures: What went wrong and why - Specific answer that failed **Look for:** - ❌ FAIL entries - these are your problems - The "Issue" field - explains what went wrong - The "Expected" field - shows what should happen ### Example Failure Entry ```markdown ### ❌ FAIL - Hallucination Resistance **Query:** What is the CEO's name? **Answer:** The CEO is John Smith based on the organizational structure... **Issue:** May be hallucinating - did not indicate lack of information **Expected:** Should contain "don't have" or "not available" ``` **What this means:** The system made up a CEO name. This is critical - fix immediately. **How to fix:** Strengthen prompt to explicitly refuse when information is missing. ## Test Often ### During Development ```bash # Quick check after making changes python tests/test_reliability_score.py ``` ### Before Deployment ```bash # Full validation python tests/test_reliability_score.py python tests/test_rag_boundaries.py ``` ### In Production ```bash # Weekly validation to catch degradation python tests/test_reliability_score.py > weekly_report.txt ``` ## Success Checklist Before considering your system "production ready": - [ ] Reliability score ≥90% - [ ] All Hallucination Resistance tests pass - [ ] All Context Adherence tests pass - [ ] All Numerical Accuracy tests pass - [ ] All Basic Functionality tests pass - [ ] Source Accuracy ≥90% - [ ] Boundary tests run without crashes - [ ] Custom tests for your use case pass ## Next Steps 1. **Run baseline test** - `python tests/test_reliability_score.py` 2. **Review failures** - Check `RELIABILITY_TEST_REPORT.md` 3. **Fix top issues** - Start with Hallucination and Context Adherence 4. **Re-test** - Iterate until ≥90% 5. **Deep dive** - Run `python tests/test_rag_boundaries.py` 6. **Add custom tests** - For your specific use cases 7. **Monitor** - Run tests regularly ## Getting Help If stuck on a failure: 1. Check `docs/guides/testing-strategy.md` for detailed solutions 2. Review the "Recommendations" section in test output 3. Look at `tests/test_data_adversarial.json` for similar cases 4. Test with different prompts and configurations ## Remember ⚠️ **A RAG system with <90% reliability will produce wrong answers ~1 in 10 queries.** ✅ **Testing is how you get to 95%+ reliability and trust your system.** Start testing now: ```bash python tests/test_reliability_score.py ```

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/gwyer/hybrid-rag-project'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

testing-quick-start.md•8.38 KiB