Hybrid RAG Project MCP Server

Overview Schema Related Servers Score Discussions

boundary-testing.md•12.7 KiB

# Boundary Testing Suggestions for Hybrid RAG System This document outlines comprehensive approaches to test the limits and boundaries of the Hybrid RAG system. ## Current Testing (In Progress) ### 1. **Large-Scale Data Ingestion** ✓ COMPLETE - **41,000+ records** across 13 files - CSV files: 41,000 rows - Markdown files: 2,500+ sections - Text files: 1,000+ specifications - **Tests:** Load time, memory usage, chunking performance ### 2. **Vector Embedding Performance** ⏳ RUNNING - Embedding generation speed - ChromaDB performance at scale - Memory consumption during indexing - Throughput (embeddings/second) ### 3. **Retrieval Performance** ⏳ RUNNING - Query latency at scale - Accuracy with large document sets - Hybrid search effectiveness - Cross-document query performance ### 4. **End-to-End QA Performance** ⏳ RUNNING - Response time with large context - Answer quality at scale - LLM performance with extensive retrieved context --- ## Additional Boundary Testing Recommendations ### Data Volume & Variety #### 5. **Maximum Document Count** ```python # Test with incrementally larger datasets test_sizes = [1000, 5000, 10000, 50000, 100000, 500000] for size in test_sizes: - Measure load time - Measure query performance - Track memory usage - Identify breaking point ``` **Expected Limits:** - ChromaDB: Millions of vectors (memory-dependent) - BM25: Thousands of documents (performance degrades) - System memory: Primary constraint #### 6. **Document Size Limits** ```python # Test individual file sizes test_cases = [ "1 KB text file", "100 KB CSV", "1 MB markdown", "10 MB PDF", "50 MB combined dataset", "100 MB+ dataset" ] ``` **What to Test:** - Maximum single file size - Total dataset size limits - Memory exhaustion points - Chunking performance degradation #### 7. **Chunk Count Stress Test** ```python # Very small chunk sizes to maximize chunk count chunk_sizes = [100, 250, 500, 1000, 2000] chunk_overlaps = [0, 50, 100, 200, 400] # Test combinations - Measure embedding time - Measure storage size - Test retrieval accuracy vs. performance tradeoff ``` --- ### Query Complexity #### 8. **Query Length Limits** ```python query_lengths = [ "5 words", "50 words", "200 words", # Paragraph "1000 words", # Essay "5000 words" # Document-length query ] ``` **Metrics:** - Embedding time for long queries - Retrieval accuracy - Response degradation #### 9. **Concurrent Query Load** ```python # Simulate multiple users concurrent_users = [1, 5, 10, 25, 50, 100] for users in concurrent_users: - Spawn concurrent queries - Measure average latency - Measure throughput (queries/sec) - Identify bottlenecks ``` **Use threading or asyncio:** ```python import asyncio import concurrent.futures async def stress_test_queries(num_concurrent=10): tasks = [query_system(test_query) for _ in range(num_concurrent)] results = await asyncio.gather(*tasks) return analyze_results(results) ``` #### 10. **Retrieval K Parameter Sweep** ```python k_values = [1, 5, 10, 20, 50, 100, 500] for k in k_values: - Measure retrieval time - Measure answer quality - Find optimal k value - Identify diminishing returns ``` --- ### Memory & Resource Limits #### 11. **Memory Pressure Testing** ```python # Monitor memory usage import psutil def memory_stress_test(): baseline = psutil.virtual_memory().available # Load incrementally larger datasets while memory_available(): load_more_documents() current_mem = psutil.virtual_memory().available if current_mem < (baseline * 0.1): # 90% used break return max_documents_loaded ``` **What to Monitor:** - RSS (Resident Set Size) - Peak memory usage - Memory leaks during repeated operations - Garbage collection performance #### 12. **Disk I/O Limits** ```python # Test ChromaDB persistence performance test_operations = [ "Sequential writes", "Random writes", "Bulk inserts", "Index rebuilds", "Persistence frequency" ] # Measure: - Write throughput (MB/s) - Read latency - Index size growth - Compaction performance ``` #### 13. **CPU Utilization** ```python # Multi-core performance import multiprocessing # Test parallel document processing num_workers = [1, 2, 4, 8, 16] for workers in num_workers: with multiprocessing.Pool(workers) as pool: results = pool.map(process_document, documents) measure_speedup() ``` --- ### Edge Cases & Error Handling #### 14. **Malformed Data** ```python edge_cases = [ "Empty files", "Files with only whitespace", "CSV with mismatched columns", "Markdown with invalid syntax", "Binary data in text files", "Files with special characters (emoji, unicode)", "Extremely long lines (no line breaks)", "Files with BOM markers", "Mixed encodings (UTF-8, Latin-1, ASCII)" ] ``` **Expected Behavior:** - Graceful error handling - Informative error messages - System continues with valid files - No crashes or data corruption #### 15. **Network Failure Simulation** ```python # Test Ollama connection failures test_scenarios = [ "Ollama server down", "Network timeout", "Slow network (add latency)", "Intermittent connectivity", "Model not available" ] # Implement retry logic # Test fallback mechanisms # Measure recovery time ``` #### 16. **Race Conditions** ```python # Concurrent writes to vector store async def test_concurrent_writes(): writers = [ write_to_vectorstore(docs_batch_1), write_to_vectorstore(docs_batch_2), write_to_vectorstore(docs_batch_3) ] # Test data consistency # Verify no corruption # Check for lock contention ``` --- ### Performance Degradation Analysis #### 17. **Time Complexity Analysis** ```python # Measure O(n) behavior dataset_sizes = [100, 500, 1000, 5000, 10000, 50000] for size in dataset_sizes: load_time = measure_load(size) query_time = measure_query(size) # Plot: # - Load time vs dataset size # - Query time vs dataset size # - Determine complexity class ``` **Expected Results:** - Document loading: O(n) - Embedding generation: O(n) - Vector search: O(log n) or O(√n) - BM25 search: O(n) but optimized #### 18. **Cache Performance** ```python # Test repeated queries (cache hit rate) queries = generate_test_queries(1000) # First pass (cold cache) cold_times = [query(q) for q in queries] # Second pass (warm cache) warm_times = [query(q) for q in queries] cache_speedup = mean(cold_times) / mean(warm_times) ``` --- ### Accuracy & Quality Metrics #### 19. **Retrieval Accuracy at Scale** ```python # Create ground truth dataset ground_truth = [ {"query": "...", "relevant_docs": [...]}, # ... 100+ test cases ] # Measure: metrics = { "Precision@k": precision_at_k(results, ground_truth), "Recall@k": recall_at_k(results, ground_truth), "MRR": mean_reciprocal_rank(results, ground_truth), "NDCG": normalized_discounted_cumulative_gain(results, ground_truth) } ``` #### 20. **Answer Quality vs. Dataset Size** ```python # Hypothesis: Answer quality may degrade with more irrelevant documents dataset_sizes = [100, 1000, 10000, 100000] for size in dataset_sizes: answers = query_system(test_questions, dataset_size=size) # Measure: - Accuracy (human evaluation) - Relevance scores - Hallucination rate - Answer confidence ``` --- ### Real-World Scenarios #### 21. **Continuous Ingestion Simulation** ```python # Simulate documents being added over time async def continuous_ingestion(): while True: new_docs = fetch_new_documents() add_to_vectorstore(new_docs) await asyncio.sleep(60) # Every minute # Test: - Query performance degradation - Index fragmentation - Memory growth ``` #### 22. **Mixed Workload Testing** ```python # Realistic usage patterns workload = { "60% simple queries": simple_queries, "30% complex queries": complex_queries, "5% document additions": add_operations, "5% document deletions": delete_operations } # Run for extended period (hours/days) # Monitor: - Average latency - P50, P95, P99 latency percentiles - Error rates - Resource usage trends ``` #### 23. **Multi-Tenant Simulation** ```python # Separate vector stores per tenant num_tenants = [10, 50, 100, 500] for tenants in num_tenants: # Create isolated environments # Test resource isolation # Measure overhead # Test cross-tenant query prevention ``` --- ### Failure & Recovery Testing #### 24. **Crash Recovery** ```python # Test scenarios: test_cases = [ "Crash during embedding generation", "Crash during vector store write", "Crash during query processing", "Disk full during persistence", "OOM kill during large batch" ] # Verify: - Data consistency after recovery - No corrupted indices - Ability to resume operations ``` #### 25. **Data Consistency Verification** ```python def verify_consistency(): # Count documents expected_docs = count_source_documents() indexed_chunks = count_vectorstore_chunks() # Verify all documents retrievable for doc_id in all_document_ids: results = query_by_id(doc_id) assert len(results) > 0, f"Document {doc_id} not retrievable" # Verify no duplicates all_ids = get_all_chunk_ids() assert len(all_ids) == len(set(all_ids)), "Duplicate chunks detected" ``` --- ## Automated Testing Framework ### Test Suite Structure ```python class BoundaryTestSuite: def __init__(self): self.results = {} def run_all_tests(self): # Data volume tests self.test_max_documents() self.test_max_file_size() self.test_max_chunks() # Performance tests self.test_concurrent_queries() self.test_query_complexity() self.test_memory_limits() # Quality tests self.test_retrieval_accuracy() self.test_answer_quality() # Failure tests self.test_error_handling() self.test_crash_recovery() # Generate report self.generate_comprehensive_report() ``` --- ## Recommended Tools ### Performance Monitoring - **psutil**: CPU, memory, disk I/O monitoring - **py-spy**: Python profiling - **memory_profiler**: Line-by-line memory usage - **cProfile**: Function-level performance profiling ### Load Testing - **locust**: Distributed load testing - **pytest-benchmark**: Microbenchmarking - **asyncio**: Concurrent operation testing ### Metrics & Visualization - **matplotlib/plotly**: Performance graphs - **pandas**: Data analysis - **Jupyter**: Interactive analysis notebooks --- ## Success Criteria ### Performance Targets - **Load time**: < 1 second per 1000 documents - **Query latency**: < 500ms for 95th percentile - **Throughput**: > 10 queries/second - **Memory efficiency**: < 1 GB per 10,000 documents ### Scalability Targets - **Support 100,000+ documents** - **Handle 50+ concurrent users** - **Process 1 GB+ of text data** - **Maintain <5% error rate under load** ### Quality Targets - **Retrieval Precision@10**: > 80% - **Answer Relevance**: > 90% - **Zero crashes under normal load** - **<1% data loss during failures** --- ## Next Steps 1. ✓ Run current boundary test suite (41K records) 2. Implement concurrent query testing 3. Add memory pressure tests 4. Create continuous ingestion simulation 5. Build automated regression test suite 6. Document performance baselines 7. Create performance tuning guide --- ## Additional Test Scenarios to Consider ### 26. **Different Document Distributions** ```python # Test with skewed data distributions = [ "All CSV (structured only)", "All Markdown (unstructured only)", "Mixed (current)", "Heavy text (95% markdown)", "Heavy CSV (95% structured)" ] ``` ### 27. **Query Pattern Analysis** ```python # Real user behavior patterns = [ "Bursty traffic (peak hours)", "Sustained high load", "Gradual ramp-up", "Query diversity analysis", "Repeat query percentage" ] ``` ### 28. **Storage Efficiency** ```python # Measure compression and deduplication metrics = { "Raw data size": measure_source_size(), "Indexed size": measure_vectorstore_size(), "Compression ratio": calculate_ratio(), "Duplicate chunk percentage": find_duplicates() } ``` --- ## Conclusion This comprehensive boundary testing approach will help you: 1. **Identify system limits** before production deployment 2. **Optimize performance** for real-world workloads 3. **Ensure reliability** under stress conditions 4. **Document capacity planning** requirements 5. **Build confidence** in scalability The current test with 41,000+ records is an excellent start. Consider implementing the additional tests above to fully understand your system's capabilities and limitations.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/gwyer/hybrid-rag-project'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

boundary-testing.md•12.7 KiB