Crawl4AI+SearXNG MCP Server

crawl4ai-rag-mcp
docs
architecture

NEO4J_QDRANT_INTEGRATION_GUIDE.md•11.7 KiB

# Neo4j-Qdrant Integration Layer Guide ## Overview The Neo4j-Qdrant integration layer provides validated code search with hallucination prevention by combining: - **Qdrant**: Semantic vector search for finding relevant code examples - **Neo4j**: Structural validation against parsed repository knowledge graphs - **Performance Optimization**: Caching, parallel processing, and health monitoring ## Architecture ``` ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ User Query │───▶│ Smart Code │───▶│ Validated │ │ │ │ Search Tool │ │ Results │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ ▼ ┌─────────────────┐ │ Validated │ │ Search Service │ └─────────────────┘ │ ┌─────────┴─────────┐ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ │ Qdrant │ │ Neo4j │ │ Semantic Search │ │ Validation │ └─────────────────┘ └─────────────────┘ │ │ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ │ Code Examples │ │ Repository │ │ + Embeddings │ │ Structure │ └─────────────────┘ └─────────────────┘ ``` ## Key Components ### 1. ValidatedCodeSearchService **Location**: `src/services/validated_search.py` **Purpose**: Core service that combines Qdrant semantic search with Neo4j structural validation. **Key Features**: - Parallel validation for performance - Confidence scoring algorithm - Intelligent caching - Fallback strategies when Neo4j unavailable **Usage**: ```python from services.validated_search import ValidatedCodeSearchService # Initialize service service = ValidatedCodeSearchService(database_client, neo4j_driver) # Perform validated search results = await service.search_and_validate_code( query="pydantic model validation example", match_count=5, source_filter="pydantic-ai", min_confidence=0.6, parallel_validation=True ) ``` ### 2. Enhanced Hallucination Detection **Location**: `src/knowledge_graph/enhanced_validation.py` **Purpose**: Comprehensive AI script validation using both databases. **Key Features**: - AST-based script analysis - Neo4j structural validation - Qdrant semantic validation - Combined confidence scoring - Suggested corrections from real code **Usage**: ```python from knowledge_graph.enhanced_validation import EnhancedHallucinationDetector # Initialize detector detector = EnhancedHallucinationDetector(database_client, neo4j_driver) # Check script for hallucinations report = await detector.check_script_hallucinations( script_path="/path/to/script.py", include_code_suggestions=True, detailed_analysis=True ) ``` ### 3. Smart Combined Query Tool **Location**: `src/tools.py` (MCP tool: `smart_code_search`) **Purpose**: Intelligent MCP tool that routes between Neo4j and Qdrant with validation. **Key Features**: - Validation mode selection (fast/balanced/thorough) - Confidence threshold control - Performance optimization - Graceful degradation **MCP Usage**: ```json { "tool": "smart_code_search", "arguments": { "query": "async function with error handling", "match_count": 5, "source_filter": "fastapi", "min_confidence": 0.7, "validation_mode": "balanced", "include_suggestions": true } } ``` ### 4. Performance Optimization Layer **Location**: `src/utils/integration_helpers.py` **Key Features**: - High-performance TTL cache with LRU eviction - Batch processing with concurrency control - Circuit breaker pattern for service failures - Health monitoring for both databases - Performance metrics and monitoring ## Validation Confidence Scoring The integration uses a sophisticated confidence scoring algorithm: ### Scoring Components 1. **Neo4j Structural Validation** (60% weight): - Repository existence: 30% - Class/method existence: 40% - Structure correctness: 30% 2. **Qdrant Semantic Validation** (40% weight): - Semantic similarity score - Example validation confidence - Code pattern matching ### Confidence Thresholds - **Critical Confidence**: ≥ 0.9 - **High Confidence**: ≥ 0.8 - **Medium Confidence**: ≥ 0.6 - **Low Confidence**: < 0.6 ## Performance Features ### Caching Strategy - **Validation Cache**: 1-hour TTL for Neo4j validation results - **Query Cache**: 30-minute TTL for optimized queries - **LRU Eviction**: Automatic cache management - **Cache Statistics**: Hit rate, evictions, performance metrics ### Parallel Processing - **Concurrent Validation**: Up to 10 parallel validations - **Batch Processing**: Configurable batch sizes - **Semaphore Control**: Resource-aware concurrency - **Exception Handling**: Graceful degradation on failures ### Health Monitoring - **Component Health**: Neo4j and Qdrant status checks - **Integration Health**: Overall system status - **Performance Metrics**: Response times, success rates - **Circuit Breaker**: Automatic failure handling ## Environment Configuration ```bash # Neo4j Configuration (required for full validation) NEO4J_URI=bolt://localhost:7687 NEO4J_USER=neo4j NEO4J_PASSWORD=your_password # Qdrant Configuration (required) QDRANT_URL=http://localhost:6333 QDRANT_API_KEY=your_api_key # optional # Knowledge Graph Features USE_KNOWLEDGE_GRAPH=true USE_AGENTIC_RAG=true USE_HYBRID_SEARCH=true ``` ## Usage Examples ### 1. Basic Validated Search ```python # Search for code examples with validation results = await service.search_and_validate_code( query="database connection pooling", match_count=3, min_confidence=0.7 ) print(f"Found {len(results['results'])} validated examples") for result in results['results']: print(f"- {result['summary']} (confidence: {result['validation']['confidence_score']:.2f})") ``` ### 2. Enhanced Hallucination Detection ```python # Check AI-generated script report = await detector.check_script_hallucinations("/tmp/ai_script.py") if report['overall_assessment']['risk_level'] == 'high': print("⚠️ High hallucination risk detected!") for hallucination in report['hallucinations']['critical']: print(f"- {hallucination['type']}: {hallucination.get('element_name', 'Unknown')}") ``` ### 3. Health Monitoring ```python # Check system health health = await service.get_health_status() print(f"Overall Status: {health['overall_status']}") print(f"Neo4j: {health['components']['neo4j']['status']}") print(f"Qdrant: {health['components']['qdrant']['status']}") # Get performance stats stats = await service.get_cache_stats() print(f"Cache Hit Rate: {stats['cache_stats']['hit_rate']:.2%}") ``` ## Integration Workflow ### 1. Repository Preparation ```bash # Parse repositories into Neo4j curl -X POST "http://localhost:8000/parse_github_repository" \ -d '{"repo_url": "https://github.com/pydantic/pydantic-ai.git"}' # Extract and index code examples in Qdrant curl -X POST "http://localhost:8000/extract_and_index_repository_code" \ -d '{"repo_name": "pydantic-ai"}' ``` ### 2. Validated Search ```bash # Search with validation curl -X POST "http://localhost:8000/smart_code_search" \ -d '{ "query": "pydantic model validation", "source_filter": "pydantic-ai", "validation_mode": "balanced", "min_confidence": 0.6 }' ``` ### 3. Hallucination Detection ```bash # Check script for hallucinations curl -X POST "http://localhost:8000/check_ai_script_hallucinations_enhanced" \ -d '{ "script_path": "/path/to/script.py", "include_code_suggestions": true }' ``` ## Error Handling & Fallback Strategies ### Neo4j Unavailable - **Fallback**: Qdrant-only semantic search - **Confidence**: Reduced to neutral (0.5) - **Suggestions**: Limited to semantic recommendations ### Qdrant Unavailable - **Fallback**: Neo4j structural validation only - **Search**: Disabled, validation-only mode - **Performance**: Degraded but functional ### Both Systems Unavailable - **Fallback**: Basic search without validation - **Warning**: Clear indication of degraded mode - **Suggestions**: Generic recommendations ## Performance Benchmarks ### Typical Response Times - **Fast Mode**: < 200ms (lower accuracy) - **Balanced Mode**: 200-500ms (optimal) - **Thorough Mode**: 500ms-2s (highest accuracy) ### Cache Performance - **Hit Rate**: 70-85% for repeated queries - **Memory Usage**: ~50MB for 1000 cached validations - **Eviction**: LRU-based automatic management ### Parallel Processing - **Speedup**: 3-5x for validation-heavy operations - **Concurrency**: Up to 10 parallel validations - **Resource Usage**: Optimized for available system resources ## Troubleshooting ### Common Issues 1. **Low Confidence Scores** - Ensure repositories are properly parsed in Neo4j - Check code examples are indexed in Qdrant - Verify query relevance and specificity 2. **Performance Issues** - Monitor cache hit rates - Adjust concurrency limits - Check database connection health 3. **Validation Failures** - Verify Neo4j connectivity - Check repository parsing completeness - Review Qdrant collection status ### Debug Commands ```python # Check integration health health = await validate_integration_health(database_client, neo4j_driver) print(health) # Get performance stats optimizer = get_performance_optimizer() stats = await optimizer.get_performance_stats() print(stats) # Clear caches await service.clear_validation_cache() ``` ## Future Enhancements ### Planned Features - **Multi-language Support**: Beyond Python - **Code Quality Scoring**: Beyond hallucination detection - **Advanced Caching**: Redis/Memcached backends - **Real-time Updates**: Live repository synchronization - **ML-based Optimization**: Adaptive confidence scoring ### Integration Improvements - **GraphRAG**: Enhanced knowledge graph queries - **Vector Similarity**: Advanced embedding techniques - **Code Generation**: Validated code suggestions - **IDE Integration**: Real-time validation in editors ## API Reference ### ValidatedCodeSearchService Methods - `search_and_validate_code()`: Main search with validation - `get_health_status()`: System health check - `get_cache_stats()`: Performance statistics - `clear_validation_cache()`: Cache management ### EnhancedHallucinationDetector Methods - `check_script_hallucinations()`: Full script analysis - `_perform_neo4j_validation()`: Structural validation - `_perform_qdrant_validation()`: Semantic validation - `_combine_validation_results()`: Result merging ### Performance Optimization Utilities - `PerformanceCache`: High-performance caching - `BatchProcessor`: Parallel processing - `CircuitBreaker`: Failure handling - `IntegrationHealthMonitor`: System monitoring This integration layer provides a robust foundation for high-confidence code search with comprehensive validation and performance optimization.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/AI-enthusiasts/crawl4ai-rag-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

NEO4J_QDRANT_INTEGRATION_GUIDE.md•11.7 KiB