Enables crawling GitHub repositories to process content and build knowledge graphs for code validation
Integrates with Google Gemini API for generating embeddings and responses, configurable as the primary AI provider
Leverages Neo4j to build knowledge graphs of code repositories for validating AI-generated code and detecting hallucinations
Allows switching to OpenAI as the primary AI provider for generating embeddings and responses
Uses Supabase with pgvector to store and search through crawled content, enabling vector and hybrid search capabilities
MCP-RAG: Model Context Protocol with Retrieval Augmented Generation
Overview
MCP-RAG is an advanced Python-based system that combines the Model Context Protocol (MCP) with Retrieval Augmented Generation (RAG) capabilities. This project provides a comprehensive framework for intelligent web crawling, content processing, multi-modal data handling, and automated code evolution with integrated testing and evaluation systems.
🚀 Key Features
- Intelligent Web Crawling: Advanced crawling capabilities using Crawl4AI
- Multi-Modal Processing: Support for text, visual, and structured data processing
- Multiple LLM Providers: Integration with OpenAI, Ollama, and HuggingFace models
- Multiple Embedding Providers: Support for OpenAI, Cohere, HuggingFace, and Ollama embeddings
- Automated Code Evolution: Self-improving codebase with version control
- Comprehensive Testing: Automated testing and integration validation
- Security Sandbox: Secure code execution environment
- Performance Evaluation: Built-in correctness evaluation systems
- Resource Management: Intelligent resource allocation and monitoring
📁 Project Structure
🏗️ Architecture Overview
Core Components
1. MCP Server (crawl4ai_mcp.py
)
- Main entry point and MCP protocol implementation
- Handles client connections and request routing
- Coordinates between different system components
- Provides unified API for all functionality
2. Agent System (agents/
)
- Base Agent: Foundation class for all AI agents
- Code Debugger: Automated code analysis and debugging
- Dependency Validator: Ensures code dependencies are correct
- Evolution Orchestrator: Manages automated code improvements
- Integration Tester: Validates component integration
3. Multi-Provider Architecture
- LLM Providers: Supports OpenAI, Ollama, and HuggingFace models
- Embedding Providers: Multiple embedding backends for flexibility
- Unified Management: Centralized provider management and switching
4. Security & Resource Management
- Sandbox Environment: Secure code execution with resource limits
- Resource Monitor: Tracks CPU, memory, and GPU usage
- Access Control: Manages file system and network access
Data Flow
- Input Processing: Web content crawled and processed
- Multi-Modal Analysis: Text, images, and structured data extracted
- Embedding Generation: Content converted to vector representations
- LLM Processing: Intelligent analysis and response generation
- Quality Assurance: Automated testing and evaluation
- Evolution: Continuous improvement based on feedback
🔧 Key Modules
Web Crawling & Content Processing
Embedding System
LLM Integration
Visual Processing
🛠️ Installation & Setup
Prerequisites
- Python 3.8+
- SQLite (for data storage)
- Optional: GPU support for local model inference
Dependencies
The project uses a modular approach with optional dependencies based on providers:
Core Dependencies:
crawl4ai
: Web crawling frameworksqlite3
: Database operationsasyncio
: Asynchronous operationsjson
: Data serialization
Optional Provider Dependencies:
openai
: OpenAI API integrationcohere
: Cohere API integrationtransformers
: HuggingFace modelstorch
: PyTorch for local inferenceollama
: Local LLM serving
Configuration
- Set up API keys for external providers (OpenAI, Cohere)
- Configure database connection strings
- Set resource limits and security policies
- Initialize embedding and LLM providers
📊 Performance Metrics
Codebase Statistics:
- Total Files: 41 files
- Total Lines of Code: 9,602 lines
- Primary Language: Python (32 files)
- Database Scripts: 2 SQL files
- Documentation: Implementation guide included
Component Distribution:
- Core MCP Server: 1,450 lines (largest component)
- Utilities: 850+ lines of helper functions
- Agent System: 6 specialized agents
- Provider Integrations: 15+ provider implementations
- Testing & Evaluation: Comprehensive test suites
🔍 Usage Examples
Basic Web Crawling
Multi-Modal Content Processing
Automated Code Evolution
🧪 Testing & Quality Assurance
The project includes comprehensive testing frameworks:
- Automated Testing: Continuous integration testing
- Correctness Evaluation: Code quality metrics
- Integration Testing: Cross-component validation
- Performance Testing: Resource usage monitoring
- Security Testing: Sandbox validation
🔒 Security Features
- Sandboxed Execution: Isolated code execution environment
- Resource Limits: CPU, memory, and time constraints
- Access Control: Restricted file system and network access
- Input Validation: Comprehensive input sanitization
- Audit Logging: Complete operation tracking
🔄 Evolution & Maintenance
The project includes automated evolution capabilities:
- Version Control Integration: Automated git operations
- Dependency Management: Smart dependency updates
- Code Quality Improvement: Automated refactoring
- Performance Optimization: Continuous performance tuning
- Documentation Updates: Automated documentation generation
📈 Future Enhancements
- Multi-Agent Collaboration: Enhanced agent coordination
- Real-time Processing: Streaming data processing
- Advanced Visual Understanding: Improved multi-modal capabilities
- Distributed Processing: Scalable architecture
- Custom Model Training: Domain-specific model fine-tuning
🤝 Contributing
This project follows a modular architecture that makes it easy to extend:
- Adding New Providers: Implement the base interface for LLM or embedding providers
- Creating New Agents: Extend the BaseAgent class for specialized functionality
- Enhancing Security: Add new sandbox policies and security measures
- Improving Performance: Optimize resource management and processing pipelines
📝 License & Documentation
For detailed implementation information, refer to the Implementation.md
file included in the repository. The project maintains comprehensive documentation for all major components and provides examples for common use cases.
This documentation was automatically generated from codebase analysis and provides a comprehensive overview of the MCP-RAG system architecture and capabilities.
This server cannot be installed
remote-capable server
The server can be hosted and run remotely because it primarily relies on remote services or has no dependency on the local environment.
Provides AI agents and coding assistants with web crawling and RAG capabilities, allowing them to scrape websites and perform semantic searches on the crawled content.
Related MCP Servers
- -securityAlicense-qualityEmpowers AI agents to perform web browsing, automation, and scraping tasks with minimal supervision using natural language instructions and Selenium.Last updated -4PythonApache 2.0
- -securityFlicense-qualityEnables intelligent web scraping through a browser automation tool that can search Google, navigate to webpages, and extract content from various websites including GitHub, Stack Overflow, and documentation sites.Last updated -1Python
- -securityAlicense-qualityWeb crawling and RAG implementation that enables AI agents to scrape websites and perform semantic search over the crawled content, storing everything in Supabase for persistent knowledge retrieval.Last updated -1,533PythonMIT License
- -securityAlicense-qualityProvides AI agents and coding assistants with advanced web crawling and RAG capabilities, allowing them to scrape websites and leverage that knowledge through various retrieval strategies.Last updated -1MIT License