When knowledge graph functionality is enabled, allows parsing GitHub repositories into Neo4j for code structure analysis, enabling validation of AI-generated code against real implementations.
Optional knowledge graph integration for parsing GitHub repositories, validating AI-generated code against real implementations, and detecting AI hallucinations in code generation.
Provides embedding generation for vector search and optional features like contextual embeddings, code example extraction, and summaries when advanced RAG strategies are enabled.
Integrated private search engine for web queries, with customizable engines (Google, Bing, DuckDuckGo), enabling AI agents to perform secure web searches directly through the MCP server.
Vector database integration for storing crawled content, performing semantic similarity searches, and enabling RAG (Retrieval Augmented Generation) capabilities with optional source filtering.
Web Crawling, Search and RAG Capabilities for AI Agents and AI Coding Assistants
(FORKED FROM https://github.com/coleam00/mcp-crawl4ai-rag). Added SearXNG integration and batch scrape and processing capabilities.
A self-contained Docker solution that combines the Model Context Protocol (MCP), Crawl4AI, SearXNG, and Supabase to provide AI agents and coding assistants with complete web search, crawling, and RAG capabilities.
🚀 Complete Stack in One Command: Deploy everything with docker compose up -d
- no Python setup, no dependencies, no external services required.
🎯 Smart RAG vs Traditional Scraping
Unlike traditional scraping (such as Firecrawl) that dumps raw content and overwhelms LLM context windows, this solution uses intelligent RAG (Retrieval Augmented Generation) to:
- 🔍 Extract only relevant content using semantic similarity search
- ⚡ Prevent context overflow by returning focused, pertinent information
- 🧠 Enhance AI responses with precisely targeted knowledge
- 📊 Maintain context efficiency for better LLM performance
Flexible Output Options:
- RAG Mode (default): Returns semantically relevant chunks with similarity scores
- Raw Markdown Mode: Full content extraction when complete context is needed
- Hybrid Search: Combines semantic and keyword search for comprehensive results
💡 Key Benefits
- 🔧 Zero Configuration: Pre-configured SearXNG instance included
- 🐳 Docker-Only: No Python environment setup required
- 🔍 Integrated Search: Built-in SearXNG for private, fast search
- ⚡ Production Ready: HTTPS, security, and monitoring included
- 🎯 AI-Optimized: RAG strategies built for coding assistants
Overview
This Docker-based MCP server provides a complete web intelligence stack that enables AI agents to:
- Search the web using the integrated SearXNG instance
- Crawl and scrape websites with advanced content extraction
- Store content in vector databases with intelligent chunking
- Perform RAG queries with multiple enhancement strategies
Advanced RAG Strategies Available:
- Contextual Embeddings for enriched semantic understanding
- Hybrid Search combining vector and keyword search
- Agentic RAG for specialized code example extraction
- Reranking for improved result relevance using cross-encoder models
- Knowledge Graph for AI hallucination detection and repository code analysis
See the Configuration section below for details on how to enable and configure these strategies.
Features
- Smart URL Detection: Automatically detects and handles different URL types (regular webpages, sitemaps, text files)
- Recursive Crawling: Follows internal links to discover content
- Parallel Processing: Efficiently crawls multiple pages simultaneously
- Content Chunking: Intelligently splits content by headers and size for better processing
- Vector Search: Performs RAG over crawled content, optionally filtering by data source for precision
- Source Retrieval: Retrieve sources available for filtering to guide the RAG process
Tools
The server provides essential web crawling and search tools:
Core Tools (Always Available)
scrape_urls
: Scrape one or more URLs and store their content in the vector database. Supports both single URLs and lists of URLs for batch processing.smart_crawl_url
: Intelligently crawl a full website based on the type of URL provided (sitemap, llms-full.txt, or a regular webpage that needs to be crawled recursively)get_available_sources
: Get a list of all available sources (domains) in the databaseperform_rag_query
: Search for relevant content using semantic search with optional source filtering- NEW!
search
: Comprehensive web search tool that integrates SearXNG search with automated scraping and RAG processing. Performs a complete workflow: (1) searches SearXNG with the provided query, (2) extracts URLs from search results, (3) automatically scrapes all found URLs using existing scraping infrastructure, (4) stores content in vector database, and (5) returns either RAG-processed results organized by URL or raw markdown content. Key parameters:query
(search terms),return_raw_markdown
(bypasses RAG for raw content),num_results
(search result limit),batch_size
(database operation batching),max_concurrent
(parallel scraping sessions). Ideal for research workflows, competitive analysis, and content discovery with built-in intelligence.
Conditional Tools
search_code_examples
(requiresUSE_AGENTIC_RAG=true
): Search specifically for code examples and their summaries from crawled documentation. This tool provides targeted code snippet retrieval for AI coding assistants.
Knowledge Graph Tools (requires USE_KNOWLEDGE_GRAPH=true
, see below)
parse_github_repository
: Parse a GitHub repository into a Neo4j knowledge graph, extracting classes, methods, functions, and their relationships for hallucination detectioncheck_ai_script_hallucinations
: Analyze Python scripts for AI hallucinations by validating imports, method calls, and class usage against the knowledge graphquery_knowledge_graph
: Explore and query the Neo4j knowledge graph with commands likerepos
,classes
,methods
, and custom Cypher queries
Prerequisites
Required:
- Docker and Docker Compose - This is a Docker-only solution
- Supabase account - For vector database and RAG functionality
- OpenAI API key - For generating embeddings
Optional:
- Neo4j instance - For knowledge graph functionality (see Knowledge Graph Setup)
- Custom domain - For production HTTPS deployment
Installation
This is a Docker-only solution - no Python environment setup required!
Quick Start
- Clone this repository:
- Configure environment:
- Deploy the complete stack:
That's it! Your complete search, crawl, and RAG stack is now running:
- MCP Server: http://localhost:8051
- SearXNG Search: http://localhost:8080 (internal)
- Caddy Proxy: Handles HTTPS and routing
What Gets Deployed
The Docker Compose stack includes:
- MCP Crawl4AI Server - Main application server
- SearXNG - Private search engine instance
- Valkey - Redis-compatible cache for SearXNG
- Caddy - Reverse proxy with automatic HTTPS
Database Setup IMPORTANT!
Before running the server, you need to set up the database with the pgvector extension:
- Go to the SQL Editor in your Supabase dashboard (create a new project first if necessary)
- Create a new query and paste the contents of
crawled_pages.sql
- Run the query to create the necessary tables and functions
Knowledge Graph Setup (Optional)
To enable AI hallucination detection and repository analysis features, you need to set up Neo4j.
Note: The knowledge graph functionality works fully with Docker and supports all features.
Neo4j Setup Options
Option 1: Local AI Package (Recommended)
The easiest way to get Neo4j running is with the Local AI Package:
- Clone and start Neo4j:
- Connection details for Docker:
- URI:
bolt://host.docker.internal:7687
(for Docker containers) - URI:
bolt://localhost:7687
(for local access) - Username:
neo4j
- Password: Check Local AI Package documentation
- URI:
Option 2: Neo4j Docker
Run Neo4j directly with Docker:
Option 3: Neo4j Desktop
Use Neo4j Desktop for a local GUI-based installation:
- Download and install: Neo4j Desktop
- Create a new database with your preferred settings
- Connection details:
- URI:
bolt://host.docker.internal:7687
(for Docker containers) - URI:
bolt://localhost:7687
(for local access) - Username:
neo4j
- Password: Whatever you set during database creation
- URI:
Configuration
Configure the Docker stack by editing your .env
file (copy from .env.example
):
Key Configuration Notes
🔍 SearXNG Integration: The stack includes a pre-configured SearXNG instance that runs automatically. No external setup required!
🐳 Docker Networking: The default configuration uses Docker internal networking (http://searxng:8080
) which works out of the box.
🔐 Production Setup: For production, set SEARXNG_HOSTNAME
to your domain and SEARXNG_TLS
to your email for automatic HTTPS.
RAG Strategy Options
The Crawl4AI RAG MCP server supports four powerful RAG strategies that can be enabled independently:
1. USE_CONTEXTUAL_EMBEDDINGS
When enabled, this strategy enhances each chunk's embedding with additional context from the entire document. The system passes both the full document and the specific chunk to an LLM (configured via MODEL_CHOICE
) to generate enriched context that gets embedded alongside the chunk content.
- When to use: Enable this when you need high-precision retrieval where context matters, such as technical documentation where terms might have different meanings in different sections.
- Trade-offs: Slower indexing due to LLM calls for each chunk, but significantly better retrieval accuracy.
- Cost: Additional LLM API calls during indexing.
2. USE_HYBRID_SEARCH
Combines traditional keyword search with semantic vector search to provide more comprehensive results. The system performs both searches in parallel and intelligently merges results, prioritizing documents that appear in both result sets.
- When to use: Enable this when users might search using specific technical terms, function names, or when exact keyword matches are important alongside semantic understanding.
- Trade-offs: Slightly slower search queries but more robust results, especially for technical content.
- Cost: No additional API costs, just computational overhead.
3. USE_AGENTIC_RAG
Enables specialized code example extraction and storage. When crawling documentation, the system identifies code blocks (≥300 characters), extracts them with surrounding context, generates summaries, and stores them in a separate vector database table specifically designed for code search.
- When to use: Essential for AI coding assistants that need to find specific code examples, implementation patterns, or usage examples from documentation.
- Trade-offs: Significantly slower crawling due to code extraction and summarization, requires more storage space.
- Cost: Additional LLM API calls for summarizing each code example.
- Benefits: Provides a dedicated
search_code_examples
tool that AI agents can use to find specific code implementations.
4. USE_RERANKING
Applies cross-encoder reranking to search results after initial retrieval. Uses a lightweight cross-encoder model (cross-encoder/ms-marco-MiniLM-L-6-v2
) to score each result against the original query, then reorders results by relevance.
- When to use: Enable this when search precision is critical and you need the most relevant results at the top. Particularly useful for complex queries where semantic similarity alone might not capture query intent.
- Trade-offs: Adds ~100-200ms to search queries depending on result count, but significantly improves result ordering.
- Cost: No additional API costs - uses a local model that runs on CPU.
- Benefits: Better result relevance, especially for complex queries. Works with both regular RAG search and code example search.
5. USE_KNOWLEDGE_GRAPH
Enables AI hallucination detection and repository analysis using Neo4j knowledge graphs. When enabled, the system can parse GitHub repositories into a graph database and validate AI-generated code against real repository structures. Fully compatible with Docker - all functionality works within the containerized environment.
- When to use: Enable this for AI coding assistants that need to validate generated code against real implementations, or when you want to detect when AI models hallucinate non-existent methods, classes, or incorrect usage patterns.
- Trade-offs: Requires Neo4j setup and additional dependencies. Repository parsing can be slow for large codebases, and validation requires repositories to be pre-indexed.
- Cost: No additional API costs for validation, but requires Neo4j infrastructure (can use free local installation or cloud AuraDB).
- Benefits: Provides three powerful tools:
parse_github_repository
for indexing codebases,check_ai_script_hallucinations
for validating AI-generated code, andquery_knowledge_graph
for exploring indexed repositories.
Usage with MCP Tools:
You can tell the AI coding assistant to add a Python GitHub repository to the knowledge graph:
"Add https://github.com/pydantic/pydantic-ai.git to the knowledge graph"
Make sure the repo URL ends with .git.
You can also have the AI coding assistant check for hallucinations with scripts it creates using the MCP check_ai_script_hallucinations
tool.
Recommended Configurations
For general documentation RAG:
For AI coding assistant with code examples:
For AI coding assistant with hallucination detection:
For fast, basic RAG:
Running the Server
The complete stack is managed through Docker Compose:
Start the Stack
View Logs
Stop the Stack
Restart Services
The MCP server will be available at http://localhost:8051
for SSE connections.
Integration with MCP Clients
After starting the Docker stack with docker compose up -d
, your MCP server will be available for integration.
SSE Configuration (Recommended)
The Docker stack runs with SSE transport by default. Connect using:
Claude Desktop/Windsurf:
Windsurf (alternative syntax):
Claude Code CLI:
Docker Networking Notes
- Same machine: Use
http://localhost:8051/sse
- Different container: Use
http://host.docker.internal:8051/sse
- Remote access: Replace
localhost
with your server's IP address
Production Deployment
For production use with custom domains:
- Update your
.env
: - Access via HTTPS:
Health Check
Verify the server is running:
Knowledge Graph Architecture
The knowledge graph system stores repository code structure in Neo4j with the following components:
Core Components (knowledge_graphs/
folder):
parse_repo_into_neo4j.py
: Clones and analyzes GitHub repositories, extracting Python classes, methods, functions, and imports into Neo4j nodes and relationshipsai_script_analyzer.py
: Parses Python scripts using AST to extract imports, class instantiations, method calls, and function usageknowledge_graph_validator.py
: Validates AI-generated code against the knowledge graph to detect hallucinations (non-existent methods, incorrect parameters, etc.)hallucination_reporter.py
: Generates comprehensive reports about detected hallucinations with confidence scores and recommendationsquery_knowledge_graph.py
: Interactive CLI tool for exploring the knowledge graph (functionality now integrated into MCP tools)
Knowledge Graph Schema:
The Neo4j database stores code structure as:
Nodes:
Repository
: GitHub repositoriesFile
: Python files within repositoriesClass
: Python classes with methods and attributesMethod
: Class methods with parameter informationFunction
: Standalone functionsAttribute
: Class attributes
Relationships:
Repository
-[]->File
File
-[]->Class
File
-[]->Function
Class
-[]->Method
Class
-[]->Attribute
Workflow:
- Repository Parsing: Use
parse_github_repository
tool to clone and analyze open-source repositories - Code Validation: Use
check_ai_script_hallucinations
tool to validate AI-generated Python scripts - Knowledge Exploration: Use
query_knowledge_graph
tool to explore available repositories, classes, and methods
Troubleshooting
Docker Issues
Container won't start:
SearXNG not accessible:
Port conflicts:
Common Configuration Issues
Environment variables not loading:
- Ensure
.env
file is in the same directory asdocker-compose.yml
- Verify no spaces around
=
in.env
file - Check for special characters that need quoting
API connection failures:
- Verify
OPENAI_API_KEY
is valid and has credits - Check
SUPABASE_URL
andSUPABASE_SERVICE_KEY
are correct - Test API connectivity from within container:
Neo4j connection issues:
- Use
host.docker.internal:7687
instead oflocalhost:7687
for Neo4j running on host - Verify Neo4j is running and accessible
- Check firewall settings for port 7687
Performance Optimization
Memory usage:
Disk space:
Getting Help
- Check logs first:
docker compose logs -f
- Verify configuration:
docker compose config
- Test connectivity: Use
curl
commands shown above - Reset everything:
docker compose down -v && docker compose up -d
Development & Customization
This Docker stack provides a foundation for building more complex MCP servers:
- Modify the MCP server: Edit files in
src/
and rebuild:docker compose build mcp-crawl4ai
- Add custom tools: Extend
src/crawl4ai_mcp.py
with@mcp.tool()
decorators - Customize SearXNG: Edit
searxng/settings.yml
and restart - Add services: Extend
docker-compose.yml
with additional containers
This server cannot be installed
hybrid server
The server is able to function both locally and remotely, depending on the configuration or use case.
Provides AI agents with complete web search, crawling, and RAG capabilities through a Docker-based solution combining Model Context Protocol, Crawl4AI, SearXNG, and Supabase.
Related MCP Servers
- -securityFlicense-qualityImplements Retrieval-Augmented Generation (RAG) using GroundX and OpenAI, allowing users to ingest documents and perform semantic searches with advanced context handling through Modern Context Processing (MCP).Last updated -4Python
- -securityFlicense-qualityHigh-performance server enabling AI assistants to access web scraping, crawling, and deep research capabilities through Model Context Protocol.Last updated -7TypeScript
- -securityAlicense-qualityWeb crawling and RAG implementation that enables AI agents to scrape websites and perform semantic search over the crawled content, storing everything in Supabase for persistent knowledge retrieval.Last updated -1,400PythonMIT License
- -securityAlicense-qualityA server that integrates Retrieval-Augmented Generation (RAG) with the Model Control Protocol (MCP) to provide web search capabilities and document analysis for AI assistants.Last updated -2PythonApache 2.0