Supports parsing local Git repositories to analyze codebases and map relationships without requiring external cloning.
Enables parsing of GitHub repositories to extract classes, methods, and functions into a knowledge graph for structural analysis and code validation.
Utilizes Neo4j as a knowledge graph backend to store repository structures, perform structural code validation, and detect AI hallucinations.
Integrates SearXNG to provide private web search capabilities, allowing agents to perform searches and automatically scrape results for RAG processing.
Uses Supabase as a vector database for storing and performing semantic similarity searches over crawled web content.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@Crawl4AI+SearXNG MCP ServerSearch for the latest Next.js features and extract relevant code examples."
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
π³ Crawl4AI+SearXNG MCP Server
Web Crawling, Search and RAG Capabilities for AI Agents and AI Coding Assistants
(FORKED FROM
A self-contained Docker solution that combines the Model Context Protocol (MCP), Crawl4AI, SearXNG, and Supabase to provide AI agents and coding assistants with complete web search, crawling, and RAG capabilities.
π Complete Stack in One Command: Deploy everything with make prod - no Python setup, no dependencies, no external services required.
π― Smart RAG vs Traditional Scraping
Unlike traditional scraping (such as Firecrawl) that dumps raw content and overwhelms LLM context windows, this solution uses intelligent RAG (Retrieval Augmented Generation) to:
π Extract only relevant content using semantic similarity search
β‘ Prevent context overflow by returning focused, pertinent information
π§ Enhance AI responses with precisely targeted knowledge
π Maintain context efficiency for better LLM performance
Flexible Output Options:
RAG Mode (default): Returns semantically relevant chunks with similarity scores
Raw Markdown Mode: Full content extraction when complete context is needed
Hybrid Search: Combines semantic and keyword search for comprehensive results
π‘ Key Benefits
π§ Zero Configuration: Pre-configured SearXNG instance included
π³ Docker-Only: No Python environment setup required
π Integrated Search: Built-in SearXNG for private, fast search
β‘ Production Ready: HTTPS, security, and monitoring included
π― AI-Optimized: RAG strategies built for coding assistants
πΊοΈ Project Roadmap
π Current Focus: Agentic Search (Highest Priority)
We are implementing an intelligent, iterative search system that combines local knowledge, web search, and LLM-driven decision making to provide comprehensive answers while minimizing costs.
Why this matters:
π Unique value proposition - no other MCP server offers this
π° 50-70% cost reduction through selective crawling
π― High-quality, complete answers without manual iteration
π Positions this as the most advanced RAG-MCP solution
π Full Roadmap: See docs/PROJECT_ROADMAP.md - the single source of truth for all development priorities.
π Architecture: See docs/AGENTIC_SEARCH_ARCHITECTURE.md for technical details.
Overview
This Docker-based MCP server provides a complete web intelligence stack that enables AI agents to:
Search the web using the integrated SearXNG instance
Crawl and scrape websites with advanced content extraction
Store content in vector databases with intelligent chunking
Perform RAG queries with multiple enhancement strategies
Advanced RAG Strategies Available:
Contextual Embeddings for enriched semantic understanding
Hybrid Search combining vector and keyword search
Agentic RAG for specialized code example extraction
Reranking for improved result relevance using cross-encoder models
Knowledge Graph for AI hallucination detection and repository code analysis
See the Configuration section below for details on how to enable and configure these strategies.
Features
Contextual Embeddings: Enhanced RAG with LLM-generated context for each chunk, improving search accuracy by 20-30% (Learn more)
Smart URL Detection: Automatically detects and handles different URL types (regular webpages, sitemaps, text files)
Recursive Crawling: Follows internal links to discover content
Parallel Processing: Efficiently crawls multiple pages simultaneously
Content Chunking: Intelligently splits content by headers and size for better processing
Vector Search: Performs RAG over crawled content, optionally filtering by data source for precision
Source Retrieval: Retrieve sources available for filtering to guide the RAG process
Tools
The server provides essential web crawling and search tools:
Core Tools (Always Available)
scrape_urls: Scrape one or more URLs and store their content in the vector database. Supports both single URLs and lists of URLs for batch processing.smart_crawl_url: Intelligently crawl a full website based on the type of URL provided (sitemap, llms-full.txt, or a regular webpage that needs to be crawled recursively)get_available_sources: Get a list of all available sources (domains) in the databaseperform_rag_query: Search for relevant content using semantic search with optional source filteringNEW!
search: Comprehensive web search tool that integrates SearXNG search with automated scraping and RAG processing. Performs a complete workflow: (1) searches SearXNG with the provided query, (2) extracts URLs from search results, (3) automatically scrapes all found URLs using existing scraping infrastructure, (4) stores content in vector database, and (5) returns either RAG-processed results organized by URL or raw markdown content. Key parameters:query(search terms),return_raw_markdown(bypasses RAG for raw content),num_results(search result limit),batch_size(database operation batching),max_concurrent(parallel scraping sessions). Ideal for research workflows, competitive analysis, and content discovery with built-in intelligence.
Conditional Tools
search_code_examples(requiresUSE_AGENTIC_RAG=true): Search specifically for code examples and their summaries from crawled documentation. This tool provides targeted code snippet retrieval for AI coding assistants.
Knowledge Graph Tools (requires USE_KNOWLEDGE_GRAPH=true, see below)
π NEW: Multi-Language Repository Parsing - The system now supports comprehensive analysis of repositories containing Python, JavaScript, TypeScript, Go, and other languages. See Multi-Language Parsing Documentation for complete details.
parse_github_repository: Parse a GitHub repository into a Neo4j knowledge graph, extracting classes, methods, functions, and their relationships across multiple programming languages (Python, JavaScript, TypeScript, Go, etc.)parse_local_repository: Parse local Git repositories directly without cloning, supporting multi-language codebasesparse_repository_branch: Parse specific branches of repositories for version-specific analysisanalyze_code_cross_language: NEW! Perform semantic search across multiple programming languages to find similar patterns (e.g., "authentication logic" across Python, JavaScript, and Go)check_ai_script_hallucinations: Analyze Python scripts for AI hallucinations by validating imports, method calls, and class usage against the knowledge graphquery_knowledge_graph: Explore and query the Neo4j knowledge graph with commands likerepos,classes,methods, and custom Cypher queriesget_script_analysis_info: Get information about script analysis setup, available paths, and usage instructions for hallucination detection tools
π Code Search and Validation
Advanced Neo4j-Qdrant Integration for Reliable AI Code Generation
The system provides sophisticated code search and validation capabilities by combining:
Qdrant: Semantic vector search for finding relevant code examples
Neo4j: Structural validation against parsed repository knowledge graphs
AI Hallucination Detection: Prevents AI from generating non-existent methods or incorrect usage patterns
When to Use Neo4j vs Qdrant
Use Case | Neo4j (Knowledge Graph) | Qdrant (Vector Search) | Combined Approach |
Exact Structure Validation | β Perfect - validates class/method existence | β Cannot verify structure | π Best - structure + semantics |
Semantic Code Search | β Limited - no semantic understanding | β Perfect - finds similar patterns | π Best - validated similarity |
Hallucination Detection | β Good - catches structural errors | β Cannot detect fake methods | π Best - comprehensive validation |
Code Discovery | β Requires exact names | β Perfect - fuzzy semantic search | π Best - discovered + validated |
Performance | β‘ Fast for exact queries | β‘ Fast for semantic search | βοΈ Balanced - parallel validation |
Enhanced Tools for Code Search and Validation
14. smart_code_search (requires both USE_KNOWLEDGE_GRAPH=true and USE_AGENTIC_RAG=true)
Intelligent code search that combines Qdrant semantic search with Neo4j structural validation:
Semantic Discovery: Find code patterns using natural language queries
Structural Validation: Verify all code examples against real repository structure
Confidence Scoring: Get reliability scores for each result (0.0-1.0)
Validation Modes: Choose between "fast", "balanced", or "thorough" validation
Intelligent Fallback: Works even when one system is unavailable
15. extract_and_index_repository_code (requires both systems)
Bridge Neo4j knowledge graph data into Qdrant for searchable code examples:
Knowledge Graph Extraction: Pull structured code from Neo4j
Semantic Indexing: Generate embeddings and store in Qdrant
Rich Metadata: Preserve class/method relationships and context
Batch Processing: Efficient indexing of large repositories
16. check_ai_script_hallucinations_enhanced (requires both systems)
Advanced hallucination detection using dual validation:
Neo4j Structural Check: Validate against actual repository structure
Qdrant Semantic Check: Find similar real code examples
Combined Confidence: Merge validation results for higher accuracy
Code Suggestions: Provide corrections from real code examples
Basic Workflow
Index Repository Structure:
parse_github_repository("https://github.com/pydantic/pydantic-ai.git")Extract and Index Code Examples:
extract_and_index_repository_code("pydantic-ai")Search with Validation:
smart_code_search( query="async function with error handling", source_filter="pydantic-ai", min_confidence=0.7, validation_mode="balanced" )Validate AI Code:
check_ai_script_hallucinations_enhanced("/path/to/ai_script.py")
π Using Hallucination Detection Tools
The hallucination detection tools require access to Python scripts. The Docker container includes volume mounts for convenient script analysis:
Script Locations:
./analysis_scripts/user_scripts/- Place your Python scripts here (recommended)./analysis_scripts/test_scripts/- For test scripts./analysis_scripts/validation_results/- Results are automatically saved here
Quick Start:
Create a script:
echo "import pandas as pd" > ./analysis_scripts/user_scripts/test.pyRun validation: Use the
check_ai_script_hallucinationstool withscript_path="test.py"Check results: View detailed analysis in
./analysis_scripts/validation_results/
Path Translation: The system automatically translates relative paths to container paths, making it convenient to reference scripts by filename.
Quick Start
Prerequisites
Docker and Docker Compose
Make (optional, for convenience commands)
8GB+ available RAM for all services
1. Start the Stack
Production deployment:
Development deployment:
2. Configure Claude Desktop (or other MCP client)
Add the MCP server to your claude_desktop_config.json:
3. Test the Connection
Try these commands in Claude to verify everything works:
4. Multi-Language Repository Analysis
Test the new multi-language capabilities:
ποΈ Architecture
The system consists of several Docker services working together:
Core Services
MCP Server: FastMCP-based server exposing all tools
Crawl4AI: Advanced web crawling and content extraction
SearXNG: Privacy-focused search engine (no external API keys)
Supabase: PostgreSQL + pgvector for embeddings and RAG
Neo4j: (Optional) Knowledge graph for code structure and hallucination detection
Qdrant: (Optional) Alternative vector database with advanced features
Data Flow
Configuration
The system supports extensive configuration through environment variables:
Core Configuration
Advanced RAG Configuration
Multi-Language Repository Support
The system now provides comprehensive support for multi-language repositories:
Supported Languages
Python (.py) - Classes, functions, methods, imports, docstrings
JavaScript (.js, .jsx, .mjs, .cjs) - ES6+ features, React components
TypeScript (.ts, .tsx) - Interfaces, types, enums, generics
Go (.go) - Structs, interfaces, methods, packages
Key Features
Unified Knowledge Graph: All languages stored in single Neo4j instance
Cross-Language Search: Find similar patterns across different languages
Language-Aware Analysis: Respects language-specific syntax and conventions
Repository Size Safety: Built-in validation prevents resource exhaustion
Batch Processing: Optimized for large multi-language repositories
Example Multi-Language Workflow
For complete documentation, see Multi-Language Parsing Guide.
Docker Services Detail
Service URLs (Development)
MCP Server: Internal only (accessed via Docker exec)
SearXNG: http://localhost:4040
Crawl4AI: http://localhost:8000
Supabase Studio: http://localhost:54323
Neo4j Browser: http://localhost:7474
Qdrant Dashboard: http://localhost:6333/dashboard (if enabled)
Volume Mounts
Performance and Scaling
Resource Requirements
Minimum (Development):
4GB RAM
10GB disk space
2 CPU cores
Recommended (Production):
8GB+ RAM
50GB+ disk space
4+ CPU cores
Optimization Settings
Troubleshooting
Common Issues
1. Services not starting:
2. MCP connection issues:
3. Multi-language parsing issues:
4. Repository too large:
Getting Help
Documentation: Check the
/docsdirectory for detailed guidesIssues: Report bugs on GitHub Issues
Logs: All services log to Docker, accessible via
docker-compose logs [service]
Development
Contributing
Fork the repository
Create a feature branch
Make changes with proper documentation
Add tests for new functionality
Submit a pull request
Adding Language Support
To add support for new programming languages:
Create analyzer in
src/knowledge_graph/analyzers/Extend
AnalyzerFactoryto recognize file extensionsAdd language-specific patterns and parsing logic
Update documentation and tests
See the Language Analyzer Development Guide for details.
Testing
Prerequisites: Start Qdrant for integration tests
Run tests:
License
This project is licensed under the MIT License - see the LICENSE file for details.
Credits
Original MCP Crawl4AI RAG: coleam00/mcp-crawl4ai-rag
Crawl4AI: unclecode/crawl4ai
SearXNG: searxng/searxng
FastMCP: jlowin/fastmcp
Development Tools
Import Verification
The repository includes comprehensive import verification tests to catch refactoring issues early:
Pre-commit Hooks
Install git hooks for automatic code quality checks:
The pre-commit hook ensures:
All modules can be imported without errors
No circular imports
Code passes basic linting checks