Crawl4AI RAG MCP Server

README.md•7.25 KiB

# Crawl4AI RAG MCP Server A high-performance Retrieval-Augmented Generation (RAG) system using Crawl4AI for web content extraction, sqlite-vec for vector storage, and MCP integration for AI assistants. ## Summary This system provides a production-ready RAG solution that combines: - **Crawl4AI** for intelligent web content extraction with markdown conversion - **SQLite with sqlite-vec** for vector storage and semantic search - **RAM Database Mode** for 10-50x faster query performance - **MCP Server** for AI assistant integration (LM-Studio, Claude Desktop, etc.) - **REST API** for bidirectional communication and remote access - **Security Layer** with input sanitization and domain blocking ## Quick Start ### Option 1: Local Development 1. **Clone and setup**: ```bash git clone https://github.com/Rob-P-Smith/mcpragcrawl4ai.git cd mcpragcrawl4ai python3 -m venv .venv source .venv/bin/activate # Linux/Mac pip install -r requirements.txt ``` 2. **Start Crawl4AI service**: ```bash docker run -d --name crawl4ai -p 11235:11235 unclecode/crawl4ai:latest ``` 3. **Configure environment**: ```bash # Create .env file cat > .env << EOF IS_SERVER=true USE_MEMORY_DB=true LOCAL_API_KEY=dev-api-key CRAWL4AI_URL=http://localhost:11235 EOF ``` 4. **Run MCP server**: ```bash python3 core/rag_processor.py ``` ### Option 2: Docker Server Deployment 1. **Deploy full server** (REST API + MCP): ```bash cd mcpragcrawl4ai docker compose -f deployments/server/docker-compose.yml up -d ``` 2. **Test deployment**: ```bash curl http://localhost:8080/health ``` See [Deployment Guide](docs/deployments.md) for complete deployment options. ## Architecture ### Core Components - **MCP Server** (core/rag_processor.py) - JSON-RPC 2.0 protocol handler - **RAG Database** (core/data/storage.py) - SQLite + sqlite-vec vector storage with RAM mode support - **Content Cleaner** (core/data/content_cleaner.py) - Navigation removal and quality filtering - **Sync Manager** (core/data/sync_manager.py) - RAM database differential sync with virtual table support - **Crawler** (core/operations/crawler.py) - Web crawling with DFS algorithm and content extraction - **Defense Layer** (core/data/dbdefense.py) - Input sanitization and security - **REST API** (api/api.py) - FastAPI server with 15+ endpoints - **Auth System** (api/auth.py) - API key authentication and rate limiting - **Recrawl Utility** (core/utilities/recrawl_utility.py) - Batch URL recrawling via API with concurrent processing ### Database Schema - **crawled_content** - Web content with markdown, embeddings, and metadata - **content_vectors** - Vector embeddings (sqlite-vec vec0 virtual table with rowid support) - **sessions** - User session tracking for temporary content - **blocked_domains** - Domain blocklist with wildcard patterns - **_sync_tracker** - Change tracking for RAM database differential sync (memory mode only) ### Technology Stack - **Python 3.11+** with asyncio for concurrent operations - **SQLite** with sqlite-vec extension for vector similarity search - **SentenceTransformers** (all-MiniLM-L6-v2) for embedding generation - **langdetect** for language detection and filtering - **FastAPI** for REST API with automatic OpenAPI documentation - **Crawl4AI** for intelligent web content extraction with fit_markdown - **Docker** for containerized deployment - **aiohttp** for async HTTP requests in utilities ## Documentation For detailed documentation, see: - [Deployment Guide](docs/deployments.md) - Comprehensive deployment options - [Installation Guide](docs/README.md) - Setup and configuration - [API Documentation](docs/API_README.md) - REST API reference - [Quick Start Guide](docs/guides/quick-start.md) - Get started quickly - [Troubleshooting](docs/guides/troubleshooting.md) - Common issues and solutions - [Full Documentation](docs/index.md) - Complete documentation index ## Key Features ### Performance - **RAM Database Mode**: In-memory SQLite with differential sync for 10-50x faster queries - **Vector Search**: 384-dimensional embeddings using all-MiniLM-L6-v2 for semantic search - **Batch Crawling**: High-performance batch processing with retry logic and progress tracking - **Content Optimization**: 70-80% storage reduction through intelligent cleaning and filtering - **Efficient Storage**: fit_markdown conversion and content chunking for optimal retrieval ### Functionality - **Deep Crawling**: DFS-based multi-page crawling with depth and page limits - **Content Cleaning**: Automatic removal of navigation, boilerplate, and low-quality content - **Language Filtering**: Automatic detection and filtering of non-English content - **Semantic Search**: Vector similarity search with tag filtering and deduplication - **Target Search**: Intelligent search with automatic tag expansion - **Content Management**: Full CRUD operations with retention policies and session management - **Batch Recrawling**: Concurrent URL recrawling via API with rate limiting and progress tracking ### Security - **Input Sanitization**: Comprehensive SQL injection defense and input validation - **Domain Blocking**: Wildcard-based domain blocking with social media and adult content filters - **API Authentication**: API key-based authentication with rate limiting - **Safe Crawling**: Automatic detection and blocking of forbidden content ### Integration - **MCP Server**: Full MCP protocol support for AI assistant integration - **REST API**: Complete REST API with 15+ endpoints for all operations - **Bidirectional Mode**: Server mode (host API) and client mode (forward to remote) - **Docker Deployment**: Production-ready containerized deployment ## Quick Usage Examples ### Via MCP (in LM-Studio/Claude Desktop) ``` crawl_and_remember("https://docs.python.org/3/tutorial/", tags="python, tutorial") search_memory("list comprehensions", tags="python", limit=5) target_search("async programming best practices", initial_limit=5, expanded_limit=20) get_database_stats() ``` ### Via REST API ```bash # Crawl and store content curl -X POST http://localhost:8080/api/v1/crawl/store \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{"url": "https://docs.python.org/3/tutorial/", "tags": "python, tutorial"}' # Semantic search curl -X POST http://localhost:8080/api/v1/search \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{"query": "list comprehensions", "tags": "python", "limit": 5}' # Get database stats curl http://localhost:8080/api/v1/stats \ -H "Authorization: Bearer YOUR_API_KEY" ``` ### Via Python Client ```python from api.api import Crawl4AIClient client = Crawl4AIClient("http://localhost:8080", "YOUR_API_KEY") result = await client.crawl_and_store("https://example.com", tags="example") search_results = await client.search("python tutorials", limit=10) stats = await client.get_database_stats() ``` ## Performance Metrics With RAM database mode enabled: - **Search queries**: 20-50ms (vs 200-500ms disk mode) - **Batch crawling**: 2,000+ URLs successfully processed - **Database size**: 215MB (2,296 pages, 8,196 embeddings) - **Sync overhead**: <100ms for differential sync (idle: 5s, periodic: 5min) - **Sync reliability**: 100% success rate with virtual table support - **Memory usage**: ~500MB for full in-memory database - **Storage optimization**: 70-80% reduction through content cleaning

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Rob-P-Smith/mcpragcrawl4ai'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

README.md•7.25 KiB