Crawl4AI RAG MCP Server

index.md•5.66 KiB

# Crawl4AI RAG MCP Server Documentation A high-performance Retrieval-Augmented Generation (RAG) system using Crawl4AI for web content extraction, sqlite-vec for vector storage, and MCP integration for AI assistants. ## Overview This is a production-ready RAG system that provides: - **High Performance**: RAM database mode with 10-50x faster queries via differential sync - **Semantic Search**: Vector-based content retrieval with 384-dimensional embeddings - **MCP Integration**: Full MCP protocol support for AI assistants (LM-Studio, Claude Desktop, etc.) - **REST API**: Complete REST API with 15+ endpoints for remote access - **Security**: Multi-layered input sanitization and domain blocking - **Batch Processing**: Advanced batch crawler with retry logic and progress tracking ## Operation Modes The system can operate in two modes: - **Server Mode**: Hosts REST API endpoints and MCP server locally - **Client Mode**: Forwards MCP tool calls to a remote REST API server ## Key Features - **RAM Database Mode**: In-memory SQLite with differential sync for ultra-fast queries - Virtual table support with hard-coded schemas - Dual sync strategy: idle (5s) + periodic (5min) - 100% sync reliability with proper error handling - **Content Optimization**: Intelligent cleaning removes 70-80% of navigation boilerplate - **Language Filtering**: Automatic detection and filtering of non-English content - **Deep Crawling**: DFS-based multi-page crawling with depth and page limits - **Input Sanitization**: Comprehensive SQL injection defense and validation - **Domain Blocking**: Wildcard-based blocking with social media and NSFW filters - **Batch Operations**: High-performance batch crawling with automatic retry - **Vector Search**: Semantic search with tag filtering and deduplication - **Concurrent Recrawling**: API-based batch URL recrawling with rate limiting - **Sync Monitoring**: Real-time sync metrics in `/api/v1/stats` endpoint ## Quick Start ### 1. Install Dependencies ```bash pip install -r requirements.txt ``` ### 2. Configure Environment Copy and edit the `.env` file: ```bash # For Server Mode (hosting the API) IS_SERVER=true USE_MEMORY_DB=true LOCAL_API_KEY=your-secure-api-key-here # For Client Mode (forwarding to remote) IS_SERVER=false REMOTE_API_URL=https://your-server.com:8080 REMOTE_API_KEY=your-remote-api-key-here ``` ### 3. Run in Server Mode ```bash # Option 1: Using the startup script python3 deployments/server/start_api_server.py # Option 2: Using uvicorn directly uvicorn api.api:create_app --host 0.0.0.0 --port 8080 # Option 3: Using Docker docker compose -f deployments/server/docker-compose.yml up -d ``` ### 4. Run in Client Mode ```bash # The existing MCP server automatically detects client mode python3 core/rag_processor.py ``` ## Documentation ### Getting Started - **[Installation Guide](README.md)** - Detailed setup instructions - **[Quick Start Guide](guides/quick-start.md)** - Get up and running quickly - **[Deployment Guide](deployments.md)** - Comprehensive deployment options (Server, Client, Docker) ### API Documentation - **[API Overview](api/index.md)** - REST API introduction and concepts - **[API Endpoints](api/endpoints.md)** - Complete endpoint reference with examples - **[API Reference](API_README.md)** - Legacy API documentation ### Advanced Features - **[RAM Database Mode](advanced/ram-database.md)** - In-memory database with differential sync - **[Security Features](advanced/security.md)** - Input sanitization and domain blocking - **[Batch Operations](advanced/batch-operations.md)** - Batch crawling and database restoration ### Guides - **[Docker Setup](docker/)** - Docker deployment and configuration - **[Troubleshooting](guides/troubleshooting.md)** - Common issues and solutions ## Architecture <img src="Diagram.svg" alt="Architecture Diagram" style="width: 50%; height: auto; display: block; margin: 0 auto;"> ## Security Features - **Input Sanitization**: Multi-layered SQL injection defense and validation (see [Security Guide](advanced/security.md)) - **Domain Blocking**: Wildcard-based domain blocking with social media and NSFW filters - **API Key Authentication**: Bearer token authentication for all endpoints - **Rate Limiting**: Configurable requests per minute per API key - **Input Validation**: Pydantic models for request validation - **URL Validation**: Prevents access to internal/private networks and blocked domains - **HTTPS Support**: Configurable CORS and security headers - **Session Management**: Automatic cleanup of expired sessions ## Performance Features - **RAM Database Mode**: In-memory SQLite with differential sync (see [RAM Database Guide](advanced/ram-database.md)) - 10-50x faster query performance - Automatic idle sync (5 seconds) and periodic sync (5 minutes) - Change tracking via triggers and shadow tables - **Content Processing Pipeline**: - Crawl4AI fit_markdown extraction (cleaner than raw_markdown) - Navigation and boilerplate removal (70-80% storage reduction) - Language detection and filtering (English-only by default) - Chunk quality filtering before embedding - **Vector Search**: 384-dimensional embeddings with similarity search - **Batch Processing**: Concurrent crawling with retry logic (see [Batch Operations Guide](advanced/batch-operations.md)) - **Content Deduplication**: URL-based deduplication in search results - **Efficient Storage**: fit_markdown conversion and intelligent content chunking ## Error Handling The system includes comprehensive error logging to `crawl4ai_rag_errors.log` with format: ``` timestamp|function_name|url|error_message|error_code|stack_trace ``` All major functions include try-catch blocks with detailed error logging for debugging and monitoring.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Rob-P-Smith/mcpragcrawl4ai'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

index.md•5.66 KiB