Crawl4AI RAG MCP Server

README.md•11.5 KiB

# Crawl4AI RAG MCP Server - Documentation Complete documentation for the Crawl4AI RAG (Retrieval-Augmented Generation) MCP Server implementation. ## Overview This system provides a production-ready RAG solution that combines: - **Crawl4AI** for intelligent web content extraction with markdown conversion - **SQLite + sqlite-vec** for vector storage and semantic search - **RAM Database Mode** for 10-50x faster queries via differential sync - **MCP integration** for AI assistants (LM-Studio, Claude Desktop, etc.) - **REST API** with 15+ endpoints for remote access and integration - **Security Layer** with input sanitization and domain blocking ## Quick Links ### Getting Started - [Quick Start Guide](guides/quick-start.md) - Get up and running quickly - [Deployment Guide](deployments.md) - Comprehensive deployment options - [Installation Guide](#installation-guide) - Detailed setup instructions ### API & Integration - [API Overview](api/index.md) - REST API introduction - [API Endpoints](api/endpoints.md) - Complete endpoint reference - [API Documentation](API_README.md) - Legacy API reference - [Docker Setup](docker/index.md) - Docker deployment guide ### Advanced Features - [RAM Database Mode](advanced/ram-database.md) - In-memory database with differential sync - [Security Features](advanced/security.md) - Input sanitization and domain blocking - [Batch Operations](advanced/batch-operations.md) - Batch crawling and database restoration ### Guides - [Deployment Options](deployments.md) - Server, Client, and Local deployment - [Troubleshooting](guides/troubleshooting.md) - Common issues and solutions - [Full Documentation](index.md) - Complete documentation index ## Architecture ``` ┌─────────────┐ │ LM-Studio │ │ (MCP Client)│ └──────┬──────┘ │ ┌──────▼────────────┐ │ MCP Server │ │ (core/rag_ │ │ processor.py) │ └──────┬────────────┘ │ ┌──────▼────────────┐ ┌──────────────┐ │ RAG System │──────│ Crawl4AI │ │ (operations/ │ │ (Docker: │ │ crawler.py) │ │ port 11235) │ └──────┬────────────┘ └──────────────┘ │ ┌──────▼────────────┐ │ Vector Database │ │ (SQLite + │ │ sqlite-vec) │ └───────────────────┘ ``` ## Installation Guide ### Prerequisites - Ubuntu/Linux system or macOS - Docker installed - Python 3.8 or higher - At least 4GB RAM available - 10GB free disk space ### Quick Setup 1. **Clone repository**: ```bash git clone https://github.com/Rob-P-Smith/mcpragcrawl4ai.git cd mcpragcrawl4ai ``` 2. **Install dependencies**: ```bash python3 -m venv .venv source .venv/bin/activate # Linux/Mac pip install -r requirements.txt ``` 3. **Start Crawl4AI**: ```bash docker run -d --name crawl4ai -p 11235:11235 unclecode/crawl4ai:latest ``` 4. **Configure environment**: ```bash cat > .env << EOF IS_SERVER=true USE_MEMORY_DB=true LOCAL_API_KEY=$(openssl rand -base64 32) CRAWL4AI_URL=http://localhost:11235 EOF ``` 5. **Run MCP server**: ```bash python3 core/rag_processor.py ``` For detailed instructions, see the [Quick Start Guide](guides/quick-start.md). ## Project Structure ### Core Components - **core/rag_processor.py**: Main MCP server implementation and JSON-RPC handling - **core/operations/crawler.py**: Web crawling logic and deep crawling functionality - **core/data/storage.py**: Database operations, content storage, and vector embeddings - **core/data/sync_manager.py**: RAM database synchronization with differential sync - **core/data/dbdefense.py**: Security layer for input sanitization and SQL injection prevention - **core/utilities/**: Helper scripts including dbstats.py and batch_crawler.py ### API Layer - **api/api.py**: FastAPI server with all REST endpoints - **api/auth.py**: Authentication middleware and session management ### Deployment Configurations - **deployments/server/**: Server deployment (REST API + MCP server in Docker) - **deployments/client/**: Client deployment (lightweight MCP client forwarder) ## Available Tools The MCP server provides the following tools: ### Basic Crawling 1. **crawl_url** - Crawl without storing 2. **crawl_and_remember** - Crawl and store permanently 3. **crawl_temp** - Crawl and store temporarily (session-only) ### Deep Crawling 4. **deep_crawl_dfs** - Deep crawl multiple pages using depth-first search without storing 5. **deep_crawl_and_store** - Deep crawl multiple pages using DFS and store all in knowledge base ### Search & Knowledge Management 6. **search_memory** - Semantic search with tag filtering and deduplication 7. **target_search** - Intelligent search with automatic tag expansion 8. **list_memory** - List all stored content with optional filtering 9. **forget_url** - Remove specific content by URL 10. **clear_temp_memory** - Clear temporary session content ### Database & Security Tools 11. **get_database_stats** - Database statistics with RAM/disk mode indicator 12. **list_domains** - List all unique domains with page counts 13. **block_domain** - Block domains from being crawled (supports wildcards) 14. **unblock_domain** - Remove domain from blocklist 15. **list_blocked_domains** - View all blocked domain patterns ## Deployment Options ### Local Development Run directly on your machine with Python virtual environment. - [Local Development Guide](guides/deployment.md#local-development-environment) ### Docker Server Deployment Full-featured server with REST API + MCP server in Docker. - [Server Deployment Guide](deployments.md#deployment-option-1-server-deployment) - [Server README](../deployments/server/README.md) ### Docker Client Deployment Lightweight client forwarding requests to remote server. - [Client Deployment Guide](deployments.md#deployment-option-2-client-deployment) - [Client README](../deployments/client/README.md) ## Configuration ### Environment Variables **Core Settings:** - `IS_SERVER` - `true` for server mode, `false` for client mode - `SERVER_HOST` - Host to bind API server (default: `0.0.0.0`) - `SERVER_PORT` - Port for API server (default: `8080`) **Authentication:** - `LOCAL_API_KEY` - API key for server mode - `REMOTE_API_KEY` - API key for client mode - `REMOTE_API_URL` - Remote server URL for client mode **Database:** - `DB_PATH` - SQLite database path (default: `crawl4ai_rag.db`) - `USE_MEMORY_DB` - Enable RAM database mode (default: `true`) **Services:** - `CRAWL4AI_URL` - Crawl4AI service URL (default: `http://localhost:11235`) ### LM-Studio Configuration For local development: ```json { "mcpServers": { "crawl4ai-rag": { "command": "/path/to/.venv/bin/python3", "args": ["/path/to/core/rag_processor.py"] } } } ``` For Docker server deployment: ```json { "mcpServers": { "crawl4ai-rag": { "command": "socat", "args": ["-", "TCP:localhost:3000"] } } } ``` For Docker client deployment: ```json { "mcpServers": { "crawl4ai-rag": { "command": "docker", "args": ["exec", "-i", "crawl4ai-mcp-client", "python3", "core/rag_processor.py"] } } } ``` ## Features ### Performance - **RAM Database Mode**: In-memory SQLite with 10-50x faster queries (see [RAM Database Guide](advanced/ram-database.md)) - **Differential Sync**: Idle (5s) and periodic (5min) sync to disk - **Vector Search**: 384-dimensional embeddings with similarity search - **Efficient Storage**: Markdown conversion and content chunking ### Functionality - **MCP Integration**: Full protocol support for AI assistants (LM-Studio, Claude Desktop, etc.) - **Deep Crawling**: DFS-based multi-page crawling (max depth 5, max pages 250) - **Semantic Search**: Vector-based retrieval with tag filtering and deduplication - **Content Management**: Full CRUD operations with retention policies - **Bidirectional Communication**: Server and client modes for flexible deployment ### Security - **Input Sanitization**: Comprehensive SQL injection defense (see [Security Guide](advanced/security.md)) - **Domain Blocking**: Wildcard-based blocking with social media and NSFW filters - **API Authentication**: Bearer token authentication with rate limiting - **URL Validation**: Prevents access to internal/private networks - **Session Management**: Automatic cleanup of temporary content ## Deep Crawling Features - **DFS Strategy**: Uses depth-first search to crawl multiple interconnected pages - **Configurable Depth**: Control how many levels deep to crawl (max 5 levels) - **Page Limits**: Restrict maximum pages crawled to prevent resource exhaustion (max 250 pages) - **External Links**: Option to follow or ignore external domain links - **URL Scoring**: Filter pages based on relevance scores (0.0-1.0 threshold) - **Bulk Storage**: Store all discovered pages in the knowledge base with automatic tagging ## Data Persistence - **Database**: SQLite file in the configured location (default: `crawl4ai_rag.db`) - **Content Chunking**: Text split into 500-word chunks with 50-word overlap - **Retention Policies**: 'permanent', 'session_only', or time-based (e.g., '30_days') - **Deep Crawl Tags**: Automatically tagged with 'deep_crawl' and depth information - **Vector Embeddings**: 384-dimensional vectors using sentence-transformers 'all-MiniLM-L6-v2' ## Development Commands ### Running the MCP Server ```bash # Run the main MCP server (local mode) python3 core/rag_processor.py # Run in client mode (forwards to remote API) # First set IS_SERVER=false in .env python3 core/rag_processor.py ``` ### Running the REST API Server ```bash # Start API server (server mode) python3 deployments/server/start_api_server.py # Or use uvicorn directly uvicorn api.api:create_app --host 0.0.0.0 --port 8080 ``` ### Testing ```bash # Test sqlite-vec installation python3 core/utilities/test_sqlite_vec.py # Test database operations python3 core/utilities/dbstats.py # Test batch crawling functionality python3 core/utilities/batch_crawler.py ``` ### Docker Operations ```bash # Server Deployment (REST API + MCP Server) docker compose -f deployments/server/docker-compose.yml up -d docker compose -f deployments/server/docker-compose.yml logs -f docker compose -f deployments/server/docker-compose.yml down # Client Deployment (MCP Client only) docker compose -f deployments/client/docker-compose.yml up -d docker compose -f deployments/client/docker-compose.yml logs -f docker compose -f deployments/client/docker-compose.yml down # Standalone Crawl4AI service docker run -d --name crawl4ai -p 11235:11235 unclecode/crawl4ai:latest ``` ## Troubleshooting For common issues and solutions, see the [Troubleshooting Guide](guides/troubleshooting.md). ### Quick Diagnostics **Check Services:** ```bash # Check Docker containers docker ps # Check ports sudo lsof -i :8080 # REST API sudo lsof -i :3000 # MCP Server sudo lsof -i :11235 # Crawl4AI ``` **Test Connectivity:** ```bash # Test REST API curl http://localhost:8080/health # Test Crawl4AI curl http://localhost:11235/ ``` **View Logs:** ```bash # Application errors tail -f data/crawl4ai_rag_errors.log # Docker logs docker logs crawl4ai docker logs crawl4ai-rag-server ``` ## Support - [GitHub Issues](https://github.com/Rob-P-Smith/mcpragcrawl4ai/issues) - [AGENTS.md](../AGENTS.md) - Developer guide - [Troubleshooting Guide](guides/troubleshooting.md) - [Deployment Guide](deployments.md)

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Rob-P-Smith/mcpragcrawl4ai'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

README.md•11.5 KiB