Crawl4AI RAG MCP Server

AGENTS.md•17 KiB

# AGENTS.md This file provides guidance to Agents working with code in this repository. ## Project Overview This is a Crawl4AI RAG (Retrieval-Augmented Generation) MCP Server implementation that provides web crawling capabilities with semantic search and knowledge base storage. The system consists of both MCP server and REST API components. ## Key Components ### Core Architecture - **MCPServer**: Main server handling JSON-RPC 2.0 protocol for MCP communication - **Crawl4AIRAG**: Web crawling interface that communicates with a local Crawl4AI service - **RAGDatabase**: SQLite-based vector database using sqlite-vec extension for semantic search - **Content Cleaner**: Navigation removal, boilerplate filtering, and chunk quality validation - **Language Filter**: Automatic detection and filtering of non-English content using langdetect - **RAM Database Mode**: In-memory SQLite database with differential sync for 10-50x faster queries - **Virtual Table Support**: Special handling for vec0 virtual tables with hard-coded schemas - **Dual Sync Strategy**: Idle sync (5s) + periodic sync (5min) for reliability - **Change Tracking**: Trigger-based tracking for regular tables, manual tracking for virtual tables - **Error Logging**: Comprehensive error logging system writing to `crawl4ai_rag_errors.log` - **API Gateway**: REST API layer with authentication and bidirectional communication support - **Security Layer**: Input sanitization and SQL injection defense system (dbdefense.py) - **Recrawl Utility**: Batch URL recrawling with concurrent processing and API-based execution ### Dependencies - **External Service**: Requires Crawl4AI service running on `http://localhost:11235` - **Vector Database**: Uses sqlite-vec extension for vector similarity search - **ML Model**: Pre-loads SentenceTransformer 'all-MiniLM-L6-v2' for embeddings - **Language Detection**: langdetect library for identifying and filtering non-English content - **Key Libraries**: sqlite3, sqlite_vec, sentence_transformers, numpy, requests, fastapi, uvicorn, aiohttp, langdetect ### File Structure - **core/rag_processor.py**: Main MCP server implementation and JSON-RPC handling - **core/operations/crawler.py**: Web crawling logic with fit_markdown extraction and deep crawling functionality - **core/data/storage.py**: Database operations, content storage, vector embeddings, and language filtering - **core/data/content_cleaner.py**: Navigation removal, boilerplate filtering, and content quality validation - **core/data/sync_manager.py**: RAM database synchronization manager with differential sync - **core/data/dbdefense.py**: Security layer for input sanitization and SQL injection prevention - **core/utilities/**: Helper scripts for testing and batch operations - **core/utilities/dbstats.py**: Database statistics and health monitoring - **core/utilities/batch_crawler.py**: Batch crawling with retry logic and progress tracking - **core/utilities/recrawl_utility.py**: Concurrent URL recrawling via API with rate limiting - **api/**: REST API module for bidirectional communication - **api/api.py**: FastAPI server with all REST endpoints - **api/auth.py**: Authentication middleware and session management - **deployments/**: Deployment configurations - **deployments/server/**: Server deployment (REST API + MCP server in Docker) - **deployments/client/**: Client deployment (lightweight MCP client forwarder) ### Database Schema - **crawled_content**: Stores web content with metadata, retention policies, and session tracking - **sessions**: Tracks user sessions for temporary content management - **content_vectors**: Virtual table for vector embeddings using vec0 engine - Schema: `['rowid', 'embedding', 'content_id']` (rowid is implicit in vec0 tables) - Primary key: `content_id` (not `id` like regular tables) - Special handling required in sync operations due to virtual table limitations - **blocked_domains**: Stores blocked domain patterns with wildcard support - **_sync_tracker**: Shadow table for tracking RAM database changes (RAM mode only) - Tracks table_name, record_id, operation (INSERT/UPDATE/DELETE), timestamp - Cleared after successful sync to disk ## Available Tools ### Crawling Tools 1. **crawl_url**: Crawl without storing 2. **crawl_and_remember**: Crawl and store permanently 3. **crawl_temp**: Crawl and store temporarily (session-only) 4. **deep_crawl_dfs**: Deep crawl multiple pages using depth-first search without storing 5. **deep_crawl_and_store**: Deep crawl multiple pages using DFS and store all in knowledge base ### Search Tools 6. **search_memory**: Semantic search of stored content - Supports tag filtering via `tags` parameter (comma-separated, ANY match) - Results are automatically deduplicated by URL (keeps best match per URL) - Example: `search_memory(query="react hooks", tags="react, frontend", limit=10)` 7. **target_search**: Intelligent search with automatic tag expansion - Performs initial semantic search to discover tags - Re-searches using discovered tags to expand results - Deduplicates and ranks results by relevance - Returns discovered tags and expansion metadata - Example: `target_search(query="react hooks", initial_limit=5, expanded_limit=20)` ### Memory Management Tools 8. **list_memory**: List all stored content with optional filtering 9. **forget_url**: Remove specific content by URL 10. **clear_temp_memory**: Clear temporary session content ### Database Tools 11. **get_database_stats**: Get comprehensive database statistics - Returns record counts, storage breakdown, and recent activity - Includes `using_ram_db` flag to indicate if RAM or disk mode is active - Queries the actual database in use (RAM when in memory mode) 12. **list_domains**: List all unique domains with page counts ### Security Tools 13. **block_domain**: Block domains from being crawled (supports wildcards) 14. **unblock_domain**: Remove domain from blocklist 15. **list_blocked_domains**: View all blocked domain patterns ## Development Commands ### Running the MCP Server ```bash # Run the main MCP server (local mode) python3 core/rag_processor.py # Run in client mode (forwards to remote API) # First set IS_SERVER=false in .env python3 core/rag_processor.py ``` ### Running the REST API Server ```bash # Start API server (server mode) python3 deployments/server/start_api_server.py # Or use uvicorn directly uvicorn api.api:create_app --host 0.0.0.0 --port 8080 # Install API dependencies first pip install -r requirements.txt ``` ### Testing ```bash # Test sqlite-vec installation python3 core/utilities/test_sqlite_vec.py # Test database operations python3 core/utilities/dbstats.py # Test batch crawling functionality python3 core/utilities/batch_crawler.py ``` ### Docker Operations ```bash # Server Deployment (REST API + MCP Server) docker compose -f deployments/server/docker-compose.yml up -d docker compose -f deployments/server/docker-compose.yml logs -f docker compose -f deployments/server/docker-compose.yml down # Client Deployment (MCP Client only) docker compose -f deployments/client/docker-compose.yml up -d docker compose -f deployments/client/docker-compose.yml logs -f docker compose -f deployments/client/docker-compose.yml down # Standalone Crawl4AI service (if needed) docker run -d --name crawl4ai -p 11235:11235 unclecode/crawl4ai:latest docker logs crawl4ai docker stop crawl4ai ``` The server expects JSON-RPC 2.0 requests via stdin and responds via stdout. Errors are logged to both stderr and the log file. ## Deep Crawling Features - **DFS Strategy**: Uses depth-first search to crawl multiple interconnected pages - **Configurable Depth**: Control how many levels deep to crawl (max 5 levels) - **Page Limits**: Restrict maximum pages crawled to prevent resource exhaustion (max 250 pages) - **External Links**: Option to follow or ignore external domain links - **URL Scoring**: Filter pages based on relevance scores (0.0-1.0 threshold) - **Bulk Storage**: Store all discovered pages in the knowledge base with automatic tagging ## Data Persistence - **Database**: SQLite file `crawl4ai_rag.db` in the working directory - **Content Processing Pipeline**: 1. Crawl4AI extracts content using fit_markdown (cleaner than raw_markdown) 2. Content cleaner removes navigation, boilerplate, and low-quality content 3. Language detection filters out non-English content 4. Chunks are filtered to remove navigation-heavy segments 5. Only quality content is stored and embedded - **Content Chunking**: Text split into 500-word chunks with 50-word overlap - **Storage Optimization**: 70-80% reduction in storage through intelligent cleaning - **Retention Policies**: 'permanent', 'session_only', or time-based (e.g., '30_days') - **Deep Crawl Tags**: Automatically tagged with 'deep_crawl' and depth information - **Vector Embeddings**: 384-dimensional vectors using all-MiniLM-L6-v2 model - **Markdown Storage**: fit_markdown format for better semantic search quality ## Environment Configuration ### Key Environment Variables ```bash # RAM Database Mode (default: true) USE_MEMORY_DB=true # Server Mode (true = host API, false = forward to remote) IS_SERVER=true # API Configuration API_PORT=8080 API_HOST=0.0.0.0 # Crawl4AI Service CRAWL4AI_URL=http://localhost:11235 # Database Path DB_PATH=/app/data/crawl4ai_rag.db # Authentication API_KEY=your-api-key-here ``` ## Best Practices ### Performance Optimization 1. **Use RAM Database Mode**: Enable `USE_MEMORY_DB=true` for 10-50x faster queries 2. **Batch Operations**: Use batch crawler for bulk ingestion instead of individual crawls 3. **Tag Filtering**: Use specific tags in searches to reduce result set size 4. **Pagination**: Use limit/offset parameters to control memory usage ### Security 1. **Input Validation**: All inputs are automatically sanitized via dbdefense.py 2. **Domain Blocking**: Block untrusted or problematic domains before crawling 3. **API Authentication**: Always use API key authentication in production 4. **Rate Limiting**: Configure appropriate rate limits for your use case ### Content Management 1. **Retention Policies**: Use session_only for temporary data to avoid database bloat 2. **Regular Cleanup**: Periodically clear temp memory for long-running sessions 3. **Tag Organization**: Use consistent tagging strategy for better searchability 4. **Domain Monitoring**: Use list_domains to track content distribution ### Monitoring 1. **Database Stats**: Monitor `/api/v1/stats` for storage and performance metrics 2. **Sync Health**: Check `/api/v1/db/stats` for RAM sync status when using memory mode 3. **Error Logs**: Review `crawl4ai_rag_errors.log` for troubleshooting 4. **Blocked Domains**: Audit blocklist regularly via list_blocked_domains ## API Integration ### REST API Endpoints The system provides a comprehensive REST API with the following endpoints: #### Health & Status - `GET /health` - Health check endpoint - `GET /api/v1/help` - List all available tools with examples - `GET /api/v1/status` - Server status and configuration #### Crawling Endpoints - `POST /api/v1/crawl` - Crawl URL without storing - `POST /api/v1/crawl/store` - Crawl and store permanently - `POST /api/v1/crawl/temp` - Crawl and store temporarily - `POST /api/v1/crawl/deep/store` - Deep crawl and store multiple pages #### Search Endpoints - `POST /api/v1/search` - Semantic search with optional tag filtering - `POST /api/v1/search/target` - Intelligent search with tag expansion #### Memory Management - `GET /api/v1/memory` - List stored content with pagination - `DELETE /api/v1/memory` - Delete content by URL - `DELETE /api/v1/memory/temp` - Clear temporary session content #### Database & Statistics - `GET /api/v1/stats` - Comprehensive database statistics - `GET /api/v1/db/stats` - RAM database sync metrics and health - `GET /api/v1/domains` - List all domains with page counts #### Security & Blocking - `GET /api/v1/blocked-domains` - List blocked domain patterns - `POST /api/v1/blocked-domains` - Add domain to blocklist - `DELETE /api/v1/blocked-domains` - Remove domain from blocklist ### API Features - **Authentication**: API key-based authentication (Bearer token) - **Rate Limiting**: Configurable rate limits per session - **Bidirectional Mode**: Supports both server mode (hosting API) and client mode (forwarding to remote) - **Configuration**: Uses `.env` file to switch between local and remote operation - **Documentation**: Auto-generated OpenAPI docs at `/docs` endpoint - **Error Handling**: Comprehensive error responses with detailed messages ## Client Mode Support The system supports running in client mode where it forwards requests to a remote API server. This allows for distributed deployment scenarios where the MCP server runs separately from the API backend. ## RAM Database Mode The system supports running the entire SQLite database in memory for significantly faster query performance (10-50x improvement): ### How It Works - **Initialization**: On container startup, the entire disk database is loaded into RAM using SQLite's backup API - **Differential Sync**: Only changed records are synced back to disk using a shadow table (`_sync_tracker`) with triggers - **Dual Sync Strategy**: - **Idle Sync**: Syncs after 5 seconds of no write activity - **Periodic Sync**: Syncs every 5 minutes regardless of activity - **Automatic Tracking**: Regular tables use SQLite triggers for automatic change detection - **Manual Tracking**: Virtual tables (like `content_vectors`) use manual tracking since they don't support triggers ### Virtual Table Support The sync manager includes special handling for sqlite-vec virtual tables: - **Schema Registry**: Hard-coded schemas for virtual tables (PRAGMA table_info returns empty) - **Primary Key Handling**: Uses `content_id` for content_vectors instead of `id` - **Column Detection**: Includes implicit `rowid` column in vec0 virtual tables - **Validation**: Checks for empty column lists to prevent sync failures ### Configuration Set `USE_MEMORY_DB=true` in `.env` to enable RAM database mode. The system will: - Load the entire database into `:memory:` on startup - Track all changes via triggers and shadow table - Sync changes to disk automatically (idle + periodic) - Query stats show `using_ram_db: true` when active - Sync metrics available in `/api/v1/stats` response ### Monitoring - **Endpoint**: `/api/v1/stats` includes `sync_metrics` with full sync health data - **Metrics**: Track total syncs, sync duration, failed syncs, records synced, and pending changes - **Health**: Monitor pending changes, time since last sync, and sync success rate - **Logs**: Check Docker logs for sync messages: `✅ Synced N changes to disk in Xs` ## Security Features ### Input Sanitization (dbdefense.py) All user inputs are sanitized before database operations: - **SQL Injection Defense**: Parameterized queries and strict input validation - **URL Validation**: Protocol checking, domain validation, and blocked domain enforcement - **String Sanitization**: Length limits, dangerous character removal, Unicode normalization - **Type Coercion**: Integer, boolean, and pattern validation with safe defaults - **Tag Sanitization**: Comma-separated tag lists with length and character restrictions - **Blocked Domains**: Wildcard pattern matching to prevent crawling forbidden domains ### Content Filtering - **Social Media Platforms**: Automatic blocking of major social media sites - **Adult Content**: Detection and blocking of NSFW domains - **Custom Blocklist**: User-defined domain patterns with wildcard support ## Batch Crawling ### Legacy Batch Crawler (batch_crawler.py) The system includes a batch crawler utility with advanced features: - **Progress Tracking**: Real-time progress bars and statistics - **Retry Logic**: Automatic retry of failed URLs with configurable attempts - **Concurrency Control**: Configurable concurrent request limits - **Error Logging**: Detailed error tracking with failed URL collection - **Resume Support**: Can restart from failed URLs only - **Rate Limiting**: Respects server rate limits and backoff strategies ### Recrawl Utility (recrawl_utility.py) Modern utility for recrawling existing URLs with improved architecture: - **API-Based**: Uses REST API instead of direct database access to avoid sync conflicts - **Concurrent Processing**: Async processing with semaphore-based rate limiting - **Direct DB Read**: Reads URLs from disk database, sends crawl requests to API - **Rate Limiting**: Configurable delay and concurrency (e.g., 10 concurrent, 60 req/min) - **Dry-Run Mode**: Preview what will be recrawled before executing - **Progress Tracking**: Real-time progress with success/failure statistics - **Filter Support**: Filter by retention policy, tags, or URL patterns #### Usage Examples ```bash # Dry run to preview recrawl python3 core/utilities/recrawl_utility.py --all --dry-run # Recrawl all URLs with 10 concurrent requests, 0.6s delay (60 req/min) python3 core/utilities/recrawl_utility.py --all --concurrent 10 --delay 0.6 # Recrawl specific retention policy python3 core/utilities/recrawl_utility.py --policy permanent --concurrent 5 --delay 1.0 # Recrawl URLs with specific tags python3 core/utilities/recrawl_utility.py --tags "react,documentation" --limit 100 ```

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Rob-P-Smith/mcpragcrawl4ai'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

AGENTS.md•17 KiB