Skip to main content
Glama

Crawl4AI RAG MCP Server

A high-performance Retrieval-Augmented Generation (RAG) system using Crawl4AI for web content extraction, sqlite-vec for vector storage, and MCP integration for AI assistants.

Summary

This system provides a production-ready RAG solution that combines:

  • Crawl4AI for intelligent web content extraction with markdown conversion

  • SQLite with sqlite-vec for vector storage and semantic search

  • RAM Database Mode for 10-50x faster query performance

  • MCP Server for AI assistant integration (LM-Studio, Claude Desktop, etc.)

  • REST API for bidirectional communication and remote access

  • Security Layer with input sanitization and domain blocking

Quick Start

Option 1: Local Development

  1. Clone and setup:

git clone https://github.com/Rob-P-Smith/mcpragcrawl4ai.git cd mcpragcrawl4ai python3 -m venv .venv source .venv/bin/activate # Linux/Mac pip install -r requirements.txt
  1. Start Crawl4AI service:

docker run -d --name crawl4ai -p 11235:11235 unclecode/crawl4ai:latest
  1. Configure environment:

# Create .env file cat > .env << EOF IS_SERVER=true USE_MEMORY_DB=true LOCAL_API_KEY=dev-api-key CRAWL4AI_URL=http://localhost:11235 EOF
  1. Run MCP server:

python3 core/rag_processor.py

Option 2: Docker Server Deployment

  1. Deploy full server (REST API + MCP):

cd mcpragcrawl4ai docker compose -f deployments/server/docker-compose.yml up -d
  1. Test deployment:

curl http://localhost:8080/health

See Deployment Guide for complete deployment options.

Architecture

Core Components

  • MCP Server (core/rag_processor.py) - JSON-RPC 2.0 protocol handler

  • RAG Database (core/data/storage.py) - SQLite + sqlite-vec vector storage with RAM mode support

  • Content Cleaner (core/data/content_cleaner.py) - Navigation removal and quality filtering

  • Sync Manager (core/data/sync_manager.py) - RAM database differential sync with virtual table support

  • Crawler (core/operations/crawler.py) - Web crawling with DFS algorithm and content extraction

  • Defense Layer (core/data/dbdefense.py) - Input sanitization and security

  • REST API (api/api.py) - FastAPI server with 15+ endpoints

  • Auth System (api/auth.py) - API key authentication and rate limiting

  • Recrawl Utility (core/utilities/recrawl_utility.py) - Batch URL recrawling via API with concurrent processing

Database Schema

  • crawled_content - Web content with markdown, embeddings, and metadata

  • content_vectors - Vector embeddings (sqlite-vec vec0 virtual table with rowid support)

  • sessions - User session tracking for temporary content

  • blocked_domains - Domain blocklist with wildcard patterns

  • _sync_tracker - Change tracking for RAM database differential sync (memory mode only)

Technology Stack

  • Python 3.11+ with asyncio for concurrent operations

  • SQLite with sqlite-vec extension for vector similarity search

  • SentenceTransformers (all-MiniLM-L6-v2) for embedding generation

  • langdetect for language detection and filtering

  • FastAPI for REST API with automatic OpenAPI documentation

  • Crawl4AI for intelligent web content extraction with fit_markdown

  • Docker for containerized deployment

  • aiohttp for async HTTP requests in utilities

Documentation

For detailed documentation, see:

Key Features

Performance

  • RAM Database Mode: In-memory SQLite with differential sync for 10-50x faster queries

  • Vector Search: 384-dimensional embeddings using all-MiniLM-L6-v2 for semantic search

  • Batch Crawling: High-performance batch processing with retry logic and progress tracking

  • Content Optimization: 70-80% storage reduction through intelligent cleaning and filtering

  • Efficient Storage: fit_markdown conversion and content chunking for optimal retrieval

Functionality

  • Deep Crawling: DFS-based multi-page crawling with depth and page limits

  • Content Cleaning: Automatic removal of navigation, boilerplate, and low-quality content

  • Language Filtering: Automatic detection and filtering of non-English content

  • Semantic Search: Vector similarity search with tag filtering and deduplication

  • Target Search: Intelligent search with automatic tag expansion

  • Content Management: Full CRUD operations with retention policies and session management

  • Batch Recrawling: Concurrent URL recrawling via API with rate limiting and progress tracking

Security

  • Input Sanitization: Comprehensive SQL injection defense and input validation

  • Domain Blocking: Wildcard-based domain blocking with social media and adult content filters

  • API Authentication: API key-based authentication with rate limiting

  • Safe Crawling: Automatic detection and blocking of forbidden content

Integration

  • MCP Server: Full MCP protocol support for AI assistant integration

  • REST API: Complete REST API with 15+ endpoints for all operations

  • Bidirectional Mode: Server mode (host API) and client mode (forward to remote)

  • Docker Deployment: Production-ready containerized deployment

Quick Usage Examples

Via MCP (in LM-Studio/Claude Desktop)

crawl_and_remember("https://docs.python.org/3/tutorial/", tags="python, tutorial") search_memory("list comprehensions", tags="python", limit=5) target_search("async programming best practices", initial_limit=5, expanded_limit=20) get_database_stats()

Via REST API

# Crawl and store content curl -X POST http://localhost:8080/api/v1/crawl/store \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{"url": "https://docs.python.org/3/tutorial/", "tags": "python, tutorial"}' # Semantic search curl -X POST http://localhost:8080/api/v1/search \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{"query": "list comprehensions", "tags": "python", "limit": 5}' # Get database stats curl http://localhost:8080/api/v1/stats \ -H "Authorization: Bearer YOUR_API_KEY"

Via Python Client

from api.api import Crawl4AIClient client = Crawl4AIClient("http://localhost:8080", "YOUR_API_KEY") result = await client.crawl_and_store("https://example.com", tags="example") search_results = await client.search("python tutorials", limit=10) stats = await client.get_database_stats()

Performance Metrics

With RAM database mode enabled:

  • Search queries: 20-50ms (vs 200-500ms disk mode)

  • Batch crawling: 2,000+ URLs successfully processed

  • Database size: 215MB (2,296 pages, 8,196 embeddings)

  • Sync overhead: <100ms for differential sync (idle: 5s, periodic: 5min)

  • Sync reliability: 100% success rate with virtual table support

  • Memory usage: ~500MB for full in-memory database

  • Storage optimization: 70-80% reduction through content cleaning

-
security - not tested
F
license - not found
-
quality - not tested

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Rob-P-Smith/mcpragcrawl4ai'

If you have feedback or need assistance with the MCP directory API, please join our Discord server