Crawl4AI RAG MCP Server

by coleam00
MIT License
115
  • Linux
  • Apple

Integrations

  • Allows running the MCP server as a container, with configuration options for both SSE and stdio transports

  • Supports integration with n8n, with special network configuration instructions for Docker environments

  • Planned future integration to enable running embedding models locally for complete privacy and control

A powerful implementation of the Model Context Protocol (MCP) integrated with Crawl4AI and Supabase for providing AI agents and AI coding assistants with advanced web crawling and RAG capabilities.

With this MCP server, you can scrape anything and then use that knowledge anywhere for RAG.

The primary goal is to bring this MCP server into Archon as I evolve it to be more of a knowledge engine for AI coding assistants to build AI agents. This first version of the Crawl4AI/RAG MCP server will be improved upon greatly soon, especially making it more configurable so you can use different embedding models and run everything locally with Ollama.

Overview

This MCP server provides tools that enable AI agents to crawl websites, store content in a vector database (Supabase), and perform RAG over the crawled content. It follows the best practices for building MCP servers based on the Mem0 MCP server template I provided on my channel previously.

Vision

The Crawl4AI RAG MCP server is just the beginning. Here's where we're headed:

  1. Integration with Archon: Building this system directly into Archon to create a comprehensive knowledge engine for AI coding assistants to build better AI agents.
  2. Multiple Embedding Models: Expanding beyond OpenAI to support a variety of embedding models, including the ability to run everything locally with Ollama for complete control and privacy.
  3. Advanced RAG Strategies: Implementing sophisticated retrieval techniques like contextual retrieval, late chunking, and others to move beyond basic "naive lookups" and significantly enhance the power and precision of the RAG system, especially as it integrates with Archon.
  4. Enhanced Chunking Strategy: Implementing a Context 7-inspired chunking approach that focuses on examples and creates distinct, semantically meaningful sections for each chunk, improving retrieval precision.
  5. Performance Optimization: Increasing crawling and indexing speed to make it more realistic to "quickly" index new documentation to then leverage it within the same prompt in an AI coding assistant.

Features

  • Smart URL Detection: Automatically detects and handles different URL types (regular webpages, sitemaps, text files)
  • Recursive Crawling: Follows internal links to discover content
  • Parallel Processing: Efficiently crawls multiple pages simultaneously
  • Content Chunking: Intelligently splits content by headers and size for better processing
  • Vector Search: Performs RAG over crawled content, optionally filtering by data source for precision
  • Source Retrieval: Retrieve sources available for filtering to guide the RAG process

Tools

The server provides four essential web crawling and search tools:

  1. crawl_single_page: Quickly crawl a single web page and store its content in the vector database
  2. smart_crawl_url: Intelligently crawl a full website based on the type of URL provided (sitemap, llms-full.txt, or a regular webpage that needs to be crawled recursively)
  3. get_available_sources: Get a list of all available sources (domains) in the database
  4. perform_rag_query: Search for relevant content using semantic search with optional source filtering

Prerequisites

Installation

  1. Clone this repository:
    git clone https://github.com/coleam00/mcp-crawl4ai-rag.git cd mcp-crawl4ai-rag
  2. Build the Docker image:
    docker build -t mcp/crawl4ai-rag --build-arg PORT=8051 .
  3. Create a .env file based on the configuration section below

Using uv directly (no Docker)

  1. Clone this repository:
    git clone https://github.com/coleam00/mcp-crawl4ai-rag.git cd mcp-crawl4ai-rag
  2. Install uv if you don't have it:
    pip install uv
  3. Create and activate a virtual environment:
    uv venv .venv\Scripts\activate # on Mac/Linux: source .venv/bin/activate
  4. Install dependencies:
    uv pip install -e . crawl4ai-setup
  5. Create a .env file based on the configuration section below

Database Setup

Before running the server, you need to set up the database with the pgvector extension:

  1. Go to the SQL Editor in your Supabase dashboard (create a new project first if necessary)
  2. Create a new query and paste the contents of crawled_pages.sql
  3. Run the query to create the necessary tables and functions

Configuration

Create a .env file in the project root with the following variables:

# MCP Server Configuration HOST=0.0.0.0 PORT=8051 TRANSPORT=sse # OpenAI API Configuration OPENAI_API_KEY=your_openai_api_key # Supabase Configuration SUPABASE_URL=your_supabase_project_url SUPABASE_SERVICE_KEY=your_supabase_service_key

Running the Server

Using Docker

docker run --env-file .env -p 8051:8051 mcp/crawl4ai-rag

Using Python

uv run src/crawl4ai_mcp.py

The server will start and listen on the configured host and port.

Integration with MCP Clients

SSE Configuration

Once you have the server running with SSE transport, you can connect to it using this configuration:

{ "mcpServers": { "crawl4ai-rag": { "transport": "sse", "url": "http://localhost:8051/sse" } } }

Note for Windsurf users: Use serverUrl instead of url in your configuration:

{ "mcpServers": { "crawl4ai-rag": { "transport": "sse", "serverUrl": "http://localhost:8051/sse" } } }

Note for Docker users: Use host.docker.internal instead of localhost if your client is running in a different container. This will apply if you are using this MCP server within n8n!

Stdio Configuration

Add this server to your MCP configuration for Claude Desktop, Windsurf, or any other MCP client:

{ "mcpServers": { "crawl4ai-rag": { "command": "python", "args": ["path/to/crawl4ai-mcp/src/crawl4ai_mcp.py"], "env": { "TRANSPORT": "stdio", "OPENAI_API_KEY": "your_openai_api_key", "SUPABASE_URL": "your_supabase_url", "SUPABASE_SERVICE_KEY": "your_supabase_service_key" } } } }

Docker with Stdio Configuration

{ "mcpServers": { "crawl4ai-rag": { "command": "docker", "args": ["run", "--rm", "-i", "-e", "TRANSPORT", "-e", "OPENAI_API_KEY", "-e", "SUPABASE_URL", "-e", "SUPABASE_SERVICE_KEY", "mcp/crawl4ai"], "env": { "TRANSPORT": "stdio", "OPENAI_API_KEY": "your_openai_api_key", "SUPABASE_URL": "your_supabase_url", "SUPABASE_SERVICE_KEY": "your_supabase_service_key" } } } }

Building Your Own Server

This implementation provides a foundation for building more complex MCP servers with web crawling capabilities. To build your own:

  1. Add your own tools by creating methods with the @mcp.tool() decorator
  2. Create your own lifespan function to add your own dependencies
  3. Modify the utils.py file for any helper functions you need
  4. Extend the crawling capabilities by adding more specialized crawlers
-
security - not tested
A
license - permissive license
-
quality - not tested

Web crawling and RAG implementation that enables AI agents to scrape websites and perform semantic search over the crawled content, storing everything in Supabase for persistent knowledge retrieval.

  1. Vision
    1. Features
      1. Tools
        1. Prerequisites
          1. Installation
            1. Using Docker (Recommended)
            2. Using uv directly (no Docker)
          2. Database Setup
            1. Configuration
              1. Running the Server
                1. Using Docker
                2. Using Python
              2. Integration with MCP Clients
                1. SSE Configuration
                2. Stdio Configuration
                3. Docker with Stdio Configuration
              3. Building Your Own Server

                Related MCP Servers

                • A
                  security
                  A
                  license
                  A
                  quality
                  This server enables AI systems to integrate with Tavily's search and data extraction tools, providing real-time web information access and domain-specific searches.
                  Last updated -
                  2
                  5,133
                  334
                  JavaScript
                  MIT License
                  • Apple
                  • Linux
                • A
                  security
                  A
                  license
                  A
                  quality
                  A server that provides web scraping and intelligent content searching capabilities using the Firecrawl API, enabling AI agents to extract structured data from websites and perform content searches.
                  Last updated -
                  5
                  2
                  TypeScript
                  MIT License
                  • Apple
                  • Linux
                • -
                  security
                  A
                  license
                  -
                  quality
                  Empowers AI agents to perform web browsing, automation, and scraping tasks with minimal supervision using natural language instructions and Selenium.
                  Last updated -
                  1
                  Python
                  Apache 2.0
                  • Apple
                • -
                  security
                  -
                  license
                  -
                  quality
                  Integrates with Dumpling AI to provide data scraping, content processing, knowledge management, and code execution capabilities through tools for web interactions, document handling, and AI services.
                  Last updated -
                  2
                  JavaScript
                  MIT License

                View all related MCP servers

                ID: 0tzydxm4hi