Crawl4AI RAG MCP 服务器

与Crawl4AI和Supabase集成的模型上下文协议 (MCP)的强大实现，为 AI 代理和 AI 编码助手提供高级网络爬虫和 RAG 功能。

使用此 MCP 服务器，您可以抓取任何内容，然后在任何地方将该知识用于 RAG。

我的主要目标是将这个 MCP 服务器引入Archon ，并逐步将其发展为一个知识引擎，供 AI 编码助手构建 AI 代理。Crawl4AI/RAG MCP 服务器的第一个版本将很快得到大幅改进，尤其是使其更具可配置性，以便您可以使用不同的嵌入模型，并使用 Ollama 在本地运行所有内容。

概述

此 MCP 服务器提供工具，使 AI 代理能够爬取网站、将内容存储在矢量数据库 (Supabase) 中，并对爬取的内容执行 RAG。它遵循基于我之前在频道中提供的Mem0 MCP 服务器模板构建 MCP 服务器的最佳实践。

想象

Crawl4AI RAG MCP 服务器只是一个开始。接下来我们将进行以下工作：

与 Archon 集成：将该系统直接构建到Archon中，为 AI 编码助手创建一个全面的知识引擎，以构建更好的 AI 代理。
多种嵌入模型：超越 OpenAI 扩展以支持各种嵌入模型，包括使用 Ollama 在本地运行所有内容以实现完全控制和隐私的能力。
先进的 RAG 策略：实施复杂的检索技术，如上下文检索、后期分块等，以超越基本的“朴素查找”，并显著增强 RAG 系统的功能和精度，特别是与 Archon 集成时。
增强的分块策略：实施受 Context 7 启发的分块方法，该方法侧重于示例并为每个块创建独特的、语义上有意义的部分，从而提高检索精度。
性能优化：提高抓取和索引速度，使其能够更真实地“快速”索引新文档，然后在 AI 编码助手的同一提示中利用它。

特征

智能 URL 检测：自动检测并处理不同的 URL 类型（常规网页、站点地图、文本文件）
递归爬行：跟踪内部链接以发现内容
并行处理：同时高效抓取多个页面
内容分块：根据标题和大小智能地拆分内容，以便更好地处理
矢量搜索：对抓取的内容执行 RAG，可选择按数据源进行过滤以提高精度
源检索：检索可供过滤的源以指导 RAG 流程

工具

该服务器提供了四个基本的网络爬虫和搜索工具：

crawl_single_page ：快速抓取单个网页，并将其内容存储在向量数据库中
smart_crawl_url ：根据提供的 URL 类型（站点地图、llms-full.txt 或需要递归抓取的常规网页）智能地抓取整个网站
get_available_sources ：获取数据库中所有可用源（域）的列表
perform_rag_query ：使用语义搜索和可选的源过滤来搜索相关内容

先决条件

如果将 MCP 服务器作为容器运行，则使用Docker/Docker Desktop （推荐）
如果直接通过 uv 运行 MCP 服务器，则需要 Python 3.12 以上版本
Supabase （RAG 数据库）
OpenAI API 密��（用于生成嵌入）

安装

使用 Docker（推荐）

克隆此存储库：
git clone https://github.com/coleam00/mcp-crawl4ai-rag.git cd mcp-crawl4ai-rag
构建 Docker 镜像：
docker build -t mcp/crawl4ai-rag --build-arg PORT=8051 .
根据下面的配置部分创建一个.env文件

直接使用 uv（无需 Docker）

克隆此存储库：
git clone https://github.com/coleam00/mcp-crawl4ai-rag.git cd mcp-crawl4ai-rag
如果你没有 uv，请安装它：
pip install uv
创建并激活虚拟环境：
uv venv .venv\Scripts\activate # on Mac/Linux: source .venv/bin/activate
安装依赖项：
uv pip install -e . crawl4ai-setup
根据下面的配置部分创建一个.env文件

数据库设置

在运行服务器之前，您需要使用 pgvector 扩展设置数据库：

转到 Supabase 仪表板中的 SQL 编辑器（如有必要，请先创建一个新项目）
创建新查询并粘贴crawled_pages.sql的内容
运行查询以创建必要的表和函数

配置

在项目根目录中创建一个.env文件，其中包含以下变量：

# MCP Server Configuration
HOST=0.0.0.0
PORT=8051
TRANSPORT=sse

# OpenAI API Configuration
OPENAI_API_KEY=your_openai_api_key

# Supabase Configuration
SUPABASE_URL=your_supabase_project_url
SUPABASE_SERVICE_KEY=your_supabase_service_key

运行服务器

使用 Docker

docker run --env-file .env -p 8051:8051 mcp/crawl4ai-rag

使用 Python

uv run src/crawl4ai_mcp.py

服务器将启动并监听配置的主机和端口。

与 MCP 客户端集成

SSE配置

一旦服务器使用 SSE 传输运行，您就可以使用以下配置连接到它：

{
  "mcpServers": {
    "crawl4ai-rag": {
      "transport": "sse",
      "url": "http://localhost:8051/sse"
    }
  }
}

Windsurf 用户请注意：在配置中使用serverUrl而不是url ：
{ "mcpServers": { "crawl4ai-rag": { "transport": "sse", "serverUrl": "http://localhost:8051/sse" } } }
Docker 用户须知：如果您的客户端运行在其他容器中，请使用host.docker.internal而不是localhost 。如果您在 n8n 中使用此 MCP 服务器，则此操作同样适用！

Stdio 配置

将此服务器添加到 Claude Desktop、Windsurf 或任何其他 MCP 客户端的 MCP 配置中：

{
  "mcpServers": {
    "crawl4ai-rag": {
      "command": "python",
      "args": ["path/to/crawl4ai-mcp/src/crawl4ai_mcp.py"],
      "env": {
        "TRANSPORT": "stdio",
        "OPENAI_API_KEY": "your_openai_api_key",
        "SUPABASE_URL": "your_supabase_url",
        "SUPABASE_SERVICE_KEY": "your_supabase_service_key"
      }
    }
  }
}

Docker 与 Stdio 配置

{
  "mcpServers": {
    "crawl4ai-rag": {
      "command": "docker",
      "args": ["run", "--rm", "-i", 
               "-e", "TRANSPORT", 
               "-e", "OPENAI_API_KEY", 
               "-e", "SUPABASE_URL", 
               "-e", "SUPABASE_SERVICE_KEY", 
               "mcp/crawl4ai"],
      "env": {
        "TRANSPORT": "stdio",
        "OPENAI_API_KEY": "your_openai_api_key",
        "SUPABASE_URL": "your_supabase_url",
        "SUPABASE_SERVICE_KEY": "your_supabase_service_key"
      }
    }
  }
}

构建您自己的服务器

此实现为构建更复杂、具有 Web 爬取功能的 MCP 服务器奠定了基础。要构建您自己的服务器，请执行以下操作：

通过使用@mcp.tool()装饰器创建方法添加您自己的工具
创建自己的生命周期函数来添加自己的依赖项
修改utils.py文件以获取您需要的任何辅助函数
通过添加更多专用爬虫来扩展爬取功能

This server cannot be installed

security - not tested

license - permissive license

quality - not tested

How are these scores calculated?

remote-capable server

The server can be hosted and run remotely because it primarily relies on remote services or has no dependency on the local environment.

网络爬取和 RAG 实现使 AI 代理能够抓取网站并对抓取的内容执行语义搜索，将所有内容存储在 Supabase 中以进行持久的知识检索。

Related MCP Servers

MCP-RAG Server
sourangshupal
-
security
F
license
-
quality
Implements Retrieval-Augmented Generation (RAG) using GroundX and OpenAI, allowing users to ingest documents and perform semantic searches with advanced context handling through Modern Context Processing (MCP).
Last updated -
4
Python
Browser Automation MCP Server
Raghu6798
-
security
F
license
-
quality
Enables intelligent web scraping through a browser automation tool that can search Google, navigate to webpages, and extract content from various websites including GitHub, Stack Overflow, and documentation sites.
Last updated -
1
Python
RAG-MCP Server
plaban1981
-
security
A
license
-
quality
A server that integrates Retrieval-Augmented Generation (RAG) with the Model Control Protocol (MCP) to provide web search capabilities and document analysis for AI assistants.
Last updated -
2
Python
Apache 2.0
Crawl4AI RAG MCP Server
Chillbruhhh
-
security
A
license
-
quality
Provides AI agents and coding assistants with advanced web crawling and RAG capabilities, allowing them to scrape websites and leverage that knowledge through various retrieval strategies.
Last updated -
1
MIT License

View all related MCP servers

Crawl4AI RAG MCP Server

概述

想象

特征

工具