文档爬虫和 MCP 服务器

该项目提供了一套工具来抓取网站、生成 Markdown 文档，并通过模型上下文协议 (MCP) 服务器搜索该文档，旨在与 Cursor 等工具集成。

特征

网络爬虫（ ：
- 使用crawl4ai从给定的 URL 开始抓取网站。
- 可配置的爬行深度、URL 模式（包括/排除）、内容类型等。
- Markdown 转换之前可选择清理 HTML（删除导航链接、页眉、页脚）。
- 从抓取的内容中生成单个合并的 Markdown 文件。
- 默认将输出保存到./storage/ 。
MCP 服务器（ ：
- 从./storage/目录加载 Markdown 文件。
- 根据标题将 Markdown 解析为语义块。
- 使用sentence-transformers （ multi-qa-mpnet-base-dot-v1 ）为每个块生成向量嵌入。
- **缓存：**利用缓存文件（ storage/document_chunks_cache.pkl ）来存储已处理的块和嵌入。
  - **首次运行：**抓取新文档后，服务器的初始启动可能需要一些时间，因为它需要解析、分块并生成所有内容的嵌入。
  - **后续运行：**如果缓存文件存在，并且./storage/中的源.md文件的修改时间没有改变，则服务器直接从缓存中加载，从而导致启动时间更快。
  - **缓存失效：**如果自上次创建缓存以来， ./storage/中的任何.md文件被修改、添加或删除，则缓存将自动失效并重新生成。
- 通过fastmcp向 Cursor 等客户端公开 MCP 工具：
  - list_documents ：列出可用的已抓取文档。
  - get_document_headings ：检索文档的标题结构。
  - search_documentation ：使用向量相似性对文档块执行语义搜索。
Cursor 集成：设计用于通过stdio传输运行 MCP 服务器以供 Cursor 内部使用。

工作流程

**爬取：**使用crawler_cli工具爬取网站并在./storage/中生成.md文件。
**运行服务器：**配置并运行mcp_server （通常由 Cursor 等 MCP 客户端管理）。
**加载和嵌入：**服务器自动加载、分块并嵌入./storage/中的.md文件的内容。
**查询：**使用 MCP 客户端（例如，Cursor Agent）与服务器的工具（ list_documents 、 search_documentation等）进行交互，以查询已爬取的内容。

设置

该项目使用uv进行依赖管理和执行。

安装：按照uv 网站上的说明进行操作。
克隆存储库：
git clone https://github.com/alizdavoodi/MCPDocSearch.git cd MCPDocSearch
安装依赖项：
uv sync
此命令创建一个虚拟环境（通常是.venv ）并安装pyproject.toml中列出的所有依赖项。

用法

1. 抓取文档

使用crawl.py脚本或直接通过uv run运行爬虫。

基本示例：

uv run python crawl.py https://docs.example.com

这将使用默认设置抓取https://docs.example.com并将输出保存到./storage/docs.example.com.md 。

带选项的示例：

uv run python crawl.py https://docs.another.site --output ./storage/custom_name.md --max-depth 2 --keyword "API" --keyword "Reference" --exclude-pattern "*blog*"

查看所有选项：

uv run python crawl.py --help

主要选项包括：

--output / -o ：指定输出文件路径。
--max-depth / -d ：设置爬行深度（必须介于 1 到 5 之间）。
--include-pattern / --exclude-pattern ：过滤要抓取的 URL。
--keyword / -k ：抓取过程中相关性评分的关键词。
--remove-links / --keep-links ：控制 HTML 清理。
--cache-mode ：控制crawl4ai缓存（ DEFAULT 、 BYPASS 、 FORCE_REFRESH ）。
--wait-for ：在捕获内容之前等待特定时间（秒）或 CSS 选择器（例如5或'css:.content' ）。对于加载延迟的页面很有用。
--js-code ：在捕获内容之前在页面上执行自定义 JavaScript。
--page-load-timeout ：设置等待页面加载的最长时间（秒）。
--wait-for-js-render / --no-wait-for-js-render ：启用特定脚本，通过滚动和点击潜在的“加载更多”按钮，更好地处理 JavaScript 密集型单页应用程序 (SPA)。如果未指定--wait-for则自动设置默认等待时间。

利用模式和深度优化爬取

有时，你可能只想抓取文档站点的特定子部分。这通常需要使用--include-pattern和--max-depth进行一些尝试。

--include-pattern ：限制爬虫仅跟踪 URL 与指定模式匹配的链接。使用通配符 ( * ) 可提高灵活性。
--max-depth ：控制爬虫程序从起始 URL 开始的“点击次数”。深度为 1 表示仅爬取与起始 URL 直接链接的页面。深度为 2 表示爬取这些页面以及从这些页面链接的页面（如果它们也符合包含模式），依此类推。

示例：仅爬取 Pulsar Admin API 部分

假设您只想要https://pulsar.apache.org/docs/4.0.x/admin-api-*下的内容。

**起始 URL：**您可以从概览页面开始： https://pulsar.apache.org/docs/4.0.x/admin-api-overview/ 。
**包含模式：**您只需要包含admin-api的链接： --include-pattern "*admin-api*" 。
**最大深度：**您需要确定管理 API 链接从起始页面到结束页面的深度。初始深度为2 ，之后可根据需要增加。
**详细模式：**使用-v查看正在访问或跳过的 URL，这有助于调试模式和深度。

uv run python crawl.py https://pulsar.apache.org/docs/4.0.x/admin-api-overview/ -v --include-pattern "*admin-api*" --max-depth 2

检查输出文件（本例中默认为./storage/pulsar.apache.org.md ）。如果缺少页面，请尝试将--max-depth增加到3 。如果包含太多不相关的页面，请更具体--include-pattern或添加--exclude-pattern规则。

2. 运行 MCP 服务器

MCP 服务器设计为由 Cursor 等 MCP 客户端通过stdio传输运行。运行服务器的命令如下：

python -m mcp_server.main

但是，它需要从项目的根目录（ MCPDocSearch ）运行，以便 Python 可以找到mcp_server模块。

⚠️ 注意：嵌入时间

MCP 服务器在首次运行或./storage/中的源 Markdown 文件发生更改时，会在本地生成嵌入。此过程涉及加载机器学习模型并处理所有文本块。

**时间变化：**嵌入生成所需的时间可能因以下因素而有很大差异：
- **硬件：**具有兼容 GPU（CUDA 或 Apple Silicon/MPS）的系统将比仅使用 CPU 的系统快得多。
- 数据大小： Markdown 文件的总数及其内容长度直接影响处理时间。
**请耐心等待：**对于大型文档集或较慢的硬件，初始启动（或更改后的启动）可能需要几分钟。后续使用缓存的启动速度会快得多。⏳

3. 为桌面配置 Cursor/Claude

要将此服务器与 Cursor 一起使用，请在该项目的根目录中创建一个.cursor/mcp.json文件（ MCPDocSearch/.cursor/mcp.json ），其中包含以下内容：

{ "mcpServers": { "doc-query-server": { "command": "uv", "args": [ "--directory", // IMPORTANT: Replace with the ABSOLUTE path to this project directory on your machine "/path/to/your/MCPDocSearch", "run", "python", "-m", "mcp_server.main" ], "env": {} } } }

解释：

"doc-query-server" ：Cursor 中的服务器名称。
"command": "uv" ：指定uv作为命令运行器。
"args" ：
- "--directory", "/path/to/your/MCPDocSearch" ：至关重要的是，它告诉uv在运行命令之前将其工作目录更改为项目根目录。请将
- "run", "python", "-m", "mcp_server.main" ：命令uv将在正确的目录和虚拟环境中执行。

保存此文件并重新启动 Cursor 后，“doc-query-server”应在 Cursor 的 MCP 设置中可用，并可供代理使用（例如， @doc-query-server search documentation for "how to install" ）。

对于 Claude for Desktop，您可以使用此官方文档来设置 MCP 服务器

依赖项

使用的关键库：

crawl4ai ：核心网络爬行功能。
fastmcp ：MCP 服务器实现。
sentence-transformers ：生成文本嵌入。
torch ： sentence-transformers所需。
typer ：构建爬虫 CLI。
uv ：项目和环境管理。
beautifulsoup4 （通过crawl4ai ）：HTML 解析。
rich ：增强终端输出。

建筑学

该项目遵循以下基本流程：

crawler_cli ：您运行此工具，提供起始 URL 和选项。
爬行（ ：该工具使用crawl4ai获取网页，并根据配置的规则（深度、模式）跟踪链接。
清理（ ：可选地，使用 BeautifulSoup 清理 HTML 内容（删除导航、链接）。
Markdown 生成（ ：清理后的 HTML 转换为 Markdown。
存储（ ：生成的Markdown内容保存到./storage/目录中的文件中。
mcp_server：当 MCP 服务器启动时（通常通过 Cursor 的配置），它会运行mcp_server/data_loader.py 。
加载和缓存：数据加载器会检查缓存文件 ( .pkl )。如果有效，则从缓存中加载数据块和嵌入。否则，它会从./storage/读取.md文件。
分块和嵌入：Markdown 文件根据标题解析成多个块。使用sentence-transformers为每个块生成嵌入，并将其存储在内存中（并保存到缓存中）。
MCP 工具（ ：服务器通过fastmcp公开工具（ list_documents 、 search_documentation等）。
search_documentation（Cursor） ：像 Cursor 这样的 MCP 客户端可以调用这些工具。search_documentation 使用预先计算的嵌入根据与查询的语义相似性来查找相关块。

执照

该项目根据 MIT 许可证获得许可 - 有关详细信息，请参阅LICENSE文件。

贡献

欢迎贡献代码！欢迎随时创建 issue 或提交 pull request。

安全说明

**Pickle Cache：**本项目使用 Python 的pickle模块来缓存已处理的数据 ( storage/document_chunks_cache.pkl )。从不受信任的来源解封数据可能存在安全隐患。请确保./storage/目录仅对受信任的用户/进程具有写入权限。

This server cannot be installed

security - not tested

license - not tested

quality - not tested

How are these scores calculated?

local-only server

The server can only run on the client's local machine because it depends on local resources.

该工具集可抓取网站、生成 Markdown 文档并通过模型上下文协议 (MCP) 服务器搜索该文档，以便与 Cursor 等工具集成。

Related MCP Servers

Memex
narphorium
-
security
-
license
-
quality
A tool for Model Context Protocol (MCP) that allows you to analyze web content and add it to your knowledge base, storing content as Markdown files for easy viewing with tools like Obsidian.
Last updated -
9
MIT License
WebSearch
josemartinrodriguezmortaloni
A
security
-
license
A
quality
Built as a Model Context Protocol (MCP) server that provides advanced web search, content extraction, web crawling, and scraping capabilities using the Firecrawl API.
Last updated -
1
MD Webcrawl MCP
jmh108
-
security
-
license
-
quality
A Python-based MCP server that crawls websites to extract and save content as markdown files, with features for mapping website structure and links.
Last updated -
3
MIT License
McpDocServer
ruan11223344
A
security
-
license
A
quality
A documentation server based on MCP protocol designed for various development frameworks that provides multi-threaded document crawling, local document loading, keyword searching, and document detail retrieval.
Last updated -
3
49
MIT License

View all related MCP servers

Documentation Crawler & MCP Server

文档爬虫和 MCP 服务器

特征

工作流程

设置

用法

1. 抓取文档

利用模式和深度优化爬取

2. 运行 MCP 服务器

⚠️ 注意：嵌入时间

3. 为桌面配置 Cursor/Claude

依赖项

建筑学

执照

贡献

安全说明

Related MCP Servers

Memex

WebSearch

MD Webcrawl MCP

McpDocServer

Appeared in Searches

New MCP Servers

MCP directory API