mcp-服务器-网络爬虫

使用模型上下文协议 (MCP) 弥合您的网络爬虫与 AI 语言模型之间的差距。借助mcp-server-webcrawl ，您的 AI 客户端可以根据您的指示或自主地过滤和分析网络内容。该服务器包含一个支持布尔值的全文搜索界面、按类型过滤资源、HTTP 状态等功能。

mcp-server-webcrawl为 LLM 提供了一个完整的菜单来搜索您的网络内容，并可与各种网络爬虫配合使用：

mcp-server-webcrawl是免费开源的，需要 Claude Desktop 和 Python (>=3.10)。它可以通过命令行使用 pip install 进行安装：

pip install mcp-server-webcrawl

特征

克劳德桌面准备就绪
全文搜索支持
按类型、状态等进行过滤
兼容多种爬虫
支持高级/布尔和字段搜索

MCP 配置

在 Claude 桌面菜单中，前往“文件”>“设置”>“开发者”。单击“编辑配置”找到配置文件，在您选择的编辑器中打开，并修改示例以反映您的 datasrc 路径。

您可以根据需要在 mcpServers 下设置更多的 mcp-server-webcrawl 连接。

{ "mcpServers": { "webcrawl": { "command": [varies by OS/env, see below], "args": [varies by crawler, see below] } } }

有关分步设置，请参阅设置指南。

Windows 与 macOS

Windows：命令设置为“mcp-server-webcrawl”

macOS：命令设置为绝对路径，即 $which mcp-server-webcrawl 的值

例如：

"command": "/Users/yourusername/.local/bin/mcp-server-webcrawl",

要查找系统上mcp-server-webcrawl可执行文件的绝对路径：

打开终端
运行which mcp-server-webcrawl
复制返回的完整路径并在配置文件中使用它

wget（使用--mirror）

datasrc 参数应该设置为镜像的父目录。

"args": ["--crawler", "wget", "--datasrc", "/path/to/wget/archives/"]

战争研究理事会

datasrc 参数应设置为 WARC 文件的父目录。

"args": ["--crawler", "warc", "--datasrc", "/path/to/warc/archives/"]

InterroBot

datasrc 参数应设置为数据库的直接路径。

"args": ["--crawler", "interrobot", "--datasrc", "/path/to/Documents/InterroBot/interrobot.v2.db"]

武士刀

datasrc 参数应设置为根主机的目录。Katana 按主机区分页面和媒体，因此 ./archives/example.com/example.com 是合理的，也是合适的。更复杂的网站会将抓取的数据扩展到原始主机目录中。

"args": ["--crawler", "katana", "--datasrc", "/path/to/katana/archives/"]

SiteOne（使用生成离线网站）

datasrc 参数应设置为档案的父目录，并且必须启用存档。

"args": ["--crawler", "siteone", "--datasrc", "/path/to/SiteOne/archives/"]

布尔搜索语法

查询引擎支持特定字段 ( field: value ) 搜索和复杂的布尔表达式。支持将 url、content 和 headers 字段组合起来的全文搜索。

虽然 API 接口设计为供 LLM 直接使用，但熟悉搜索语法会很有帮助。LLM 生成的搜索是可检查的，但通常在 UI 中折叠。如果您需要查看查询，请展开 MCP 折叠部分。

示例查询

查询示例	描述
隐私	全文单关键字匹配
“隐私政策”	全文匹配精确短语
边界*	全文通配符匹配以边界（边界，边界）开头的结果
编号：12345	id 字段通过 ID 匹配特定资源
网址：example.com/*	url 字段匹配包含 example.com/ 的 URL 的结果
类型：html	仅适用于 HTML 页面的类型字段匹配
状态：200	状态字段匹配特定的 HTTP 状态代码（等于 200）
状态：>=400	状态字段匹配特定的 HTTP 状态代码（大于或等于 400）
内容：h1	内容字段与内容匹配（HTTP 响应主体，通常但不总是 HTML）
标题：文本/xml	headers 字段匹配 HTTP 响应头
隐私和政策	全文匹配
隐私政策或政策	全文匹配
政策而非隐私	全文匹配不包含隐私的政策
（登录或登录）和表格	fulltext 匹配 fullext login 或 signin with form
类型：html 和状态：200	fulltext 仅匹配 HTTP 成功的 HTML 页面

字段搜索定义

字段搜索提供精准的搜索，让您可以指定要过滤的搜索索引列。您可以将查询限制为特定属性（例如网址、标头或内容正文），而无需搜索所有内容。这种方法在查找爬取数据中的特定属性或模式时，可以提高效率。

场地	描述
ID	数据库 ID
网址	资源 URL
类型	枚举类型列表（参见类型表）
地位	HTTP 响应代码
标题	HTTP 响应标头
内容	HTTP 主体——HTML、CSS、JS 等

内容类型

抓取的内容不仅包含 HTML 页面，还包含多种资源类型。type type: field 搜索功能允许按广泛的内容类型组进行筛选，这在筛选不包含复杂扩展查询的图片时尤其有用。例如，您可以搜索type: html NOT content: login来查找不包含“login”的页面，或者搜索type: img来分析图片资源。下表列出了搜索系统支持的所有内容类型。

类型	描述
html	网页
内嵌框架	iframe
图片	网络图片
声音的	网络音频文件
视频	网络视频文件
字体	网络字体文件
风格	CSS 样式表
脚本	JavaScript 文件
RSS	RSS 联合提要
文本	纯文本内容
PDF	PDF 文件
文档	MS Word 文档
其他	未分类

This server cannot be installed

-

security - not tested

-

license - not tested

-

quality - not tested

How are these scores calculated?

local-only server

The server can only run on the client's local machine because it depends on local resources.

弥合您的网页爬虫与 AI 语言模型之间的差距。借助 mcp-server-webcrawl，您的 AI 客户端可以根据您的指示或自主地过滤和分析网页内容，并从中提取洞察。

支持 WARC、wget、InterroBot、Katana 和 SiteOne 爬虫。

Related MCP Servers

Crawl4AI MCP Server
weidwonder
-
security
-
license
-
quality
Crawl4AI MCP Server is an intelligent information retrieval server offering robust search capabilities and LLM-optimized web content understanding, utilizing multi-engine search and intelligent content extraction to efficiently gather and comprehend internet information.
Last updated -
118
MIT License
pure.md MCP serverofficial
puremd
A
security
-
license
A
quality
An MCP server that enables AI clients like Cursor, Windsurf, and Claude Desktop to access web content in markdown format, providing web unblocking and searching capabilities.
Last updated -
2
32
41
API Docs MCP Server
ShotaNagafuchi
-
security
-
license
-
quality
An MCP server that crawls API documentation websites and exposes their content to AI models, enabling them to search, browse, and reference API specifications.
Last updated -
crawl4ai-mcp
ritvij14
A
security
-
license
A
quality
An MCP Server for Web scraping and Crawling, built using Crawl4AI
Last updated -
2
25

View all related MCP servers

mcp-server-webcrawl

mcp-服务器-网络爬虫

特征

MCP 配置

Windows 与 macOS

wget（使用--mirror）

战争研究理事会

InterroBot

武士刀

SiteOne（使用生成离线网站）

布尔搜索语法

字段搜索定义

内容类型

Related MCP Servers

Crawl4AI MCP Server

pure.md MCP serverofficial

API Docs MCP Server

crawl4ai-mcp

Appeared in Searches

New MCP Servers

MCP directory API