WebScraping-AI MCP 服务器

WebScraping.AI MCP 服务器

模型上下文协议 (MCP) 服务器实现与WebScraping.AI集成，实现 Web 数据提取功能。

特征

关于网页内容的问答
从网页中提取结构化数据
使用 JavaScript 渲染的 HTML 内容检索
从网页中提取纯文本
基于 CSS 选择器的内容提取
多种代理类型（数据中心、住宅）和国家选择
使用无头 Chrome/Chromium 进行 JavaScript 渲染
具有速率限制的并发请求管理
在目标页面上执行自定义 JavaScript
设备模拟（台式机、手机、平板电脑）
账户使用情况监控

安装

使用 npx 运行

env WEBSCRAPING_AI_API_KEY=your_api_key npx -y webscraping-ai-mcp

手动安装

# Clone the repository
git clone https://github.com/webscraping-ai/webscraping-ai-mcp-server.git
cd webscraping-ai-mcp-server

# Install dependencies
npm install

# Run
npm start

在光标中配置

注意：需要 Cursor 版本 0.45.6+

WebScraping.AI MCP 服务器可以在 Cursor 中以两种方式配置：

项目特定配置（推荐用于团队项目）：在项目目录中创建一个.cursor/mcp.json文件：
{ "servers": { "webscraping-ai": { "type": "command", "command": "npx -y webscraping-ai-mcp", "env": { "WEBSCRAPING_AI_API_KEY": "your-api-key", "WEBSCRAPING_AI_CONCURRENCY_LIMIT": "5" } } } }
全局配置（供所有项目个人使用）：在您的主目录中创建一个~/.cursor/mcp.json文件，其配置格式与上述相同。

如果您使用的是 Windows 并且遇到问题，请尝试使用cmd /c "set WEBSCRAPING_AI_API_KEY=your-api-key && npx -y webscraping-ai-mcp"作为命令。

此配置将使 WebScraping.AI 工具在与网页抓取任务相关时自动可供 Cursor 的 AI 代理使用。

在 Claude Desktop 上运行

将其添加到您的claude_desktop_config.json中：

{
  "mcpServers": {
    "mcp-server-webscraping-ai": {
      "command": "npx",
      "args": ["-y", "webscraping-ai-mcp"],
      "env": {
        "WEBSCRAPING_AI_API_KEY": "YOUR_API_KEY_HERE",
        "WEBSCRAPING_AI_CONCURRENCY_LIMIT": "5"
      }
    }
  }
}

配置

环境变量

必需的

WEBSCRAPING_AI_API_KEY ：您的 WebScraping.AI API 密钥
- 所有操作均必需
- 从WebScraping.AI获取您的 API 密钥

可选配置

WEBSCRAPING_AI_CONCURRENCY_LIMIT ：最大并发请求数（默认值： 5 ）
WEBSCRAPING_AI_DEFAULT_PROXY_TYPE ：使用的代理类型（默认值： residential ）
WEBSCRAPING_AI_DEFAULT_JS_RENDERING ：启用/禁用 JavaScript 渲染（默认值： true ）
WEBSCRAPING_AI_DEFAULT_TIMEOUT ：最大网页检索时间（毫秒）（默认值： 15000 ，最大值： 30000 ）
WEBSCRAPING_AI_DEFAULT_JS_TIMEOUT ：最大 JavaScript 渲染时间（毫秒）（默认值： 2000 ）

配置示例

对于标准用法：

# Required
export WEBSCRAPING_AI_API_KEY=your-api-key

# Optional - customize behavior (default values)
export WEBSCRAPING_AI_CONCURRENCY_LIMIT=5
export WEBSCRAPING_AI_DEFAULT_PROXY_TYPE=residential # datacenter or residential
export WEBSCRAPING_AI_DEFAULT_JS_RENDERING=true
export WEBSCRAPING_AI_DEFAULT_TIMEOUT=15000
export WEBSCRAPING_AI_DEFAULT_JS_TIMEOUT=2000

可用工具

1.问题工具（ `webscraping_ai_question` ）

询问有关网页内容的问题。

{
  "name": "webscraping_ai_question",
  "arguments": {
    "url": "https://example.com",
    "question": "What is the main topic of this page?",
    "timeout": 30000,
    "js": true,
    "js_timeout": 2000,
    "wait_for": ".content-loaded",
    "proxy": "datacenter",
    "country": "us"
  }
}

响应示例：

{
  "content": [
    {
      "type": "text",
      "text": "The main topic of this page is examples and documentation for HTML and web standards."
    }
  ],
  "isError": false
}

2. 字段工具（ `webscraping_ai_fields` ）

根据指令从网页中提取结构化数据。

{
  "name": "webscraping_ai_fields",
  "arguments": {
    "url": "https://example.com/product",
    "fields": {
      "title": "Extract the product title",
      "price": "Extract the product price",
      "description": "Extract the product description"
    },
    "js": true,
    "timeout": 30000
  }
}

响应示例：

{
  "content": [
    {
      "type": "text",
      "text": {
        "title": "Example Product",
        "price": "$99.99",
        "description": "This is an example product description."
      }
    }
  ],
  "isError": false
}

3. HTML工具（ `webscraping_ai_html` ）

获取带有 JavaScript 渲染的网页的完整 HTML。

{
  "name": "webscraping_ai_html",
  "arguments": {
    "url": "https://example.com",
    "js": true,
    "timeout": 30000,
    "wait_for": "#content-loaded"
  }
}

响应示例：

{
  "content": [
    {
      "type": "text",
      "text": "<html>...[full HTML content]...</html>"
    }
  ],
  "isError": false
}

4.文本工具（ `webscraping_ai_text` ）

从网页中提取可见的文本内容。

{
  "name": "webscraping_ai_text",
  "arguments": {
    "url": "https://example.com",
    "js": true,
    "timeout": 30000
  }
}

响应示例：

{
  "content": [
    {
      "type": "text",
      "text": "Example Domain\nThis domain is for use in illustrative examples in documents..."
    }
  ],
  "isError": false
}

5.选定工具（ `webscraping_ai_selected` ）

使用 CSS 选择器从特定元素中提取内容。

{
  "name": "webscraping_ai_selected",
  "arguments": {
    "url": "https://example.com",
    "selector": "div.main-content",
    "js": true,
    "timeout": 30000
  }
}

响应示例：

{
  "content": [
    {
      "type": "text",
      "text": "<div class=\"main-content\">This is the main content of the page.</div>"
    }
  ],
  "isError": false
}

6. 选定多个工具（ `webscraping_ai_selected_multiple` ）

使用 CSS 选择器从多个元素中提取内容。

{
  "name": "webscraping_ai_selected_multiple",
  "arguments": {
    "url": "https://example.com",
    "selectors": ["div.header", "div.product-list", "div.footer"],
    "js": true,
    "timeout": 30000
  }
}

响应示例：

{
  "content": [
    {
      "type": "text",
      "text": [
        "<div class=\"header\">Header content</div>",
        "<div class=\"product-list\">Product list content</div>",
        "<div class=\"footer\">Footer content</div>"
      ]
    }
  ],
  "isError": false
}

7.帐户工具（ `webscraping_ai_account` ）

获取有关您的 WebScraping.AI 帐户的信息。

{
  "name": "webscraping_ai_account",
  "arguments": {}
}

响应示例：

{
  "content": [
    {
      "type": "text",
      "text": {
        "requests": 5000,
        "remaining": 4500,
        "limit": 10000,
        "resets_at": "2023-12-31T23:59:59Z"
      }
    }
  ],
  "isError": false
}

所有工具的通用选项

以下选项可与所有抓取工具一起使用：

timeout ：网页检索的最大时间（毫秒）（默认为 15000，最大值为 30000）
js ：使用无头浏览器执行页面 JavaScript（默认为 true）
js_timeout ：最大 JavaScript 渲染时间（毫秒）（默认为 2000）
wait_for ：返回页面内容之前等待的 CSS 选择器
proxy ：代理类型，数据中心或住宅（默认为住宅）
country ：代理服务器所在国家/地区（默认为美国）。支持的国家/地区：美国、英国、德国、意大利、法国、加拿大、西班牙、俄罗斯、日本、韩国、印度
custom_proxy ：您自己的代理 URL，格式为“ http://user:password@host:port ”
device ：设备仿真类型。支持的值：桌面设备、移动设备、平板电脑
error_on_404 ：在目标页面上返回 404 HTTP 状态错误（默认为 false）
error_on_redirect ：在目标页面重定向时返回错误（默认为 false）
js_script ：在目标页面上执行的自定义 JavaScript 代码

错误处理

服务器提供了强大的错误处理：

暂时性错误自动重试
带退避的速率限制处理
详细错误消息
网络弹性

错误响应示例：

{
  "content": [
    {
      "type": "text",
      "text": "API Error: 429 Too Many Requests"
    }
  ],
  "isError": true
}

与法学硕士 (LLM) 的整合

此服务器实现了模型上下文协议 (MCP) ，使其与任何支持 MCP 的 LLM 平台兼容。您可以配置您的 LLM，以使用这些工具执行 Web 抓取任务。

示例：使用 MCP 配置 Claude

const { Claude } = require('@anthropic-ai/sdk');
const { Client } = require('@modelcontextprotocol/sdk/client/index.js');
const { StdioClientTransport } = require('@modelcontextprotocol/sdk/client/stdio.js');

const claude = new Claude({
  apiKey: process.env.ANTHROPIC_API_KEY
});

const transport = new StdioClientTransport({
  command: 'npx',
  args: ['-y', 'webscraping-ai-mcp'],
  env: {
    WEBSCRAPING_AI_API_KEY: 'your-api-key'
  }
});

const client = new Client({
  name: 'claude-client',
  version: '1.0.0'
});

await client.connect(transport);

// Now you can use Claude with WebScraping.AI tools
const tools = await client.listTools();
const response = await claude.complete({
  prompt: 'What is the main topic of example.com?',
  tools: tools
});

发展

# Clone the repository
git clone https://github.com/webscraping-ai/webscraping-ai-mcp-server.git
cd webscraping-ai-mcp-server

# Install dependencies
npm install

# Run tests
npm test

# Add your .env file
cp .env.example .env

# Start the inspector
npx @modelcontextprotocol/inspector node src/index.js

贡献

分叉存储库
创建你的功能分支
运行测试： npm test
提交拉取请求

执照

MIT 许可证 - 详情请参阅许可证文件

Install Server

HTTP connection URL

security – no known vulnerabilities

license - not found

quality - confirmed to work

How are these scores calculated?

remote-capable server

The server can be hosted and run remotely because it primarily relies on remote services or has no dependency on the local environment.

与 WebScraping.AI API 交互以提取和抓取网页数据

Related MCP Servers

tavily-search
Tomatio13
-
security
A
license
-
quality
Tavily AI search API
Last updated -
1
40
Python
MIT License
MCP Browser Use
janspoerer
-
security
A
license
-
quality
Empowers AI agents to perform web browsing, automation, and scraping tasks with minimal supervision using natural language instructions and Selenium.
Last updated -
4
Python
Apache 2.0
baidu-ai-searchofficial
baidubce
-
security
A
license
-
quality
Search web using baidu with AI.
Last updated -
543
Python
Apache 2.0
Browser Automation MCP Server
Raghu6798
-
security
F
license
-
quality
Enables intelligent web scraping through a browser automation tool that can search Google, navigate to webpages, and extract content from various websites including GitHub, Stack Overflow, and documentation sites.
Last updated -
1
Python

View all related MCP servers

WebScraping-AI MCP Server