web_fetch
Fetch web content from a URL and extract plain text by removing HTML tags. Specify maximum characters and timeout to control output.
Instructions
抓取指定 URL 的网页内容,返回纯文本(自动去除 HTML 标签)。
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | 要抓取的网页 URL | |
| max_len | No | 最大返回字符数(默认 5000) | |
| timeout | No | 超时秒数(默认 15) |
Implementation Reference
- src/onion_mcp_server/tools/web.py:88-110 (handler)Core handler for web_fetch tool: fetches URL content using httpx, strips HTML via BeautifulSoup, removes script/style/nav/footer/header tags, and returns plain text truncated to max_len.
async def _web_fetch(args: dict) -> list[types.TextContent]: url = args["url"] max_len = int(args.get("max_len", 5000)) timeout = int(args.get("timeout", 15)) try: import httpx from bs4 import BeautifulSoup async with httpx.AsyncClient(timeout=timeout, follow_redirects=True) as client: resp = await client.get(url, headers={"User-Agent": "Mozilla/5.0"}) resp.raise_for_status() soup = BeautifulSoup(resp.text, "html.parser") for tag in soup(["script", "style", "nav", "footer", "header"]): tag.decompose() text = soup.get_text(separator="\n", strip=True) text = "\n".join(line for line in text.splitlines() if line.strip()) if len(text) > max_len: text = text[:max_len] + f"\n\n... [已截断,共 {len(text)} 字符]" return [types.TextContent(type="text", text=f"🌐 {url}\n\n{text}")] except ImportError: return [types.TextContent(type="text", text="❌ 需要安装依赖: pip install httpx beautifulsoup4")] except Exception as e: return [types.TextContent(type="text", text=f"❌ 抓取失败: {e}")] - Schema/registration definition for web_fetch tool. Defines name, description, inputSchema with url (required), max_len (default 5000), and timeout (default 15).
types.Tool( name="web_fetch", description="抓取指定 URL 的网页内容,返回纯文本(自动去除 HTML 标签)。", inputSchema={ "type": "object", "properties": { "url": {"type": "string", "description": "要抓取的网页 URL"}, "max_len": { "type": "integer", "description": "最大返回字符数(默认 5000)", "default": 5000, }, "timeout": { "type": "integer", "description": "超时秒数(默认 15)", "default": 15, }, }, "required": ["url"], }, ), - src/onion_mcp_server/server.py:58-59 (registration)Registration in server.py: maps web_fetch tool name to handle_web via the routing table _HANDLERS.
for _t in WEB_TOOLS: _HANDLERS[_t.name] = handle_web - src/onion_mcp_server/tools/__init__.py:5-15 (registration)Re-exports WEB_TOOLS and handle_web from tools/__init__.py for use by server.py.
from onion_mcp_server.tools.web import WEB_TOOLS, handle_web from onion_mcp_server.tools.system import SYSTEM_TOOLS, handle_system __all__ = [ "AI_TOOLS", "handle_ai", "CODE_TOOLS", "handle_code", "TEXT_TOOLS", "handle_text", "DATA_TOOLS", "handle_data", "WEB_TOOLS", "handle_web", "SYSTEM_TOOLS", "handle_system", ] - Dispatcher helper that routes web_fetch calls to the actual _web_fetch implementation.
async def handle_web(name: str, arguments: dict) -> list[types.TextContent]: if name == "web_fetch": return await _web_fetch(arguments) elif name == "web_search": return await _web_search(arguments) elif name == "web_extract": return await _web_extract(arguments) raise ValueError(f"未知 web 工具: {name}")