web_scrape
Extract text content and metadata from any webpage URL for data collection and analysis purposes.
Instructions
抓取指定网页的内容,提取文本和元数据
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| extractMetadata | No | 是否提取元数据(默认true) | |
| extractText | No | 是否提取纯文本内容(默认true) | |
| url | Yes | 要抓取的网页URL |
Implementation Reference
- src/index.ts:207-249 (handler)Primary handler for executing the web_scrape tool. Parses arguments, invokes the scraping helper, formats scraped data (title, URL, metadata, content preview) into a text response, and propagates errors.
private async handleWebScrape(args: any) { const { url, extractText = true, extractMetadata = true } = args; try { const content = await this.scrapeWebPage(url, extractText, extractMetadata); let result = `网页内容抓取结果:\n\n`; result += `**标题**: ${content.title}\n`; result += `**URL**: ${content.url}\n\n`; if (extractMetadata && content.metadata) { result += `**元数据**:\n`; if (content.metadata.description) { result += `- 描述: ${content.metadata.description}\n`; } if (content.metadata.keywords) { result += `- 关键词: ${content.metadata.keywords}\n`; } if (content.metadata.author) { result += `- 作者: ${content.metadata.author}\n`; } if (content.metadata.publishedDate) { result += `- 发布日期: ${content.metadata.publishedDate}\n`; } result += `\n`; } if (extractText) { result += `**内容摘要** (前500字符):\n${content.content.substring(0, 500)}${content.content.length > 500 ? '...' : ''}`; } return { content: [ { type: 'text', text: result, }, ], }; } catch (error) { throw new Error(`网页抓取失败: ${error instanceof Error ? error.message : String(error)}`); } } - src/index.ts:101-120 (schema)Input schema defining the parameters for the web_scrape tool: required URL and optional flags for text and metadata extraction.
inputSchema: { type: 'object', properties: { url: { type: 'string', description: '要抓取的网页URL', }, extractText: { type: 'boolean', description: '是否提取纯文本内容(默认true)', default: true, }, extractMetadata: { type: 'boolean', description: '是否提取元数据(默认true)', default: true, }, }, required: ['url'], }, - src/index.ts:98-121 (registration)Registration of the web_scrape tool in the ListToolsRequestSchema response, specifying name, description, and input schema.
{ name: 'web_scrape', description: '抓取指定网页的内容,提取文本和元数据', inputSchema: { type: 'object', properties: { url: { type: 'string', description: '要抓取的网页URL', }, extractText: { type: 'boolean', description: '是否提取纯文本内容(默认true)', default: true, }, extractMetadata: { type: 'boolean', description: '是否提取元数据(默认true)', default: true, }, }, required: ['url'], }, }, - src/index.ts:312-349 (helper)Core helper function implementing the web scraping logic: fetches page with axios, parses with cheerio, extracts title, cleans text content, and scrapes meta tags for metadata.
private async scrapeWebPage(url: string, extractText: boolean, extractMetadata: boolean): Promise<WebPageContent> { const response = await axios.get(url, { timeout: this.requestTimeout, headers: { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', }, }); const $ = cheerio.load(response.data); const title = $('title').text().trim() || '无标题'; let content = ''; let metadata: any = {}; if (extractText) { // 移除脚本和样式标签 $('script, style, nav, header, footer, aside').remove(); content = $('body').text().replace(/\s+/g, ' ').trim(); } if (extractMetadata) { metadata = { description: $('meta[name="description"]').attr('content') || $('meta[property="og:description"]').attr('content'), keywords: $('meta[name="keywords"]').attr('content'), author: $('meta[name="author"]').attr('content') || $('meta[property="article:author"]').attr('content'), publishedDate: $('meta[property="article:published_time"]').attr('content') || $('meta[name="date"]').attr('content'), }; } return { url, title, content, metadata, };