web_search_and_scrape
Search the web and extract content from top results using Google Custom Search API to gather comprehensive information for research and analysis.
Instructions
搜索网页并抓取前几个结果的内容
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| language | No | 搜索语言(如:zh-CN, en-US) | zh-CN |
| maxResults | No | 最大抓取结果数量(默认3) | |
| query | Yes | 搜索查询关键词 |
Implementation Reference
- src/index.ts:251-286 (handler)The primary handler function for the 'web_search_and_scrape' tool. It performs a web search using performWebSearch, then scrapes the content of each top result using scrapeWebPage, and compiles a formatted response with titles, URLs, snippets, and content summaries.
private async handleWebSearchAndScrape(args: any) { const { query, maxResults = 3, language = 'zh-CN' } = args; try { // 首先进行搜索 const searchResults = await this.performWebSearch(query, maxResults, language); let result = `搜索 "${query}" 并抓取内容:\n\n`; // 然后抓取每个结果的内容 for (let i = 0; i < searchResults.length; i++) { const searchResult = searchResults[i]; result += `## ${i + 1}. ${searchResult.title}\n`; result += `**URL**: ${searchResult.url}\n`; result += `**搜索摘要**: ${searchResult.snippet}\n\n`; try { const scrapedContent = await this.scrapeWebPage(searchResult.url, true, false); result += `**抓取内容摘要** (前300字符):\n${scrapedContent.content.substring(0, 300)}${scrapedContent.content.length > 300 ? '...' : ''}\n\n`; } catch (scrapeError) { result += `**抓取失败**: ${scrapeError instanceof Error ? scrapeError.message : String(scrapeError)}\n\n`; } } return { content: [ { type: 'text', text: result, }, ], }; } catch (error) { throw new Error(`搜索和抓取失败: ${error instanceof Error ? error.message : String(error)}`); } } - src/index.ts:122-145 (schema)The input schema and metadata for the 'web_search_and_scrape' tool, registered in the ListTools response.
{ name: 'web_search_and_scrape', description: '搜索网页并抓取前几个结果的内容', inputSchema: { type: 'object', properties: { query: { type: 'string', description: '搜索查询关键词', }, maxResults: { type: 'number', description: '最大抓取结果数量(默认3)', default: 3, }, language: { type: 'string', description: '搜索语言(如:zh-CN, en-US)', default: 'zh-CN', }, }, required: ['query'], }, }, - src/index.ts:160-161 (registration)The switch case in the CallToolRequest handler that routes calls to 'web_search_and_scrape' to its handler function.
case 'web_search_and_scrape': return await this.handleWebSearchAndScrape(args); - src/index.ts:288-310 (helper)Helper function called by the handler to perform the actual web search using Google Custom Search API.
private async performWebSearch(query: string, maxResults: number, language: string): Promise<SearchResult[]> { const url = `https://www.googleapis.com/customsearch/v1`; const params = { key: this.searchApiKey, cx: this.searchEngineId, q: query, num: Math.min(maxResults, 10), lr: `lang_${language}`, }; const response = await axios.get(url, { params, timeout: this.requestTimeout, }); const items = response.data.items || []; return items.map((item: any, index: number) => ({ title: item.title, url: item.link, snippet: item.snippet, rank: index + 1, })); } - src/index.ts:312-350 (helper)Helper function called by the handler to scrape web page content using axios and cheerio, extracting title, text, and metadata.
private async scrapeWebPage(url: string, extractText: boolean, extractMetadata: boolean): Promise<WebPageContent> { const response = await axios.get(url, { timeout: this.requestTimeout, headers: { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', }, }); const $ = cheerio.load(response.data); const title = $('title').text().trim() || '无标题'; let content = ''; let metadata: any = {}; if (extractText) { // 移除脚本和样式标签 $('script, style, nav, header, footer, aside').remove(); content = $('body').text().replace(/\s+/g, ' ').trim(); } if (extractMetadata) { metadata = { description: $('meta[name="description"]').attr('content') || $('meta[property="og:description"]').attr('content'), keywords: $('meta[name="keywords"]').attr('content'), author: $('meta[name="author"]').attr('content') || $('meta[property="article:author"]').attr('content'), publishedDate: $('meta[property="article:published_time"]').attr('content') || $('meta[name="date"]').attr('content'), }; } return { url, title, content, metadata, }; }