batch_crawl_urls
Extract content from multiple web pages simultaneously by crawling specified URLs, with options to retrieve main text and links.
Instructions
Crawl and extract content from multiple web pages
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| urls | Yes | Array of URLs to crawl | |
| extractText | No | Extract main text content | |
| extractLinks | No | Extract all links from pages | |
| maxConcurrent | No | Maximum concurrent requests | |
| delay | No | Delay between batches in milliseconds |
Implementation Reference
- The main handler function for the batch_crawl_urls tool. Validates input URLs, performs batched concurrent crawling using WebCrawlerClient, processes results, and returns structured output with success metrics.execute: async (args: any) => { const { urls, extractText = true, extractLinks = false, maxConcurrent = 3, delay = 1000 } = args; try { const startTime = Date.now(); // 验证URLs for (const url of urls) { try { new URL(url); } catch { return { success: false, error: `Invalid URL format: ${url}` }; } } const results = await client.crawlMultiplePages(urls, { extractText, extractLinks, maxConcurrent, delay, maxContentLength: 3000 // 减少单页内容长度以处理多页 }); const crawlTime = Date.now() - startTime; const successCount = results.filter(r => r.success).length; return { success: true, data: { source: 'Web Crawler', totalUrls: urls.length, successCount, failureCount: urls.length - successCount, crawlTime, results, summary: { successRate: Math.round((successCount / urls.length) * 100), averageTimePerPage: Math.round(crawlTime / urls.length), totalContentExtracted: results.filter(r => r.success).length }, timestamp: Date.now() } }; } catch (error) { return { success: false, error: `Multiple page crawling failed: ${error instanceof Error ? error.message : String(error)}` }; } }
- Core helper method in WebCrawlerClient that implements batched concurrent page fetching, content extraction, error handling, and rate limiting delays.async crawlMultiplePages(urls: string[], options: any = {}) { const results = []; const maxConcurrent = options.maxConcurrent || 3; for (let i = 0; i < urls.length; i += maxConcurrent) { const batch = urls.slice(i, i + maxConcurrent); const batchPromises = batch.map(async (url) => { try { const pageData = await this.fetchPage(url, options); const content = this.extractContent(pageData.html, options); return { url, success: true, data: { ...content, status: pageData.status, finalUrl: pageData.url } }; } catch (error) { return { url, success: false, error: error instanceof Error ? error.message : String(error) }; } }); const batchResults = await Promise.all(batchPromises); results.push(...batchResults); // 添加延迟以避免过于频繁的请求 if (i + maxConcurrent < urls.length) { await new Promise(resolve => setTimeout(resolve, options.delay || 1000)); } } return results; } }
- Input schema definition for the batch_crawl_urls tool, specifying parameters like urls array (max 10), extraction options, concurrency limits, and delays.inputSchema: { type: 'object', properties: { urls: { type: 'array', items: { type: 'string' }, description: 'Array of URLs to crawl', maxItems: 10 }, extractText: { type: 'boolean', description: 'Extract main text content', default: true }, extractLinks: { type: 'boolean', description: 'Extract all links from pages', default: false }, maxConcurrent: { type: 'number', description: 'Maximum concurrent requests', default: 3, minimum: 1, maximum: 5 }, delay: { type: 'number', description: 'Delay between batches in milliseconds', default: 1000, minimum: 500, maximum: 5000 } }, required: ['urls'] },
- src/tools/utility/web-crawler-tools.ts:292-386 (registration)Local tool registration in registerWebCrawlerTools function, defining name, description, schema, and execute handler for batch_crawl_urls.registry.registerTool({ name: 'batch_crawl_urls', description: 'Crawl and extract content from multiple web pages', category: 'utility', source: 'Web Crawler', inputSchema: { type: 'object', properties: { urls: { type: 'array', items: { type: 'string' }, description: 'Array of URLs to crawl', maxItems: 10 }, extractText: { type: 'boolean', description: 'Extract main text content', default: true }, extractLinks: { type: 'boolean', description: 'Extract all links from pages', default: false }, maxConcurrent: { type: 'number', description: 'Maximum concurrent requests', default: 3, minimum: 1, maximum: 5 }, delay: { type: 'number', description: 'Delay between batches in milliseconds', default: 1000, minimum: 500, maximum: 5000 } }, required: ['urls'] }, execute: async (args: any) => { const { urls, extractText = true, extractLinks = false, maxConcurrent = 3, delay = 1000 } = args; try { const startTime = Date.now(); // 验证URLs for (const url of urls) { try { new URL(url); } catch { return { success: false, error: `Invalid URL format: ${url}` }; } } const results = await client.crawlMultiplePages(urls, { extractText, extractLinks, maxConcurrent, delay, maxContentLength: 3000 // 减少单页内容长度以处理多页 }); const crawlTime = Date.now() - startTime; const successCount = results.filter(r => r.success).length; return { success: true, data: { source: 'Web Crawler', totalUrls: urls.length, successCount, failureCount: urls.length - successCount, crawlTime, results, summary: { successRate: Math.round((successCount / urls.length) * 100), averageTimePerPage: Math.round(crawlTime / urls.length), totalContentExtracted: results.filter(r => r.success).length }, timestamp: Date.now() } }; } catch (error) { return { success: false, error: `Multiple page crawling failed: ${error instanceof Error ? error.message : String(error)}` }; } } });
- src/index.ts:248-248 (registration)Global registration call in OpenSearchMCPServer.registerAllTools(), importing and registering the web crawler tools including batch_crawl_urls.registerWebCrawlerTools(this.toolRegistry); // 2 tools: crawl_url_content, batch_crawl_urls