batch_crawl_urls
Extract main text content and links from multiple web pages simultaneously. Specify URLs, manage concurrent requests, and set delays for efficient batch crawling with Open Search MCP.
Instructions
Crawl and extract content from multiple web pages
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| delay | No | Delay between batches in milliseconds | |
| extractLinks | No | Extract all links from pages | |
| extractText | No | Extract main text content | |
| maxConcurrent | No | Maximum concurrent requests | |
| urls | Yes | Array of URLs to crawl |
Implementation Reference
- The main handler function (execute) for the batch_crawl_urls tool. It validates input URLs, calls the crawlMultiplePages helper with concurrency and delay controls, processes results, and returns structured output with success metrics.execute: async (args: any) => { const { urls, extractText = true, extractLinks = false, maxConcurrent = 3, delay = 1000 } = args; try { const startTime = Date.now(); // 验证URLs for (const url of urls) { try { new URL(url); } catch { return { success: false, error: `Invalid URL format: ${url}` }; } } const results = await client.crawlMultiplePages(urls, { extractText, extractLinks, maxConcurrent, delay, maxContentLength: 3000 // 减少单页内容长度以处理多页 }); const crawlTime = Date.now() - startTime; const successCount = results.filter(r => r.success).length; return { success: true, data: { source: 'Web Crawler', totalUrls: urls.length, successCount, failureCount: urls.length - successCount, crawlTime, results, summary: { successRate: Math.round((successCount / urls.length) * 100), averageTimePerPage: Math.round(crawlTime / urls.length), totalContentExtracted: results.filter(r => r.success).length }, timestamp: Date.now() } }; } catch (error) { return { success: false, error: `Multiple page crawling failed: ${error instanceof Error ? error.message : String(error)}` }; } }
- Input schema definition for the batch_crawl_urls tool, specifying parameters like urls array (max 10), extraction options, concurrency, and delay with types, defaults, and constraints.inputSchema: { type: 'object', properties: { urls: { type: 'array', items: { type: 'string' }, description: 'Array of URLs to crawl', maxItems: 10 }, extractText: { type: 'boolean', description: 'Extract main text content', default: true }, extractLinks: { type: 'boolean', description: 'Extract all links from pages', default: false }, maxConcurrent: { type: 'number', description: 'Maximum concurrent requests', default: 3, minimum: 1, maximum: 5 }, delay: { type: 'number', description: 'Delay between batches in milliseconds', default: 1000, minimum: 500, maximum: 5000 } }, required: ['urls'] },
- src/tools/utility/web-crawler-tools.ts:292-386 (registration)The registry.registerTool call that defines and registers the batch_crawl_urls tool including name, description, schema, and execute handler.registry.registerTool({ name: 'batch_crawl_urls', description: 'Crawl and extract content from multiple web pages', category: 'utility', source: 'Web Crawler', inputSchema: { type: 'object', properties: { urls: { type: 'array', items: { type: 'string' }, description: 'Array of URLs to crawl', maxItems: 10 }, extractText: { type: 'boolean', description: 'Extract main text content', default: true }, extractLinks: { type: 'boolean', description: 'Extract all links from pages', default: false }, maxConcurrent: { type: 'number', description: 'Maximum concurrent requests', default: 3, minimum: 1, maximum: 5 }, delay: { type: 'number', description: 'Delay between batches in milliseconds', default: 1000, minimum: 500, maximum: 5000 } }, required: ['urls'] }, execute: async (args: any) => { const { urls, extractText = true, extractLinks = false, maxConcurrent = 3, delay = 1000 } = args; try { const startTime = Date.now(); // 验证URLs for (const url of urls) { try { new URL(url); } catch { return { success: false, error: `Invalid URL format: ${url}` }; } } const results = await client.crawlMultiplePages(urls, { extractText, extractLinks, maxConcurrent, delay, maxContentLength: 3000 // 减少单页内容长度以处理多页 }); const crawlTime = Date.now() - startTime; const successCount = results.filter(r => r.success).length; return { success: true, data: { source: 'Web Crawler', totalUrls: urls.length, successCount, failureCount: urls.length - successCount, crawlTime, results, summary: { successRate: Math.round((successCount / urls.length) * 100), averageTimePerPage: Math.round(crawlTime / urls.length), totalContentExtracted: results.filter(r => r.success).length }, timestamp: Date.now() } }; } catch (error) { return { success: false, error: `Multiple page crawling failed: ${error instanceof Error ? error.message : String(error)}` }; } } });
- Core helper method in WebCrawlerClient that performs batched concurrent crawling of multiple URLs, using fetchPage and extractContent, with delays to avoid rate limiting.async crawlMultiplePages(urls: string[], options: any = {}) { const results = []; const maxConcurrent = options.maxConcurrent || 3; for (let i = 0; i < urls.length; i += maxConcurrent) { const batch = urls.slice(i, i + maxConcurrent); const batchPromises = batch.map(async (url) => { try { const pageData = await this.fetchPage(url, options); const content = this.extractContent(pageData.html, options); return { url, success: true, data: { ...content, status: pageData.status, finalUrl: pageData.url } }; } catch (error) { return { url, success: false, error: error instanceof Error ? error.message : String(error) }; } }); const batchResults = await Promise.all(batchPromises); results.push(...batchResults); // 添加延迟以避免过于频繁的请求 if (i + maxConcurrent < urls.length) { await new Promise(resolve => setTimeout(resolve, options.delay || 1000)); } } return results; }
- src/index.ts:248-248 (registration)Top-level registration call in the main server initialization that registers the web crawler tools module, including batch_crawl_urls.registerWebCrawlerTools(this.toolRegistry); // 2 tools: crawl_url_content, batch_crawl_urls