Skip to main content
Glama
flyanima

Open Search MCP

by flyanima

batch_crawl_urls

Extract content from multiple web pages simultaneously by crawling specified URLs, with options to retrieve main text and links.

Instructions

Crawl and extract content from multiple web pages

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlsYesArray of URLs to crawl
extractTextNoExtract main text content
extractLinksNoExtract all links from pages
maxConcurrentNoMaximum concurrent requests
delayNoDelay between batches in milliseconds

Implementation Reference

  • The main handler function for the batch_crawl_urls tool. Validates input URLs, performs batched concurrent crawling using WebCrawlerClient, processes results, and returns structured output with success metrics.
    execute: async (args: any) => { const { urls, extractText = true, extractLinks = false, maxConcurrent = 3, delay = 1000 } = args; try { const startTime = Date.now(); // 验证URLs for (const url of urls) { try { new URL(url); } catch { return { success: false, error: `Invalid URL format: ${url}` }; } } const results = await client.crawlMultiplePages(urls, { extractText, extractLinks, maxConcurrent, delay, maxContentLength: 3000 // 减少单页内容长度以处理多页 }); const crawlTime = Date.now() - startTime; const successCount = results.filter(r => r.success).length; return { success: true, data: { source: 'Web Crawler', totalUrls: urls.length, successCount, failureCount: urls.length - successCount, crawlTime, results, summary: { successRate: Math.round((successCount / urls.length) * 100), averageTimePerPage: Math.round(crawlTime / urls.length), totalContentExtracted: results.filter(r => r.success).length }, timestamp: Date.now() } }; } catch (error) { return { success: false, error: `Multiple page crawling failed: ${error instanceof Error ? error.message : String(error)}` }; } }
  • Core helper method in WebCrawlerClient that implements batched concurrent page fetching, content extraction, error handling, and rate limiting delays.
    async crawlMultiplePages(urls: string[], options: any = {}) { const results = []; const maxConcurrent = options.maxConcurrent || 3; for (let i = 0; i < urls.length; i += maxConcurrent) { const batch = urls.slice(i, i + maxConcurrent); const batchPromises = batch.map(async (url) => { try { const pageData = await this.fetchPage(url, options); const content = this.extractContent(pageData.html, options); return { url, success: true, data: { ...content, status: pageData.status, finalUrl: pageData.url } }; } catch (error) { return { url, success: false, error: error instanceof Error ? error.message : String(error) }; } }); const batchResults = await Promise.all(batchPromises); results.push(...batchResults); // 添加延迟以避免过于频繁的请求 if (i + maxConcurrent < urls.length) { await new Promise(resolve => setTimeout(resolve, options.delay || 1000)); } } return results; } }
  • Input schema definition for the batch_crawl_urls tool, specifying parameters like urls array (max 10), extraction options, concurrency limits, and delays.
    inputSchema: { type: 'object', properties: { urls: { type: 'array', items: { type: 'string' }, description: 'Array of URLs to crawl', maxItems: 10 }, extractText: { type: 'boolean', description: 'Extract main text content', default: true }, extractLinks: { type: 'boolean', description: 'Extract all links from pages', default: false }, maxConcurrent: { type: 'number', description: 'Maximum concurrent requests', default: 3, minimum: 1, maximum: 5 }, delay: { type: 'number', description: 'Delay between batches in milliseconds', default: 1000, minimum: 500, maximum: 5000 } }, required: ['urls'] },
  • Local tool registration in registerWebCrawlerTools function, defining name, description, schema, and execute handler for batch_crawl_urls.
    registry.registerTool({ name: 'batch_crawl_urls', description: 'Crawl and extract content from multiple web pages', category: 'utility', source: 'Web Crawler', inputSchema: { type: 'object', properties: { urls: { type: 'array', items: { type: 'string' }, description: 'Array of URLs to crawl', maxItems: 10 }, extractText: { type: 'boolean', description: 'Extract main text content', default: true }, extractLinks: { type: 'boolean', description: 'Extract all links from pages', default: false }, maxConcurrent: { type: 'number', description: 'Maximum concurrent requests', default: 3, minimum: 1, maximum: 5 }, delay: { type: 'number', description: 'Delay between batches in milliseconds', default: 1000, minimum: 500, maximum: 5000 } }, required: ['urls'] }, execute: async (args: any) => { const { urls, extractText = true, extractLinks = false, maxConcurrent = 3, delay = 1000 } = args; try { const startTime = Date.now(); // 验证URLs for (const url of urls) { try { new URL(url); } catch { return { success: false, error: `Invalid URL format: ${url}` }; } } const results = await client.crawlMultiplePages(urls, { extractText, extractLinks, maxConcurrent, delay, maxContentLength: 3000 // 减少单页内容长度以处理多页 }); const crawlTime = Date.now() - startTime; const successCount = results.filter(r => r.success).length; return { success: true, data: { source: 'Web Crawler', totalUrls: urls.length, successCount, failureCount: urls.length - successCount, crawlTime, results, summary: { successRate: Math.round((successCount / urls.length) * 100), averageTimePerPage: Math.round(crawlTime / urls.length), totalContentExtracted: results.filter(r => r.success).length }, timestamp: Date.now() } }; } catch (error) { return { success: false, error: `Multiple page crawling failed: ${error instanceof Error ? error.message : String(error)}` }; } } });
  • src/index.ts:248-248 (registration)
    Global registration call in OpenSearchMCPServer.registerAllTools(), importing and registering the web crawler tools including batch_crawl_urls.
    registerWebCrawlerTools(this.toolRegistry); // 2 tools: crawl_url_content, batch_crawl_urls

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/flyanima/open-search-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server