Skip to main content
Glama
flyanima

Open Search MCP

by flyanima

batch_crawl_urls

Extract main text content and links from multiple web pages simultaneously. Specify URLs, manage concurrent requests, and set delays for efficient batch crawling with Open Search MCP.

Instructions

Crawl and extract content from multiple web pages

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
delayNoDelay between batches in milliseconds
extractLinksNoExtract all links from pages
extractTextNoExtract main text content
maxConcurrentNoMaximum concurrent requests
urlsYesArray of URLs to crawl

Implementation Reference

  • The main handler function (execute) for the batch_crawl_urls tool. It validates input URLs, calls the crawlMultiplePages helper with concurrency and delay controls, processes results, and returns structured output with success metrics.
    execute: async (args: any) => { const { urls, extractText = true, extractLinks = false, maxConcurrent = 3, delay = 1000 } = args; try { const startTime = Date.now(); // 验证URLs for (const url of urls) { try { new URL(url); } catch { return { success: false, error: `Invalid URL format: ${url}` }; } } const results = await client.crawlMultiplePages(urls, { extractText, extractLinks, maxConcurrent, delay, maxContentLength: 3000 // 减少单页内容长度以处理多页 }); const crawlTime = Date.now() - startTime; const successCount = results.filter(r => r.success).length; return { success: true, data: { source: 'Web Crawler', totalUrls: urls.length, successCount, failureCount: urls.length - successCount, crawlTime, results, summary: { successRate: Math.round((successCount / urls.length) * 100), averageTimePerPage: Math.round(crawlTime / urls.length), totalContentExtracted: results.filter(r => r.success).length }, timestamp: Date.now() } }; } catch (error) { return { success: false, error: `Multiple page crawling failed: ${error instanceof Error ? error.message : String(error)}` }; } }
  • Input schema definition for the batch_crawl_urls tool, specifying parameters like urls array (max 10), extraction options, concurrency, and delay with types, defaults, and constraints.
    inputSchema: { type: 'object', properties: { urls: { type: 'array', items: { type: 'string' }, description: 'Array of URLs to crawl', maxItems: 10 }, extractText: { type: 'boolean', description: 'Extract main text content', default: true }, extractLinks: { type: 'boolean', description: 'Extract all links from pages', default: false }, maxConcurrent: { type: 'number', description: 'Maximum concurrent requests', default: 3, minimum: 1, maximum: 5 }, delay: { type: 'number', description: 'Delay between batches in milliseconds', default: 1000, minimum: 500, maximum: 5000 } }, required: ['urls'] },
  • The registry.registerTool call that defines and registers the batch_crawl_urls tool including name, description, schema, and execute handler.
    registry.registerTool({ name: 'batch_crawl_urls', description: 'Crawl and extract content from multiple web pages', category: 'utility', source: 'Web Crawler', inputSchema: { type: 'object', properties: { urls: { type: 'array', items: { type: 'string' }, description: 'Array of URLs to crawl', maxItems: 10 }, extractText: { type: 'boolean', description: 'Extract main text content', default: true }, extractLinks: { type: 'boolean', description: 'Extract all links from pages', default: false }, maxConcurrent: { type: 'number', description: 'Maximum concurrent requests', default: 3, minimum: 1, maximum: 5 }, delay: { type: 'number', description: 'Delay between batches in milliseconds', default: 1000, minimum: 500, maximum: 5000 } }, required: ['urls'] }, execute: async (args: any) => { const { urls, extractText = true, extractLinks = false, maxConcurrent = 3, delay = 1000 } = args; try { const startTime = Date.now(); // 验证URLs for (const url of urls) { try { new URL(url); } catch { return { success: false, error: `Invalid URL format: ${url}` }; } } const results = await client.crawlMultiplePages(urls, { extractText, extractLinks, maxConcurrent, delay, maxContentLength: 3000 // 减少单页内容长度以处理多页 }); const crawlTime = Date.now() - startTime; const successCount = results.filter(r => r.success).length; return { success: true, data: { source: 'Web Crawler', totalUrls: urls.length, successCount, failureCount: urls.length - successCount, crawlTime, results, summary: { successRate: Math.round((successCount / urls.length) * 100), averageTimePerPage: Math.round(crawlTime / urls.length), totalContentExtracted: results.filter(r => r.success).length }, timestamp: Date.now() } }; } catch (error) { return { success: false, error: `Multiple page crawling failed: ${error instanceof Error ? error.message : String(error)}` }; } } });
  • Core helper method in WebCrawlerClient that performs batched concurrent crawling of multiple URLs, using fetchPage and extractContent, with delays to avoid rate limiting.
    async crawlMultiplePages(urls: string[], options: any = {}) { const results = []; const maxConcurrent = options.maxConcurrent || 3; for (let i = 0; i < urls.length; i += maxConcurrent) { const batch = urls.slice(i, i + maxConcurrent); const batchPromises = batch.map(async (url) => { try { const pageData = await this.fetchPage(url, options); const content = this.extractContent(pageData.html, options); return { url, success: true, data: { ...content, status: pageData.status, finalUrl: pageData.url } }; } catch (error) { return { url, success: false, error: error instanceof Error ? error.message : String(error) }; } }); const batchResults = await Promise.all(batchPromises); results.push(...batchResults); // 添加延迟以避免过于频繁的请求 if (i + maxConcurrent < urls.length) { await new Promise(resolve => setTimeout(resolve, options.delay || 1000)); } } return results; }
  • src/index.ts:248-248 (registration)
    Top-level registration call in the main server initialization that registers the web crawler tools module, including batch_crawl_urls.
    registerWebCrawlerTools(this.toolRegistry); // 2 tools: crawl_url_content, batch_crawl_urls

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/flyanima/open-search-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server