Skip to main content
Glama

batch_webpage_scrape

Extract data from multiple webpages concurrently with support for up to 20 URLs and configurable concurrency settings using crawler technology.

Instructions

Batch scrape multiple webpages with concurrent processing support.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
maxConcurrentNoMaximum concurrency
urlsYesList of webpage URLs to scrape, up to 20.

Implementation Reference

  • Main execution handler for the batch_webpage_scrape tool. Validates inputs, processes URLs in concurrent batches using scrapeWebpage helper, collects results and errors.
    async function handleBatchWebpageScrape(args) { const { urls, maxConcurrent = 3 } = args; if (!Array.isArray(urls) || urls.length === 0) { throw new Error('urls must be a non-empty array'); } if (urls.length > 20) { throw new Error('A maximum of 20 URLs is supported'); } if (maxConcurrent < 1 || maxConcurrent > 10) { throw new Error('maxConcurrent must be between 1 and 10'); } const searchService = (await import('../services/searchService.js')).default; const results = []; const errors = []; // Process in batches for (let i = 0; i < urls.length; i += maxConcurrent) { const batch = urls.slice(i, i + maxConcurrent); const batchPromises = batch.map(async (url) => { try { const result = await searchService.scrapeWebpage(url); return { success: true, url, data: result }; } catch (error) { return { success: false, url, error: error.message }; } }); const batchResults = await Promise.allSettled(batchPromises); batchResults.forEach((result) => { if (result.status === 'fulfilled') { if (result.value.success) { results.push(result.value); } else { errors.push(result.value); } } else { errors.push({ url: 'unknown', error: result.reason?.message || 'Unknown error' }); } }); } return { tool: 'batch_webpage_scrape', totalUrls: urls.length, successful: results.length, failed: errors.length, maxConcurrent, results, errors, timestamp: new Date().toISOString() };
  • Input schema definition for batch_webpage_scrape tool, specifying urls array (1-20) and optional maxConcurrent (1-10).
    { name: 'batch_webpage_scrape', description: 'Batch scrape multiple webpages with concurrent processing support.', inputSchema: { type: 'object', properties: { urls: { type: 'array', items: { type: 'string' }, description: 'List of webpage URLs to scrape, up to 20.', minItems: 1, maxItems: 20 }, maxConcurrent: { type: 'number', description: 'Maximum concurrency', default: 3, minimum: 1, maximum: 10 } }, required: ['urls'] } }
  • MCP ListTools request handler that returns the list of tools including batch_webpage_scrape via generateTools().
    server.setRequestHandler(ListToolsRequestSchema, async () => { return { tools: generateTools() }; });
  • Supporting utility function scrapeWebpage used by batch_webpage_scrape handler to scrape individual webpages using axios and cheerio, extracting metadata and content.
    async scrapeWebpage(url) { try { const response = await axios.get(url, { headers: { 'User-Agent': this.getRandomUserAgent(), 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en-US,en;q=0.5', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'keep-alive' }, timeout: 15000 }); const $ = cheerio.load(response.data); // Extract page info const title = $('title').text().trim(); const description = $('meta[name="description"]').attr('content') || ''; const keywords = $('meta[name="keywords"]').attr('content') || ''; // Extract main content const content = $('body').text() .replace(/\s+/g, ' ') .trim() .substring(0, 2000); // limit content length // Extract links const links = []; $('a[href]').each((index, element) => { if (index < 50) { // limit number of links const href = $(element).attr('href'); const text = $(element).text().trim(); if (href && text && href.startsWith('http')) { links.push({ url: href, text }); } } }); logger.info(`Webpage scraped successfully: ${url}`); return { url, title, description, keywords, content, links, timestamp: new Date().toISOString() }; } catch (error) { logger.error(`Webpage scraping error for ${url}:`, error); throw new Error(`Failed to scrape webpage: ${error.message}`); } }

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Bosegluon2/spider-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server