Skip to main content
Glama

scrape_html

Extract HTML content from web pages for development workflows, handling both static and dynamic content with configurable options.

Instructions

Scrape HTML content from a URL

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesURL to scrape
useBrowserNoUse browser for dynamic content
timeoutNoRequest timeout in milliseconds
headersNoCustom HTTP headers

Implementation Reference

  • Tool registration and schema definition for 'scrape_html' including input parameters for URL, browser usage, timeout, and headers.
    { name: 'scrape_html', description: 'Scrape HTML content from a URL', inputSchema: { type: 'object', properties: { url: { type: 'string', description: 'URL to scrape', }, useBrowser: { type: 'boolean', description: 'Use browser for dynamic content', default: false, }, timeout: { type: 'number', description: 'Request timeout in milliseconds', default: 30000, }, headers: { type: 'object', description: 'Custom HTTP headers', }, }, required: ['url'], }, },
  • Dispatch handler for 'scrape_html' tool call, choosing between dynamic and static scraper based on useBrowser, then formatting the result.
    case 'scrape_html': { if (config.useBrowser) { const data = await dynamicScraper.scrapeDynamicContent(config); return Formatters.formatScrapedData(data); } else { const data = await staticScraper.scrapeHTML(config); return Formatters.formatScrapedData(data); } }
  • Core implementation of HTML scraping: fetches page with axios, parses with cheerio, extracts title, text, HTML, links, images, and tables.
    async scrapeHTML(config: ScrapingConfig): Promise<ScrapedData> { const validation = Validators.validateScrapingConfig(config); if (!validation.valid) { throw new Error(`Invalid scraping config: ${validation.errors.join(', ')}`); } try { const response = await axios.get(config.url, { headers: config.headers || { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36', }, timeout: config.timeout || 30000, maxRedirects: config.maxRedirects || 5, validateStatus: (status) => status < 500, }); const $ = cheerio.load(response.data); const title = $('title').text().trim(); const text = $('body').text().replace(/\s+/g, ' ').trim(); const html = response.data; // Extract links const links: string[] = []; $('a[href]').each((_, element) => { const href = $(element).attr('href'); if (href) { try { const url = new URL(href, config.url); links.push(url.href); } catch { // Invalid URL, skip } } }); // Extract images const images: string[] = []; $('img[src]').each((_, element) => { const src = $(element).attr('src'); if (src) { try { const url = new URL(src, config.url); images.push(url.href); } catch { // Invalid URL, skip } } }); // Extract tables const tables: TableData[] = []; $('table').each((_, tableElement) => { const table: TableData = { headers: [], rows: [], }; // Extract caption const caption = $(tableElement).find('caption').text().trim(); if (caption) { table.caption = caption; } // Extract headers $(tableElement) .find('thead th, thead td, tr:first-child th, tr:first-child td') .each((_, header) => { table.headers.push($(header).text().trim()); }); // Extract rows $(tableElement) .find('tbody tr, tr') .each((_, row) => { const rowData: string[] = []; $(row) .find('td, th') .each((_, cell) => { rowData.push($(cell).text().trim()); }); if (rowData.length > 0) { table.rows.push(rowData); } }); if (table.headers.length > 0 || table.rows.length > 0) { tables.push(table); } }); return { url: config.url, title, text, html, links: [...new Set(links)], // Remove duplicates images: [...new Set(images)], // Remove duplicates tables, scrapedAt: new Date(), }; } catch (error) { throw new Error(`Failed to scrape ${config.url}: ${error instanceof Error ? error.message : String(error)}`); } }
  • src/server.ts:18-25 (registration)
    Central registration of all tools, including webScrapingTools which contains 'scrape_html'.
    const allTools = [ ...codeAnalysisTools, ...codeQualityTools, ...dependencyAnalysisTools, ...lintingTools, ...webScrapingTools, ...apiDiscoveryTools, ];
  • MCP server dispatch logic that routes 'scrape_html' calls (via webScrapingTools check) to the web scraping handler.
    } else if (webScrapingTools.some((t) => t.name === name)) { result = await handleWebScrapingTool(name, args || {}); } else if (apiDiscoveryTools.some((t) => t.name === name)) {

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/code-alchemist01/development-tools-mcp-Server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server