scrape_html
Extract HTML content from web pages for development workflows, handling both static and dynamic content with configurable options.
Instructions
Scrape HTML content from a URL
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | URL to scrape | |
| useBrowser | No | Use browser for dynamic content | |
| timeout | No | Request timeout in milliseconds | |
| headers | No | Custom HTTP headers |
Implementation Reference
- src/tools/web-scraping.ts:8-35 (registration)Tool registration and schema definition for 'scrape_html' including input parameters for URL, browser usage, timeout, and headers.{ name: 'scrape_html', description: 'Scrape HTML content from a URL', inputSchema: { type: 'object', properties: { url: { type: 'string', description: 'URL to scrape', }, useBrowser: { type: 'boolean', description: 'Use browser for dynamic content', default: false, }, timeout: { type: 'number', description: 'Request timeout in milliseconds', default: 30000, }, headers: { type: 'object', description: 'Custom HTTP headers', }, }, required: ['url'], }, },
- src/tools/web-scraping.ts:283-291 (handler)Dispatch handler for 'scrape_html' tool call, choosing between dynamic and static scraper based on useBrowser, then formatting the result.case 'scrape_html': { if (config.useBrowser) { const data = await dynamicScraper.scrapeDynamicContent(config); return Formatters.formatScrapedData(data); } else { const data = await staticScraper.scrapeHTML(config); return Formatters.formatScrapedData(data); } }
- src/scrapers/static-scraper.ts:10-113 (handler)Core implementation of HTML scraping: fetches page with axios, parses with cheerio, extracts title, text, HTML, links, images, and tables.async scrapeHTML(config: ScrapingConfig): Promise<ScrapedData> { const validation = Validators.validateScrapingConfig(config); if (!validation.valid) { throw new Error(`Invalid scraping config: ${validation.errors.join(', ')}`); } try { const response = await axios.get(config.url, { headers: config.headers || { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36', }, timeout: config.timeout || 30000, maxRedirects: config.maxRedirects || 5, validateStatus: (status) => status < 500, }); const $ = cheerio.load(response.data); const title = $('title').text().trim(); const text = $('body').text().replace(/\s+/g, ' ').trim(); const html = response.data; // Extract links const links: string[] = []; $('a[href]').each((_, element) => { const href = $(element).attr('href'); if (href) { try { const url = new URL(href, config.url); links.push(url.href); } catch { // Invalid URL, skip } } }); // Extract images const images: string[] = []; $('img[src]').each((_, element) => { const src = $(element).attr('src'); if (src) { try { const url = new URL(src, config.url); images.push(url.href); } catch { // Invalid URL, skip } } }); // Extract tables const tables: TableData[] = []; $('table').each((_, tableElement) => { const table: TableData = { headers: [], rows: [], }; // Extract caption const caption = $(tableElement).find('caption').text().trim(); if (caption) { table.caption = caption; } // Extract headers $(tableElement) .find('thead th, thead td, tr:first-child th, tr:first-child td') .each((_, header) => { table.headers.push($(header).text().trim()); }); // Extract rows $(tableElement) .find('tbody tr, tr') .each((_, row) => { const rowData: string[] = []; $(row) .find('td, th') .each((_, cell) => { rowData.push($(cell).text().trim()); }); if (rowData.length > 0) { table.rows.push(rowData); } }); if (table.headers.length > 0 || table.rows.length > 0) { tables.push(table); } }); return { url: config.url, title, text, html, links: [...new Set(links)], // Remove duplicates images: [...new Set(images)], // Remove duplicates tables, scrapedAt: new Date(), }; } catch (error) { throw new Error(`Failed to scrape ${config.url}: ${error instanceof Error ? error.message : String(error)}`); } }
- src/server.ts:18-25 (registration)Central registration of all tools, including webScrapingTools which contains 'scrape_html'.const allTools = [ ...codeAnalysisTools, ...codeQualityTools, ...dependencyAnalysisTools, ...lintingTools, ...webScrapingTools, ...apiDiscoveryTools, ];
- src/server.ts:70-72 (handler)MCP server dispatch logic that routes 'scrape_html' calls (via webScrapingTools check) to the web scraping handler.} else if (webScrapingTools.some((t) => t.name === name)) { result = await handleWebScrapingTool(name, args || {}); } else if (apiDiscoveryTools.some((t) => t.name === name)) {