Skip to main content
Glama
code-alchemist01

Development Tools MCP Server

scrape_html

Extract HTML content from web pages for development workflows, handling both static and dynamic content with configurable options.

Instructions

Scrape HTML content from a URL

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesURL to scrape
useBrowserNoUse browser for dynamic content
timeoutNoRequest timeout in milliseconds
headersNoCustom HTTP headers

Implementation Reference

  • Tool registration and schema definition for 'scrape_html' including input parameters for URL, browser usage, timeout, and headers.
    {
      name: 'scrape_html',
      description: 'Scrape HTML content from a URL',
      inputSchema: {
        type: 'object',
        properties: {
          url: {
            type: 'string',
            description: 'URL to scrape',
          },
          useBrowser: {
            type: 'boolean',
            description: 'Use browser for dynamic content',
            default: false,
          },
          timeout: {
            type: 'number',
            description: 'Request timeout in milliseconds',
            default: 30000,
          },
          headers: {
            type: 'object',
            description: 'Custom HTTP headers',
          },
        },
        required: ['url'],
      },
    },
  • Dispatch handler for 'scrape_html' tool call, choosing between dynamic and static scraper based on useBrowser, then formatting the result.
    case 'scrape_html': {
      if (config.useBrowser) {
        const data = await dynamicScraper.scrapeDynamicContent(config);
        return Formatters.formatScrapedData(data);
      } else {
        const data = await staticScraper.scrapeHTML(config);
        return Formatters.formatScrapedData(data);
      }
    }
  • Core implementation of HTML scraping: fetches page with axios, parses with cheerio, extracts title, text, HTML, links, images, and tables.
    async scrapeHTML(config: ScrapingConfig): Promise<ScrapedData> {
      const validation = Validators.validateScrapingConfig(config);
      if (!validation.valid) {
        throw new Error(`Invalid scraping config: ${validation.errors.join(', ')}`);
      }
    
      try {
        const response = await axios.get(config.url, {
          headers: config.headers || {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
          },
          timeout: config.timeout || 30000,
          maxRedirects: config.maxRedirects || 5,
          validateStatus: (status) => status < 500,
        });
    
        const $ = cheerio.load(response.data);
        const title = $('title').text().trim();
        const text = $('body').text().replace(/\s+/g, ' ').trim();
        const html = response.data;
    
        // Extract links
        const links: string[] = [];
        $('a[href]').each((_, element) => {
          const href = $(element).attr('href');
          if (href) {
            try {
              const url = new URL(href, config.url);
              links.push(url.href);
            } catch {
              // Invalid URL, skip
            }
          }
        });
    
        // Extract images
        const images: string[] = [];
        $('img[src]').each((_, element) => {
          const src = $(element).attr('src');
          if (src) {
            try {
              const url = new URL(src, config.url);
              images.push(url.href);
            } catch {
              // Invalid URL, skip
            }
          }
        });
    
        // Extract tables
        const tables: TableData[] = [];
        $('table').each((_, tableElement) => {
          const table: TableData = {
            headers: [],
            rows: [],
          };
    
          // Extract caption
          const caption = $(tableElement).find('caption').text().trim();
          if (caption) {
            table.caption = caption;
          }
    
          // Extract headers
          $(tableElement)
            .find('thead th, thead td, tr:first-child th, tr:first-child td')
            .each((_, header) => {
              table.headers.push($(header).text().trim());
            });
    
          // Extract rows
          $(tableElement)
            .find('tbody tr, tr')
            .each((_, row) => {
              const rowData: string[] = [];
              $(row)
                .find('td, th')
                .each((_, cell) => {
                  rowData.push($(cell).text().trim());
                });
              if (rowData.length > 0) {
                table.rows.push(rowData);
              }
            });
    
          if (table.headers.length > 0 || table.rows.length > 0) {
            tables.push(table);
          }
        });
    
        return {
          url: config.url,
          title,
          text,
          html,
          links: [...new Set(links)], // Remove duplicates
          images: [...new Set(images)], // Remove duplicates
          tables,
          scrapedAt: new Date(),
        };
      } catch (error) {
        throw new Error(`Failed to scrape ${config.url}: ${error instanceof Error ? error.message : String(error)}`);
      }
    }
  • src/server.ts:18-25 (registration)
    Central registration of all tools, including webScrapingTools which contains 'scrape_html'.
    const allTools = [
      ...codeAnalysisTools,
      ...codeQualityTools,
      ...dependencyAnalysisTools,
      ...lintingTools,
      ...webScrapingTools,
      ...apiDiscoveryTools,
    ];
  • MCP server dispatch logic that routes 'scrape_html' calls (via webScrapingTools check) to the web scraping handler.
    } else if (webScrapingTools.some((t) => t.name === name)) {
      result = await handleWebScrapingTool(name, args || {});
    } else if (apiDiscoveryTools.some((t) => t.name === name)) {

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/code-alchemist01/development-tools-mcp-Server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server