Skip to main content
Glama

scrape_html

Extract structured data from web pages including text, links, and images by providing a URL. This tool helps automate content collection for analysis, research, or integration workflows.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYes
extractTextNo
extractLinksNo
extractImagesNo

Implementation Reference

  • The main execution handler for the 'scrape_html' tool. Fetches HTML from the given URL, extracts specified content (text, links, images), and returns JSON-structured results wrapped in MCP content format.
    async ({ url, extractText, extractLinks, extractImages }) => {
      return wrapToolExecution(async () => {
        const response = await fetch(url);
        if (!response.ok) {
          throw new Error(`HTTP ${response.status}: ${response.statusText}`);
        }
    
        const html = await response.text();
        const results = extractHtmlContent(html, extractText, extractLinks, extractImages);
    
        return {
          content: [{
            type: "text" as const,
            text: JSON.stringify(results, null, 2)
          }],
          metadata: {
            url,
            extracted: {
              text: extractText,
              links: extractLinks,
              images: extractImages
            }
          }
        };
      }, {
        errorCode: ERROR_CODES.HTTP_REQUEST,
        context: "Failed to scrape HTML"
      });
    }
  • Input schema using Zod validation for the tool parameters: required 'url', optional flags for extracting text, links, and images with defaults from constants.
    {
      url: z.string().url("Valid URL is required"),
      extractText: z.boolean().optional().default(DEFAULTS.EXTRACT_TEXT),
      extractLinks: z.boolean().optional().default(DEFAULTS.EXTRACT_LINKS),
      extractImages: z.boolean().optional().default(DEFAULTS.EXTRACT_IMAGES)
    },
  • Function that registers the 'scrape_html' tool with the MCP server by calling server.tool() with the tool name, input schema, and handler function.
    function registerScrapeHtml(server: McpServer): void {
      server.tool("scrape_html",
        {
          url: z.string().url("Valid URL is required"),
          extractText: z.boolean().optional().default(DEFAULTS.EXTRACT_TEXT),
          extractLinks: z.boolean().optional().default(DEFAULTS.EXTRACT_LINKS),
          extractImages: z.boolean().optional().default(DEFAULTS.EXTRACT_IMAGES)
        },
        async ({ url, extractText, extractLinks, extractImages }) => {
          return wrapToolExecution(async () => {
            const response = await fetch(url);
            if (!response.ok) {
              throw new Error(`HTTP ${response.status}: ${response.statusText}`);
            }
    
            const html = await response.text();
            const results = extractHtmlContent(html, extractText, extractLinks, extractImages);
    
            return {
              content: [{
                type: "text" as const,
                text: JSON.stringify(results, null, 2)
              }],
              metadata: {
                url,
                extracted: {
                  text: extractText,
                  links: extractLinks,
                  images: extractImages
                }
              }
            };
          }, {
            errorCode: ERROR_CODES.HTTP_REQUEST,
            context: "Failed to scrape HTML"
          });
        }
      );
    }
  • Core helper function that orchestrates HTML content extraction based on boolean flags, calling specialized extractors for text, links, and images.
    function extractHtmlContent(
      html: string,
      extractText: boolean,
      extractLinks: boolean,
      extractImages: boolean
    ): HtmlExtraction {
      const results: HtmlExtraction = {};
    
      if (extractText) {
        results.text = extractTextFromHtml(html);
      }
    
      if (extractLinks) {
        results.links = extractLinksFromHtml(html);
      }
    
      if (extractImages) {
        results.images = extractImagesFromHtml(html);
      }
    
      return results;
    }
  • src/index.ts:66-66 (registration)
    Top-level call to registerWebTools(server), which in turn calls registerScrapeHtml to register the scrape_html tool.
    registerWebTools(server);

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ishuru/open-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server