Skip to main content
Glama

scrape_html

Extract structured data from web pages including text, links, and images by providing a URL. This tool helps automate content collection for analysis, research, or integration workflows.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYes
extractTextNo
extractLinksNo
extractImagesNo

Implementation Reference

  • The main execution handler for the 'scrape_html' tool. Fetches HTML from the given URL, extracts specified content (text, links, images), and returns JSON-structured results wrapped in MCP content format.
    async ({ url, extractText, extractLinks, extractImages }) => { return wrapToolExecution(async () => { const response = await fetch(url); if (!response.ok) { throw new Error(`HTTP ${response.status}: ${response.statusText}`); } const html = await response.text(); const results = extractHtmlContent(html, extractText, extractLinks, extractImages); return { content: [{ type: "text" as const, text: JSON.stringify(results, null, 2) }], metadata: { url, extracted: { text: extractText, links: extractLinks, images: extractImages } } }; }, { errorCode: ERROR_CODES.HTTP_REQUEST, context: "Failed to scrape HTML" }); }
  • Input schema using Zod validation for the tool parameters: required 'url', optional flags for extracting text, links, and images with defaults from constants.
    { url: z.string().url("Valid URL is required"), extractText: z.boolean().optional().default(DEFAULTS.EXTRACT_TEXT), extractLinks: z.boolean().optional().default(DEFAULTS.EXTRACT_LINKS), extractImages: z.boolean().optional().default(DEFAULTS.EXTRACT_IMAGES) },
  • Function that registers the 'scrape_html' tool with the MCP server by calling server.tool() with the tool name, input schema, and handler function.
    function registerScrapeHtml(server: McpServer): void { server.tool("scrape_html", { url: z.string().url("Valid URL is required"), extractText: z.boolean().optional().default(DEFAULTS.EXTRACT_TEXT), extractLinks: z.boolean().optional().default(DEFAULTS.EXTRACT_LINKS), extractImages: z.boolean().optional().default(DEFAULTS.EXTRACT_IMAGES) }, async ({ url, extractText, extractLinks, extractImages }) => { return wrapToolExecution(async () => { const response = await fetch(url); if (!response.ok) { throw new Error(`HTTP ${response.status}: ${response.statusText}`); } const html = await response.text(); const results = extractHtmlContent(html, extractText, extractLinks, extractImages); return { content: [{ type: "text" as const, text: JSON.stringify(results, null, 2) }], metadata: { url, extracted: { text: extractText, links: extractLinks, images: extractImages } } }; }, { errorCode: ERROR_CODES.HTTP_REQUEST, context: "Failed to scrape HTML" }); } ); }
  • Core helper function that orchestrates HTML content extraction based on boolean flags, calling specialized extractors for text, links, and images.
    function extractHtmlContent( html: string, extractText: boolean, extractLinks: boolean, extractImages: boolean ): HtmlExtraction { const results: HtmlExtraction = {}; if (extractText) { results.text = extractTextFromHtml(html); } if (extractLinks) { results.links = extractLinksFromHtml(html); } if (extractImages) { results.images = extractImagesFromHtml(html); } return results; }
  • src/index.ts:66-66 (registration)
    Top-level call to registerWebTools(server), which in turn calls registerScrapeHtml to register the scrape_html tool.
    registerWebTools(server);

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ishuru/open-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server