scrape_html

scrape_html

Extract structured data from web pages including text, links, and images by providing a URL. This tool helps automate content collection for analysis, research, or integration workflows.

Input Schema

TableJSON Schema

Name	Required	Description	Default
`url`	Yes
`extractText`	No
`extractLinks`	No
`extractImages`	No

Implementation Reference

src/tools/web-tools.ts:120-148 (handler)
The main execution handler for the 'scrape_html' tool. Fetches HTML from the given URL, extracts specified content (text, links, images), and returns JSON-structured results wrapped in MCP content format.
async ({ url, extractText, extractLinks, extractImages }) => { return wrapToolExecution(async () => { const response = await fetch(url); if (!response.ok) { throw new Error(`HTTP ${response.status}: ${response.statusText}`); } const html = await response.text(); const results = extractHtmlContent(html, extractText, extractLinks, extractImages); return { content: [{ type: "text" as const, text: JSON.stringify(results, null, 2) }], metadata: { url, extracted: { text: extractText, links: extractLinks, images: extractImages } } }; }, { errorCode: ERROR_CODES.HTTP_REQUEST, context: "Failed to scrape HTML" }); }
src/tools/web-tools.ts:114-119 (schema)
Input schema using Zod validation for the tool parameters: required 'url', optional flags for extracting text, links, and images with defaults from constants.
{ url: z.string().url("Valid URL is required"), extractText: z.boolean().optional().default(DEFAULTS.EXTRACT_TEXT), extractLinks: z.boolean().optional().default(DEFAULTS.EXTRACT_LINKS), extractImages: z.boolean().optional().default(DEFAULTS.EXTRACT_IMAGES) },
src/tools/web-tools.ts:112-150 (registration)
Function that registers the 'scrape_html' tool with the MCP server by calling server.tool() with the tool name, input schema, and handler function.
function registerScrapeHtml(server: McpServer): void { server.tool("scrape_html", { url: z.string().url("Valid URL is required"), extractText: z.boolean().optional().default(DEFAULTS.EXTRACT_TEXT), extractLinks: z.boolean().optional().default(DEFAULTS.EXTRACT_LINKS), extractImages: z.boolean().optional().default(DEFAULTS.EXTRACT_IMAGES) }, async ({ url, extractText, extractLinks, extractImages }) => { return wrapToolExecution(async () => { const response = await fetch(url); if (!response.ok) { throw new Error(`HTTP ${response.status}: ${response.statusText}`); } const html = await response.text(); const results = extractHtmlContent(html, extractText, extractLinks, extractImages); return { content: [{ type: "text" as const, text: JSON.stringify(results, null, 2) }], metadata: { url, extracted: { text: extractText, links: extractLinks, images: extractImages } } }; }, { errorCode: ERROR_CODES.HTTP_REQUEST, context: "Failed to scrape HTML" }); } ); }
src/tools/web-tools.ts:155-176 (helper)
Core helper function that orchestrates HTML content extraction based on boolean flags, calling specialized extractors for text, links, and images.
function extractHtmlContent( html: string, extractText: boolean, extractLinks: boolean, extractImages: boolean ): HtmlExtraction { const results: HtmlExtraction = {}; if (extractText) { results.text = extractTextFromHtml(html); } if (extractLinks) { results.links = extractLinksFromHtml(html); } if (extractImages) { results.images = extractImagesFromHtml(html); } return results; }
src/index.ts:66-66 (registration)
Top-level call to registerWebTools(server), which in turn calls registerScrapeHtml to register the scrape_html tool.
registerWebTools(server);

Open MCP Server

Input Schema

Implementation Reference

Other Tools

Latest Blog Posts

MCP directory API