scrape_html
Extract structured data from web pages including text, links, and images by providing a URL. This tool helps automate content collection for analysis, research, or integration workflows.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | ||
| extractText | No | ||
| extractLinks | No | ||
| extractImages | No |
Implementation Reference
- src/tools/web-tools.ts:120-148 (handler)The main execution handler for the 'scrape_html' tool. Fetches HTML from the given URL, extracts specified content (text, links, images), and returns JSON-structured results wrapped in MCP content format.async ({ url, extractText, extractLinks, extractImages }) => { return wrapToolExecution(async () => { const response = await fetch(url); if (!response.ok) { throw new Error(`HTTP ${response.status}: ${response.statusText}`); } const html = await response.text(); const results = extractHtmlContent(html, extractText, extractLinks, extractImages); return { content: [{ type: "text" as const, text: JSON.stringify(results, null, 2) }], metadata: { url, extracted: { text: extractText, links: extractLinks, images: extractImages } } }; }, { errorCode: ERROR_CODES.HTTP_REQUEST, context: "Failed to scrape HTML" }); }
- src/tools/web-tools.ts:114-119 (schema)Input schema using Zod validation for the tool parameters: required 'url', optional flags for extracting text, links, and images with defaults from constants.{ url: z.string().url("Valid URL is required"), extractText: z.boolean().optional().default(DEFAULTS.EXTRACT_TEXT), extractLinks: z.boolean().optional().default(DEFAULTS.EXTRACT_LINKS), extractImages: z.boolean().optional().default(DEFAULTS.EXTRACT_IMAGES) },
- src/tools/web-tools.ts:112-150 (registration)Function that registers the 'scrape_html' tool with the MCP server by calling server.tool() with the tool name, input schema, and handler function.function registerScrapeHtml(server: McpServer): void { server.tool("scrape_html", { url: z.string().url("Valid URL is required"), extractText: z.boolean().optional().default(DEFAULTS.EXTRACT_TEXT), extractLinks: z.boolean().optional().default(DEFAULTS.EXTRACT_LINKS), extractImages: z.boolean().optional().default(DEFAULTS.EXTRACT_IMAGES) }, async ({ url, extractText, extractLinks, extractImages }) => { return wrapToolExecution(async () => { const response = await fetch(url); if (!response.ok) { throw new Error(`HTTP ${response.status}: ${response.statusText}`); } const html = await response.text(); const results = extractHtmlContent(html, extractText, extractLinks, extractImages); return { content: [{ type: "text" as const, text: JSON.stringify(results, null, 2) }], metadata: { url, extracted: { text: extractText, links: extractLinks, images: extractImages } } }; }, { errorCode: ERROR_CODES.HTTP_REQUEST, context: "Failed to scrape HTML" }); } ); }
- src/tools/web-tools.ts:155-176 (helper)Core helper function that orchestrates HTML content extraction based on boolean flags, calling specialized extractors for text, links, and images.function extractHtmlContent( html: string, extractText: boolean, extractLinks: boolean, extractImages: boolean ): HtmlExtraction { const results: HtmlExtraction = {}; if (extractText) { results.text = extractTextFromHtml(html); } if (extractLinks) { results.links = extractLinksFromHtml(html); } if (extractImages) { results.images = extractImagesFromHtml(html); } return results; }
- src/index.ts:66-66 (registration)Top-level call to registerWebTools(server), which in turn calls registerScrapeHtml to register the scrape_html tool.registerWebTools(server);