scrape-to-markdown

scrape-to-markdown

Extract meaningful content from websites and convert HTML to high-quality Markdown using Mozilla's Readability engine.

Input Schema

TableJSON Schema

Name	Required	Description	Default
`url`	Yes

Implementation Reference

src/index.ts:74-89 (handler)
The inline handler function for the "scrape-to-markdown" tool that invokes scrapeToMarkdown and returns the markdown content or an error response.
async ({ url }) => { try { const markdown = await scrapeToMarkdown(url); // Return the markdown as the tool result return { content: [{ type: "text", text: markdown }] }; } catch (error: any) { // Handle errors gracefully return { content: [{ type: "text", text: `Error: ${error.message}` }], isError: true }; } }
src/index.ts:73-73 (schema)
Zod schema defining the input parameter: a required valid URL string.
{ url: z.string().url() },
src/index.ts:71-90 (registration)
Registration of the "scrape-to-markdown" tool on the MCP server using server.tool() with name, schema, and handler.
server.tool( "scrape-to-markdown", { url: z.string().url() }, async ({ url }) => { try { const markdown = await scrapeToMarkdown(url); // Return the markdown as the tool result return { content: [{ type: "text", text: markdown }] }; } catch (error: any) { // Handle errors gracefully return { content: [{ type: "text", text: `Error: ${error.message}` }], isError: true }; } } );
src/index.ts:15-59 (helper)
Main helper function that performs the web scraping: fetches HTML, extracts article with Readability/JSDOM, converts to markdown, and adds code block language hints.
export async function scrapeToMarkdown(url: string): Promise<string> { try { // Fetch the HTML content from the provided URL with proper headers const response = await fetch(url, { headers: { 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Language': 'en-US,en;q=0.5', } }); if (!response.ok) { throw new Error(`Failed to fetch URL: ${response.status}`); } // Get content type to check encoding const contentType = response.headers.get('content-type') || ''; const htmlContent = await response.text(); // Parse the HTML using JSDOM with the URL to resolve relative links const dom = new JSDOM(htmlContent, { url, pretendToBeVisual: true, // This helps with some interactive content }); // Extract the main content using Readability const reader = new Readability(dom.window.document); const article = reader.parse(); if (!article || !article.content) { throw new Error("Failed to parse article content"); } // Convert the cleaned article HTML to Markdown using htmlToMarkdown let markdown = htmlToMarkdown(article.content); // Simple post-processing to improve code blocks with language hints markdown = markdown.replace(/```\n(class|function|import|const|let|var|if|for|while)/g, '```javascript\n$1'); markdown = markdown.replace(/```\n(def|class|import|from|with|if|for|while)(\s+)/g, '```python\n$1$2'); return markdown; } catch (error: any) { throw new Error(`Scraping error: ${error.message}`); } }
src/data_processing.ts:9-19 (helper)
Supporting utility to convert cleaned HTML to Markdown using TurndownService, removing script tags first.
export function htmlToMarkdown(html: string): string { // Remove script tags and their content before conversion const cleanHtml = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, ''); const turndownService = new TurndownService({ codeBlockStyle: 'fenced', emDelimiter: '_' }); return turndownService.turndown(cleanHtml); }

Website Scraper MCP Server

Input Schema

Implementation Reference

Other Tools

Latest Blog Posts

MCP directory API