parse
Extract webpage content into clean, LLM-optimized Markdown by removing ads, navigation, and non-essential elements. Retrieve article title, main content, excerpt, byline, and site name using Mozilla's Readability algorithm.
Instructions
Extracts and transforms webpage content into clean, LLM-optimized Markdown. Returns article title, main content, excerpt, byline and site name. Uses Mozilla's Readability algorithm to remove ads, navigation, footers and non-essential elements while preserving the core content structure.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | The website URL to parse |
Implementation Reference
- dist/index.js:16-50 (handler)Core handler function that fetches the webpage, parses it using Readability, extracts main content, converts to Markdown, and returns structured article data.async fetchAndParse(url) { try { // Fetch the webpage const response = await axios.get(url, { headers: { 'User-Agent': 'Mozilla/5.0 (compatible; MCPBot/1.0)' } }); // Create a DOM from the HTML const dom = new JSDOM(response.data, { url }); const document = dom.window.document; // Use Readability to extract main content const reader = new Readability(document); const article = reader.parse(); if (!article) { throw new Error('Failed to parse content'); } // Convert HTML to Markdown const markdown = turndownService.turndown(article.content); return { title: article.title, content: markdown, excerpt: article.excerpt, byline: article.byline, siteName: article.siteName }; } catch (error) { throw new Error(`Failed to fetch or parse content: ${error.message}`); } }
- dist/index.js:82-119 (handler)MCP tool call handler that validates input, executes the parse tool logic via WebsiteParser, formats output as MCP content block, handles errors.server.setRequestHandler(CallToolRequestSchema, async (request) => { const { name, arguments: args } = request.params; if (name !== "parse") { throw new McpError(ErrorCode.MethodNotFound, `Unknown tool: ${name}`); } if (!args?.url) { throw new McpError(ErrorCode.InvalidParams, "URL is required"); } try { const result = await parser.fetchAndParse(args.url); return { content: [{ type: "text", text: JSON.stringify({ title: result.title, content: result.content, metadata: { excerpt: result.excerpt, byline: result.byline, siteName: result.siteName } }, null, 2) }] }; } catch (error) { return { isError: true, content: [{ type: "text", text: `Error: ${error.message}` }] }; } });
- dist/index.js:64-79 (registration)Registers the 'parse' tool with MCP server by defining it in the listTools response, including name, description, and input schema.server.setRequestHandler(ListToolsRequestSchema, async () => ({ tools: [{ name: "parse", description: "Extracts and transforms webpage content into clean, LLM-optimized Markdown. Returns article title, main content, excerpt, byline and site name. Uses Mozilla's Readability algorithm to remove ads, navigation, footers and non-essential elements while preserving the core content structure.", inputSchema: { type: "object", properties: { url: { type: "string", description: "The website URL to parse" } }, required: ["url"] } }] }));
- dist/index.js:68-77 (schema)Input schema for the 'parse' tool: requires a 'url' string.inputSchema: { type: "object", properties: { url: { type: "string", description: "The website URL to parse" } }, required: ["url"] }
- dist/index.js:15-51 (helper)Helper class encapsulating the WebsiteParser with fetchAndParse method used by the tool handler.class WebsiteParser { async fetchAndParse(url) { try { // Fetch the webpage const response = await axios.get(url, { headers: { 'User-Agent': 'Mozilla/5.0 (compatible; MCPBot/1.0)' } }); // Create a DOM from the HTML const dom = new JSDOM(response.data, { url }); const document = dom.window.document; // Use Readability to extract main content const reader = new Readability(document); const article = reader.parse(); if (!article) { throw new Error('Failed to parse content'); } // Convert HTML to Markdown const markdown = turndownService.turndown(article.content); return { title: article.title, content: markdown, excerpt: article.excerpt, byline: article.byline, siteName: article.siteName }; } catch (error) { throw new Error(`Failed to fetch or parse content: ${error.message}`); } } }