get_markdown_summary
Extracts main content from web pages and converts it to clean Markdown format, removing navigation menus and peripheral elements for focused reading.
Instructions
Extracts and converts the main content area of a web page to Markdown format, automatically removing navigation menus, headers, footers, and other peripheral content. Perfect for capturing the core content of articles, blog posts, or documentation pages.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | URL of the web page whose main content should be extracted and converted to Markdown. |
Implementation Reference
- src/index.ts:151-158 (handler)Handler for the 'get_markdown_summary' tool. It calls getMarkdownStringFromHtmlByTD with mainOnly=true to get a markdown summary of the main content.case "get_markdown_summary": { return { content: [{ type: "text", text: (await getMarkdownStringFromHtmlByTD(url, true)) }] }; }
- src/index.ts:99-108 (schema)Input schema for the get_markdown_summary tool, requiring a single 'url' string parameter.inputSchema: { type: "object", properties: { url: { type: "string", description: "URL of the web page whose main content should be extracted and converted to Markdown." } }, required: ["url"] }
- src/index.ts:96-109 (registration)Registration of the 'get_markdown_summary' tool in the ListTools response, including name, description, and input schema.{ name: "get_markdown_summary", description: "Extracts and converts the main content area of a web page to Markdown format, automatically removing navigation menus, headers, footers, and other peripheral content. Perfect for capturing the core content of articles, blog posts, or documentation pages.", inputSchema: { type: "object", properties: { url: { type: "string", description: "URL of the web page whose main content should be extracted and converted to Markdown." } }, required: ["url"] } },
- src/index.ts:212-285 (helper)Primary helper implementing Markdown conversion from HTML using Turndown library. Supports mainOnly mode to exclude headers/footers/nav, and custom rules for tables and definition lists. Called by the handler with mainOnly=true.// Helper method to convert HTML to Markdown using Turndown with custom rules for tables and definition lists export async function getMarkdownStringFromHtmlByTD( request_url: string, mainOnly: boolean = false, ) { const htmlString = await getHtmlString(request_url); const turndownService = new Turndown({ headingStyle: 'atx' }); turndownService.remove('script'); turndownService.remove('style'); if (mainOnly) { turndownService.remove('header'); turndownService.remove('footer'); turndownService.remove('nav'); } turndownService.addRule('table', { filter: 'table', // eslint-disable-next-line @typescript-eslint/no-unused-vars replacement: function (content, node, _options) { // Process each row in the table const rows = Array.from(node.querySelectorAll('tr')); if (rows.length === 0) { return ''; } const headerRow = rows[0]; const headerCells = Array.from( headerRow.querySelectorAll('th, td'), ).map((cell) => cell.textContent?.trim() || ''); const separator = headerCells.map(() => '---').join('|'); // Header row and separator line let markdown = `\n| ${headerCells.join(' | ')} |\n|${separator}|`; // Process remaining rows for (let i = 1; i < rows.length; i++) { const row = rows[i]; const rowCells = Array.from(row.querySelectorAll('th, td')).map( (cell) => cell.textContent?.trim() || '', ); markdown += `\n| ${rowCells.join(' | ')} |`; } return markdown + '\n'; }, }); turndownService.addRule('dl', { filter: 'dl', // eslint-disable-next-line @typescript-eslint/no-unused-vars replacement: function (content, node, _options) { let markdown = '\n\n'; const items = Array.from(node.children); let currentDt: string = ''; items.forEach((item) => { if (item.tagName === 'DT') { currentDt = item.textContent?.trim() || ''; if (currentDt) { markdown += `**${currentDt}:**`; } } else if (item.tagName === 'DD') { const ddContent = item.textContent?.trim() || ''; if (ddContent) { markdown += ` ${ddContent}\n`; } } }); return markdown + '\n'; }, }); const markdownString = turndownService.turndown(htmlString); return markdownString; }
- src/index.ts:174-210 (helper)Helper function to fetch fully rendered HTML content from a URL using Playwright Chromium headless browser, essential for dynamic content.async function getHtmlString(request_url: string): Promise<string> { let browser: Browser | null = null; let page: Page | null = null; try { browser = await chromium.launch({ headless: true, // args: ['--single-process'], }); const context = await browser.newContext(); page = await context.newPage(); await page.goto(request_url, { waitUntil: 'domcontentloaded', timeout: TIMEOUT, }); const htmlString = await page.content(); return htmlString; } catch (error) { console.error(`Failed to fetch HTML for ${request_url}:`, error); return ""; } finally { if (page) { try { await page.close(); } catch (e) { console.error("Error closing page:", e); } } if (browser) { try { await browser.close(); } catch (error) { console.error('Error closing browser:', error); } } } }