get_markdown
Convert web pages to structured Markdown while preserving tables, lists, and document hierarchy for clean content extraction.
Instructions
Converts web page content to well-formatted Markdown, preserving structural elements like tables and definition lists. Recommended as the default tool for web content extraction when a clean, readable text format is needed while maintaining document structure.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | URL of the web page to convert to Markdown format, supporting various HTML elements and structures. |
Implementation Reference
- src/index.ts:82-95 (registration)Registration of the get_markdown tool in the list of tools, including its name, description, and input schema.{ name: "get_markdown", description: "Converts web page content to well-formatted Markdown, preserving structural elements like tables and definition lists. Recommended as the default tool for web content extraction when a clean, readable text format is needed while maintaining document structure.", inputSchema: { type: "object", properties: { url: { type: "string", description: "URL of the web page to convert to Markdown format, supporting various HTML elements and structures." } }, required: ["url"] } },
- src/index.ts:85-94 (schema)Input schema definition for the get_markdown tool.inputSchema: { type: "object", properties: { url: { type: "string", description: "URL of the web page to convert to Markdown format, supporting various HTML elements and structures." } }, required: ["url"] }
- src/index.ts:143-150 (handler)Handler logic for the get_markdown tool within the CallToolRequestSchema handler. Fetches rendered HTML and converts it to Markdown using getMarkdownStringFromHtmlByNHM.case "get_markdown": { return { content: [{ type: "text", text: (await getMarkdownStringFromHtmlByNHM(url)) }] }; }
- src/index.ts:288-341 (helper)Core helper function that converts fetched HTML to Markdown using NodeHtmlMarkdown library with custom translators for dl, dt, dd elements and optional main content filtering.export async function getMarkdownStringFromHtmlByNHM( request_url: string, mainOnly: boolean = false, ) { const htmlString = await getHtmlString(request_url); const customTranslators: TranslatorConfigObject = { dl: () => ({ preserveWhitespace: false, surroundingNewlines: true, }), dt: () => ({ prefix: '**', postfix: ':** ', surroundingNewlines: false, }), dd: () => ({ postfix: '\n', surroundingNewlines: false, }), Head: () => ({ postfix: '\n', ignore: false, postprocess: (ctx) => { const titleNode = ctx.node.querySelector('title'); if (titleNode) { return titleNode.textContent || ''; } return ''; }, surroundingNewlines: true, }), }; if (mainOnly) { customTranslators.Header = () => ({ ignore: true, }); customTranslators.Footer = () => ({ ignore: true, }); customTranslators.Nav = () => ({ ignore: true, }); } const markdownString = NodeHtmlMarkdown.translate( htmlString, {}, customTranslators, ); return markdownString; }
- src/index.ts:174-210 (helper)Supporting helper that launches a headless Chromium browser to fetch fully rendered HTML content from the URL, which is then passed to the markdown converter.async function getHtmlString(request_url: string): Promise<string> { let browser: Browser | null = null; let page: Page | null = null; try { browser = await chromium.launch({ headless: true, // args: ['--single-process'], }); const context = await browser.newContext(); page = await context.newPage(); await page.goto(request_url, { waitUntil: 'domcontentloaded', timeout: TIMEOUT, }); const htmlString = await page.content(); return htmlString; } catch (error) { console.error(`Failed to fetch HTML for ${request_url}:`, error); return ""; } finally { if (page) { try { await page.close(); } catch (e) { console.error("Error closing page:", e); } } if (browser) { try { await browser.close(); } catch (error) { console.error('Error closing browser:', error); } } } }