get_markdown
Convert web pages to structured Markdown while preserving tables, lists, and document hierarchy for clean content extraction.
Instructions
Converts web page content to well-formatted Markdown, preserving structural elements like tables and definition lists. Recommended as the default tool for web content extraction when a clean, readable text format is needed while maintaining document structure.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | URL of the web page to convert to Markdown format, supporting various HTML elements and structures. |
Implementation Reference
- src/index.ts:82-95 (registration)Registration of the get_markdown tool in the list of tools, including its name, description, and input schema.
{ name: "get_markdown", description: "Converts web page content to well-formatted Markdown, preserving structural elements like tables and definition lists. Recommended as the default tool for web content extraction when a clean, readable text format is needed while maintaining document structure.", inputSchema: { type: "object", properties: { url: { type: "string", description: "URL of the web page to convert to Markdown format, supporting various HTML elements and structures." } }, required: ["url"] } }, - src/index.ts:85-94 (schema)Input schema definition for the get_markdown tool.
inputSchema: { type: "object", properties: { url: { type: "string", description: "URL of the web page to convert to Markdown format, supporting various HTML elements and structures." } }, required: ["url"] } - src/index.ts:143-150 (handler)Handler logic for the get_markdown tool within the CallToolRequestSchema handler. Fetches rendered HTML and converts it to Markdown using getMarkdownStringFromHtmlByNHM.
case "get_markdown": { return { content: [{ type: "text", text: (await getMarkdownStringFromHtmlByNHM(url)) }] }; } - src/index.ts:288-341 (helper)Core helper function that converts fetched HTML to Markdown using NodeHtmlMarkdown library with custom translators for dl, dt, dd elements and optional main content filtering.
export async function getMarkdownStringFromHtmlByNHM( request_url: string, mainOnly: boolean = false, ) { const htmlString = await getHtmlString(request_url); const customTranslators: TranslatorConfigObject = { dl: () => ({ preserveWhitespace: false, surroundingNewlines: true, }), dt: () => ({ prefix: '**', postfix: ':** ', surroundingNewlines: false, }), dd: () => ({ postfix: '\n', surroundingNewlines: false, }), Head: () => ({ postfix: '\n', ignore: false, postprocess: (ctx) => { const titleNode = ctx.node.querySelector('title'); if (titleNode) { return titleNode.textContent || ''; } return ''; }, surroundingNewlines: true, }), }; if (mainOnly) { customTranslators.Header = () => ({ ignore: true, }); customTranslators.Footer = () => ({ ignore: true, }); customTranslators.Nav = () => ({ ignore: true, }); } const markdownString = NodeHtmlMarkdown.translate( htmlString, {}, customTranslators, ); return markdownString; } - src/index.ts:174-210 (helper)Supporting helper that launches a headless Chromium browser to fetch fully rendered HTML content from the URL, which is then passed to the markdown converter.
async function getHtmlString(request_url: string): Promise<string> { let browser: Browser | null = null; let page: Page | null = null; try { browser = await chromium.launch({ headless: true, // args: ['--single-process'], }); const context = await browser.newContext(); page = await context.newPage(); await page.goto(request_url, { waitUntil: 'domcontentloaded', timeout: TIMEOUT, }); const htmlString = await page.content(); return htmlString; } catch (error) { console.error(`Failed to fetch HTML for ${request_url}:`, error); return ""; } finally { if (page) { try { await page.close(); } catch (e) { console.error("Error closing page:", e); } } if (browser) { try { await browser.close(); } catch (error) { console.error('Error closing browser:', error); } } } }