get_markdown
Converts web page content into clean, structured Markdown format, preserving elements like tables and lists for readability and document consistency.
Instructions
Converts web page content to well-formatted Markdown, preserving structural elements like tables and definition lists. Recommended as the default tool for web content extraction when a clean, readable text format is needed while maintaining document structure.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | URL of the web page to convert to Markdown format, supporting various HTML elements and structures. |
Implementation Reference
- src/index.ts:82-94 (registration)Registers the 'get_markdown' tool with its description and input schema (requires 'url' string).{ name: "get_markdown", description: "Converts web page content to well-formatted Markdown, preserving structural elements like tables and definition lists. Recommended as the default tool for web content extraction when a clean, readable text format is needed while maintaining document structure.", inputSchema: { type: "object", properties: { url: { type: "string", description: "URL of the web page to convert to Markdown format, supporting various HTML elements and structures." } }, required: ["url"] }
- src/index.ts:143-149 (handler)Dispatcher handler for 'get_markdown' tool that calls the markdown conversion function and returns the result as text content.case "get_markdown": { return { content: [{ type: "text", text: (await getMarkdownStringFromHtmlByNHM(url)) }] };
- src/index.ts:288-341 (handler)Core handler function that fetches HTML using a headless browser and converts it to Markdown using NodeHtmlMarkdown with custom translators for definition lists (dl/dt/dd) and head elements.export async function getMarkdownStringFromHtmlByNHM( request_url: string, mainOnly: boolean = false, ) { const htmlString = await getHtmlString(request_url); const customTranslators: TranslatorConfigObject = { dl: () => ({ preserveWhitespace: false, surroundingNewlines: true, }), dt: () => ({ prefix: '**', postfix: ':** ', surroundingNewlines: false, }), dd: () => ({ postfix: '\n', surroundingNewlines: false, }), Head: () => ({ postfix: '\n', ignore: false, postprocess: (ctx) => { const titleNode = ctx.node.querySelector('title'); if (titleNode) { return titleNode.textContent || ''; } return ''; }, surroundingNewlines: true, }), }; if (mainOnly) { customTranslators.Header = () => ({ ignore: true, }); customTranslators.Footer = () => ({ ignore: true, }); customTranslators.Nav = () => ({ ignore: true, }); } const markdownString = NodeHtmlMarkdown.translate( htmlString, {}, customTranslators, ); return markdownString; }
- src/index.ts:174-210 (helper)Helper function to fetch fully rendered HTML content from a URL using Playwright Chromium headless browser.async function getHtmlString(request_url: string): Promise<string> { let browser: Browser | null = null; let page: Page | null = null; try { browser = await chromium.launch({ headless: true, // args: ['--single-process'], }); const context = await browser.newContext(); page = await context.newPage(); await page.goto(request_url, { waitUntil: 'domcontentloaded', timeout: TIMEOUT, }); const htmlString = await page.content(); return htmlString; } catch (error) { console.error(`Failed to fetch HTML for ${request_url}:`, error); return ""; } finally { if (page) { try { await page.close(); } catch (e) { console.error("Error closing page:", e); } } if (browser) { try { await browser.close(); } catch (error) { console.error('Error closing browser:', error); } } } }