extract
Fetch any URL and get clean article markdown using Mozilla Readability with text fallback. Optionally truncate content up to 50,000 characters. Private addresses are blocked by default for security.
Instructions
Fetch a URL and return clean article markdown. Uses Mozilla Readability with a text fallback. Best-effort: failures return { error } instead of throwing. Private/loopback addresses blocked unless SURF_ALLOW_PRIVATE=true.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | URL to fetch | |
| max_chars | No | Truncate content to this many chars (default 8000) |
Implementation Reference
- src/extract.ts:73-144 (handler)The core handler function for the 'extract' tool. Opens a page, applies Mozilla Readability for article extraction, falls back to innerText cleanup, returns markdown (via Turndown) or plain text. Handles SSRF checks, HTTP errors, and graceful error recovery.
export async function extract( ctx: BrowserContext, url: string, maxChars = 8_000, navTimeoutMs = 10_000, ): Promise<ExtractResult> { const checkErr = checkUrl(url); if (checkErr) return { url, error: checkErr }; let page: Page | null = null; try { page = await ctx.newPage(); const resp = await page.goto(url, { waitUntil: 'domcontentloaded', timeout: navTimeoutMs }); if (resp && resp.status() >= 400) { return { url, error: `http ${resp.status()}` }; } // SPA settle: wait briefly for JS-rendered content await page.waitForTimeout(500); await page.addScriptTag({ path: READABILITY_PATH }).catch(() => {}); const article = await page.evaluate(() => { try { const W = window as unknown as { Readability?: new (doc: Document) => { parse: () => ReadabilityOutput | null } }; if (!W.Readability) return null; const cloned = document.cloneNode(true) as Document; const reader = new W.Readability(cloned); return reader.parse(); } catch { return null; } }) as ReadabilityOutput | null; if (article && article.content) { const md = turndown.turndown(article.content).slice(0, maxChars); return { url, title: article.title || undefined, content: md, excerpt: (article.excerpt || article.textContent || '').slice(0, 200).trim() || undefined, length: md.length, }; } const fallback = await page.evaluate((sel: string[]) => { sel.forEach(s => document.querySelectorAll(s).forEach(e => e.remove())); const main = document.querySelector('main, article, [role="main"]') || document.body; const text = (main as HTMLElement).innerText || ''; const title = document.title; return { title, text: text.replace(/\n{3,}/g, '\n\n').trim() }; }, NAV_SELECTORS); if (!fallback.text) { return { url, title: fallback.title || undefined, error: 'no extractable content' }; } const text = fallback.text.slice(0, maxChars); return { url, title: fallback.title || undefined, content: text, excerpt: text.slice(0, 200), length: text.length, }; } catch (e) { return { url, error: (e as Error).message.slice(0, 200) }; } finally { await page?.close().catch(() => {}); } } - src/extract.ts:16-23 (schema)ExtractResult interface - the output schema for the extract tool, with fields for url, title, content, excerpt, length, and error.
export interface ExtractResult { url: string; title?: string; content?: string; excerpt?: string; length?: number; error?: string; } - src/index.ts:211-222 (registration)Tool registration in the MCP ListToolsRequestSchema handler. Defines the 'extract' tool with name, description, and inputSchema (url required, max_chars optional).
{ name: 'extract', description: 'Fetch a URL and return clean article markdown. Uses Mozilla Readability with a text fallback. Best-effort: failures return { error } instead of throwing. Private/loopback addresses blocked unless SURF_ALLOW_PRIVATE=true.', inputSchema: { type: 'object', properties: { url: { type: 'string', description: 'URL to fetch' }, max_chars: { type: 'number', minimum: 200, maximum: 50000, description: 'Truncate content to this many chars (default 8000)' }, }, required: ['url'], }, }, - src/index.ts:306-330 (registration)CallToolRequestSchema handler for 'extract'. Parses args (url, max_chars), calls pool.extractOne(), wraps in timeout, returns JSON result with elapsed_ms.
if (name === 'extract') { const url = String(args?.url || '').trim(); if (!url) throw new McpError(ErrorCode.InvalidParams, 'url required'); const maxChars = Math.min(Math.max(Number(args?.max_chars) || DEFAULT_EXTRACT_MAX_CHARS, 200), 50_000); const t0 = Date.now(); try { // extract never throws CaptchaError, no fallback needed const result = await trackPool(async () => { const p = await ensurePool(); return await withTimeout(p.extractOne(url, maxChars), REQUEST_TIMEOUT_MS, 'extract'); }); const failed = !!result.error && !result.content; return { content: [{ type: 'text', text: JSON.stringify({ ...result, elapsed_ms: Date.now() - t0 }, null, 2), }], ...(failed ? { isError: true } : {}), }; } catch (e) { console.error('[google-surf-mcp] extract error:', e); return { content: [{ type: 'text', text: `Error: ${(e as Error).message}` }], isError: true }; } } - src/pool.ts:81-89 (helper)SearchPool.extractOne() - acquires a worker context from the pool and delegates to the extract() function, releasing the worker afterwards.
async extractOne(url: string, maxChars: number, navTimeoutMs?: number): Promise<ExtractResult> { if (!this.warmed) await this.warm(); const w = await this.acquire(); try { return await extract(w.ctx, url, maxChars, navTimeoutMs); } finally { this.release(w); } }