scrape
Extract specific content from web pages using CSS selectors, with JavaScript rendering options for dynamic sites.
Instructions
Extrai conteúdo inteligente de uma URL específica
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | URL da página para fazer scraping | |
| selector | No | Seletor CSS para extrair elemento específico | |
| javascript | No | Renderizar JavaScript antes de extrair | |
| timeout | No | Timeout em milissegundos |
Implementation Reference
- src/tools/scrape.ts:23-78 (handler)Core handler function for the 'scrape' tool. Handles caching, fetches/scrapes HTML using scrapePage helper, extracts structured content, applies optional CSS selector, caches results, and returns a ScrapeResult object.export async function scrape(params: ScrapeParams): Promise<ScrapeResult> { const { url, selector, javascript = false, timeout = 30000 } = params; const cached = getFromCache(url); if (cached && !selector) { return { url, title: cached.title, content: cached.content, markdown: cached.markdown, fromCache: true, timestamp: new Date(cached.cached_at).toISOString(), }; } try { const { html } = await scrapePage(url, { javascript, timeout }); const extracted = await extractContent(html, url); if (!extracted) { throw new Error("Failed to extract content"); } if (!cached) { saveToCache(url, { content: extracted.textContent, markdown: extracted.markdown, title: extracted.title, }); } let selectedContent: string | undefined; if (selector) { try { const dom = new JSDOM(html, { url }); const element = dom.window.document.querySelector(selector); selectedContent = element ? element.textContent || undefined : undefined; } catch (e) { console.error("Selector error:", e); } } return { url, title: extracted.title, content: extracted.textContent, markdown: extracted.markdown, selectedContent, fromCache: false, timestamp: new Date().toISOString(), }; } catch (error) { console.error("Scrape error:", error); throw error; } }
- src/index.ts:82-109 (registration)Registers the 'scrape' tool with the MCP server, including description, Zod input schema validation, and a thin wrapper that invokes the scrape handler and formats the response as MCP content.server.tool( "scrape", "Extrai conteúdo inteligente de uma URL específica", { url: z .string() .url() .describe("URL da página para fazer scraping"), selector: z .string() .optional() .describe("Seletor CSS para extrair elemento específico"), javascript: z .boolean() .default(false) .describe("Renderizar JavaScript antes de extrair"), timeout: z .number() .int() .default(30000) .describe("Timeout em milissegundos"), }, async (params) => { const result = await scrape(params); return { content: [{ type: "text", text: JSON.stringify(result, null, 2) }], }; }
- src/tools/scrape.ts:6-21 (schema)TypeScript interfaces defining the input parameters (ScrapeParams) and output structure (ScrapeResult) for the scrape tool handler.interface ScrapeParams { url: string; selector?: string; javascript?: boolean; timeout?: number; } interface ScrapeResult { url: string; title: string; content: string; markdown: string; selectedContent?: string; fromCache: boolean; timestamp: string; }
- src/lib/scraper.ts:8-51 (helper)Supporting utility scrapePage: fetches HTML from URL using simple fetch (no JS) or full browser rendering with Playwright Chromium (with JS). Called by the main handler.export async function scrapePage( url: string, options: { javascript?: boolean; timeout?: number } = {} ): Promise<ScrapeResult> { const { javascript = false, timeout = 30000 } = options; if (!javascript) { try { const response = await fetch(url, { headers: { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36", }, }); return { html: await response.text(), status: response.status, }; } catch (error) { console.error("Fetch error:", error); throw error; } } const browser = await chromium.launch(); try { const page = await browser.newPage({ userAgent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36", }); await page.goto(url, { waitUntil: "networkidle", timeout }); const html = await page.content(); await page.close(); return { html, status: 200 }; } catch (error) { console.error("Browser scraping error:", error); throw error; } finally { await browser.close(); } }