Skip to main content
Glama

scrape

Extract specific content from web pages using CSS selectors, with JavaScript rendering options for dynamic sites.

Instructions

Extrai conteúdo inteligente de uma URL específica

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesURL da página para fazer scraping
selectorNoSeletor CSS para extrair elemento específico
javascriptNoRenderizar JavaScript antes de extrair
timeoutNoTimeout em milissegundos

Implementation Reference

  • Core handler function for the 'scrape' tool. Handles caching, fetches/scrapes HTML using scrapePage helper, extracts structured content, applies optional CSS selector, caches results, and returns a ScrapeResult object.
    export async function scrape(params: ScrapeParams): Promise<ScrapeResult> { const { url, selector, javascript = false, timeout = 30000 } = params; const cached = getFromCache(url); if (cached && !selector) { return { url, title: cached.title, content: cached.content, markdown: cached.markdown, fromCache: true, timestamp: new Date(cached.cached_at).toISOString(), }; } try { const { html } = await scrapePage(url, { javascript, timeout }); const extracted = await extractContent(html, url); if (!extracted) { throw new Error("Failed to extract content"); } if (!cached) { saveToCache(url, { content: extracted.textContent, markdown: extracted.markdown, title: extracted.title, }); } let selectedContent: string | undefined; if (selector) { try { const dom = new JSDOM(html, { url }); const element = dom.window.document.querySelector(selector); selectedContent = element ? element.textContent || undefined : undefined; } catch (e) { console.error("Selector error:", e); } } return { url, title: extracted.title, content: extracted.textContent, markdown: extracted.markdown, selectedContent, fromCache: false, timestamp: new Date().toISOString(), }; } catch (error) { console.error("Scrape error:", error); throw error; } }
  • src/index.ts:82-109 (registration)
    Registers the 'scrape' tool with the MCP server, including description, Zod input schema validation, and a thin wrapper that invokes the scrape handler and formats the response as MCP content.
    server.tool( "scrape", "Extrai conteúdo inteligente de uma URL específica", { url: z .string() .url() .describe("URL da página para fazer scraping"), selector: z .string() .optional() .describe("Seletor CSS para extrair elemento específico"), javascript: z .boolean() .default(false) .describe("Renderizar JavaScript antes de extrair"), timeout: z .number() .int() .default(30000) .describe("Timeout em milissegundos"), }, async (params) => { const result = await scrape(params); return { content: [{ type: "text", text: JSON.stringify(result, null, 2) }], }; }
  • TypeScript interfaces defining the input parameters (ScrapeParams) and output structure (ScrapeResult) for the scrape tool handler.
    interface ScrapeParams { url: string; selector?: string; javascript?: boolean; timeout?: number; } interface ScrapeResult { url: string; title: string; content: string; markdown: string; selectedContent?: string; fromCache: boolean; timestamp: string; }
  • Supporting utility scrapePage: fetches HTML from URL using simple fetch (no JS) or full browser rendering with Playwright Chromium (with JS). Called by the main handler.
    export async function scrapePage( url: string, options: { javascript?: boolean; timeout?: number } = {} ): Promise<ScrapeResult> { const { javascript = false, timeout = 30000 } = options; if (!javascript) { try { const response = await fetch(url, { headers: { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36", }, }); return { html: await response.text(), status: response.status, }; } catch (error) { console.error("Fetch error:", error); throw error; } } const browser = await chromium.launch(); try { const page = await browser.newPage({ userAgent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36", }); await page.goto(url, { waitUntil: "networkidle", timeout }); const html = await page.content(); await page.close(); return { html, status: 200 }; } catch (error) { console.error("Browser scraping error:", error); throw error; } finally { await browser.close(); } }

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/alucardeht/isis-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server