Skip to main content
Glama

extract

Fetch any URL and get clean article markdown using Mozilla Readability with text fallback. Optionally truncate content up to 50,000 characters. Private addresses are blocked by default for security.

Instructions

Fetch a URL and return clean article markdown. Uses Mozilla Readability with a text fallback. Best-effort: failures return { error } instead of throwing. Private/loopback addresses blocked unless SURF_ALLOW_PRIVATE=true.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesURL to fetch
max_charsNoTruncate content to this many chars (default 8000)

Implementation Reference

  • The core handler function for the 'extract' tool. Opens a page, applies Mozilla Readability for article extraction, falls back to innerText cleanup, returns markdown (via Turndown) or plain text. Handles SSRF checks, HTTP errors, and graceful error recovery.
    export async function extract(
      ctx: BrowserContext,
      url: string,
      maxChars = 8_000,
      navTimeoutMs = 10_000,
    ): Promise<ExtractResult> {
      const checkErr = checkUrl(url);
      if (checkErr) return { url, error: checkErr };
    
      let page: Page | null = null;
      try {
        page = await ctx.newPage();
    
        const resp = await page.goto(url, { waitUntil: 'domcontentloaded', timeout: navTimeoutMs });
        if (resp && resp.status() >= 400) {
          return { url, error: `http ${resp.status()}` };
        }
    
        // SPA settle: wait briefly for JS-rendered content
        await page.waitForTimeout(500);
    
        await page.addScriptTag({ path: READABILITY_PATH }).catch(() => {});
    
        const article = await page.evaluate(() => {
          try {
            const W = window as unknown as { Readability?: new (doc: Document) => { parse: () => ReadabilityOutput | null } };
            if (!W.Readability) return null;
            const cloned = document.cloneNode(true) as Document;
            const reader = new W.Readability(cloned);
            return reader.parse();
          } catch {
            return null;
          }
        }) as ReadabilityOutput | null;
    
        if (article && article.content) {
          const md = turndown.turndown(article.content).slice(0, maxChars);
          return {
            url,
            title: article.title || undefined,
            content: md,
            excerpt: (article.excerpt || article.textContent || '').slice(0, 200).trim() || undefined,
            length: md.length,
          };
        }
    
        const fallback = await page.evaluate((sel: string[]) => {
          sel.forEach(s => document.querySelectorAll(s).forEach(e => e.remove()));
          const main = document.querySelector('main, article, [role="main"]') || document.body;
          const text = (main as HTMLElement).innerText || '';
          const title = document.title;
          return { title, text: text.replace(/\n{3,}/g, '\n\n').trim() };
        }, NAV_SELECTORS);
    
        if (!fallback.text) {
          return { url, title: fallback.title || undefined, error: 'no extractable content' };
        }
    
        const text = fallback.text.slice(0, maxChars);
        return {
          url,
          title: fallback.title || undefined,
          content: text,
          excerpt: text.slice(0, 200),
          length: text.length,
        };
      } catch (e) {
        return { url, error: (e as Error).message.slice(0, 200) };
      } finally {
        await page?.close().catch(() => {});
      }
    }
  • ExtractResult interface - the output schema for the extract tool, with fields for url, title, content, excerpt, length, and error.
    export interface ExtractResult {
      url: string;
      title?: string;
      content?: string;
      excerpt?: string;
      length?: number;
      error?: string;
    }
  • src/index.ts:211-222 (registration)
    Tool registration in the MCP ListToolsRequestSchema handler. Defines the 'extract' tool with name, description, and inputSchema (url required, max_chars optional).
    {
      name: 'extract',
      description: 'Fetch a URL and return clean article markdown. Uses Mozilla Readability with a text fallback. Best-effort: failures return { error } instead of throwing. Private/loopback addresses blocked unless SURF_ALLOW_PRIVATE=true.',
      inputSchema: {
        type: 'object',
        properties: {
          url: { type: 'string', description: 'URL to fetch' },
          max_chars: { type: 'number', minimum: 200, maximum: 50000, description: 'Truncate content to this many chars (default 8000)' },
        },
        required: ['url'],
      },
    },
  • src/index.ts:306-330 (registration)
    CallToolRequestSchema handler for 'extract'. Parses args (url, max_chars), calls pool.extractOne(), wraps in timeout, returns JSON result with elapsed_ms.
    if (name === 'extract') {
      const url = String(args?.url || '').trim();
      if (!url) throw new McpError(ErrorCode.InvalidParams, 'url required');
      const maxChars = Math.min(Math.max(Number(args?.max_chars) || DEFAULT_EXTRACT_MAX_CHARS, 200), 50_000);
    
      const t0 = Date.now();
      try {
        // extract never throws CaptchaError, no fallback needed
        const result = await trackPool(async () => {
          const p = await ensurePool();
          return await withTimeout(p.extractOne(url, maxChars), REQUEST_TIMEOUT_MS, 'extract');
        });
        const failed = !!result.error && !result.content;
        return {
          content: [{
            type: 'text',
            text: JSON.stringify({ ...result, elapsed_ms: Date.now() - t0 }, null, 2),
          }],
          ...(failed ? { isError: true } : {}),
        };
      } catch (e) {
        console.error('[google-surf-mcp] extract error:', e);
        return { content: [{ type: 'text', text: `Error: ${(e as Error).message}` }], isError: true };
      }
    }
  • SearchPool.extractOne() - acquires a worker context from the pool and delegates to the extract() function, releasing the worker afterwards.
    async extractOne(url: string, maxChars: number, navTimeoutMs?: number): Promise<ExtractResult> {
      if (!this.warmed) await this.warm();
      const w = await this.acquire();
      try {
        return await extract(w.ctx, url, maxChars, navTimeoutMs);
      } finally {
        this.release(w);
      }
    }
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations, the description discloses key behaviors: uses Mozilla Readability with text fallback, best-effort error handling returning { error }, and private address blocking with env var override. This provides good transparency, though details on redirects or caching are missing.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is three concise sentences, each adding unique value: purpose, method/fallback, and error/security. No redundant or unnecessary information.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given no output schema, the description adequately covers returns (markdown or error) and security restrictions. It is sufficient for a simple extraction tool, though it could mention pagination or truncation behavior beyond max_chars.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100% with descriptions for both parameters, so the baseline is 3. The description does not add significant semantic detail beyond what the schema provides, though it mentions the output format ('clean article markdown') which indirectly relates to the parameters.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the verb 'Fetch a URL' and the resource 'clean article markdown', distinguishing it from siblings like search or search_extract. It is specific and unambiguous.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage for extracting article content from a URL but does not explicitly guide when to use versus alternatives or when not to use. It lacks explicit context for sibling differentiation.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/HarimxChoi/google-surf-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server