Skip to main content
Glama

batch_scrape

Scrape up to 10 URLs simultaneously to extract content as markdown, optimizing web data collection for AI agents with batch processing.

Instructions

Scrape multiple URLs at once (up to 10) and get all results as markdown. More efficient than calling scrape() in a loop.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlsYesList of URLs to scrape (max 10)
contextNoOptional: what you're trying to accomplish

Implementation Reference

  • The Fastify POST handler for the '/batch' endpoint, which processes the 'batch_scrape' tool request by validating URLs, attempting tier-0 scrapes, and using a browser pool for fallback.
    app.post('/batch', async (req: FastifyRequest, reply: FastifyReply) => {
      const body = req.body as BatchRequestBody;
      const urls = body?.urls;
      const context = body?.context;
    
      if (!Array.isArray(urls) || urls.length === 0) {
        return reply.status(400).send({ error: 'urls array required' });
      }
      if (urls.length > 10) {
        return reply.status(400).send({ error: 'max 10 URLs per batch' });
      }
    
      // Validate all are strings
      const urlStrings: string[] = [];
      for (const u of urls) {
        if (typeof u !== 'string') {
          return reply.status(400).send({ error: 'all urls must be strings' });
        }
        urlStrings.push(u);
      }
    
      // Auth check — owner has no limits
      const ownerKey = getOwnerKey(req);
      const isOwner = isOwnerKey(ownerKey);
      const isPro = !isOwner && isProUser(req);
      const isFree = !isOwner && !isPro;
    
      const clientIp = req.ip || 'unknown';
    
      // Internal token bypass (from MCP)
      const internalToken = req.headers['x-internal-token'] as string | undefined;
      // (payment gate already validated internal tokens before this handler runs,
      //  but /batch is not in DEFAULT_PRICES so we check here)
    
      if (isFree) {
        const ok = checkBatchFreeTier(clientIp);
        if (!ok.allowed) {
          return reply.status(429).send({
            error: 'Free tier batch limit reached (5 batches/day). Upgrade to Pro for unlimited batches.',
            upgrade: 'https://anybrowse.dev/checkout',
            reset: 'Resets at midnight UTC',
          });
        }
      }
    
      // ── Tier 0: try plain HTTP fetch for each URL (no browser pool needed) ──
      type BatchResult = { url: string; success: boolean; markdown: string | null; title: string | null; error?: string };
      const tier0Results = await Promise.allSettled(
        urlStrings.map(async (url): Promise<BatchResult> => {
          try {
            const r = await scrapeUrlTier0(url);
            if (r && r.status === 'success' && r.markdown) {
              return { url, success: true, markdown: r.markdown, title: r.title ?? null };
            }
          } catch { /* fall through */ }
          return { url, success: false, markdown: null, title: null, error: 'tier0_miss' };
        })
      );
    
      // Separate tier0 hits from misses
      const results: BatchResult[] = new Array(urlStrings.length);
      const browserQueue: Array<{ idx: number; url: string }> = [];
    
      tier0Results.forEach((r, i) => {
        const val = r.status === 'fulfilled' ? r.value : { url: urlStrings[i], success: false, markdown: null, title: null, error: 'tier0_error' };
        if (val.success) {
          results[i] = val;
        } else {
          browserQueue.push({ idx: i, url: urlStrings[i] });
        }
      });
    
      // ── Browser pool: handle URLs that tier0 couldn't serve ──────────────
      let session: Awaited<ReturnType<typeof acquireSession>> | null = null;
      let hadError = false;
    
      if (browserQueue.length > 0) {
        try {
          session = await acquireSession();
          const browser = session.browser as Browser;
    
          const PER_URL_TIMEOUT_MS = 15_000; // hard cap — tier0 already failed for these URLs
          const settled = await Promise.allSettled(
            browserQueue.map(({ url }) =>
              Promise.race([
                scrapeUrlWithFallback(browser, url, true, { skipTier0: true }),
                new Promise<never>((_, rej) =>
                  setTimeout(() => rej(new Error('per-url browser timeout')), PER_URL_TIMEOUT_MS)
                ),
              ])
            )
          );
    
          settled.forEach((r, qi) => {
            const { idx, url } = browserQueue[qi];
            if (r.status === 'fulfilled') {
              const val = r.value;
              if (val.status === 'success') {
                results[idx] = { url, success: true, markdown: val.markdown, title: val.title ?? null };
              } else {
                hadError = true;
                results[idx] = { url, success: false, markdown: null, title: null, error: val.error || val.status };
              }
            } else {
              hadError = true;
              results[idx] = { url, success: false, markdown: null, title: null, error: r.reason?.message || String(r.reason) };
            }
          });
        } catch (err: any) {
          hadError = true;
          // Fill remaining slots with error
          browserQueue.forEach(({ idx, url }) => {
            if (!results[idx]) {
              results[idx] = { url, success: false, markdown: null, title: null, error: err.message || 'Browser scrape failed' };
            }
          });
        } finally {
          if (session) releaseSession(session, hadError);
        }
      }
    
      const successCount = results.filter((r) => r?.success).length;
      return reply.send({
        results,
        summary: { total: results.length, success: successCount, failed: results.length - successCount },
      });
    });
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It mentions efficiency benefits and the 10-URL limit, which is useful context. However, it doesn't cover important behavioral aspects like rate limits, authentication needs, error handling, or what 'scrape' entails (e.g., HTML parsing, timeout behavior). The description adds some value but leaves significant gaps.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is perfectly concise: two sentences that each earn their place. The first sentence states the core functionality, and the second provides valuable comparative guidance. No wasted words, well-structured, and front-loaded with the main purpose.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool has no annotations and no output schema, the description should do more to compensate. While it clearly explains the batch nature and efficiency advantage, it doesn't describe what the output looks like (beyond 'markdown'), error conditions, or performance characteristics. For a scraping tool with multiple parameters and no structured output documentation, this is adequate but has clear gaps.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema already documents both parameters thoroughly. The description doesn't add any parameter-specific semantics beyond what's in the schema (e.g., it doesn't explain URL format requirements or how 'context' affects scraping). Baseline 3 is appropriate when the schema does the heavy lifting.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: 'Scrape multiple URLs at once (up to 10) and get all results as markdown.' It specifies the action (scrape), resource (URLs), scope (multiple, up to 10), and output format (markdown). It also distinguishes from sibling 'scrape' by emphasizing batch efficiency.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides clear usage context: 'More efficient than calling scrape() in a loop' explicitly indicates when to use this tool over the sibling 'scrape' tool. However, it doesn't mention when NOT to use it (e.g., for single URLs) or alternatives like 'crawl' or 'extract' from the sibling list.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/kc23go/anybrowse'

If you have feedback or need assistance with the MCP directory API, please join our Discord server