Skip to main content
Glama

crawl

Search Google and scrape top results to Markdown with structured content, URLs, and titles for web research and data extraction.

Instructions

Search Google for a query and scrape the top results to Markdown. Returns structured results with title, URL, and full page content.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
qYesThe search query
countNoNumber of results to scrape (1-20, default 3)
contextNoOptional: what you're trying to accomplish (e.g., 'finding competitors pricing', 'researching market trends'). Helps return more targeted results.

Implementation Reference

  • The actual implementation of the /crawl handler which processes crawl requests.
    app.post('/crawl', async (req: FastifyRequest, reply: FastifyReply) => {
      const body = req.body as CrawlRequestBody;
      const seedUrl = (body?.url ?? '').toString().trim();
      const maxPages = Math.max(1, Math.min(50, Number(body?.maxPages ?? 5)));
      const sameDomainOnly = body?.sameDomain !== false; // default true
    
      if (!seedUrl || !/^https?:\/\//i.test(seedUrl)) {
        return reply.status(400).send({
          error: 'url_required',
          hint: 'POST { "url": "https://example.com", "maxPages": 5 }',
        });
      }
    
      let seedDomain = '';
      try { seedDomain = new URL(seedUrl).hostname; } catch {
        return reply.status(400).send({ error: 'invalid_url' });
      }
    
      type PageResult = { url: string; title: string; markdown: string; success: boolean; error?: string };
      const results: PageResult[] = [];
      // Normalize seed URL for dedup (strip trailing slash, lowercase scheme+host)
      const normalizeCrawlUrl = (u: string): string => {
        try {
          const p = new URL(u);
          return p.origin.toLowerCase() + (p.pathname.replace(/\/$/, '') || '/') + (p.search || '');
        } catch { return u; }
      };
      const seenUrls = new Set<string>([normalizeCrawlUrl(seedUrl)]);
    
      // ── Helper: scrape one URL (tier0 → browser fallback) with hard cap ──────
      async function scrapeOne(url: string, skipTier0 = false): Promise<PageResult> {
        try {
          // Try tier0 first (fast plain HTTP) unless caller already tried it
          if (!skipTier0) {
            const t0 = await Promise.race<CrawlResult | null>([
              scrapeUrlTier0(url, { includeLinks: true }),
              new Promise<null>(resolve => setTimeout(() => resolve(null), 7_000)),
            ]).catch(() => null);
            if (t0 && t0.status === 'success' && t0.markdown) {
              return { url, title: t0.title ?? '', markdown: t0.markdown, success: true, _links: t0.links } as PageResult & { _links?: string[] };
            }
          }
          // Browser fallback with hard timeout
          const session = await acquireSession();
          let hadError = false;
          try {
            const result = await Promise.race<CrawlResult>([
              scrapeUrlWithFallback(session.browser as Browser, url, isAgentUserAgent(req), { skipTier0: true, includeLinks: true }),
              new Promise<CrawlResult>((_, rej) => setTimeout(() => rej(new Error('page timeout')), 15_000)),
            ]);
            if (result.status === 'success' && result.markdown) {
              return { url, title: result.title ?? '', markdown: result.markdown, success: true, _links: (result as any).links } as PageResult & { _links?: string[] };
            }
            hadError = result.status === 'error';
            return { url, title: '', markdown: '', success: false, error: result.error || result.status };
          } catch (e: any) {
            hadError = true;
            return { url, title: '', markdown: '', success: false, error: e.message || 'timeout' };
          } finally {
            releaseSession(session, hadError);
          }
        } catch (e: any) {
          return { url, title: '', markdown: '', success: false, error: e.message || 'scrape_error' };
        }
      }
    
      // ── Step 1: Scrape seed URL and collect its links ─────────────────────────
      const seedResult = await scrapeOne(seedUrl) as PageResult & { _links?: string[] };
      results.push({ url: seedResult.url, title: seedResult.title, markdown: seedResult.markdown, success: seedResult.success, error: seedResult.error });
    
      if (!seedResult.success) {
        return reply.send({
          url: seedUrl,
          results,
          summary: { total: 1, success: 0, failed: 1, pagesScraped: 1 },
        });
      }
    
      // ── Step 2: Discover links from seed page ─────────────────────────────────
      const rawLinks: string[] = (seedResult as any)._links ?? [];
      const candidateUrls = rawLinks
        .filter(link => {
          try {
            const parsed = new URL(link);
            if (!sameDomainOnly) return true;
            if (parsed.hostname !== seedDomain) return false;
            // Skip obvious non-content resource paths
            const p = parsed.pathname.toLowerCase();
            if (p.match(/\.(css|js|png|jpg|jpeg|gif|webp|svg|ico|woff|woff2|ttf|pdf|xml|rss)$/)) return false;
            return true;
          } catch { return false; }
        })
        .filter(link => {
          // Dedup by normalized URL (preserve query strings; strip fragment + trailing slash)
          const key = normalizeCrawlUrl(link);
          if (seenUrls.has(key)) return false;
          seenUrls.add(key);
          return true;
        })
        .slice(0, maxPages - 1);
    
      if (DEBUG_LOG) {
        console.log(`[crawl] Seed ok, found ${rawLinks.length} links → ${candidateUrls.length} candidates (maxPages=${maxPages})`);
      }
    
      // ── Step 3: Scrape candidate pages in parallel (tier0 → browser) ─────────
      if (candidateUrls.length > 0) {
        const TIER0_TIMEOUT_MS = 6_000;
        const tier0Settled = await Promise.allSettled(
          candidateUrls.map(url =>
            Promise.race<CrawlResult | null>([
              scrapeUrlTier0(url, { includeLinks: false }),
              new Promise<null>(resolve => setTimeout(() => resolve(null), TIER0_TIMEOUT_MS)),
            ])
              .then(r => ({ url, result: r }))
              .catch(() => ({ url, result: null }))
          )
        );
    
        const browserQueue: string[] = [];
        for (const settled of tier0Settled) {
          if (settled.status !== 'fulfilled') continue;
          const { url, result } = settled.value;
          if (result && result.status === 'success' && result.markdown) {
            results.push({ url, title: result.title ?? '', markdown: result.markdown, success: true });
          } else {
            browserQueue.push(url);
          }
        }
    
        // Browser fallback for tier0 misses — in parallel with hard per-URL cap
        if (browserQueue.length > 0) {
          let session: Awaited<ReturnType<typeof acquireSession>> | null = null;
          let hadError = false;
          try {
            session = await acquireSession();
            const PER_URL_MS = 15_000;
            const browserResults = await Promise.allSettled(
              browserQueue.map(url =>
                Promise.race<CrawlResult>([
                  scrapeUrlWithFallback(session!.browser as Browser, url, isAgentUserAgent(req), { skipTier0: true }),
                  new Promise<CrawlResult>((_, rej) => setTimeout(() => rej(new Error('page timeout')), PER_URL_MS)),
                ])
              )
            );
            browserResults.forEach((r, i) => {
              const url = browserQueue[i];
              if (r.status === 'fulfilled' && r.value.status === 'success') {
                results.push({ url, title: r.value.title ?? '', markdown: r.value.markdown, success: true });
              } else {
                hadError = true;
                const errMsg = r.status === 'rejected' ? r.reason?.message : r.value.error;
                results.push({ url, title: '', markdown: '', success: false, error: errMsg || 'failed' });
              }
            });
          } catch (e: any) {
            hadError = true;
            browserQueue.forEach(url => {
              if (!results.find(r => r.url === url)) {
                results.push({ url, title: '', markdown: '', success: false, error: e.message || 'browser_unavailable' });
              }
            });
          } finally {
            if (session) releaseSession(session, hadError);
          }
        }
      }
    
      const successCount = results.filter(r => r.success).length;
      return reply.send({
        url: seedUrl,
        results,
        summary: {
          total: results.length,
          success: successCount,
          failed: results.length - successCount,
          pagesScraped: results.length,
        },
      });
    });
  • src/crawl.ts:48-362 (registration)
    The function that registers the /crawl route (and others) on the Fastify app.
    export async function registerCrawlRoutes(app: FastifyInstance): Promise<void> {
      /**
       * POST /crawl
       * Crawl a website: fetch a seed URL, discover same-domain links, scrape up to maxPages.
       * Returns an array of {url, title, markdown, success} objects.
       *
       * Body: { url: string, maxPages?: number (default 5), sameDomain?: boolean (default true) }
       */
      app.post('/crawl', async (req: FastifyRequest, reply: FastifyReply) => {
        const body = req.body as CrawlRequestBody;
        const seedUrl = (body?.url ?? '').toString().trim();
        const maxPages = Math.max(1, Math.min(50, Number(body?.maxPages ?? 5)));
        const sameDomainOnly = body?.sameDomain !== false; // default true
    
        if (!seedUrl || !/^https?:\/\//i.test(seedUrl)) {
          return reply.status(400).send({
            error: 'url_required',
            hint: 'POST { "url": "https://example.com", "maxPages": 5 }',
          });
        }
    
        let seedDomain = '';
        try { seedDomain = new URL(seedUrl).hostname; } catch {
          return reply.status(400).send({ error: 'invalid_url' });
        }
    
        type PageResult = { url: string; title: string; markdown: string; success: boolean; error?: string };
        const results: PageResult[] = [];
        // Normalize seed URL for dedup (strip trailing slash, lowercase scheme+host)
        const normalizeCrawlUrl = (u: string): string => {
          try {
            const p = new URL(u);
            return p.origin.toLowerCase() + (p.pathname.replace(/\/$/, '') || '/') + (p.search || '');
          } catch { return u; }
        };
        const seenUrls = new Set<string>([normalizeCrawlUrl(seedUrl)]);
    
        // ── Helper: scrape one URL (tier0 → browser fallback) with hard cap ──────
        async function scrapeOne(url: string, skipTier0 = false): Promise<PageResult> {
          try {
            // Try tier0 first (fast plain HTTP) unless caller already tried it
            if (!skipTier0) {
              const t0 = await Promise.race<CrawlResult | null>([
                scrapeUrlTier0(url, { includeLinks: true }),
                new Promise<null>(resolve => setTimeout(() => resolve(null), 7_000)),
              ]).catch(() => null);
              if (t0 && t0.status === 'success' && t0.markdown) {
                return { url, title: t0.title ?? '', markdown: t0.markdown, success: true, _links: t0.links } as PageResult & { _links?: string[] };
              }
            }
            // Browser fallback with hard timeout
            const session = await acquireSession();
            let hadError = false;
            try {
              const result = await Promise.race<CrawlResult>([
                scrapeUrlWithFallback(session.browser as Browser, url, isAgentUserAgent(req), { skipTier0: true, includeLinks: true }),
                new Promise<CrawlResult>((_, rej) => setTimeout(() => rej(new Error('page timeout')), 15_000)),
              ]);
              if (result.status === 'success' && result.markdown) {
                return { url, title: result.title ?? '', markdown: result.markdown, success: true, _links: (result as any).links } as PageResult & { _links?: string[] };
              }
              hadError = result.status === 'error';
              return { url, title: '', markdown: '', success: false, error: result.error || result.status };
            } catch (e: any) {
              hadError = true;
              return { url, title: '', markdown: '', success: false, error: e.message || 'timeout' };
            } finally {
              releaseSession(session, hadError);
            }
          } catch (e: any) {
            return { url, title: '', markdown: '', success: false, error: e.message || 'scrape_error' };
          }
        }
    
        // ── Step 1: Scrape seed URL and collect its links ─────────────────────────
        const seedResult = await scrapeOne(seedUrl) as PageResult & { _links?: string[] };
        results.push({ url: seedResult.url, title: seedResult.title, markdown: seedResult.markdown, success: seedResult.success, error: seedResult.error });
    
        if (!seedResult.success) {
          return reply.send({
            url: seedUrl,
            results,
            summary: { total: 1, success: 0, failed: 1, pagesScraped: 1 },
          });
        }
    
        // ── Step 2: Discover links from seed page ─────────────────────────────────
        const rawLinks: string[] = (seedResult as any)._links ?? [];
        const candidateUrls = rawLinks
          .filter(link => {
            try {
              const parsed = new URL(link);
              if (!sameDomainOnly) return true;
              if (parsed.hostname !== seedDomain) return false;
              // Skip obvious non-content resource paths
              const p = parsed.pathname.toLowerCase();
              if (p.match(/\.(css|js|png|jpg|jpeg|gif|webp|svg|ico|woff|woff2|ttf|pdf|xml|rss)$/)) return false;
              return true;
            } catch { return false; }
          })
          .filter(link => {
            // Dedup by normalized URL (preserve query strings; strip fragment + trailing slash)
            const key = normalizeCrawlUrl(link);
            if (seenUrls.has(key)) return false;
            seenUrls.add(key);
            return true;
          })
          .slice(0, maxPages - 1);
    
        if (DEBUG_LOG) {
          console.log(`[crawl] Seed ok, found ${rawLinks.length} links → ${candidateUrls.length} candidates (maxPages=${maxPages})`);
        }
    
        // ── Step 3: Scrape candidate pages in parallel (tier0 → browser) ─────────
        if (candidateUrls.length > 0) {
          const TIER0_TIMEOUT_MS = 6_000;
          const tier0Settled = await Promise.allSettled(
            candidateUrls.map(url =>
              Promise.race<CrawlResult | null>([
                scrapeUrlTier0(url, { includeLinks: false }),
                new Promise<null>(resolve => setTimeout(() => resolve(null), TIER0_TIMEOUT_MS)),
              ])
                .then(r => ({ url, result: r }))
                .catch(() => ({ url, result: null }))
            )
          );
    
          const browserQueue: string[] = [];
          for (const settled of tier0Settled) {
            if (settled.status !== 'fulfilled') continue;
            const { url, result } = settled.value;
            if (result && result.status === 'success' && result.markdown) {
              results.push({ url, title: result.title ?? '', markdown: result.markdown, success: true });
            } else {
              browserQueue.push(url);
            }
          }
    
          // Browser fallback for tier0 misses — in parallel with hard per-URL cap
          if (browserQueue.length > 0) {
            let session: Awaited<ReturnType<typeof acquireSession>> | null = null;
            let hadError = false;
            try {
              session = await acquireSession();
              const PER_URL_MS = 15_000;
              const browserResults = await Promise.allSettled(
                browserQueue.map(url =>
                  Promise.race<CrawlResult>([
                    scrapeUrlWithFallback(session!.browser as Browser, url, isAgentUserAgent(req), { skipTier0: true }),
                    new Promise<CrawlResult>((_, rej) => setTimeout(() => rej(new Error('page timeout')), PER_URL_MS)),
                  ])
                )
              );
              browserResults.forEach((r, i) => {
                const url = browserQueue[i];
                if (r.status === 'fulfilled' && r.value.status === 'success') {
                  results.push({ url, title: r.value.title ?? '', markdown: r.value.markdown, success: true });
                } else {
                  hadError = true;
                  const errMsg = r.status === 'rejected' ? r.reason?.message : r.value.error;
                  results.push({ url, title: '', markdown: '', success: false, error: errMsg || 'failed' });
                }
              });
            } catch (e: any) {
              hadError = true;
              browserQueue.forEach(url => {
                if (!results.find(r => r.url === url)) {
                  results.push({ url, title: '', markdown: '', success: false, error: e.message || 'browser_unavailable' });
                }
              });
            } finally {
              if (session) releaseSession(session, hadError);
            }
          }
        }
    
        const successCount = results.filter(r => r.success).length;
        return reply.send({
          url: seedUrl,
          results,
          summary: {
            total: results.length,
            success: successCount,
            failed: results.length - successCount,
            pagesScraped: results.length,
          },
        });
      });
    
      /**
       * POST /scrape
       * Scrape a single URL to Markdown
       */
      app.post('/scrape', async (req: FastifyRequest, reply: FastifyReply) => {
        const perf = createPerfLogger();
        const body = req.body as ScrapeRequestBody;
        const url = (body?.url ?? '').toString().trim();
    
        perf.event('Scrape request', { url });
    
        if (!url || !/^https?:\/\//i.test(url)) {
          return reply.status(400).send({ error: 'valid_url_required' });
        }
    
        // Reject PDF URLs if DATALAB_API_KEY is not configured
        if (isPdfUrl(url) && !isPdfSupportEnabled()) {
          perf.event('PDF rejected - no API key');
          perf.summary();
          return reply.status(400).send({
            error: 'pdf_not_supported',
            message: 'PDF URLs require DATALAB_API_KEY to be configured',
          });
        }
    
        // Build options from request body
        const scrapeOpts: ScrapeOptions = {};
        if (body?.waitForSelector) scrapeOpts.waitForSelector = body.waitForSelector;
        if (body?.targetSelector) scrapeOpts.targetSelector = body.targetSelector;
        if (body?.respondWith) scrapeOpts.respondWith = body.respondWith;
        if (body?.actions) scrapeOpts.actions = body.actions;
        if (body?.screenshot) scrapeOpts.screenshot = true;
        if (body?.includeLinks) scrapeOpts.includeLinks = true;
        const hasOptions = Object.keys(scrapeOpts).length > 0;
    
        // ── Tier 0: try plain HTTP fetch BEFORE acquiring the browser pool ────────
        // This handles ~40% of static URLs with no browser overhead at all.
        // Skip if options require browser-specific features.
        const needsBrowser = !!(scrapeOpts.actions?.length || scrapeOpts.waitForSelector || scrapeOpts.respondWith === 'screenshot' || scrapeOpts.screenshot);
        if (!needsBrowser && !isPdfUrl(url)) {
          const tier0 = await scrapeUrlTier0(url, scrapeOpts);
          if (tier0) {
            perf.event('Tier0 HTTP success', { len: tier0.markdown.length });
            perf.summary();
            return reply.send({ ...tier0, success: tier0.status === 'success' });
          }
        }
    
        const session = await acquireSession();
        let hadError = false;
    
        try {
          perf.beginStep('Scrape URL');
          let result = await scrapeUrlWithFallback(session.browser as Browser, url, isAgentUserAgent(req), hasOptions ? scrapeOpts : undefined);
    
          // If result has very little content, retry with explicit slow scraper as final attempt
          const contentLen = result.markdown ? result.markdown.length : 0;
          if (contentLen < 200 && result.status !== 'error') {
            console.log(`[scrape] Content too short (${contentLen} chars), retrying with slow scraper: ${url}`);
            const retryResult = await scrapeUrlWithFallback(session.browser as Browser, url, isAgentUserAgent(req), hasOptions ? scrapeOpts : undefined);
            if (retryResult.markdown && retryResult.markdown.length > contentLen) {
              result = retryResult;
            }
          }
    
          perf.endStep('Scrape URL', { status: result.status, contentLen: result.markdown?.length ?? 0 });
          perf.summary();
          return reply.send({ ...result, success: result.status === 'success' });
        } catch (err) {
          hadError = true;
          const message = err instanceof Error ? err.message : String(err);
          perf.error('Scrape', message);
          perf.summary();
    
          if (DEBUG_LOG) {
            console.error('[scrape] error:', err);
          }
    
          return reply.status(500).send({
            error: 'scrape_failed',
            message,
          });
        } finally {
          releaseSession(session, hadError);
        }
      });
    
      /**
       * GET /r/:url
       * Shorthand scrape: GET /r/https://example.com → returns markdown as text/plain
       * Inspired by Jina Reader's /r/ shortcut for quick LLM context fetching.
       */
      app.get('/r/*', async (req: FastifyRequest, reply: FastifyReply) => {
        const rawUrl = (req.params as any)['*'] as string;
        if (!rawUrl) {
          return reply.status(400).type('text/plain').send('Usage: GET /r/https://example.com');
        }
        const fullUrl = (rawUrl.startsWith('http://') || rawUrl.startsWith('https://')) ? rawUrl : 'https://' + rawUrl;
        if (!/^https?:\/\//i.test(fullUrl)) {
          return reply.status(400).type('text/plain').send('Invalid URL');
        }
    
        // Try tier 0 (plain HTTP) first — fast, no browser needed
        const tier0 = await scrapeUrlTier0(fullUrl);
        if (tier0) {
          reply.header('Content-Type', 'text/plain; charset=utf-8');
          return reply.send(tier0.markdown);
        }
    
        // Escalate to browser tier
        const session = await acquireSession();
        let hadError = false;
        try {
          const result = await scrapeUrlWithFallback(session.browser as Browser, fullUrl, isAgentUserAgent(req));
          reply.header('Content-Type', 'text/plain; charset=utf-8');
          return reply.send(result.markdown || result.error || 'No content');
        } catch (err) {
          hadError = true;
          const message = err instanceof Error ? err.message : String(err);
          reply.header('Content-Type', 'text/plain; charset=utf-8');
          return reply.status(500).send(`Error: ${message}`);
        } finally {
          releaseSession(session, hadError);
        }
      });
    }
  • Type definition for the input schema of the crawl tool.
    interface CrawlRequestBody {
      url?: string;
      maxPages?: number;
      sameDomain?: boolean; // default true — only follow links on the same domain
    }
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description carries the full burden of behavioral disclosure. It mentions that the tool searches Google and scrapes results to Markdown, but lacks details on permissions, rate limits, error handling, or whether it's a read-only or mutating operation. For a tool that interacts with external services, this is a significant gap.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is front-loaded and efficiently structured in two sentences: one stating the core action and another detailing the return format. Every sentence adds value without redundancy, making it easy for an agent to parse quickly.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (external search and scraping) and lack of annotations and output schema, the description is minimally adequate. It covers the basic purpose and return format but misses behavioral details like error handling or limitations. Without an output schema, it should ideally explain return values more thoroughly.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema already documents all three parameters (q, count, context) with their descriptions. The description adds no additional parameter semantics beyond what the schema provides, such as examples or constraints not in the schema, meeting the baseline for high coverage.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose with specific verbs ('Search Google', 'scrape the top results to Markdown') and resource ('Google'), and distinguishes it from siblings by specifying it returns structured results with title, URL, and full page content. It explicitly mentions what the tool does beyond just the name.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus the sibling tools (batch_scrape, extract, scrape, search). It mentions what the tool does but offers no context on alternatives, prerequisites, or exclusions, leaving the agent to guess based on tool names alone.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/kc23go/anybrowse'

If you have feedback or need assistance with the MCP directory API, please join our Discord server