Skip to main content
Glama
omgwtfwow

MCP Server for Crawl4AI

by omgwtfwow

crawl_recursive

Deep crawl websites by following internal links to map entire sites, find all pages, and build comprehensive indexes with configurable depth and page limits.

Instructions

[STATELESS] Deep crawl a website following internal links. Use when: mapping entire sites, finding all pages, building comprehensive indexes. Control with max_depth (default 3) and max_pages (default 50). Note: May need JS execution for dynamic sites. Each page gets a fresh browser. For persistent operations use create_session + crawl.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesStarting URL to crawl from
max_depthNoMaximum depth to follow links
max_pagesNoMaximum number of pages to crawl
include_patternNoRegex to match URLs to crawl. Example: ".*\/blog\/.*" for blog posts only, ".*\.html$" for HTML pages
exclude_patternNoRegex to skip URLs. Example: ".*\/(login|admin).*" to avoid auth pages, ".*\.pdf$" to skip PDFs

Implementation Reference

  • Main handler function for 'crawl_recursive' tool. Implements recursive crawling using BFS, visiting internal links up to max_depth (default 3), max_pages (default 50), with optional include/exclude regex patterns. Uses the /crawl endpoint for each page and collects summaries.
    async crawlRecursive(options: {
      url: string;
      max_depth?: number;
      max_pages?: number;
      include_pattern?: string;
      exclude_pattern?: string;
    }) {
      try {
        const startUrl = new URL(options.url);
        const visited = new Set<string>();
        const toVisit: Array<{ url: string; depth: number }> = [{ url: options.url, depth: 0 }];
        const results: Array<{ url: string; content: string; internal_links_found: number; depth: number }> = [];
        let maxDepthReached = 0;
    
        const includeRegex = options.include_pattern ? new RegExp(options.include_pattern) : null;
        const excludeRegex = options.exclude_pattern ? new RegExp(options.exclude_pattern) : null;
    
        const maxDepth = options.max_depth !== undefined ? options.max_depth : 3;
        const maxPages = options.max_pages || 50;
    
        while (toVisit.length > 0 && results.length < maxPages) {
          const current = toVisit.shift();
          if (!current || visited.has(current.url) || current.depth > maxDepth) {
            continue;
          }
    
          visited.add(current.url);
    
          try {
            // Check URL patterns
            if (excludeRegex && excludeRegex.test(current.url)) continue;
            if (includeRegex && !includeRegex.test(current.url)) continue;
    
            // Crawl the page using the crawl endpoint to get links
            const response = await this.axiosClient.post('/crawl', {
              urls: [current.url],
              crawler_config: {
                cache_mode: 'BYPASS',
              },
            });
    
            const crawlResults = response.data.results || [response.data];
            const result: CrawlResultItem = crawlResults[0];
    
            if (result && result.success) {
              const markdownContent = result.markdown?.fit_markdown || result.markdown?.raw_markdown || '';
              const internalLinksCount = result.links?.internal?.length || 0;
              maxDepthReached = Math.max(maxDepthReached, current.depth);
              results.push({
                url: current.url,
                content: markdownContent,
                internal_links_found: internalLinksCount,
                depth: current.depth,
              });
    
              // Add internal links to crawl queue
              if (current.depth < maxDepth && result.links?.internal) {
                for (const linkObj of result.links.internal) {
                  const linkUrl = linkObj.href || linkObj;
                  try {
                    const absoluteUrl = new URL(linkUrl, current.url).toString();
                    if (!visited.has(absoluteUrl) && new URL(absoluteUrl).hostname === startUrl.hostname) {
                      toVisit.push({ url: absoluteUrl, depth: current.depth + 1 });
                    }
                  } catch (e) {
                    // Skip invalid URLs
                    console.debug('Invalid URL:', e);
                  }
                }
              }
            }
          } catch (error) {
            // Log but continue crawling other pages
            console.error(`Failed to crawl ${current.url}:`, error instanceof Error ? error.message : error);
          }
        }
    
        // Prepare the output text
        let outputText = `Recursive crawl completed:\n\nPages crawled: ${results.length}\nStarting URL: ${options.url}\n`;
    
        if (results.length > 0) {
          outputText += `Max depth reached: ${maxDepthReached} (limit: ${maxDepth})\n\nNote: Only internal links (same domain) are followed during recursive crawling.\n\nPages found:\n${results.map((r) => `- [Depth ${r.depth}] ${r.url}\n  Content: ${r.content.length} chars\n  Internal links found: ${r.internal_links_found}`).join('\n')}`;
        } else {
          outputText += `\nNo pages could be crawled. This might be due to:\n- The starting URL returned an error\n- No internal links were found\n- All discovered links were filtered out by include/exclude patterns`;
        }
    
        return {
          content: [
            {
              type: 'text',
              text: outputText,
            },
          ],
        };
      } catch (error) {
        throw this.formatError(error, 'crawl recursively');
      }
    }
  • Zod input schema validation for the crawl_recursive tool parameters.
    export const CrawlRecursiveSchema = createStatelessSchema(
      z.object({
        url: z.string().url(),
        max_depth: z.number().optional(),
        max_pages: z.number().optional(),
        include_pattern: z.string().optional(),
        exclude_pattern: z.string().optional(),
      }),
      'crawl_recursive',
    );
  • src/server.ts:870-873 (registration)
    Tool registration and dispatching in the switch statement of the callToolRequestHandler. Validates args with CrawlRecursiveSchema and delegates to crawlHandlers.crawlRecursive.
    case 'crawl_recursive':
      return await this.validateAndExecute('crawl_recursive', args, CrawlRecursiveSchema, async (validatedArgs) =>
        this.crawlHandlers.crawlRecursive(validatedArgs),
      );
  • src/server.ts:951-952 (registration)
    Public accessor method in Crawl4AIServer class that delegates to the crawlHandlers instance.
    protected async crawlRecursive(options: Parameters<CrawlHandlers['crawlRecursive']>[0]) {
      return this.crawlHandlers.crawlRecursive(options);
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It effectively describes key traits: stateless operation ('[STATELESS]'), need for JS execution for dynamic sites, fresh browser per page, and limitations (defaults for max_depth and max_pages). It doesn't cover rate limits or error handling, but provides substantial context beyond basic functionality.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is efficiently structured with zero waste: it opens with the stateless hint, states the core purpose, provides usage guidelines, mentions key parameters with defaults, and notes behavioral considerations. Every sentence earns its place, and information is front-loaded appropriately.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (recursive crawling with multiple parameters) and no annotations or output schema, the description does well to cover purpose, usage, key behaviors, and parameter defaults. It could improve by briefly mentioning the return format or error scenarios, but it's largely complete for guiding agent selection and invocation.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema already documents all 5 parameters thoroughly. The description adds minimal value by mentioning defaults for max_depth and max_pages, but doesn't provide additional semantics beyond what's in the schema. This meets the baseline of 3 when schema coverage is high.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose with specific verbs ('deep crawl', 'following internal links') and resources ('a website'), distinguishing it from siblings like 'crawl' (likely simpler) and 'smart_crawl' (likely more intelligent). It explicitly mentions use cases like mapping entire sites and building comprehensive indexes.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides explicit guidance on when to use ('Use when: mapping entire sites, finding all pages, building comprehensive indexes') and when not to use ('For persistent operations use create_session + crawl'), with clear alternatives named ('create_session + crawl'). It also distinguishes from other tools by mentioning JS execution needs and fresh browser contexts.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/omgwtfwow/mcp-crawl4ai-ts'

If you have feedback or need assistance with the MCP directory API, please join our Discord server