Skip to main content
Glama
omgwtfwow

MCP Server for Crawl4AI

by omgwtfwow

crawl_recursive

Deep crawl websites by following internal links to map entire sites, find all pages, and build comprehensive indexes with configurable depth and page limits.

Instructions

[STATELESS] Deep crawl a website following internal links. Use when: mapping entire sites, finding all pages, building comprehensive indexes. Control with max_depth (default 3) and max_pages (default 50). Note: May need JS execution for dynamic sites. Each page gets a fresh browser. For persistent operations use create_session + crawl.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesStarting URL to crawl from
max_depthNoMaximum depth to follow links
max_pagesNoMaximum number of pages to crawl
include_patternNoRegex to match URLs to crawl. Example: ".*\/blog\/.*" for blog posts only, ".*\.html$" for HTML pages
exclude_patternNoRegex to skip URLs. Example: ".*\/(login|admin).*" to avoid auth pages, ".*\.pdf$" to skip PDFs

Implementation Reference

  • Main handler function for 'crawl_recursive' tool. Implements recursive crawling using BFS, visiting internal links up to max_depth (default 3), max_pages (default 50), with optional include/exclude regex patterns. Uses the /crawl endpoint for each page and collects summaries.
    async crawlRecursive(options: {
      url: string;
      max_depth?: number;
      max_pages?: number;
      include_pattern?: string;
      exclude_pattern?: string;
    }) {
      try {
        const startUrl = new URL(options.url);
        const visited = new Set<string>();
        const toVisit: Array<{ url: string; depth: number }> = [{ url: options.url, depth: 0 }];
        const results: Array<{ url: string; content: string; internal_links_found: number; depth: number }> = [];
        let maxDepthReached = 0;
    
        const includeRegex = options.include_pattern ? new RegExp(options.include_pattern) : null;
        const excludeRegex = options.exclude_pattern ? new RegExp(options.exclude_pattern) : null;
    
        const maxDepth = options.max_depth !== undefined ? options.max_depth : 3;
        const maxPages = options.max_pages || 50;
    
        while (toVisit.length > 0 && results.length < maxPages) {
          const current = toVisit.shift();
          if (!current || visited.has(current.url) || current.depth > maxDepth) {
            continue;
          }
    
          visited.add(current.url);
    
          try {
            // Check URL patterns
            if (excludeRegex && excludeRegex.test(current.url)) continue;
            if (includeRegex && !includeRegex.test(current.url)) continue;
    
            // Crawl the page using the crawl endpoint to get links
            const response = await this.axiosClient.post('/crawl', {
              urls: [current.url],
              crawler_config: {
                cache_mode: 'BYPASS',
              },
            });
    
            const crawlResults = response.data.results || [response.data];
            const result: CrawlResultItem = crawlResults[0];
    
            if (result && result.success) {
              const markdownContent = result.markdown?.fit_markdown || result.markdown?.raw_markdown || '';
              const internalLinksCount = result.links?.internal?.length || 0;
              maxDepthReached = Math.max(maxDepthReached, current.depth);
              results.push({
                url: current.url,
                content: markdownContent,
                internal_links_found: internalLinksCount,
                depth: current.depth,
              });
    
              // Add internal links to crawl queue
              if (current.depth < maxDepth && result.links?.internal) {
                for (const linkObj of result.links.internal) {
                  const linkUrl = linkObj.href || linkObj;
                  try {
                    const absoluteUrl = new URL(linkUrl, current.url).toString();
                    if (!visited.has(absoluteUrl) && new URL(absoluteUrl).hostname === startUrl.hostname) {
                      toVisit.push({ url: absoluteUrl, depth: current.depth + 1 });
                    }
                  } catch (e) {
                    // Skip invalid URLs
                    console.debug('Invalid URL:', e);
                  }
                }
              }
            }
          } catch (error) {
            // Log but continue crawling other pages
            console.error(`Failed to crawl ${current.url}:`, error instanceof Error ? error.message : error);
          }
        }
    
        // Prepare the output text
        let outputText = `Recursive crawl completed:\n\nPages crawled: ${results.length}\nStarting URL: ${options.url}\n`;
    
        if (results.length > 0) {
          outputText += `Max depth reached: ${maxDepthReached} (limit: ${maxDepth})\n\nNote: Only internal links (same domain) are followed during recursive crawling.\n\nPages found:\n${results.map((r) => `- [Depth ${r.depth}] ${r.url}\n  Content: ${r.content.length} chars\n  Internal links found: ${r.internal_links_found}`).join('\n')}`;
        } else {
          outputText += `\nNo pages could be crawled. This might be due to:\n- The starting URL returned an error\n- No internal links were found\n- All discovered links were filtered out by include/exclude patterns`;
        }
    
        return {
          content: [
            {
              type: 'text',
              text: outputText,
            },
          ],
        };
      } catch (error) {
        throw this.formatError(error, 'crawl recursively');
      }
    }
  • Zod input schema validation for the crawl_recursive tool parameters.
    export const CrawlRecursiveSchema = createStatelessSchema(
      z.object({
        url: z.string().url(),
        max_depth: z.number().optional(),
        max_pages: z.number().optional(),
        include_pattern: z.string().optional(),
        exclude_pattern: z.string().optional(),
      }),
      'crawl_recursive',
    );
  • src/server.ts:870-873 (registration)
    Tool registration and dispatching in the switch statement of the callToolRequestHandler. Validates args with CrawlRecursiveSchema and delegates to crawlHandlers.crawlRecursive.
    case 'crawl_recursive':
      return await this.validateAndExecute('crawl_recursive', args, CrawlRecursiveSchema, async (validatedArgs) =>
        this.crawlHandlers.crawlRecursive(validatedArgs),
      );
  • src/server.ts:951-952 (registration)
    Public accessor method in Crawl4AIServer class that delegates to the crawlHandlers instance.
    protected async crawlRecursive(options: Parameters<CrawlHandlers['crawlRecursive']>[0]) {
      return this.crawlHandlers.crawlRecursive(options);

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/omgwtfwow/mcp-crawl4ai-ts'

If you have feedback or need assistance with the MCP directory API, please join our Discord server