Skip to main content
Glama
omgwtfwow

MCP Server for Crawl4AI

by omgwtfwow

crawl_recursive

Deep crawl websites by following internal links to map entire sites, find all pages, and build comprehensive indexes with configurable depth and page limits.

Instructions

[STATELESS] Deep crawl a website following internal links. Use when: mapping entire sites, finding all pages, building comprehensive indexes. Control with max_depth (default 3) and max_pages (default 50). Note: May need JS execution for dynamic sites. Each page gets a fresh browser. For persistent operations use create_session + crawl.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesStarting URL to crawl from
max_depthNoMaximum depth to follow links
max_pagesNoMaximum number of pages to crawl
include_patternNoRegex to match URLs to crawl. Example: ".*\/blog\/.*" for blog posts only, ".*\.html$" for HTML pages
exclude_patternNoRegex to skip URLs. Example: ".*\/(login|admin).*" to avoid auth pages, ".*\.pdf$" to skip PDFs

Implementation Reference

  • Main handler function for 'crawl_recursive' tool. Implements recursive crawling using BFS, visiting internal links up to max_depth (default 3), max_pages (default 50), with optional include/exclude regex patterns. Uses the /crawl endpoint for each page and collects summaries.
    async crawlRecursive(options: { url: string; max_depth?: number; max_pages?: number; include_pattern?: string; exclude_pattern?: string; }) { try { const startUrl = new URL(options.url); const visited = new Set<string>(); const toVisit: Array<{ url: string; depth: number }> = [{ url: options.url, depth: 0 }]; const results: Array<{ url: string; content: string; internal_links_found: number; depth: number }> = []; let maxDepthReached = 0; const includeRegex = options.include_pattern ? new RegExp(options.include_pattern) : null; const excludeRegex = options.exclude_pattern ? new RegExp(options.exclude_pattern) : null; const maxDepth = options.max_depth !== undefined ? options.max_depth : 3; const maxPages = options.max_pages || 50; while (toVisit.length > 0 && results.length < maxPages) { const current = toVisit.shift(); if (!current || visited.has(current.url) || current.depth > maxDepth) { continue; } visited.add(current.url); try { // Check URL patterns if (excludeRegex && excludeRegex.test(current.url)) continue; if (includeRegex && !includeRegex.test(current.url)) continue; // Crawl the page using the crawl endpoint to get links const response = await this.axiosClient.post('/crawl', { urls: [current.url], crawler_config: { cache_mode: 'BYPASS', }, }); const crawlResults = response.data.results || [response.data]; const result: CrawlResultItem = crawlResults[0]; if (result && result.success) { const markdownContent = result.markdown?.fit_markdown || result.markdown?.raw_markdown || ''; const internalLinksCount = result.links?.internal?.length || 0; maxDepthReached = Math.max(maxDepthReached, current.depth); results.push({ url: current.url, content: markdownContent, internal_links_found: internalLinksCount, depth: current.depth, }); // Add internal links to crawl queue if (current.depth < maxDepth && result.links?.internal) { for (const linkObj of result.links.internal) { const linkUrl = linkObj.href || linkObj; try { const absoluteUrl = new URL(linkUrl, current.url).toString(); if (!visited.has(absoluteUrl) && new URL(absoluteUrl).hostname === startUrl.hostname) { toVisit.push({ url: absoluteUrl, depth: current.depth + 1 }); } } catch (e) { // Skip invalid URLs console.debug('Invalid URL:', e); } } } } } catch (error) { // Log but continue crawling other pages console.error(`Failed to crawl ${current.url}:`, error instanceof Error ? error.message : error); } } // Prepare the output text let outputText = `Recursive crawl completed:\n\nPages crawled: ${results.length}\nStarting URL: ${options.url}\n`; if (results.length > 0) { outputText += `Max depth reached: ${maxDepthReached} (limit: ${maxDepth})\n\nNote: Only internal links (same domain) are followed during recursive crawling.\n\nPages found:\n${results.map((r) => `- [Depth ${r.depth}] ${r.url}\n Content: ${r.content.length} chars\n Internal links found: ${r.internal_links_found}`).join('\n')}`; } else { outputText += `\nNo pages could be crawled. This might be due to:\n- The starting URL returned an error\n- No internal links were found\n- All discovered links were filtered out by include/exclude patterns`; } return { content: [ { type: 'text', text: outputText, }, ], }; } catch (error) { throw this.formatError(error, 'crawl recursively'); } }
  • Zod input schema validation for the crawl_recursive tool parameters.
    export const CrawlRecursiveSchema = createStatelessSchema( z.object({ url: z.string().url(), max_depth: z.number().optional(), max_pages: z.number().optional(), include_pattern: z.string().optional(), exclude_pattern: z.string().optional(), }), 'crawl_recursive', );
  • src/server.ts:870-873 (registration)
    Tool registration and dispatching in the switch statement of the callToolRequestHandler. Validates args with CrawlRecursiveSchema and delegates to crawlHandlers.crawlRecursive.
    case 'crawl_recursive': return await this.validateAndExecute('crawl_recursive', args, CrawlRecursiveSchema, async (validatedArgs) => this.crawlHandlers.crawlRecursive(validatedArgs), );
  • src/server.ts:951-952 (registration)
    Public accessor method in Crawl4AIServer class that delegates to the crawlHandlers instance.
    protected async crawlRecursive(options: Parameters<CrawlHandlers['crawlRecursive']>[0]) { return this.crawlHandlers.crawlRecursive(options);

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/omgwtfwow/mcp-crawl4ai-ts'

If you have feedback or need assistance with the MCP directory API, please join our Discord server