crawl_recursive
Deep crawl websites by following internal links to map entire sites, find all pages, and build comprehensive indexes with configurable depth and page limits.
Instructions
[STATELESS] Deep crawl a website following internal links. Use when: mapping entire sites, finding all pages, building comprehensive indexes. Control with max_depth (default 3) and max_pages (default 50). Note: May need JS execution for dynamic sites. Each page gets a fresh browser. For persistent operations use create_session + crawl.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| exclude_pattern | No | Regex to skip URLs. Example: ".*\/(login|admin).*" to avoid auth pages, ".*\.pdf$" to skip PDFs | |
| include_pattern | No | Regex to match URLs to crawl. Example: ".*\/blog\/.*" for blog posts only, ".*\.html$" for HTML pages | |
| max_depth | No | Maximum depth to follow links | |
| max_pages | No | Maximum number of pages to crawl | |
| url | Yes | Starting URL to crawl from |
Implementation Reference
- src/handlers/crawl-handlers.ts:196-293 (handler)The primary handler function implementing recursive web crawling. Uses a BFS queue to explore internal links from the starting URL, respecting max_depth and max_pages limits, with optional regex filtering. Each page is crawled via the /crawl endpoint, extracts markdown content and internal links for further recursion.async crawlRecursive(options: { url: string; max_depth?: number; max_pages?: number; include_pattern?: string; exclude_pattern?: string; }) { try { const startUrl = new URL(options.url); const visited = new Set<string>(); const toVisit: Array<{ url: string; depth: number }> = [{ url: options.url, depth: 0 }]; const results: Array<{ url: string; content: string; internal_links_found: number; depth: number }> = []; let maxDepthReached = 0; const includeRegex = options.include_pattern ? new RegExp(options.include_pattern) : null; const excludeRegex = options.exclude_pattern ? new RegExp(options.exclude_pattern) : null; const maxDepth = options.max_depth !== undefined ? options.max_depth : 3; const maxPages = options.max_pages || 50; while (toVisit.length > 0 && results.length < maxPages) { const current = toVisit.shift(); if (!current || visited.has(current.url) || current.depth > maxDepth) { continue; } visited.add(current.url); try { // Check URL patterns if (excludeRegex && excludeRegex.test(current.url)) continue; if (includeRegex && !includeRegex.test(current.url)) continue; // Crawl the page using the crawl endpoint to get links const response = await this.axiosClient.post('/crawl', { urls: [current.url], crawler_config: { cache_mode: 'BYPASS', }, }); const crawlResults = response.data.results || [response.data]; const result: CrawlResultItem = crawlResults[0]; if (result && result.success) { const markdownContent = result.markdown?.fit_markdown || result.markdown?.raw_markdown || ''; const internalLinksCount = result.links?.internal?.length || 0; maxDepthReached = Math.max(maxDepthReached, current.depth); results.push({ url: current.url, content: markdownContent, internal_links_found: internalLinksCount, depth: current.depth, }); // Add internal links to crawl queue if (current.depth < maxDepth && result.links?.internal) { for (const linkObj of result.links.internal) { const linkUrl = linkObj.href || linkObj; try { const absoluteUrl = new URL(linkUrl, current.url).toString(); if (!visited.has(absoluteUrl) && new URL(absoluteUrl).hostname === startUrl.hostname) { toVisit.push({ url: absoluteUrl, depth: current.depth + 1 }); } } catch (e) { // Skip invalid URLs console.debug('Invalid URL:', e); } } } } } catch (error) { // Log but continue crawling other pages console.error(`Failed to crawl ${current.url}:`, error instanceof Error ? error.message : error); } } // Prepare the output text let outputText = `Recursive crawl completed:\n\nPages crawled: ${results.length}\nStarting URL: ${options.url}\n`; if (results.length > 0) { outputText += `Max depth reached: ${maxDepthReached} (limit: ${maxDepth})\n\nNote: Only internal links (same domain) are followed during recursive crawling.\n\nPages found:\n${results.map((r) => `- [Depth ${r.depth}] ${r.url}\n Content: ${r.content.length} chars\n Internal links found: ${r.internal_links_found}`).join('\n')}`; } else { outputText += `\nNo pages could be crawled. This might be due to:\n- The starting URL returned an error\n- No internal links were found\n- All discovered links were filtered out by include/exclude patterns`; } return { content: [ { type: 'text', text: outputText, }, ], }; } catch (error) { throw this.formatError(error, 'crawl recursively'); } }
- Zod validation schema defining the input parameters for the crawl_recursive tool, including required URL and optional limits and patterns.export const CrawlRecursiveSchema = createStatelessSchema( z.object({ url: z.string().url(), max_depth: z.number().optional(), max_pages: z.number().optional(), include_pattern: z.string().optional(), exclude_pattern: z.string().optional(), }), 'crawl_recursive', );
- src/server.ts:870-873 (registration)Registration and dispatching logic in the MCP server switch statement, validating args with CrawlRecursiveSchema and calling the crawlHandlers.crawlRecursive method.case 'crawl_recursive': return await this.validateAndExecute('crawl_recursive', args, CrawlRecursiveSchema, async (validatedArgs) => this.crawlHandlers.crawlRecursive(validatedArgs), );
- src/server.ts:309-343 (registration)Tool metadata registration in the MCP server's listTools response, providing name, description, and inputSchema for the crawl_recursive tool.{ name: 'crawl_recursive', description: '[STATELESS] Deep crawl a website following internal links. Use when: mapping entire sites, finding all pages, building comprehensive indexes. Control with max_depth (default 3) and max_pages (default 50). Note: May need JS execution for dynamic sites. Each page gets a fresh browser. For persistent operations use create_session + crawl.', inputSchema: { type: 'object', properties: { url: { type: 'string', description: 'Starting URL to crawl from', }, max_depth: { type: 'number', description: 'Maximum depth to follow links', default: 3, }, max_pages: { type: 'number', description: 'Maximum number of pages to crawl', default: 50, }, include_pattern: { type: 'string', description: 'Regex to match URLs to crawl. Example: ".*\\/blog\\/.*" for blog posts only, ".*\\.html$" for HTML pages', }, exclude_pattern: { type: 'string', description: 'Regex to skip URLs. Example: ".*\\/(login|admin).*" to avoid auth pages, ".*\\.pdf$" to skip PDFs', }, }, required: ['url'], }, },