smart_crawl

Automatically detect and crawl web content by analyzing URL types and formats, handling HTML pages, sitemaps, RSS feeds, and text content with adaptive parsing strategies.

Instructions

[STATELESS] Auto-detect and handle different content types (HTML, sitemap, RSS, text). Use when: URL type is unknown, crawling feeds/sitemaps, or want automatic format handling. Adapts strategy based on content. Creates new browser each time. For persistent operations use create_session + crawl.

Input Schema

TableJSON Schema

Name	Required	Description
`bypass_cache`	No	Force fresh crawl
`follow_links`	No	For sitemaps/RSS: crawl found URLs (max 10). For HTML: no effect
`max_depth`	No	Maximum crawl depth for sitemaps
`url`	Yes	The URL to crawl intelligently

Implementation Reference

src/handlers/crawl-handlers.ts:90-194 (handler)
Main handler function that implements smart_crawl logic: detects content type (HTML, sitemap, RSS, etc.), performs crawl, optionally follows links in sitemaps/RSS.
async smartCrawl(options: { url: string; max_depth?: number; follow_links?: boolean; bypass_cache?: boolean }) { try { // First, try to detect the content type from URL or HEAD request let contentType = ''; try { const headResponse = await this.axiosClient.head(options.url); contentType = headResponse.headers['content-type'] || ''; } catch { // If HEAD request fails, continue anyway - we'll detect from the crawl response console.debug('HEAD request failed, will detect content type from response'); } let detectedType = 'html'; if (options.url.includes('sitemap') || options.url.endsWith('.xml')) { detectedType = 'sitemap'; } else if (options.url.includes('rss') || options.url.includes('feed')) { detectedType = 'rss'; } else if (contentType.includes('text/plain') || options.url.endsWith('.txt')) { detectedType = 'text'; } else if (contentType.includes('application/xml') || contentType.includes('text/xml')) { detectedType = 'xml'; } else if (contentType.includes('application/json')) { detectedType = 'json'; } // Crawl without the unsupported 'strategy' parameter const response = await this.axiosClient.post('/crawl', { urls: [options.url], crawler_config: { cache_mode: options.bypass_cache ? 'BYPASS' : 'ENABLED', }, browser_config: { headless: true, browser_type: 'chromium', }, }); const results = response.data.results || []; const result = results[0] || {}; // Handle follow_links for sitemaps and RSS feeds if (options.follow_links && (detectedType === 'sitemap' || detectedType === 'rss' || detectedType === 'xml')) { // Extract URLs from the content const urlPattern = /<loc>(.*?)<\/loc>|<link[^>]*>(.*?)<\/link>|href=["']([^"']+)["']/gi; const content = result.markdown || result.html || ''; const foundUrls: string[] = []; let match; while ((match = urlPattern.exec(content)) !== null) { const url = match[1] || match[2] || match[3]; if (url && url.startsWith('http')) { foundUrls.push(url); } } if (foundUrls.length > 0) { // Limit to first 10 URLs to avoid overwhelming the system const urlsToFollow = foundUrls.slice(0, Math.min(10, options.max_depth || 10)); // Crawl the found URLs await this.axiosClient.post('/crawl', { urls: urlsToFollow, max_concurrent: 3, bypass_cache: options.bypass_cache, }); return { content: [ { type: 'text', text: `Smart crawl detected content type: ${detectedType}\n\nMain content:\n${result.markdown?.raw_markdown || result.html || 'No content extracted'}\n\n---\nFollowed ${urlsToFollow.length} links:\n${urlsToFollow.map((url, i) => `${i + 1}. ${url}`).join('\n')}`, }, ...(result.metadata ? [ { type: 'text', text: `\n\n---\nMetadata:\n${JSON.stringify(result.metadata, null, 2)}`, }, ] : []), ], }; } } return { content: [ { type: 'text', text: `Smart crawl detected content type: ${detectedType}\n\n${result.markdown?.raw_markdown || result.html || 'No content extracted'}`, }, ...(result.metadata ? [ { type: 'text', text: `\n\n---\nMetadata:\n${JSON.stringify(result.metadata, null, 2)}`, }, ] : []), ], }; } catch (error) { throw this.formatError(error, 'smart crawl'); } }
src/schemas/validation-schemas.ts:112-120 (schema)
Zod schema for validating smart_crawl input parameters: url (required), max_depth, follow_links, bypass_cache (all optional).
export const SmartCrawlSchema = createStatelessSchema( z.object({ url: z.string().url(), max_depth: z.number().optional(), follow_links: z.boolean().optional(), bypass_cache: z.boolean().optional(), }), 'smart_crawl', );
src/server.ts:852-855 (registration)
Tool registration in the switch statement: validates args with SmartCrawlSchema and calls this.crawlHandlers.smartCrawl
case 'smart_crawl': return await this.validateAndExecute('smart_crawl', args, SmartCrawlSchema, async (validatedArgs) => this.crawlHandlers.smartCrawl(validatedArgs), );
src/server.ts:244-272 (registration)
Tool definition in listTools response: name, description, and inputSchema for smart_crawl.
name: 'smart_crawl', description: '[STATELESS] Auto-detect and handle different content types (HTML, sitemap, RSS, text). Use when: URL type is unknown, crawling feeds/sitemaps, or want automatic format handling. Adapts strategy based on content. Creates new browser each time. For persistent operations use create_session + crawl.', inputSchema: { type: 'object', properties: { url: { type: 'string', description: 'The URL to crawl intelligently', }, max_depth: { type: 'number', description: 'Maximum crawl depth for sitemaps', default: 2, }, follow_links: { type: 'boolean', description: 'For sitemaps/RSS: crawl found URLs (max 10). For HTML: no effect', default: false, }, bypass_cache: { type: 'boolean', description: 'Force fresh crawl', default: false, }, }, required: ['url'], }, },

MCP Server for Crawl4AI

smart_crawl

Instructions

Input Schema

Implementation Reference

Other Tools

Latest Blog Posts

MCP directory API