Skip to main content
Glama
omgwtfwow

MCP Server for Crawl4AI

by omgwtfwow

smart_crawl

Automatically detect and crawl web content types including HTML, sitemaps, RSS feeds, and text documents for data extraction and analysis.

Instructions

[STATELESS] Auto-detect and handle different content types (HTML, sitemap, RSS, text). Use when: URL type is unknown, crawling feeds/sitemaps, or want automatic format handling. Adapts strategy based on content. Creates new browser each time. For persistent operations use create_session + crawl.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesThe URL to crawl intelligently
max_depthNoMaximum crawl depth for sitemaps
follow_linksNoFor sitemaps/RSS: crawl found URLs (max 10). For HTML: no effect
bypass_cacheNoForce fresh crawl

Implementation Reference

  • Core handler function that implements smart_crawl logic: detects content type (HTML, sitemap, RSS, etc.), performs crawl, optionally follows links from sitemaps/RSS, bypasses cache if specified, and returns structured content with metadata.
    async smartCrawl(options: { url: string; max_depth?: number; follow_links?: boolean; bypass_cache?: boolean }) {
      try {
        // First, try to detect the content type from URL or HEAD request
        let contentType = '';
        try {
          const headResponse = await this.axiosClient.head(options.url);
          contentType = headResponse.headers['content-type'] || '';
        } catch {
          // If HEAD request fails, continue anyway - we'll detect from the crawl response
          console.debug('HEAD request failed, will detect content type from response');
        }
    
        let detectedType = 'html';
        if (options.url.includes('sitemap') || options.url.endsWith('.xml')) {
          detectedType = 'sitemap';
        } else if (options.url.includes('rss') || options.url.includes('feed')) {
          detectedType = 'rss';
        } else if (contentType.includes('text/plain') || options.url.endsWith('.txt')) {
          detectedType = 'text';
        } else if (contentType.includes('application/xml') || contentType.includes('text/xml')) {
          detectedType = 'xml';
        } else if (contentType.includes('application/json')) {
          detectedType = 'json';
        }
    
        // Crawl without the unsupported 'strategy' parameter
        const response = await this.axiosClient.post('/crawl', {
          urls: [options.url],
          crawler_config: {
            cache_mode: options.bypass_cache ? 'BYPASS' : 'ENABLED',
          },
          browser_config: {
            headless: true,
            browser_type: 'chromium',
          },
        });
    
        const results = response.data.results || [];
        const result = results[0] || {};
    
        // Handle follow_links for sitemaps and RSS feeds
        if (options.follow_links && (detectedType === 'sitemap' || detectedType === 'rss' || detectedType === 'xml')) {
          // Extract URLs from the content
          const urlPattern = /<loc>(.*?)<\/loc>|<link[^>]*>(.*?)<\/link>|href=["']([^"']+)["']/gi;
          const content = result.markdown || result.html || '';
          const foundUrls: string[] = [];
          let match;
    
          while ((match = urlPattern.exec(content)) !== null) {
            const url = match[1] || match[2] || match[3];
            if (url && url.startsWith('http')) {
              foundUrls.push(url);
            }
          }
    
          if (foundUrls.length > 0) {
            // Limit to first 10 URLs to avoid overwhelming the system
            const urlsToFollow = foundUrls.slice(0, Math.min(10, options.max_depth || 10));
    
            // Crawl the found URLs
            await this.axiosClient.post('/crawl', {
              urls: urlsToFollow,
              max_concurrent: 3,
              bypass_cache: options.bypass_cache,
            });
    
            return {
              content: [
                {
                  type: 'text',
                  text: `Smart crawl detected content type: ${detectedType}\n\nMain content:\n${result.markdown?.raw_markdown || result.html || 'No content extracted'}\n\n---\nFollowed ${urlsToFollow.length} links:\n${urlsToFollow.map((url, i) => `${i + 1}. ${url}`).join('\n')}`,
                },
                ...(result.metadata
                  ? [
                      {
                        type: 'text',
                        text: `\n\n---\nMetadata:\n${JSON.stringify(result.metadata, null, 2)}`,
                      },
                    ]
                  : []),
              ],
            };
          }
        }
    
        return {
          content: [
            {
              type: 'text',
              text: `Smart crawl detected content type: ${detectedType}\n\n${result.markdown?.raw_markdown || result.html || 'No content extracted'}`,
            },
            ...(result.metadata
              ? [
                  {
                    type: 'text',
                    text: `\n\n---\nMetadata:\n${JSON.stringify(result.metadata, null, 2)}`,
                  },
                ]
              : []),
          ],
        };
      } catch (error) {
        throw this.formatError(error, 'smart crawl');
      }
    }
  • Zod schema defining input validation for smart_crawl tool: requires URL, optional max_depth, follow_links, bypass_cache.
    export const SmartCrawlSchema = createStatelessSchema(
      z.object({
        url: z.string().url(),
        max_depth: z.number().optional(),
        follow_links: z.boolean().optional(),
        bypass_cache: z.boolean().optional(),
      }),
      'smart_crawl',
    );
  • src/server.ts:852-855 (registration)
    Tool dispatch registration in server request handler: validates arguments using SmartCrawlSchema and delegates to CrawlHandlers.smartCrawl.
    case 'smart_crawl':
      return await this.validateAndExecute('smart_crawl', args, SmartCrawlSchema, async (validatedArgs) =>
        this.crawlHandlers.smartCrawl(validatedArgs),
      );
  • src/server.ts:244-272 (registration)
    Tool metadata registration in listTools response: defines name, description, and inputSchema for smart_crawl.
      name: 'smart_crawl',
      description:
        '[STATELESS] Auto-detect and handle different content types (HTML, sitemap, RSS, text). Use when: URL type is unknown, crawling feeds/sitemaps, or want automatic format handling. Adapts strategy based on content. Creates new browser each time. For persistent operations use create_session + crawl.',
      inputSchema: {
        type: 'object',
        properties: {
          url: {
            type: 'string',
            description: 'The URL to crawl intelligently',
          },
          max_depth: {
            type: 'number',
            description: 'Maximum crawl depth for sitemaps',
            default: 2,
          },
          follow_links: {
            type: 'boolean',
            description: 'For sitemaps/RSS: crawl found URLs (max 10). For HTML: no effect',
            default: false,
          },
          bypass_cache: {
            type: 'boolean',
            description: 'Force fresh crawl',
            default: false,
          },
        },
        required: ['url'],
      },
    },
  • src/server.ts:30-30 (registration)
    Import of SmartCrawlSchema used for validation.
    SmartCrawlSchema,

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/omgwtfwow/mcp-crawl4ai-ts'

If you have feedback or need assistance with the MCP directory API, please join our Discord server