firecrawl_crawl
Extract content from multiple related web pages by crawling a website. Use this tool to comprehensively gather information from all pages within specified depth and URL limits.
Instructions
Starts an asynchronous crawl job on a website and extracts content from all pages.
Best for: Extracting content from multiple related pages, when you need comprehensive coverage. Not recommended for: Extracting content from a single page (use scrape); when token limits are a concern (use map + batch_scrape); when you need fast results (crawling can be slow). Warning: Crawl responses can be very large and may exceed token limits. Limit the crawl depth and number of pages, or use map + batch_scrape for better control. Common mistakes: Setting limit or maxDepth too high (causes token overflow); using crawl for a single page (use scrape instead). Prompt Example: "Get all blog posts from the first two levels of example.com/blog." Usage Example:
Returns: Operation ID for status checking; use firecrawl_check_crawl_status to check progress.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| allowBackwardLinks | No | Allow crawling links that point to parent directories | |
| allowExternalLinks | No | Allow crawling links to external domains | |
| deduplicateSimilarURLs | No | Remove similar URLs during crawl | |
| excludePaths | No | URL paths to exclude from crawling | |
| ignoreQueryParameters | No | Ignore query parameters when comparing URLs | |
| ignoreSitemap | No | Skip sitemap.xml discovery | |
| includePaths | No | Only crawl these URL paths | |
| limit | No | Maximum number of pages to crawl | |
| maxDepth | No | Maximum link depth to crawl | |
| scrapeOptions | No | Options for scraping each page | |
| url | Yes | Starting URL for the crawl | |
| webhook | No |
Implementation Reference
- src/index.ts:1107-1134 (handler)The main handler for firecrawl_crawl tool execution. Validates input using isCrawlOptions, calls client.asyncCrawlUrl via withRetry, and returns the crawl job ID with instructions to check status.case 'firecrawl_crawl': { if (!isCrawlOptions(args)) { throw new Error('Invalid arguments for firecrawl_crawl'); } const { url, ...options } = args; const response = await withRetry( async () => // @ts-expect-error Extended API options including origin client.asyncCrawlUrl(url, { ...options, origin: 'mcp-server' }), 'crawl operation' ); if (!response.success) { throw new Error(response.error); } return { content: [ { type: 'text', text: trimResponseText( `Started crawl for ${url} with job ID: ${response.id}. Use firecrawl_check_crawl_status to check progress.` ), }, ], isError: false, }; }
- src/index.ts:255-385 (schema)Tool definition including name, detailed description, and comprehensive inputSchema for parameters like url, maxDepth, limit, scrapeOptions, etc.const CRAWL_TOOL: Tool = { name: 'firecrawl_crawl', description: ` Starts an asynchronous crawl job on a website and extracts content from all pages. **Best for:** Extracting content from multiple related pages, when you need comprehensive coverage. **Not recommended for:** Extracting content from a single page (use scrape); when token limits are a concern (use map + batch_scrape); when you need fast results (crawling can be slow). **Warning:** Crawl responses can be very large and may exceed token limits. Limit the crawl depth and number of pages, or use map + batch_scrape for better control. **Common mistakes:** Setting limit or maxDepth too high (causes token overflow); using crawl for a single page (use scrape instead). **Prompt Example:** "Get all blog posts from the first two levels of example.com/blog." **Usage Example:** \`\`\`json { "name": "firecrawl_crawl", "arguments": { "url": "https://example.com/blog/*", "maxDepth": 2, "limit": 100, "allowExternalLinks": false, "deduplicateSimilarURLs": true } } \`\`\` **Returns:** Operation ID for status checking; use firecrawl_check_crawl_status to check progress. `, inputSchema: { type: 'object', properties: { url: { type: 'string', description: 'Starting URL for the crawl', }, excludePaths: { type: 'array', items: { type: 'string' }, description: 'URL paths to exclude from crawling', }, includePaths: { type: 'array', items: { type: 'string' }, description: 'Only crawl these URL paths', }, maxDepth: { type: 'number', description: 'Maximum link depth to crawl', }, ignoreSitemap: { type: 'boolean', description: 'Skip sitemap.xml discovery', }, limit: { type: 'number', description: 'Maximum number of pages to crawl', }, allowBackwardLinks: { type: 'boolean', description: 'Allow crawling links that point to parent directories', }, allowExternalLinks: { type: 'boolean', description: 'Allow crawling links to external domains', }, webhook: { oneOf: [ { type: 'string', description: 'Webhook URL to notify when crawl is complete', }, { type: 'object', properties: { url: { type: 'string', description: 'Webhook URL', }, headers: { type: 'object', description: 'Custom headers for webhook requests', }, }, required: ['url'], }, ], }, deduplicateSimilarURLs: { type: 'boolean', description: 'Remove similar URLs during crawl', }, ignoreQueryParameters: { type: 'boolean', description: 'Ignore query parameters when comparing URLs', }, scrapeOptions: { type: 'object', properties: { formats: { type: 'array', items: { type: 'string', enum: [ 'markdown', 'html', 'rawHtml', 'screenshot', 'links', 'screenshot@fullPage', 'extract', ], }, }, onlyMainContent: { type: 'boolean', }, includeTags: { type: 'array', items: { type: 'string' }, }, excludeTags: { type: 'array', items: { type: 'string' }, }, waitFor: { type: 'number', }, }, description: 'Options for scraping each page', }, }, required: ['url'], }, };
- src/index.ts:962-973 (registration)Registers the CRAWL_TOOL (firecrawl_crawl) in the list of available tools returned by ListToolsRequestSchema.server.setRequestHandler(ListToolsRequestSchema, async () => ({ tools: [ SCRAPE_TOOL, MAP_TOOL, CRAWL_TOOL, CHECK_CRAWL_STATUS_TOOL, SEARCH_TOOL, EXTRACT_TOOL, DEEP_RESEARCH_TOOL, GENERATE_LLMSTXT_TOOL, ], }));
- src/index.ts:803-810 (helper)Type guard function used in the handler to validate that arguments contain a valid 'url' string for the crawl tool.function isCrawlOptions(args: unknown): args is CrawlParams & { url: string } { return ( typeof args === 'object' && args !== null && 'url' in args && typeof (args as { url: unknown }).url === 'string' ); }