firecrawl_extract
Extract structured data from web pages using LLM prompts and JSON schemas. Supports cloud and self-hosted AI for web content analysis.
Instructions
Extract structured information from web pages using LLM. Supports both cloud AI and self-hosted LLM extraction.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| urls | Yes | List of URLs to extract information from | |
| prompt | No | Prompt for the LLM extraction | |
| systemPrompt | No | System prompt for LLM extraction | |
| schema | No | JSON schema for structured data extraction | |
| allowExternalLinks | No | Allow extraction from external links | |
| enableWebSearch | No | Enable web search for additional context | |
| includeSubdomains | No | Include subdomains in extraction |
Implementation Reference
- src/index.ts:1288-1385 (handler)Main handler for 'firecrawl_extract' tool in the switch statement of CallToolRequestSchema. Validates arguments with isExtractOptions, calls client.extract from FirecrawlApp, handles success/error responses, credit tracking, and logging.case 'firecrawl_extract': { if (!isExtractOptions(args)) { throw new Error('Invalid arguments for firecrawl_extract'); } try { const extractStartTime = Date.now(); safeLog( 'info', `Starting extraction for URLs: ${args.urls.join(', ')}` ); // Log if using self-hosted instance if (FIRECRAWL_API_URL) { safeLog('info', 'Using self-hosted instance for extraction'); } const extractResponse = await withRetry( async () => client.extract(args.urls, { prompt: args.prompt, systemPrompt: args.systemPrompt, schema: args.schema, allowExternalLinks: args.allowExternalLinks, enableWebSearch: args.enableWebSearch, includeSubdomains: args.includeSubdomains, origin: 'mcp-server', } as ExtractParams), 'extract operation' ); // Type guard for successful response if (!('success' in extractResponse) || !extractResponse.success) { throw new Error(extractResponse.error || 'Extraction failed'); } const response = extractResponse as ExtractResponse; // Monitor credits for cloud API if (!FIRECRAWL_API_URL && hasCredits(response)) { await updateCreditUsage(response.creditsUsed || 0); } // Log performance metrics safeLog( 'info', `Extraction completed in ${Date.now() - extractStartTime}ms` ); // Add warning to response if present const result = { content: [ { type: 'text', text: trimResponseText(JSON.stringify(response.data, null, 2)), }, ], isError: false, }; if (response.warning) { safeLog('warning', response.warning); } return result; } catch (error) { const errorMessage = error instanceof Error ? error.message : String(error); // Special handling for self-hosted instance errors if ( FIRECRAWL_API_URL && errorMessage.toLowerCase().includes('not supported') ) { safeLog( 'error', 'Extraction is not supported by this self-hosted instance' ); return { content: [ { type: 'text', text: trimResponseText( 'Extraction is not supported by this self-hosted instance. Please ensure LLM support is configured.' ), }, ], isError: true, }; } return { content: [{ type: 'text', text: trimResponseText(errorMessage) }], isError: true, }; } }
- src/index.ts:483-523 (schema)Tool definition for 'firecrawl_extract' including name, description, and detailed inputSchema for parameters like urls, prompt, schema, etc.const EXTRACT_TOOL: Tool = { name: 'firecrawl_extract', description: 'Extract structured information from web pages using LLM. ' + 'Supports both cloud AI and self-hosted LLM extraction.', inputSchema: { type: 'object', properties: { urls: { type: 'array', items: { type: 'string' }, description: 'List of URLs to extract information from', }, prompt: { type: 'string', description: 'Prompt for the LLM extraction', }, systemPrompt: { type: 'string', description: 'System prompt for LLM extraction', }, schema: { type: 'object', description: 'JSON schema for structured data extraction', }, allowExternalLinks: { type: 'boolean', description: 'Allow extraction from external links', }, enableWebSearch: { type: 'boolean', description: 'Enable web search for additional context', }, includeSubdomains: { type: 'boolean', description: 'Include subdomains in extraction', }, }, required: ['urls'], }, };
- src/index.ts:960-973 (registration)Registration of the 'firecrawl_extract' tool (as EXTRACT_TOOL) in the list of tools returned by ListToolsRequestSchema handler.server.setRequestHandler(ListToolsRequestSchema, async () => ({ tools: [ SCRAPE_TOOL, MAP_TOOL, CRAWL_TOOL, BATCH_SCRAPE_TOOL, CHECK_BATCH_STATUS_TOOL, CHECK_CRAWL_STATUS_TOOL, SEARCH_TOOL, EXTRACT_TOOL, DEEP_RESEARCH_TOOL, GENERATE_LLMSTXT_TOOL, ], }));
- src/index.ts:738-745 (helper)Type guard function 'isExtractOptions' used to validate input arguments for the firecrawl_extract handler, ensuring 'urls' is a non-empty array of strings.function isExtractOptions(args: unknown): args is ExtractArgs { if (typeof args !== 'object' || args === null) return false; const { urls } = args as { urls?: unknown }; return ( Array.isArray(urls) && urls.every((url): url is string => typeof url === 'string') ); }