tavily-crawl
Initiates a structured web crawl from a specified URL, following internal links to explore site content with configurable depth, breadth, and filtering options for targeted data extraction.
Instructions
A powerful web crawler that initiates a structured web crawl starting from a specified base URL. The crawler expands from that point like a tree, following internal links across pages. You can control how deep and wide it goes, and guide it to focus on specific sections of the site.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| allow_external | No | Whether to allow following links that go to external domains | |
| categories | No | Filter URLs using predefined categories like documentation, blog, api, etc | |
| extract_depth | No | Advanced extraction retrieves more data, including tables and embedded content, with higher success but may increase latency | basic |
| instructions | No | Natural language instructions for the crawler | |
| limit | No | Total number of links the crawler will process before stopping | |
| max_breadth | No | Max number of links to follow per level of the tree (i.e., per page) | |
| max_depth | No | Max depth of the crawl. Defines how far from the base URL the crawler can explore. | |
| select_domains | No | Regex patterns to select crawling to specific domains or subdomains (e.g., ^docs\.example\.com$) | |
| select_paths | No | Regex patterns to select only URLs with specific path patterns (e.g., /docs/.*, /api/v1.*) | |
| url | Yes | The root URL to begin the crawl |
Implementation Reference
- src/index.ts:764-790 (handler)Main handler logic for executing the 'tavily-crawl' tool: constructs parameters from tool arguments, invokes tavilyClient.crawl(), formats results with formatCrawlResults, and returns MCP content response.case 'tavily-crawl': { const crawlParams: TavilyCrawlParams = { url: args?.url as string, max_depth: (args?.max_depth as number) || 1, max_breadth: (args?.max_breadth as number) || 20, limit: (args?.limit as number) || 50, instructions: args?.instructions as string, select_paths: Array.isArray(args?.select_paths) ? args.select_paths : [], select_domains: Array.isArray(args?.select_domains) ? args.select_domains : [], allow_external: (args?.allow_external as boolean) || false, categories: Array.isArray(args?.categories) ? args.categories : [], extract_depth: (args?.extract_depth as 'basic' | 'advanced') || 'basic', format: (args?.format as 'markdown' | 'text') || 'markdown', include_favicon: (args?.include_favicon as boolean) || false }; const crawlResult = await this.tavilyClient.crawl(crawlParams); return { content: [ { type: 'text', text: formatCrawlResults(crawlResult), }, ], }; }
- src/index.ts:489-556 (registration)Registration of the 'tavily-crawl' tool in the tools list, including name, description, and complete inputSchema for MCP tool listing.{ name: "tavily-crawl", description: "A powerful web crawler that initiates a structured web crawl starting from a specified base URL. The crawler expands from that point like a tree, following internal links across pages. You can control how deep and wide it goes, and guide it to focus on specific sections of the site.", inputSchema: { type: "object", properties: { url: { type: "string", description: "The root URL to begin the crawl" }, max_depth: { type: "integer", description: "Max depth of the crawl. Defines how far from the base URL the crawler can explore.", default: 1, minimum: 1 }, max_breadth: { type: "integer", description: "Max number of links to follow per level of the tree (i.e., per page)", default: 20, minimum: 1 }, limit: { type: "integer", description: "Total number of links the crawler will process before stopping", default: 50, minimum: 1 }, instructions: { type: "string", description: "Natural language instructions for the crawler" }, select_paths: { type: "array", items: { type: "string" }, description: "Regex patterns to select only URLs with specific path patterns (e.g., /docs/.*, /api/v1.*)", default: [] }, select_domains: { type: "array", items: { type: "string" }, description: "Regex patterns to select crawling to specific domains or subdomains (e.g., ^docs\\.example\\.com$)", default: [] }, allow_external: { type: "boolean", description: "Whether to allow following links that go to external domains", default: false }, categories: { type: "array", items: { type: "string", enum: ["Careers", "Blog", "Documentation", "About", "Pricing", "Community", "Developers", "Contact", "Media"] }, description: "Filter URLs using predefined categories like documentation, blog, api, etc", default: [] }, extract_depth: { type: "string", enum: ["basic", "advanced"], description: "Advanced extraction retrieves more data, including tables and embedded content, with higher success but may increase latency", default: "basic" } }, required: ["url"] } } as Tool,
- src/tavilyClient.ts:281-283 (handler)Core handler in TavilyClient that executes the crawl by making a POST request to Tavily API endpoint https://api.tavily.com/crawl with the provided parameters.async crawl(params: TavilyCrawlParams): Promise<any> { return this.makeRequest(this.baseURLs.crawl, params); }
- src/tavilyClient.ts:33-46 (schema)TypeScript interface defining the input parameters for the tavily-crawl tool, matching the JSON schema.export interface TavilyCrawlParams { url: string; max_depth?: number; max_breadth?: number; limit?: number; instructions?: string; select_paths?: string[]; select_domains?: string[]; allow_external?: boolean; categories?: string[]; extract_depth?: 'basic' | 'advanced'; format?: 'markdown' | 'text'; include_favicon?: boolean; }
- src/index.ts:146-190 (helper)Helper function that formats the raw crawl results from Tavily API into a string for the tool response.export function formatCrawlResults(response: any): string { try { if (!response || typeof response !== 'object') { return 'Error: Invalid crawl response format'; } const output = []; output.push(`Crawl Results:`); output.push(`Base URL: ${sanitizeText(response.base_url, 500)}`); output.push('\nCrawled Pages:'); if (Array.isArray(response.results)) { response.results.slice(0, 20).forEach((page: any, index: number) => { output.push(`\n[${index + 1}] URL: ${sanitizeText(page.url, 500)}`); if (page.raw_content) { const cleanContent = sanitizeText(page.raw_content, 1000); const contentPreview = cleanContent.length > 300 ? cleanContent.substring(0, 300) + "... [内容已截断]" : cleanContent; output.push(`Content: ${contentPreview}`); } if (page.title) { output.push(`Title: ${sanitizeText(page.title, 200)}`); } }); } const result = output.join('\n'); // 限制总输出大小 if (result.length > 30000) { return result.substring(0, 30000) + '\n\n... [爬取结果过长,已截断]'; } return result; } catch (error) { console.error('Error formatting crawl results:', error); return 'Error: Failed to format crawl results'; } } export function formatMapResults(response: any): string { try { if (!response || typeof response !== 'object') { return 'Error: Invalid map response format';