tavily-crawl
Initiate structured web crawls from a base URL, controlling depth, breadth, and path selection. Focus on specific site sections or categories to extract relevant data efficiently.
Instructions
A powerful web crawler that initiates a structured web crawl starting from a specified base URL. The crawler expands from that point like a tree, following internal links across pages. You can control how deep and wide it goes, and guide it to focus on specific sections of the site.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| allow_external | No | Whether to allow following links that go to external domains | |
| categories | No | Filter URLs using predefined categories like documentation, blog, api, etc | |
| extract_depth | No | Advanced extraction retrieves more data, including tables and embedded content, with higher success but may increase latency | basic |
| instructions | No | Natural language instructions for the crawler | |
| limit | No | Total number of links the crawler will process before stopping | |
| max_breadth | No | Max number of links to follow per level of the tree (i.e., per page) | |
| max_depth | No | Max depth of the crawl. Defines how far from the base URL the crawler can explore. | |
| select_domains | No | Regex patterns to select crawling to specific domains or subdomains (e.g., ^docs\.example\.com$) | |
| select_paths | No | Regex patterns to select only URLs with specific path patterns (e.g., /docs/.*, /api/v1.*) | |
| url | Yes | The root URL to begin the crawl |
Implementation Reference
- src/index.ts:496-511 (handler)The core handler function that executes the tavily-crawl tool by sending a POST request to the Tavily crawl API endpoint with the user's parameters and API key.async crawl(params: any): Promise<TavilyCrawlResponse> { try { const response = await this.axiosInstance.post(this.baseURLs.crawl, { ...params, api_key: API_KEY }); return response.data; } catch (error: any) { if (error.response?.status === 401) { throw new Error('Invalid API key'); } else if (error.response?.status === 429) { throw new Error('Usage limit exceeded'); } throw error; } }
- src/index.ts:381-399 (handler)The dispatcher in the CallToolRequestSchema handler that invokes the crawl method with parsed arguments and formats the response using formatCrawlResults.case "tavily-crawl": const crawlResponse = await this.crawl({ url: args.url, max_depth: args.max_depth, max_breadth: args.max_breadth, limit: args.limit, instructions: args.instructions, select_paths: Array.isArray(args.select_paths) ? args.select_paths : [], select_domains: Array.isArray(args.select_domains) ? args.select_domains : [], allow_external: args.allow_external, categories: Array.isArray(args.categories) ? args.categories : [], extract_depth: args.extract_depth }); return { content: [{ type: "text", text: formatCrawlResults(crawlResponse) }] };
- src/index.ts:220-282 (schema)The input schema defining parameters for the tavily-crawl tool, including url, max_depth, max_breadth, limit, instructions, selectors, and extraction options.inputSchema: { type: "object", properties: { url: { type: "string", description: "The root URL to begin the crawl" }, max_depth: { type: "integer", description: "Max depth of the crawl. Defines how far from the base URL the crawler can explore.", default: 1, minimum: 1 }, max_breadth: { type: "integer", description: "Max number of links to follow per level of the tree (i.e., per page)", default: 20, minimum: 1 }, limit: { type: "integer", description: "Total number of links the crawler will process before stopping", default: 50, minimum: 1 }, instructions: { type: "string", description: "Natural language instructions for the crawler" }, select_paths: { type: "array", items: { type: "string" }, description: "Regex patterns to select only URLs with specific path patterns (e.g., /docs/.*, /api/v1.*)", default: [] }, select_domains: { type: "array", items: { type: "string" }, description: "Regex patterns to select crawling to specific domains or subdomains (e.g., ^docs\\.example\\.com$)", default: [] }, allow_external: { type: "boolean", description: "Whether to allow following links that go to external domains", default: false }, categories: { type: "array", items: { type: "string", enum: ["Careers", "Blog", "Documentation", "About", "Pricing", "Community", "Developers", "Contact", "Media"] }, description: "Filter URLs using predefined categories like documentation, blog, api, etc", default: [] }, extract_depth: { type: "string", enum: ["basic", "advanced"], description: "Advanced extraction retrieves more data, including tables and embedded content, with higher success but may increase latency", default: "basic" } }, required: ["url"]
- src/index.ts:217-284 (registration)The tool registration in the ListToolsRequestSchema handler, defining name, description, and inputSchema for tavily-crawl.{ name: "tavily-crawl", description: "A powerful web crawler that initiates a structured web crawl starting from a specified base URL. The crawler expands from that point like a tree, following internal links across pages. You can control how deep and wide it goes, and guide it to focus on specific sections of the site.", inputSchema: { type: "object", properties: { url: { type: "string", description: "The root URL to begin the crawl" }, max_depth: { type: "integer", description: "Max depth of the crawl. Defines how far from the base URL the crawler can explore.", default: 1, minimum: 1 }, max_breadth: { type: "integer", description: "Max number of links to follow per level of the tree (i.e., per page)", default: 20, minimum: 1 }, limit: { type: "integer", description: "Total number of links the crawler will process before stopping", default: 50, minimum: 1 }, instructions: { type: "string", description: "Natural language instructions for the crawler" }, select_paths: { type: "array", items: { type: "string" }, description: "Regex patterns to select only URLs with specific path patterns (e.g., /docs/.*, /api/v1.*)", default: [] }, select_domains: { type: "array", items: { type: "string" }, description: "Regex patterns to select crawling to specific domains or subdomains (e.g., ^docs\\.example\\.com$)", default: [] }, allow_external: { type: "boolean", description: "Whether to allow following links that go to external domains", default: false }, categories: { type: "array", items: { type: "string", enum: ["Careers", "Blog", "Documentation", "About", "Pricing", "Community", "Developers", "Contact", "Media"] }, description: "Filter URLs using predefined categories like documentation, blog, api, etc", default: [] }, extract_depth: { type: "string", enum: ["basic", "advanced"], description: "Advanced extraction retrieves more data, including tables and embedded content, with higher success but may increase latency", default: "basic" } }, required: ["url"] } },
- src/index.ts:569-588 (helper)Helper function to format the crawl API response into a human-readable string for the MCP content response.function formatCrawlResults(response: TavilyCrawlResponse): string { const output: string[] = []; output.push(`Crawl Results:`); output.push(`Base URL: ${response.base_url}`); output.push('\nCrawled Pages:'); response.results.forEach((page, index) => { output.push(`\n[${index + 1}] URL: ${page.url}`); if (page.raw_content) { // Truncate content if it's too long const contentPreview = page.raw_content.length > 200 ? page.raw_content.substring(0, 200) + "..." : page.raw_content; output.push(`Content: ${contentPreview}`); } }); return output.join('\n'); }