tavily-crawl
Crawl websites starting from a base URL to map site structure, follow internal links, and extract content with controlled depth and breadth parameters for comprehensive web analysis.
Instructions
A powerful web crawler that initiates a structured web crawl starting from a specified base URL. The crawler expands from that point like a tree, following internal links across pages. You can control how deep and wide it goes, and guide it to focus on specific sections of the site.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | The root URL to begin the crawl | |
| max_depth | No | Max depth of the crawl. Defines how far from the base URL the crawler can explore. | |
| max_breadth | No | Max number of links to follow per level of the tree (i.e., per page) | |
| limit | No | Total number of links the crawler will process before stopping | |
| instructions | No | Natural language instructions for the crawler | |
| select_paths | No | Regex patterns to select only URLs with specific path patterns (e.g., /docs/.*, /api/v1.*) | |
| select_domains | No | Regex patterns to select crawling to specific domains or subdomains (e.g., ^docs\.example\.com$) | |
| allow_external | No | Whether to allow following links that go to external domains | |
| categories | No | Filter URLs using predefined categories like documentation, blog, api, etc | |
| extract_depth | No | Advanced extraction retrieves more data, including tables and embedded content, with higher success but may increase latency | basic |
Implementation Reference
- src/index.ts:495-510 (handler)Core handler function that executes the tavily-crawl tool by making a POST request to the Tavily API's crawl endpoint with user parameters and handling authentication/rate limit errors.async crawl(params: any): Promise<TavilyCrawlResponse> { try { const response = await this.axiosInstance.post(this.baseURLs.crawl, { ...params, api_key: API_KEY }); return response.data; } catch (error: any) { if (error.response?.status === 401) { throw new Error('Invalid API key'); } else if (error.response?.status === 429) { throw new Error('Usage limit exceeded'); } throw error; } }
- src/index.ts:216-283 (registration)Registration of the tavily-crawl tool in the ListTools response, including name, description, and detailed input schema for parameters like url, depth, breadth, limits, filters, etc.{ name: "tavily-crawl", description: "A powerful web crawler that initiates a structured web crawl starting from a specified base URL. The crawler expands from that point like a tree, following internal links across pages. You can control how deep and wide it goes, and guide it to focus on specific sections of the site.", inputSchema: { type: "object", properties: { url: { type: "string", description: "The root URL to begin the crawl" }, max_depth: { type: "integer", description: "Max depth of the crawl. Defines how far from the base URL the crawler can explore.", default: 1, minimum: 1 }, max_breadth: { type: "integer", description: "Max number of links to follow per level of the tree (i.e., per page)", default: 20, minimum: 1 }, limit: { type: "integer", description: "Total number of links the crawler will process before stopping", default: 50, minimum: 1 }, instructions: { type: "string", description: "Natural language instructions for the crawler" }, select_paths: { type: "array", items: { type: "string" }, description: "Regex patterns to select only URLs with specific path patterns (e.g., /docs/.*, /api/v1.*)", default: [] }, select_domains: { type: "array", items: { type: "string" }, description: "Regex patterns to select crawling to specific domains or subdomains (e.g., ^docs\\.example\\.com$)", default: [] }, allow_external: { type: "boolean", description: "Whether to allow following links that go to external domains", default: false }, categories: { type: "array", items: { type: "string", enum: ["Careers", "Blog", "Documentation", "About", "Pricing", "Community", "Developers", "Contact", "Media"] }, description: "Filter URLs using predefined categories like documentation, blog, api, etc", default: [] }, extract_depth: { type: "string", enum: ["basic", "advanced"], description: "Advanced extraction retrieves more data, including tables and embedded content, with higher success but may increase latency", default: "basic" } }, required: ["url"] } },
- src/index.ts:380-398 (handler)Dispatch handler in the CallToolRequestSchema switch statement that invokes the crawl method with parsed arguments and formats the response using formatCrawlResults.case "tavily-crawl": const crawlResponse = await this.crawl({ url: args.url, max_depth: args.max_depth, max_breadth: args.max_breadth, limit: args.limit, instructions: args.instructions, select_paths: Array.isArray(args.select_paths) ? args.select_paths : [], select_domains: Array.isArray(args.select_domains) ? args.select_domains : [], allow_external: args.allow_external, categories: Array.isArray(args.categories) ? args.categories : [], extract_depth: args.extract_depth }); return { content: [{ type: "text", text: formatCrawlResults(crawlResponse) }] };
- src/index.ts:568-587 (helper)Helper function to format the crawl response into a human-readable string, including base URL and preview of crawled pages' content.function formatCrawlResults(response: TavilyCrawlResponse): string { const output: string[] = []; output.push(`Crawl Results:`); output.push(`Base URL: ${response.base_url}`); output.push('\nCrawled Pages:'); response.results.forEach((page, index) => { output.push(`\n[${index + 1}] URL: ${page.url}`); if (page.raw_content) { // Truncate content if it's too long const contentPreview = page.raw_content.length > 200 ? page.raw_content.substring(0, 200) + "..." : page.raw_content; output.push(`Content: ${contentPreview}`); } }); return output.join('\n'); }
- src/index.ts:39-46 (schema)TypeScript interface defining the expected response structure from the Tavily crawl API.interface TavilyCrawlResponse { base_url: string; results: Array<{ url: string; raw_content: string; }>; response_time: number; }