tavily-crawl
Initiate structured web crawls from a base URL, controlling depth, breadth, and path selection. Focus on specific site sections or categories to extract relevant data efficiently.
Instructions
A powerful web crawler that initiates a structured web crawl starting from a specified base URL. The crawler expands from that point like a tree, following internal links across pages. You can control how deep and wide it goes, and guide it to focus on specific sections of the site.
Input Schema
Name | Required | Description | Default |
---|---|---|---|
allow_external | No | Whether to allow following links that go to external domains | |
categories | No | Filter URLs using predefined categories like documentation, blog, api, etc | |
extract_depth | No | Advanced extraction retrieves more data, including tables and embedded content, with higher success but may increase latency | basic |
instructions | No | Natural language instructions for the crawler | |
limit | No | Total number of links the crawler will process before stopping | |
max_breadth | No | Max number of links to follow per level of the tree (i.e., per page) | |
max_depth | No | Max depth of the crawl. Defines how far from the base URL the crawler can explore. | |
select_domains | No | Regex patterns to select crawling to specific domains or subdomains (e.g., ^docs\.example\.com$) | |
select_paths | No | Regex patterns to select only URLs with specific path patterns (e.g., /docs/.*, /api/v1.*) | |
url | Yes | The root URL to begin the crawl |
Input Schema (JSON Schema)
{
"properties": {
"allow_external": {
"default": false,
"description": "Whether to allow following links that go to external domains",
"type": "boolean"
},
"categories": {
"default": [],
"description": "Filter URLs using predefined categories like documentation, blog, api, etc",
"items": {
"enum": [
"Careers",
"Blog",
"Documentation",
"About",
"Pricing",
"Community",
"Developers",
"Contact",
"Media"
],
"type": "string"
},
"type": "array"
},
"extract_depth": {
"default": "basic",
"description": "Advanced extraction retrieves more data, including tables and embedded content, with higher success but may increase latency",
"enum": [
"basic",
"advanced"
],
"type": "string"
},
"instructions": {
"description": "Natural language instructions for the crawler",
"type": "string"
},
"limit": {
"default": 50,
"description": "Total number of links the crawler will process before stopping",
"minimum": 1,
"type": "integer"
},
"max_breadth": {
"default": 20,
"description": "Max number of links to follow per level of the tree (i.e., per page)",
"minimum": 1,
"type": "integer"
},
"max_depth": {
"default": 1,
"description": "Max depth of the crawl. Defines how far from the base URL the crawler can explore.",
"minimum": 1,
"type": "integer"
},
"select_domains": {
"default": [],
"description": "Regex patterns to select crawling to specific domains or subdomains (e.g., ^docs\\.example\\.com$)",
"items": {
"type": "string"
},
"type": "array"
},
"select_paths": {
"default": [],
"description": "Regex patterns to select only URLs with specific path patterns (e.g., /docs/.*, /api/v1.*)",
"items": {
"type": "string"
},
"type": "array"
},
"url": {
"description": "The root URL to begin the crawl",
"type": "string"
}
},
"required": [
"url"
],
"type": "object"
}