crawl_recursive
Deep crawl websites by following internal links to map entire sites, find all pages, and build comprehensive indexes with configurable depth and page limits.
Instructions
[STATELESS] Deep crawl a website following internal links. Use when: mapping entire sites, finding all pages, building comprehensive indexes. Control with max_depth (default 3) and max_pages (default 50). Note: May need JS execution for dynamic sites. Each page gets a fresh browser. For persistent operations use create_session + crawl.
Input Schema
Name | Required | Description | Default |
---|---|---|---|
exclude_pattern | No | Regex to skip URLs. Example: ".*\/(login|admin).*" to avoid auth pages, ".*\.pdf$" to skip PDFs | |
include_pattern | No | Regex to match URLs to crawl. Example: ".*\/blog\/.*" for blog posts only, ".*\.html$" for HTML pages | |
max_depth | No | Maximum depth to follow links | |
max_pages | No | Maximum number of pages to crawl | |
url | Yes | Starting URL to crawl from |
Input Schema (JSON Schema)
{
"properties": {
"exclude_pattern": {
"description": "Regex to skip URLs. Example: \".*\\/(login|admin).*\" to avoid auth pages, \".*\\.pdf$\" to skip PDFs",
"type": "string"
},
"include_pattern": {
"description": "Regex to match URLs to crawl. Example: \".*\\/blog\\/.*\" for blog posts only, \".*\\.html$\" for HTML pages",
"type": "string"
},
"max_depth": {
"default": 3,
"description": "Maximum depth to follow links",
"type": "number"
},
"max_pages": {
"default": 50,
"description": "Maximum number of pages to crawl",
"type": "number"
},
"url": {
"description": "Starting URL to crawl from",
"type": "string"
}
},
"required": [
"url"
],
"type": "object"
}