crawl_start
Initialize a resumable web crawl job that returns a job ID immediately without network I/O. Use with crawl_status to fetch pages and crawl_cancel to stop.
Instructions
Initialise a resumable crawl job. Returns { jobId, status: "pending" } immediately — performs NO network I/O. Drive progress with crawl_status({ jobId, advance: N }) which fetches up to N pages per call. Same args as the legacy crawl tool. Use crawl_cancel to stop.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | REQUIRED Starting URL to crawl | |
| max_depth | No | Max link-follow depth. Default: 2 | |
| max_pages | No | Max pages to crawl. Default: 20 | |
| scope | No | URL glob limiting which URLs to follow. Default: same origin. | |
| include_patterns | No | URL globs — follow only links matching at least one. | |
| exclude_patterns | No | URL globs — skip links matching any. | |
| output_format | No | Content format. Default: markdown | |
| onlyMainContent | No | For markdown-clean, remove nav/footer/ads before conversion. Default: true | |
| includeLinks | No | For markdown-clean, include link destinations in markdown. Default: true | |
| respect_robots | No | Whether to obey robots.txt. Default: true | |
| delay_ms | No | Delay between page fetches (ms). Default: 1000 | |
| concurrency | No | Max parallel fetches. Default: 3 | |
| cache_mode | No | Opt-in crawl content cache mode. Default: disabled. | |
| cache_ttl_ms | No | Maximum age for enabled/read_only cache hits. Omit for no TTL expiry. | |
| cache_scope | No | Cache namespace/safety scope. Default: public. |