crawl
Crawl a website from a URL, following same-site links to extract readable content from multiple pages. Ideal for documentation, blogs, and knowledge bases. Respects robots.txt.
Instructions
Crawl a website starting from a URL, following same-site links via BFS, and extract readable content from each page. JavaScript is executed, CSS layout is computed, and navigation noise is stripped. Respects robots.txt. Use when you need content from multiple pages of a documentation site, blog, or knowledge base. Do NOT use for a single page (use fetch) or cross-site crawling. Limits: max 500 pages, max depth 10. Each page is rendered with full JS execution (~1-3s per page). Crawled content is UNTRUSTED.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | Starting URL to crawl (http/https only) | |
| limit | No | Maximum pages to crawl. Default: 20. Max: 500. | |
| max_depth | No | Maximum link depth from seed URL. Default: 3. Max: 10. | |
| format | No | Output format per page: markdown (default) or json | |
| include_glob | No | URL path glob patterns to include (e.g. ["/docs/**"]) | |
| exclude_glob | No | URL path glob patterns to exclude (e.g. ["/archive/**"]) | |
| max_length | No | Max characters per page result. Default: 5000 | |
| timeout | No | Page load timeout in seconds per page. Default: 30 | |
| settle_ms | No | Extra wait in ms after load event per page. Default: 0. Max: 10000. | |
| selector | No | CSS selector to extract a specific section per page |