crawl
Recursively crawl a website to extract text and links from multiple pages. Follows site boundaries and respects robots.txt with configurable depth limits.
Instructions
Recursively crawl a website via BFS. Opens pages in new tabs, extracts text and links, follows them up to max_depth. Respects robots.txt and scope constraints.
When to use: Extracting content from multiple pages of a site when the URL structure is not known in advance. When NOT to use: Use crawl_sitemap when the site has a sitemap.xml, or navigate for a single page.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | Starting URL to crawl | |
| max_depth | No | Maximum link-follow depth (0 = start page only). Default: 2 | |
| max_pages | No | Maximum number of pages to crawl. Default: 20 | |
| cursor | No | Opaque pagination cursor returned as nextCursor from a prior crawl call. Cursoring paginates returned pages after the crawl completes. | |
| scope | No | URL glob pattern limiting which URLs to follow (e.g. "https://docs.example.com/**"). Default: same origin as start URL. | |
| include_patterns | No | URL glob patterns — only follow links matching at least one | |
| exclude_patterns | No | URL glob patterns — skip links matching any of these | |
| output_format | No | Content format per page. "markdown-clean" uses cheerio+turndown to strip nav/footer/ads. Default: markdown | |
| onlyMainContent | No | markdown-clean only: strip nav/header/footer/aside/ads. Default: true. | |
| includeLinks | No | markdown-clean only: preserve <a> as markdown links. Default: true. | |
| content_filter | No | markdown-clean only: deterministic fit_markdown filter. Default: none. | |
| return_raw | No | markdown-clean only: include raw_markdown in each page. Default: false. | |
| return_fit | No | markdown-clean only: include fit_markdown and use it as content when filtering. Default: true when filtered. | |
| respect_robots | No | Whether to fetch and obey robots.txt. Default: true | |
| delay_ms | No | Delay between page fetches in milliseconds. Default: 1000 | |
| concurrency | No | Max parallel tab fetches. Default: 3 | |
| engine | No | Fetch engine: "cdp" (default, opens a Chrome tab per page), "static" (Node fetch only, fails closed on insufficient pages), or "auto" (static first, fall back to CDP when static is insufficient). | |
| include_metrics | No | When true, include approximate output size/token metrics in the JSON result. Default: false. | |
| strategy | No | Crawl traversal strategy. Default: bfs. best_first scores discovered URLs by query/url_score and visits highest-scoring URLs first. | |
| query | No | Optional query terms used by strategy=best_first URL scoring. | |
| url_score | No | Optional strategy=best_first URL scoring hints: keywords, prefer_paths, exclude_paths, same_depth_bias. | |
| dispatcher | No | Crawl concurrency dispatcher. Default: fixed. adaptive reduces concurrency on memory/error pressure and records origin backoff for 429/503 responses. | |
| dispatcher_options | No | dispatcher=adaptive options: min_concurrency, max_concurrency, memory_pressure_mb, origin_backoff_ms, rate_limit_statuses. | |
| cache_mode | No | Opt-in crawl content cache mode. Default: disabled. | |
| cache_ttl_ms | No | Maximum age for enabled/read_only cache hits. Negative or omitted means no TTL expiry. | |
| cache_scope | No | Cache namespace/safety scope. Default: public; session adds a session fingerprint to the key. |