crawl_sitemap
Crawl a website's pages listed in its sitemap.xml. Auto-discovers sitemaps from robots.txt or /sitemap.xml and supports URL filtering for targeted extraction.
Instructions
Crawl a website using its sitemap.xml. Auto-discovers sitemaps from robots.txt or /sitemap.xml. Supports sitemap index files and URL filtering.
When to use: Extracting content from many pages of a site that publishes a sitemap.xml. When NOT to use: Use crawl for BFS discovery when no sitemap exists, or navigate for a single page.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | Website URL (auto-discovers /sitemap.xml, /sitemap_index.xml) | |
| sitemap_url | No | Explicit sitemap URL (skips auto-discovery) | |
| filter | No | URL glob pattern to filter which sitemap URLs to visit | |
| max_pages | No | Maximum number of pages to visit. Default: 50 | |
| output_format | No | Content format per page. "markdown-clean" uses cheerio+turndown to strip nav/footer/ads. Default: markdown | |
| onlyMainContent | No | markdown-clean only: strip nav/header/footer/aside/ads. Default: true. | |
| includeLinks | No | markdown-clean only: preserve <a> as markdown links. Default: true. | |
| query | No | markdown-clean content_filter="bm25" query terms. | |
| content_filter | No | markdown-clean only: deterministic fit_markdown filter. Default: none. | |
| return_raw | No | markdown-clean only: include raw_markdown in each page. Default: false. | |
| return_fit | No | markdown-clean only: include fit_markdown and use it as content when filtering. Default: true when filtered. | |
| concurrency | No | Max concurrent page fetches. Default: 3 | |
| engine | No | Fetch engine: "cdp" (default, opens a Chrome tab per page), "static" (Node fetch only, fails closed on insufficient pages), or "auto" (static first, fall back to CDP when static is insufficient). | |
| cache_mode | No | Opt-in crawl content cache mode. Default: disabled. | |
| cache_ttl_ms | No | Maximum age for enabled/read_only cache hits. Omit for no TTL expiry. | |
| cache_scope | No | Cache namespace/safety scope. Default: public. | |
| include_metrics | No | When true, include approximate output size/token metrics in the JSON result. Default: false. |