deep_crawl_site
Crawl multiple pages from a starting URL with configurable depth and strategy. Save results as per-URL markdown files and a metadata index.
Instructions
Crawl multiple pages from a site with configurable depth. Use output_path (directory) to persist per-URL markdown files + index.json; the response is then slimmed to metadata only.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | Starting URL | |
| max_depth | No | Link depth (1-2) | |
| max_pages | No | Max pages (max: 10) | |
| crawl_strategy | No | 'bfs'|'dfs'|'best_first' | bfs |
| include_external | No | Follow external links | |
| url_pattern | No | URL filter pattern | |
| score_threshold | No | Min relevance 0-1 | |
| extract_media | No | Extract media | |
| base_timeout | No | Timeout per page | |
| output_path | No | Absolute directory path to persist per-URL markdown files + index.json. Existing regular files at this path are rejected; otherwise the directory is created if missing (dot-containing names like /tmp/run.v1 are fine). When set, the response is slimmed to metadata+file paths. Failed items (success=False) are NOT written as .md but still recorded in index.json with file=null. | |
| include_content_in_response | No | When True (with output_path set), also include per-page content/markdown in the response items. Defaults to False so the response stays token-efficient. | |
| overwrite | No | Overwrite existing per-URL files inside output_path. Defaults to False (existing files cause an output_path_exists error). |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
No arguments | |||