crawl_url_with_fallback
Crawl web pages with automatic fallback strategies to bypass anti-bot protections. Supports pagination via content offsets and limits, and can save full content as markdown to disk.
Instructions
Crawl with fallback strategies for anti-bot sites. Use content_offset/content_limit to paginate the response. Use output_path to persist the full unsliced content to disk as markdown and receive a slim response.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | URL to crawl | |
| css_selector | No | CSS selector | |
| extract_media | No | Extract media | |
| take_screenshot | No | Take screenshot | |
| generate_markdown | No | Generate markdown | |
| wait_for_selector | No | Element to wait for | |
| timeout | No | Timeout in seconds | |
| wait_for_js | No | Wait for JavaScript | |
| auto_summarize | No | Auto-summarize content | |
| content_limit | No | Max characters to return (0=unlimited) | |
| content_offset | No | Start position for content (0-indexed) | |
| output_path | No | Absolute file path (auto .md extension) to persist the full unsliced markdown. When set, the response is slimmed to metadata+file path. content_limit/content_offset still affect the response copy but not the on-disk file. | |
| include_content_in_response | No | When True (with output_path set), keep markdown/content in the response too. Note: the response copy is still subject to content_limit/content_offset slicing; only the on-disk file holds the full unsliced payload. | |
| overwrite | No | Overwrite an existing output file at output_path. Defaults to False (existing files rejected before any fetch). |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
No arguments | |||