crawl_url
Extract web page content with JavaScript rendering, paginate responses, and save full content to markdown files for data analysis or archiving.
Instructions
Extract web page content with JavaScript support. Use wait_for_js=true for SPAs. Use content_offset/content_limit to paginate the response. Use output_path to persist the full unsliced content to disk as markdown and receive a slim metadata-only response.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | URL to crawl | |
| css_selector | No | CSS selector for extraction | |
| extract_media | No | Extract images/videos | |
| take_screenshot | No | Take screenshot | |
| generate_markdown | No | Generate markdown | |
| include_cleaned_html | No | Include cleaned HTML | |
| wait_for_selector | No | Wait for element to load | |
| timeout | No | Timeout in seconds | |
| wait_for_js | No | Wait for JavaScript | |
| auto_summarize | No | Auto-summarize large content | |
| use_undetected_browser | No | Bypass bot detection | |
| content_limit | No | Max characters to return (0=unlimited) | |
| content_offset | No | Start position for content (0-indexed) | |
| output_path | No | Absolute file path (auto .md extension) to persist the full unsliced markdown. When set, the response is slimmed to metadata+file path to save tokens. content_limit/content_offset still affect the response copy but not the on-disk file. | |
| include_content_in_response | No | When True (with output_path set), keep markdown/content in the response too. Note: the response copy is still subject to content_limit/content_offset slicing; only the on-disk file holds the full unsliced payload. Defaults to False. | |
| overwrite | No | Overwrite an existing output file at output_path. Defaults to False (existing files are rejected before any fetch). |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
No arguments | |||