alterlab_crawl
Initiate an asynchronous website crawl that discovers URLs via sitemaps and link extraction, returning a crawl ID for later status polling. Supports pattern-based scoping and optional JavaScript rendering.
Instructions
Start an asynchronous crawl of an entire website. Discovers URLs via sitemap parsing and link extraction, then scrapes each page. Returns a crawl_id immediately — use alterlab_crawl_status to poll results. Use include_patterns/exclude_patterns to scope the crawl to specific sections. Use render_js='auto' for mixed sites to save 30-60% vs always rendering. Supports extraction_schema to extract structured data from every page.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | Start URL for the crawl | |
| max_pages | No | Maximum number of pages to scrape | |
| max_depth | No | Maximum link-following depth from start URL (0 = start page only) | |
| include_patterns | No | Glob patterns — only scrape URLs whose path matches at least one (e.g., ['/blog/*', '/docs/*']) | |
| exclude_patterns | No | Glob patterns — skip URLs whose path matches any (e.g., ['/tag/*', '/author/*']) | |
| sitemap | No | Sitemap mode: include (default), skip (link extraction only), only (sitemap URLs only) | include |
| formats | No | Output formats for each scraped page | |
| extraction_schema | No | JSON schema for structured extraction on each page | |
| extraction_model | No | Per-request LLM model override in provider-specific format (e.g. 'gpt-4o', 'claude-opus-4-5-20251101', 'llama3-70b-8192'). Overrides the model saved in your BYOK key settings for this request only. | |
| render_js | No | Render JavaScript on crawled pages. true=always (Tier 4), false=never, auto=smart detection per page | |
| use_proxy | No | Route all crawl requests through premium proxy | |
| max_concurrency | No | Maximum concurrent pages to scrape simultaneously | |
| respect_robots | No | Respect robots.txt rules for the target domain | |
| include_subdomains | No | Include links to subdomains during discovery | |
| webhook_url | No | Webhook URL to notify on crawl completion |