alterlab_scrape
Extract web page content as markdown, JSON, or structured data with automatic anti-bot bypass. Supports JavaScript rendering, proxy rotation, and extraction schemas for AI workflows.
Instructions
Scrape a URL and return its content as markdown, text, HTML, JSON, or structured sections. Automatically handles anti-bot protection with tier escalation. Returns markdown by default — optimized for LLM context. Supports GET (default) and POST/PUT/PATCH/DELETE/HEAD via the method parameter. Use method='POST' with body for GraphQL APIs, REST endpoints, and form submissions. For GraphQL: set body='{"query": "{ ... }"}' and method='POST'. Use render_js=true for JavaScript-heavy sites (React, Angular, SPAs). Use render_js='auto' for mixed sites to detect JS needs per-page (saves 30-60%). Use use_proxy=true for geo-restricted or heavily protected sites. Use formats=['json_v2'] for a structured section tree (headings + content blocks). Use formats=['rag'] for chunked text optimized for RAG pipelines. Use formats=['content'] for AI/KB pipelines — returns body_markdown, content_hash, images, links. Use extraction_schema to extract structured fields from the page using LLM (add formats=['json'] to retrieve result in content.json, also available in filtered_content). Use extraction_prompt for natural language extraction instructions (mutually exclusive with extraction_schema). Use extraction_profile to select a pre-built extraction template (product, article, job_posting, etc.). Use extraction_provider to select a specific BYOK LLM provider (openai, anthropic, openrouter, groq). Supports authenticated scraping via session_id (stored session) or inline cookies. Use scroll_to_load=true for infinite-scroll pages that lazy-load content. Use location.country to scrape geo-targeted content.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | URL to scrape | |
| method | No | HTTP method for the request. Default GET (standard page scraping). Use POST for GraphQL endpoints, form submissions, REST API calls. Use PUT/PATCH for REST API updates. When using POST/PUT/PATCH, provide body with the request payload. | GET |
| body | No | Request body for POST/PUT/PATCH requests. For GraphQL: JSON string with 'query' and optional 'variables' fields (e.g., '{"query": "{ user { id name } }"}').For REST APIs: JSON-encoded payload string. For form submissions: URL-encoded key=value pairs (e.g., 'name=Alice&email=alice@example.com'). Omit for GET/HEAD/DELETE requests. | |
| mode | No | Scraping mode: auto (recommended), html, js (headless browser), pdf, or ocr | auto |
| formats | No | Output formats. 'markdown' is best for LLM consumption. 'json_v2' returns a structured section tree (headings + content blocks). 'rag' returns chunked text optimized for retrieval-augmented generation. 'content' returns body_markdown + content_hash + images + links for AI/KB pipelines. | |
| extraction_schema | No | JSON schema for structured extraction. The API extracts fields matching this schema from the scraped page using LLM. Result is returned in content.json (add 'json' to formats) and in the top-level filtered_content field. Example: { "title": "string", "price": "number", "in_stock": "boolean" } | |
| extraction_model | No | Per-request LLM model override in provider-specific format (e.g. 'gpt-4o', 'claude-opus-4-5-20251101', 'llama3-70b-8192'). Overrides the model saved in your BYOK key settings for this request only. | |
| extraction_provider | No | LLM provider to use for extraction. Selects the matching BYOK key registered at /dashboard/settings/llm-keys. When omitted, the most recently used registered key is used automatically. Requires extraction_schema or extraction_prompt. | |
| extraction_prompt | No | Natural language extraction instruction. Describes what fields to extract from the page. Mutually exclusive with extraction_schema. Example: "Extract the product name, price, and availability". | |
| extraction_profile | No | Pre-built extraction schema template. auto: detect best template. product: e-commerce product details. article: news/blog article fields. job_posting: job listing fields. faq: FAQ entries. recipe: recipe ingredients and instructions. event: event details. ecommerce_homepage: homepage product listings. directory_listing: directory/listing page entries. | |
| render_js | No | Render JavaScript using headless browser (forces Tier 4 minimum — no separate add-on charge). Required for JS-heavy sites. Set to 'auto' for smart detection (probes each page, only renders JS-heavy pages with browser — saves 30-60% on mixed sites). | |
| use_proxy | No | Route through premium proxy (+$0.0002). Helps bypass geo-restrictions and anti-bot | |
| proxy_country | No | ISO country code for geo-targeting (e.g., 'US', 'DE'). Requires use_proxy=true | |
| wait_for | No | CSS selector to wait for before extracting content (e.g., '#main-content') | |
| timeout | No | Request timeout in seconds (1-300) | |
| max_response_bytes | No | Soft cap on raw response body size in bytes. When the downloaded HTML exceeds this value it is truncated before extraction. Default: 5 MB (5242880). Set to 0 for no limit. Maximum: 50 MB (52428800). Useful for very large pages where you only need the beginning of the content. | |
| include_raw_html | No | Include raw HTML in the response alongside formatted content | |
| session_id | No | UUID of a stored session for authenticated scraping. Use alterlab_list_sessions to find available sessions. The session's cookies will be injected into the request. | |
| cookies | No | Inline cookies as key-value pairs for authenticated scraping (e.g., {"session_token": "abc123"}). Use this for one-off requests; use session_id for reusable sessions. | |
| scroll_to_load | No | Scroll page to trigger lazy-loaded content (requires render_js). Performs explicit viewport-height scrolls to load dynamic content. Adds ~2-3s latency. | |
| scroll_count | No | Number of scroll iterations when scroll_to_load is enabled (1-10, default 3) | |
| remove_cookie_banners | No | Remove cookie consent banners from HTML before content extraction (free, enabled by default) | |
| location | No | Geo-targeting parameters for localized content scraping. Controls proxy country routing, Accept-Language header, and browser locale. |