web_scraping_web_scraping: POST /
hasdata_web_scraping_web_scraping_scrapeWebPageScrape any public URL using managed proxies, JS rendering, custom headers, and wait conditions. Extract structured data via CSS or AI rules, capture screenshots, block resources. Returns HTML, markdown, or JSON for direct integration.
Instructions
Scrape Web Page
Universal web scraper that fetches any public URL through managed proxies (datacenter or residential, geo-targeted) with optional JS rendering, custom headers, wait conditions, jsScenario actions (click, scroll, fill, waitFor), screenshots, resource/ad/URL blocking, and extractRules/aiExtractRules for LLM-driven structured extraction. Returns HTML, text, markdown, and/or JSON along with status code, extracted emails and links, CSS-selector extractions, and AI-structured fields per schema. Use as a fallback/universal fetcher for sites without a dedicated API, for scraping JS-heavy SPAs, bypassing bot protections, capturing screenshots, or producing clean markdown/structured JSON to feed downstream parsers, RAG pipelines, or data warehouses.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | The URL of the web page to scrape. | |
| headers | No | Optional custom headers to send with the request. | |
| proxyType | No | Type of proxy to use. | |
| proxyCountry | No | Optional proxy country code. | |
| blockResources | No | Whether to block loading of resources like images and stylesheets. | |
| blockAds | No | Whether to block ads. | |
| blockUrls | No | List of URLs to block. | |
| wait | No | Time in milliseconds to wait after the page load. | |
| waitFor | No | CSS selector to wait for before scraping. | |
| jsScenario | No | Enables custom JavaScript interactions on the target webpage during scraping. It's an array where each object defines a specific action or step. These actions can include clicking elements, waiting for elements, executing custom scripts, and more. Key actions within this field include: - `evaluate`: Run custom JavaScript code on the page. - `click`: Click on an element specified by a CSS selector. - `wait`: Pause for a set duration (in milliseconds). - `waitFor`: Delay until a specific element appears. - `waitForAndClick`: Combine waiting for an element and then clicking it. - `scrollX`, `scrollY`: Scroll to specified positions on the page. - `fill`: Enter values into input fields identified by CSS selectors. Actions are executed sequentially. | |
| extractRules | No | Rules for extracting specific data from the page. For example: `{ "title": "h1", "link_href": "a#link @href", "page_text": "body" }` | |
| screenshot | No | Whether to take a screenshot of the page. | |
| jsRendering | No | Enable JavaScript rendering. | |
| extractEmails | No | Extract emails from the page. | |
| extractLinks | No | Extract links from the page. | |
| includeOnlyTags | No | The `includeOnlyTags` parameter accepts an array of valid CSS selectors. When specified, only the elements matching these selectors will be included in the response content. Each value must be a valid `querySelectorAll` selector. Useful for extracting specific parts of the document. | |
| excludeTags | No | The `excludeTags` parameter accepts an array of valid CSS selectors. Elements matching these selectors will be removed from the final output. Each value must be a valid `querySelectorAll` selector. This can be used to remove ads, scripts, or other unwanted sections. | |
| removeBase64Images | No | If set to `true`, any images embedded as base64-encoded strings will be removed from the output. Useful for reducing response size or when base64 images are not needed. | |
| outputFormat | No | The outputFormat parameter specifies the desired response format: `html`, `text`, `markdown`, or `json`. If only one of `html`, `text`, or `markdown` is provided, the API returns the response in that format. If multiple formats are specified, the API returns a JSON response with keys for each requested format. If `json` is included with any other format, the API returns a JSON response with keys for the other specified formats. | |
| aiExtractRules | No | Defines custom rules for AI-based data extraction using LLMs. This enables the system to extract structured data directly from the HTML of the page. Each key in the object represents a desired output field name, and the value specifies its type and optional description to guide the AI. Supported types: - `string`: plain text value - `number`: numeric value - `boolean`: true/false - `list`: an array of values - `item`: a nested object with its own structure defined under `output` |