smartscraper
Extract structured data from webpages, HTML, or markdown using AI-powered natural language prompts to get specific information like product details, contact methods, or article metadata.
Instructions
Extract structured data from a webpage, HTML, or markdown using AI-powered extraction.
This tool uses advanced AI to understand your natural language prompt and extract specific
structured data from web content. Supports three input modes: URL scraping. Ideal for extracting product information, contact details,
article metadata, or any structured content. Costs 10 credits per page. Read-only operation.
Args:
user_prompt (str): Natural language instructions describing what data to extract.
- Be specific about the fields you want for better results
- Use clear, descriptive language about the target data
- Examples:
* "Extract product name, price, description, and availability status"
* "Find all contact methods: email addresses, phone numbers, and social media links"
* "Get article title, author, publication date, and summary"
* "Extract all job listings with title, company, location, and salary"
- Tips for better results:
* Specify exact field names you want
* Mention data types (numbers, dates, URLs, etc.)
* Include context about where data might be located
website_url (Optional[str]): The complete URL of the webpage to scrape.
- Mutually exclusive with website_html and website_markdown
- Must include protocol (http:// or https://)
- Supports dynamic and static content
- Examples:
* https://example.com/products/item
* https://news.site.com/article/123
* https://company.com/contact
- Default: None (must provide one of the three input sources)
website_html (Optional[str]): Raw HTML content to process locally.
- Mutually exclusive with website_url and website_markdown
- Maximum size: 2MB
- Useful for processing pre-fetched or generated HTML
- Use when you already have HTML content from another source
- Example: "<html><body><h1>Title</h1><p>Content</p></body></html>"
- Default: None
website_markdown (Optional[str]): Markdown content to process locally.
- Mutually exclusive with website_url and website_html
- Maximum size: 2MB
- Useful for extracting from markdown documents or converted content
- Works well with documentation, README files, or converted web content
- Example: "# TitleSection
Content here..." - Default: None
output_schema (Optional[Union[str, Dict]]): JSON schema defining expected output structure.
- Can be provided as a dictionary or JSON string
- Helps ensure consistent, structured output format
- Optional but recommended for complex extractions
- IMPORTANT: Must include a "required" field (can be empty array [] if no fields are required)
- Examples:
* As dict: {'type': 'object', 'properties': {'title': {'type': 'string'}, 'price': {'type': 'number'}}, 'required': []}
* As JSON string: '{"type": "object", "properties": {"name": {"type": "string"}}, "required": []}'
* For arrays: {'type': 'array', 'items': {'type': 'object', 'properties': {...}, 'required': []}, 'required': []}
* With required fields: {'type': 'object', 'properties': {'name': {'type': 'string'}, 'email': {'type': 'string'}}, 'required': ['name', 'email']}
- Note: If "required" field is missing, it will be automatically added as an empty array []
- Default: None (AI will infer structure from prompt)
number_of_scrolls (Optional[int]): Number of infinite scrolls to perform before scraping.
- Range: 0-50 scrolls
- Default: 0 (no scrolling)
- Useful for dynamically loaded content (lazy loading, infinite scroll)
- Each scroll waits for content to load before continuing
- Examples:
* 0: Static content, no scrolling needed
* 3: Social media feeds, product listings
* 10: Long articles, extensive product catalogs
- Note: Increases processing time proportionally
total_pages (Optional[int]): Number of pages to process for pagination.
- Range: 1-100 pages
- Default: 1 (single page only)
- Automatically follows pagination links when available
- Useful for multi-page listings, search results, catalogs
- Examples:
* 1: Single page extraction
* 5: First 5 pages of search results
* 20: Comprehensive catalog scraping
- Note: Each page counts toward credit usage (10 credits × pages)
render_heavy_js (Optional[bool]): Enable heavy JavaScript rendering for dynamic sites.
- Default: false
- Set to true for Single Page Applications (SPAs), React apps, Vue.js sites
- Increases processing time but captures client-side rendered content
- Use when content is loaded dynamically via JavaScript
- Examples of when to use:
* React/Angular/Vue applications
* Sites with dynamic content loading
* AJAX-heavy interfaces
* Content that appears after page load
- Note: Significantly increases processing time (30-60 seconds vs 5-15 seconds)
stealth (Optional[bool]): Enable stealth mode to avoid bot detection.
- Default: false
- Helps bypass basic anti-scraping measures
- Uses techniques to appear more like a human browser
- Useful for sites with bot detection systems
- Examples of when to use:
* Sites that block automated requests
* E-commerce sites with protection
* Sites that require "human-like" behavior
- Note: May increase processing time and is not 100% guaranteed
Returns:
Dictionary containing:
- extracted_data: The structured data matching your prompt and optional schema
- metadata: Information about the extraction process
- credits_used: Number of credits consumed (10 per page processed)
- processing_time: Time taken for the extraction
- pages_processed: Number of pages that were analyzed
- status: Success/error status of the operation
Raises:
ValueError: If no input source provided or multiple sources provided
HTTPError: If website_url cannot be accessed
TimeoutError: If processing exceeds timeout limits
ValidationError: If output_schema is malformed JSONInput Schema
| Name | Required | Description | Default |
|---|---|---|---|
| user_prompt | Yes | ||
| website_url | No | ||
| website_html | No | ||
| website_markdown | No | ||
| output_schema | No | ||
| number_of_scrolls | No | ||
| total_pages | No | ||
| render_heavy_js | No | ||
| stealth | No |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
No arguments | |||