smartscraper

Extract structured data from a webpage, HTML, or markdown using AI-powered extraction.

This tool uses advanced AI to understand your natural language prompt and extract specific
structured data from web content. Supports three input modes: URL scraping. Ideal for extracting product information, contact details,
article metadata, or any structured content. Costs 10 credits per page. Read-only operation.

Args:
    user_prompt (str): Natural language instructions describing what data to extract.
        - Be specific about the fields you want for better results
        - Use clear, descriptive language about the target data
        - Examples:
          * "Extract product name, price, description, and availability status"
          * "Find all contact methods: email addresses, phone numbers, and social media links"
          * "Get article title, author, publication date, and summary"
          * "Extract all job listings with title, company, location, and salary"
        - Tips for better results:
          * Specify exact field names you want
          * Mention data types (numbers, dates, URLs, etc.)
          * Include context about where data might be located

    website_url (Optional[str]): The complete URL of the webpage to scrape.
        - Mutually exclusive with website_html and website_markdown
        - Must include protocol (http:// or https://)
        - Supports dynamic and static content
        - Examples:
          * https://example.com/products/item
          * https://news.site.com/article/123
          * https://company.com/contact
        - Default: None (must provide one of the three input sources)

    website_html (Optional[str]): Raw HTML content to process locally.
        - Mutually exclusive with website_url and website_markdown
        - Maximum size: 2MB
        - Useful for processing pre-fetched or generated HTML
        - Use when you already have HTML content from another source
        - Example: "<html><body><h1>Title</h1><p>Content</p></body></html>"
        - Default: None

    website_markdown (Optional[str]): Markdown content to process locally.
        - Mutually exclusive with website_url and website_html
        - Maximum size: 2MB
        - Useful for extracting from markdown documents or converted content
        - Works well with documentation, README files, or converted web content
        - Example: "# Title

Section

Content here..." - Default: None

    output_schema (Optional[Union[str, Dict]]): JSON schema defining expected output structure.
        - Can be provided as a dictionary or JSON string
        - Helps ensure consistent, structured output format
        - Optional but recommended for complex extractions
        - IMPORTANT: Must include a "required" field (can be empty array [] if no fields are required)
        - Examples:
          * As dict: {'type': 'object', 'properties': {'title': {'type': 'string'}, 'price': {'type': 'number'}}, 'required': []}
          * As JSON string: '{"type": "object", "properties": {"name": {"type": "string"}}, "required": []}'
          * For arrays: {'type': 'array', 'items': {'type': 'object', 'properties': {...}, 'required': []}, 'required': []}
          * With required fields: {'type': 'object', 'properties': {'name': {'type': 'string'}, 'email': {'type': 'string'}}, 'required': ['name', 'email']}
        - Note: If "required" field is missing, it will be automatically added as an empty array []
        - Default: None (AI will infer structure from prompt)

    number_of_scrolls (Optional[int]): Number of infinite scrolls to perform before scraping.
        - Range: 0-50 scrolls
        - Default: 0 (no scrolling)
        - Useful for dynamically loaded content (lazy loading, infinite scroll)
        - Each scroll waits for content to load before continuing
        - Examples:
          * 0: Static content, no scrolling needed
          * 3: Social media feeds, product listings
          * 10: Long articles, extensive product catalogs
        - Note: Increases processing time proportionally

    total_pages (Optional[int]): Number of pages to process for pagination.
        - Range: 1-100 pages
        - Default: 1 (single page only)
        - Automatically follows pagination links when available
        - Useful for multi-page listings, search results, catalogs
        - Examples:
          * 1: Single page extraction
          * 5: First 5 pages of search results
          * 20: Comprehensive catalog scraping
        - Note: Each page counts toward credit usage (10 credits × pages)

    render_heavy_js (Optional[bool]): Enable heavy JavaScript rendering for dynamic sites.
        - Default: false
        - Set to true for Single Page Applications (SPAs), React apps, Vue.js sites
        - Increases processing time but captures client-side rendered content
        - Use when content is loaded dynamically via JavaScript
        - Examples of when to use:
          * React/Angular/Vue applications
          * Sites with dynamic content loading
          * AJAX-heavy interfaces
          * Content that appears after page load
        - Note: Significantly increases processing time (30-60 seconds vs 5-15 seconds)

    stealth (Optional[bool]): Enable stealth mode to avoid bot detection.
        - Default: false
        - Helps bypass basic anti-scraping measures
        - Uses techniques to appear more like a human browser
        - Useful for sites with bot detection systems
        - Examples of when to use:
          * Sites that block automated requests
          * E-commerce sites with protection
          * Sites that require "human-like" behavior
        - Note: May increase processing time and is not 100% guaranteed

Returns:
    Dictionary containing:
    - extracted_data: The structured data matching your prompt and optional schema
    - metadata: Information about the extraction process
    - credits_used: Number of credits consumed (10 per page processed)
    - processing_time: Time taken for the extraction
    - pages_processed: Number of pages that were analyzed
    - status: Success/error status of the operation

Raises:
    ValueError: If no input source provided or multiple sources provided
    HTTPError: If website_url cannot be accessed
    TimeoutError: If processing exceeds timeout limits
    ValidationError: If output_schema is malformed JSON

Name	Required	Description	Default
`user_prompt`	Yes
`website_url`	No
`website_html`	No
`website_markdown`	No
`output_schema`	No
`number_of_scrolls`	No
`total_pages`	No
`render_heavy_js`	No
`stealth`	No

Name

Required

Description

Default

user_prompt

Yes

website_url

website_html

website_markdown

output_schema

number_of_scrolls

total_pages

render_heavy_js

stealth

Name	Required	Description	Default
No arguments

Name

Required

Description

Default

No arguments

ScrapeGraph MCP Server

Instructions

Section

Input Schema

Output Schema

Tool Definition Quality

Other Tools

Latest Blog Posts

MCP directory API