Skip to main content
Glama
ScrapeGraphAI

ScrapeGraph MCP Server

Official

smartscraper

Extract structured data from webpages, HTML, or markdown using AI-powered natural language prompts to get specific information like product details, contact methods, or article metadata.

Instructions

Extract structured data from a webpage, HTML, or markdown using AI-powered extraction.

This tool uses advanced AI to understand your natural language prompt and extract specific
structured data from web content. Supports three input modes: URL scraping. Ideal for extracting product information, contact details,
article metadata, or any structured content. Costs 10 credits per page. Read-only operation.

Args:
    user_prompt (str): Natural language instructions describing what data to extract.
        - Be specific about the fields you want for better results
        - Use clear, descriptive language about the target data
        - Examples:
          * "Extract product name, price, description, and availability status"
          * "Find all contact methods: email addresses, phone numbers, and social media links"
          * "Get article title, author, publication date, and summary"
          * "Extract all job listings with title, company, location, and salary"
        - Tips for better results:
          * Specify exact field names you want
          * Mention data types (numbers, dates, URLs, etc.)
          * Include context about where data might be located

    website_url (Optional[str]): The complete URL of the webpage to scrape.
        - Mutually exclusive with website_html and website_markdown
        - Must include protocol (http:// or https://)
        - Supports dynamic and static content
        - Examples:
          * https://example.com/products/item
          * https://news.site.com/article/123
          * https://company.com/contact
        - Default: None (must provide one of the three input sources)

    website_html (Optional[str]): Raw HTML content to process locally.
        - Mutually exclusive with website_url and website_markdown
        - Maximum size: 2MB
        - Useful for processing pre-fetched or generated HTML
        - Use when you already have HTML content from another source
        - Example: "<html><body><h1>Title</h1><p>Content</p></body></html>"
        - Default: None

    website_markdown (Optional[str]): Markdown content to process locally.
        - Mutually exclusive with website_url and website_html
        - Maximum size: 2MB
        - Useful for extracting from markdown documents or converted content
        - Works well with documentation, README files, or converted web content
        - Example: "# Title

Section

Content here..." - Default: None

    output_schema (Optional[Union[str, Dict]]): JSON schema defining expected output structure.
        - Can be provided as a dictionary or JSON string
        - Helps ensure consistent, structured output format
        - Optional but recommended for complex extractions
        - IMPORTANT: Must include a "required" field (can be empty array [] if no fields are required)
        - Examples:
          * As dict: {'type': 'object', 'properties': {'title': {'type': 'string'}, 'price': {'type': 'number'}}, 'required': []}
          * As JSON string: '{"type": "object", "properties": {"name": {"type": "string"}}, "required": []}'
          * For arrays: {'type': 'array', 'items': {'type': 'object', 'properties': {...}, 'required': []}, 'required': []}
          * With required fields: {'type': 'object', 'properties': {'name': {'type': 'string'}, 'email': {'type': 'string'}}, 'required': ['name', 'email']}
        - Note: If "required" field is missing, it will be automatically added as an empty array []
        - Default: None (AI will infer structure from prompt)

    number_of_scrolls (Optional[int]): Number of infinite scrolls to perform before scraping.
        - Range: 0-50 scrolls
        - Default: 0 (no scrolling)
        - Useful for dynamically loaded content (lazy loading, infinite scroll)
        - Each scroll waits for content to load before continuing
        - Examples:
          * 0: Static content, no scrolling needed
          * 3: Social media feeds, product listings
          * 10: Long articles, extensive product catalogs
        - Note: Increases processing time proportionally

    total_pages (Optional[int]): Number of pages to process for pagination.
        - Range: 1-100 pages
        - Default: 1 (single page only)
        - Automatically follows pagination links when available
        - Useful for multi-page listings, search results, catalogs
        - Examples:
          * 1: Single page extraction
          * 5: First 5 pages of search results
          * 20: Comprehensive catalog scraping
        - Note: Each page counts toward credit usage (10 credits × pages)

    render_heavy_js (Optional[bool]): Enable heavy JavaScript rendering for dynamic sites.
        - Default: false
        - Set to true for Single Page Applications (SPAs), React apps, Vue.js sites
        - Increases processing time but captures client-side rendered content
        - Use when content is loaded dynamically via JavaScript
        - Examples of when to use:
          * React/Angular/Vue applications
          * Sites with dynamic content loading
          * AJAX-heavy interfaces
          * Content that appears after page load
        - Note: Significantly increases processing time (30-60 seconds vs 5-15 seconds)

    stealth (Optional[bool]): Enable stealth mode to avoid bot detection.
        - Default: false
        - Helps bypass basic anti-scraping measures
        - Uses techniques to appear more like a human browser
        - Useful for sites with bot detection systems
        - Examples of when to use:
          * Sites that block automated requests
          * E-commerce sites with protection
          * Sites that require "human-like" behavior
        - Note: May increase processing time and is not 100% guaranteed

Returns:
    Dictionary containing:
    - extracted_data: The structured data matching your prompt and optional schema
    - metadata: Information about the extraction process
    - credits_used: Number of credits consumed (10 per page processed)
    - processing_time: Time taken for the extraction
    - pages_processed: Number of pages that were analyzed
    - status: Success/error status of the operation

Raises:
    ValueError: If no input source provided or multiple sources provided
    HTTPError: If website_url cannot be accessed
    TimeoutError: If processing exceeds timeout limits
    ValidationError: If output_schema is malformed JSON

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
user_promptYes
website_urlNo
website_htmlNo
website_markdownNo
output_schemaNo
number_of_scrollsNo
total_pagesNo
render_heavy_jsNo
stealthNo

Implementation Reference

  • MCP tool handler for 'smartscraper': validates parameters, normalizes output_schema, instantiates ScapeGraphClient, calls the API method, and handles exceptions with error responses.
    @mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True})
    def smartscraper(
        user_prompt: str,
        ctx: Context,
        website_url: Optional[str] = None,
        website_html: Optional[str] = None,
        website_markdown: Optional[str] = None,
        output_schema: Optional[Annotated[Union[str, Dict[str, Any]], Field(
            default=None,
            description="JSON schema dict or JSON string defining the expected output structure",
            json_schema_extra={
                "oneOf": [
                    {"type": "string"},
                    {"type": "object"}
                ]
            }
        )]] = None,
        number_of_scrolls: Optional[int] = None,
        total_pages: Optional[int] = None,
        render_heavy_js: Optional[bool] = None,
        stealth: Optional[bool] = None
    ) -> Dict[str, Any]:
        """
        Extract structured data from a webpage, HTML, or markdown using AI-powered extraction.
    
        This tool uses advanced AI to understand your natural language prompt and extract specific
        structured data from web content. Supports three input modes: URL scraping. Ideal for extracting product information, contact details,
        article metadata, or any structured content. Costs 10 credits per page. Read-only operation.
    
        Args:
            user_prompt (str): Natural language instructions describing what data to extract.
                - Be specific about the fields you want for better results
                - Use clear, descriptive language about the target data
                - Examples:
                  * "Extract product name, price, description, and availability status"
                  * "Find all contact methods: email addresses, phone numbers, and social media links"
                  * "Get article title, author, publication date, and summary"
                  * "Extract all job listings with title, company, location, and salary"
                - Tips for better results:
                  * Specify exact field names you want
                  * Mention data types (numbers, dates, URLs, etc.)
                  * Include context about where data might be located
    
            website_url (Optional[str]): The complete URL of the webpage to scrape.
                - Mutually exclusive with website_html and website_markdown
                - Must include protocol (http:// or https://)
                - Supports dynamic and static content
                - Examples:
                  * https://example.com/products/item
                  * https://news.site.com/article/123
                  * https://company.com/contact
                - Default: None (must provide one of the three input sources)
    
            website_html (Optional[str]): Raw HTML content to process locally.
                - Mutually exclusive with website_url and website_markdown
                - Maximum size: 2MB
                - Useful for processing pre-fetched or generated HTML
                - Use when you already have HTML content from another source
                - Example: "<html><body><h1>Title</h1><p>Content</p></body></html>"
                - Default: None
    
            website_markdown (Optional[str]): Markdown content to process locally.
                - Mutually exclusive with website_url and website_html
                - Maximum size: 2MB
                - Useful for extracting from markdown documents or converted content
                - Works well with documentation, README files, or converted web content
                - Example: "# Title\n\n## Section\n\nContent here..."
                - Default: None
    
            output_schema (Optional[Union[str, Dict]]): JSON schema defining expected output structure.
                - Can be provided as a dictionary or JSON string
                - Helps ensure consistent, structured output format
                - Optional but recommended for complex extractions
                - IMPORTANT: Must include a "required" field (can be empty array [] if no fields are required)
                - Examples:
                  * As dict: {'type': 'object', 'properties': {'title': {'type': 'string'}, 'price': {'type': 'number'}}, 'required': []}
                  * As JSON string: '{"type": "object", "properties": {"name": {"type": "string"}}, "required": []}'
                  * For arrays: {'type': 'array', 'items': {'type': 'object', 'properties': {...}, 'required': []}, 'required': []}
                  * With required fields: {'type': 'object', 'properties': {'name': {'type': 'string'}, 'email': {'type': 'string'}}, 'required': ['name', 'email']}
                - Note: If "required" field is missing, it will be automatically added as an empty array []
                - Default: None (AI will infer structure from prompt)
    
            number_of_scrolls (Optional[int]): Number of infinite scrolls to perform before scraping.
                - Range: 0-50 scrolls
                - Default: 0 (no scrolling)
                - Useful for dynamically loaded content (lazy loading, infinite scroll)
                - Each scroll waits for content to load before continuing
                - Examples:
                  * 0: Static content, no scrolling needed
                  * 3: Social media feeds, product listings
                  * 10: Long articles, extensive product catalogs
                - Note: Increases processing time proportionally
    
            total_pages (Optional[int]): Number of pages to process for pagination.
                - Range: 1-100 pages
                - Default: 1 (single page only)
                - Automatically follows pagination links when available
                - Useful for multi-page listings, search results, catalogs
                - Examples:
                  * 1: Single page extraction
                  * 5: First 5 pages of search results
                  * 20: Comprehensive catalog scraping
                - Note: Each page counts toward credit usage (10 credits × pages)
    
            render_heavy_js (Optional[bool]): Enable heavy JavaScript rendering for dynamic sites.
                - Default: false
                - Set to true for Single Page Applications (SPAs), React apps, Vue.js sites
                - Increases processing time but captures client-side rendered content
                - Use when content is loaded dynamically via JavaScript
                - Examples of when to use:
                  * React/Angular/Vue applications
                  * Sites with dynamic content loading
                  * AJAX-heavy interfaces
                  * Content that appears after page load
                - Note: Significantly increases processing time (30-60 seconds vs 5-15 seconds)
    
            stealth (Optional[bool]): Enable stealth mode to avoid bot detection.
                - Default: false
                - Helps bypass basic anti-scraping measures
                - Uses techniques to appear more like a human browser
                - Useful for sites with bot detection systems
                - Examples of when to use:
                  * Sites that block automated requests
                  * E-commerce sites with protection
                  * Sites that require "human-like" behavior
                - Note: May increase processing time and is not 100% guaranteed
    
        Returns:
            Dictionary containing:
            - extracted_data: The structured data matching your prompt and optional schema
            - metadata: Information about the extraction process
            - credits_used: Number of credits consumed (10 per page processed)
            - processing_time: Time taken for the extraction
            - pages_processed: Number of pages that were analyzed
            - status: Success/error status of the operation
    
        Raises:
            ValueError: If no input source provided or multiple sources provided
            HTTPError: If website_url cannot be accessed
            TimeoutError: If processing exceeds timeout limits
            ValidationError: If output_schema is malformed JSON
        """
        try:
            api_key = get_api_key(ctx)
            client = ScapeGraphClient(api_key)
    
            # Parse output_schema if it's a JSON string
            normalized_schema: Optional[Dict[str, Any]] = None
            if isinstance(output_schema, dict):
                normalized_schema = output_schema
            elif isinstance(output_schema, str):
                try:
                    parsed_schema = json.loads(output_schema)
                    if isinstance(parsed_schema, dict):
                        normalized_schema = parsed_schema
                    else:
                        return {"error": "output_schema must be a JSON object"}
                except json.JSONDecodeError as e:
                    return {"error": f"Invalid JSON for output_schema: {str(e)}"}
    
            # Ensure output_schema has a 'required' field if it exists
            if normalized_schema is not None:
                if "required" not in normalized_schema:
                    normalized_schema["required"] = []
    
            return client.smartscraper(
                user_prompt=user_prompt,
                website_url=website_url,
                website_html=website_html,
                website_markdown=website_markdown,
                output_schema=normalized_schema,
                number_of_scrolls=number_of_scrolls,
                total_pages=total_pages,
                render_heavy_js=render_heavy_js,
                stealth=stealth
            )
        except Exception as e:
            return {"error": str(e)}
  • ScapeGraphClient.smartscraper: Constructs the API request payload with mutual exclusion validation for input sources, makes POST to /smartscraper endpoint, and returns JSON response or raises on error.
    def smartscraper(
        self,
        user_prompt: str,
        website_url: str = None,
        website_html: str = None,
        website_markdown: str = None,
        output_schema: Dict[str, Any] = None,
        number_of_scrolls: int = None,
        total_pages: int = None,
        render_heavy_js: bool = None,
        stealth: bool = None
    ) -> Dict[str, Any]:
        """
        Extract structured data from a webpage using AI.
    
        Args:
            user_prompt: Instructions for what data to extract
            website_url: URL of the webpage to scrape (mutually exclusive with website_html and website_markdown)
            website_html: HTML content to process locally (mutually exclusive with website_url and website_markdown, max 2MB)
            website_markdown: Markdown content to process locally (mutually exclusive with website_url and website_html, max 2MB)
            output_schema: JSON schema defining expected output structure (optional)
            number_of_scrolls: Number of infinite scrolls to perform (0-50, default 0)
            total_pages: Number of pages to process for pagination (1-100, default 1)
            render_heavy_js: Enable heavy JavaScript rendering for dynamic pages (default false)
            stealth: Enable stealth mode to avoid bot detection (default false)
    
        Returns:
            Dictionary containing the extracted data
        """
        url = f"{self.BASE_URL}/smartscraper"
        data = {"user_prompt": user_prompt}
    
        # Add input source (mutually exclusive)
        if website_url is not None:
            data["website_url"] = website_url
        elif website_html is not None:
            data["website_html"] = website_html
        elif website_markdown is not None:
            data["website_markdown"] = website_markdown
        else:
            raise ValueError("Must provide one of: website_url, website_html, or website_markdown")
    
        # Add optional parameters
        if output_schema is not None:
            data["output_schema"] = output_schema
        if number_of_scrolls is not None:
            data["number_of_scrolls"] = number_of_scrolls
        if total_pages is not None:
            data["total_pages"] = total_pages
        if render_heavy_js is not None:
            data["render_heavy_js"] = render_heavy_js
        if stealth is not None:
            data["stealth"] = stealth
    
        response = self.client.post(url, headers=self.headers, json=data)
    
        if response.status_code != 200:
            error_msg = f"Error {response.status_code}: {response.text}"
            raise Exception(error_msg)
    
        return response.json()
  • @mcp.tool decorator registers the smartscraper function as an MCP tool with read-only, non-destructive, idempotent hints.
    @mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True})
  • Pydantic schema validation for output_schema parameter in MCP tool: accepts str or dict, with oneOf JSON schema extra for MCP compatibility.
    output_schema: Optional[Annotated[Union[str, Dict[str, Any]], Field(
        default=None,
        description="JSON schema dict or JSON string defining the expected output structure",
        json_schema_extra={
            "oneOf": [
                {"type": "string"},
                {"type": "object"}
            ]
        }
    )]] = None,
    number_of_scrolls: Optional[int] = None,
    total_pages: Optional[int] = None,
    render_heavy_js: Optional[bool] = None,
    stealth: Optional[bool] = None

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ScrapeGraphAI/scrapegraph-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server