Skip to main content
Glama
ScrapeGraphAI

ScrapeGraph MCP Server

Official

smartscraper

Read-onlyIdempotent

Extract structured data from webpages, HTML, or markdown using AI-powered natural language prompts to get specific information like product details, contact methods, or article metadata.

Instructions

Extract structured data from a webpage, HTML, or markdown using AI-powered extraction.

This tool uses advanced AI to understand your natural language prompt and extract specific
structured data from web content. Supports three input modes: URL scraping. Ideal for extracting product information, contact details,
article metadata, or any structured content. Costs 10 credits per page. Read-only operation.

Args:
    user_prompt (str): Natural language instructions describing what data to extract.
        - Be specific about the fields you want for better results
        - Use clear, descriptive language about the target data
        - Examples:
          * "Extract product name, price, description, and availability status"
          * "Find all contact methods: email addresses, phone numbers, and social media links"
          * "Get article title, author, publication date, and summary"
          * "Extract all job listings with title, company, location, and salary"
        - Tips for better results:
          * Specify exact field names you want
          * Mention data types (numbers, dates, URLs, etc.)
          * Include context about where data might be located

    website_url (Optional[str]): The complete URL of the webpage to scrape.
        - Mutually exclusive with website_html and website_markdown
        - Must include protocol (http:// or https://)
        - Supports dynamic and static content
        - Examples:
          * https://example.com/products/item
          * https://news.site.com/article/123
          * https://company.com/contact
        - Default: None (must provide one of the three input sources)

    website_html (Optional[str]): Raw HTML content to process locally.
        - Mutually exclusive with website_url and website_markdown
        - Maximum size: 2MB
        - Useful for processing pre-fetched or generated HTML
        - Use when you already have HTML content from another source
        - Example: "<html><body><h1>Title</h1><p>Content</p></body></html>"
        - Default: None

    website_markdown (Optional[str]): Markdown content to process locally.
        - Mutually exclusive with website_url and website_html
        - Maximum size: 2MB
        - Useful for extracting from markdown documents or converted content
        - Works well with documentation, README files, or converted web content
        - Example: "# Title

Section

Content here..." - Default: None

    output_schema (Optional[Union[str, Dict]]): JSON schema defining expected output structure.
        - Can be provided as a dictionary or JSON string
        - Helps ensure consistent, structured output format
        - Optional but recommended for complex extractions
        - IMPORTANT: Must include a "required" field (can be empty array [] if no fields are required)
        - Examples:
          * As dict: {'type': 'object', 'properties': {'title': {'type': 'string'}, 'price': {'type': 'number'}}, 'required': []}
          * As JSON string: '{"type": "object", "properties": {"name": {"type": "string"}}, "required": []}'
          * For arrays: {'type': 'array', 'items': {'type': 'object', 'properties': {...}, 'required': []}, 'required': []}
          * With required fields: {'type': 'object', 'properties': {'name': {'type': 'string'}, 'email': {'type': 'string'}}, 'required': ['name', 'email']}
        - Note: If "required" field is missing, it will be automatically added as an empty array []
        - Default: None (AI will infer structure from prompt)

    number_of_scrolls (Optional[int]): Number of infinite scrolls to perform before scraping.
        - Range: 0-50 scrolls
        - Default: 0 (no scrolling)
        - Useful for dynamically loaded content (lazy loading, infinite scroll)
        - Each scroll waits for content to load before continuing
        - Examples:
          * 0: Static content, no scrolling needed
          * 3: Social media feeds, product listings
          * 10: Long articles, extensive product catalogs
        - Note: Increases processing time proportionally

    total_pages (Optional[int]): Number of pages to process for pagination.
        - Range: 1-100 pages
        - Default: 1 (single page only)
        - Automatically follows pagination links when available
        - Useful for multi-page listings, search results, catalogs
        - Examples:
          * 1: Single page extraction
          * 5: First 5 pages of search results
          * 20: Comprehensive catalog scraping
        - Note: Each page counts toward credit usage (10 credits × pages)

    render_heavy_js (Optional[bool]): Enable heavy JavaScript rendering for dynamic sites.
        - Default: false
        - Set to true for Single Page Applications (SPAs), React apps, Vue.js sites
        - Increases processing time but captures client-side rendered content
        - Use when content is loaded dynamically via JavaScript
        - Examples of when to use:
          * React/Angular/Vue applications
          * Sites with dynamic content loading
          * AJAX-heavy interfaces
          * Content that appears after page load
        - Note: Significantly increases processing time (30-60 seconds vs 5-15 seconds)

    stealth (Optional[bool]): Enable stealth mode to avoid bot detection.
        - Default: false
        - Helps bypass basic anti-scraping measures
        - Uses techniques to appear more like a human browser
        - Useful for sites with bot detection systems
        - Examples of when to use:
          * Sites that block automated requests
          * E-commerce sites with protection
          * Sites that require "human-like" behavior
        - Note: May increase processing time and is not 100% guaranteed

Returns:
    Dictionary containing:
    - extracted_data: The structured data matching your prompt and optional schema
    - metadata: Information about the extraction process
    - credits_used: Number of credits consumed (10 per page processed)
    - processing_time: Time taken for the extraction
    - pages_processed: Number of pages that were analyzed
    - status: Success/error status of the operation

Raises:
    ValueError: If no input source provided or multiple sources provided
    HTTPError: If website_url cannot be accessed
    TimeoutError: If processing exceeds timeout limits
    ValidationError: If output_schema is malformed JSON

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
user_promptYes
website_urlNo
website_htmlNo
website_markdownNo
output_schemaNo
number_of_scrollsNo
total_pagesNo
render_heavy_jsNo
stealthNo

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault

No arguments

Implementation Reference

  • MCP tool handler for 'smartscraper': validates parameters, normalizes output_schema, instantiates ScapeGraphClient, calls the API method, and handles exceptions with error responses.
    @mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True})
    def smartscraper(
        user_prompt: str,
        ctx: Context,
        website_url: Optional[str] = None,
        website_html: Optional[str] = None,
        website_markdown: Optional[str] = None,
        output_schema: Optional[Annotated[Union[str, Dict[str, Any]], Field(
            default=None,
            description="JSON schema dict or JSON string defining the expected output structure",
            json_schema_extra={
                "oneOf": [
                    {"type": "string"},
                    {"type": "object"}
                ]
            }
        )]] = None,
        number_of_scrolls: Optional[int] = None,
        total_pages: Optional[int] = None,
        render_heavy_js: Optional[bool] = None,
        stealth: Optional[bool] = None
    ) -> Dict[str, Any]:
        """
        Extract structured data from a webpage, HTML, or markdown using AI-powered extraction.
    
        This tool uses advanced AI to understand your natural language prompt and extract specific
        structured data from web content. Supports three input modes: URL scraping. Ideal for extracting product information, contact details,
        article metadata, or any structured content. Costs 10 credits per page. Read-only operation.
    
        Args:
            user_prompt (str): Natural language instructions describing what data to extract.
                - Be specific about the fields you want for better results
                - Use clear, descriptive language about the target data
                - Examples:
                  * "Extract product name, price, description, and availability status"
                  * "Find all contact methods: email addresses, phone numbers, and social media links"
                  * "Get article title, author, publication date, and summary"
                  * "Extract all job listings with title, company, location, and salary"
                - Tips for better results:
                  * Specify exact field names you want
                  * Mention data types (numbers, dates, URLs, etc.)
                  * Include context about where data might be located
    
            website_url (Optional[str]): The complete URL of the webpage to scrape.
                - Mutually exclusive with website_html and website_markdown
                - Must include protocol (http:// or https://)
                - Supports dynamic and static content
                - Examples:
                  * https://example.com/products/item
                  * https://news.site.com/article/123
                  * https://company.com/contact
                - Default: None (must provide one of the three input sources)
    
            website_html (Optional[str]): Raw HTML content to process locally.
                - Mutually exclusive with website_url and website_markdown
                - Maximum size: 2MB
                - Useful for processing pre-fetched or generated HTML
                - Use when you already have HTML content from another source
                - Example: "<html><body><h1>Title</h1><p>Content</p></body></html>"
                - Default: None
    
            website_markdown (Optional[str]): Markdown content to process locally.
                - Mutually exclusive with website_url and website_html
                - Maximum size: 2MB
                - Useful for extracting from markdown documents or converted content
                - Works well with documentation, README files, or converted web content
                - Example: "# Title\n\n## Section\n\nContent here..."
                - Default: None
    
            output_schema (Optional[Union[str, Dict]]): JSON schema defining expected output structure.
                - Can be provided as a dictionary or JSON string
                - Helps ensure consistent, structured output format
                - Optional but recommended for complex extractions
                - IMPORTANT: Must include a "required" field (can be empty array [] if no fields are required)
                - Examples:
                  * As dict: {'type': 'object', 'properties': {'title': {'type': 'string'}, 'price': {'type': 'number'}}, 'required': []}
                  * As JSON string: '{"type": "object", "properties": {"name": {"type": "string"}}, "required": []}'
                  * For arrays: {'type': 'array', 'items': {'type': 'object', 'properties': {...}, 'required': []}, 'required': []}
                  * With required fields: {'type': 'object', 'properties': {'name': {'type': 'string'}, 'email': {'type': 'string'}}, 'required': ['name', 'email']}
                - Note: If "required" field is missing, it will be automatically added as an empty array []
                - Default: None (AI will infer structure from prompt)
    
            number_of_scrolls (Optional[int]): Number of infinite scrolls to perform before scraping.
                - Range: 0-50 scrolls
                - Default: 0 (no scrolling)
                - Useful for dynamically loaded content (lazy loading, infinite scroll)
                - Each scroll waits for content to load before continuing
                - Examples:
                  * 0: Static content, no scrolling needed
                  * 3: Social media feeds, product listings
                  * 10: Long articles, extensive product catalogs
                - Note: Increases processing time proportionally
    
            total_pages (Optional[int]): Number of pages to process for pagination.
                - Range: 1-100 pages
                - Default: 1 (single page only)
                - Automatically follows pagination links when available
                - Useful for multi-page listings, search results, catalogs
                - Examples:
                  * 1: Single page extraction
                  * 5: First 5 pages of search results
                  * 20: Comprehensive catalog scraping
                - Note: Each page counts toward credit usage (10 credits × pages)
    
            render_heavy_js (Optional[bool]): Enable heavy JavaScript rendering for dynamic sites.
                - Default: false
                - Set to true for Single Page Applications (SPAs), React apps, Vue.js sites
                - Increases processing time but captures client-side rendered content
                - Use when content is loaded dynamically via JavaScript
                - Examples of when to use:
                  * React/Angular/Vue applications
                  * Sites with dynamic content loading
                  * AJAX-heavy interfaces
                  * Content that appears after page load
                - Note: Significantly increases processing time (30-60 seconds vs 5-15 seconds)
    
            stealth (Optional[bool]): Enable stealth mode to avoid bot detection.
                - Default: false
                - Helps bypass basic anti-scraping measures
                - Uses techniques to appear more like a human browser
                - Useful for sites with bot detection systems
                - Examples of when to use:
                  * Sites that block automated requests
                  * E-commerce sites with protection
                  * Sites that require "human-like" behavior
                - Note: May increase processing time and is not 100% guaranteed
    
        Returns:
            Dictionary containing:
            - extracted_data: The structured data matching your prompt and optional schema
            - metadata: Information about the extraction process
            - credits_used: Number of credits consumed (10 per page processed)
            - processing_time: Time taken for the extraction
            - pages_processed: Number of pages that were analyzed
            - status: Success/error status of the operation
    
        Raises:
            ValueError: If no input source provided or multiple sources provided
            HTTPError: If website_url cannot be accessed
            TimeoutError: If processing exceeds timeout limits
            ValidationError: If output_schema is malformed JSON
        """
        try:
            api_key = get_api_key(ctx)
            client = ScapeGraphClient(api_key)
    
            # Parse output_schema if it's a JSON string
            normalized_schema: Optional[Dict[str, Any]] = None
            if isinstance(output_schema, dict):
                normalized_schema = output_schema
            elif isinstance(output_schema, str):
                try:
                    parsed_schema = json.loads(output_schema)
                    if isinstance(parsed_schema, dict):
                        normalized_schema = parsed_schema
                    else:
                        return {"error": "output_schema must be a JSON object"}
                except json.JSONDecodeError as e:
                    return {"error": f"Invalid JSON for output_schema: {str(e)}"}
    
            # Ensure output_schema has a 'required' field if it exists
            if normalized_schema is not None:
                if "required" not in normalized_schema:
                    normalized_schema["required"] = []
    
            return client.smartscraper(
                user_prompt=user_prompt,
                website_url=website_url,
                website_html=website_html,
                website_markdown=website_markdown,
                output_schema=normalized_schema,
                number_of_scrolls=number_of_scrolls,
                total_pages=total_pages,
                render_heavy_js=render_heavy_js,
                stealth=stealth
            )
        except Exception as e:
            return {"error": str(e)}
  • ScapeGraphClient.smartscraper: Constructs the API request payload with mutual exclusion validation for input sources, makes POST to /smartscraper endpoint, and returns JSON response or raises on error.
    def smartscraper(
        self,
        user_prompt: str,
        website_url: str = None,
        website_html: str = None,
        website_markdown: str = None,
        output_schema: Dict[str, Any] = None,
        number_of_scrolls: int = None,
        total_pages: int = None,
        render_heavy_js: bool = None,
        stealth: bool = None
    ) -> Dict[str, Any]:
        """
        Extract structured data from a webpage using AI.
    
        Args:
            user_prompt: Instructions for what data to extract
            website_url: URL of the webpage to scrape (mutually exclusive with website_html and website_markdown)
            website_html: HTML content to process locally (mutually exclusive with website_url and website_markdown, max 2MB)
            website_markdown: Markdown content to process locally (mutually exclusive with website_url and website_html, max 2MB)
            output_schema: JSON schema defining expected output structure (optional)
            number_of_scrolls: Number of infinite scrolls to perform (0-50, default 0)
            total_pages: Number of pages to process for pagination (1-100, default 1)
            render_heavy_js: Enable heavy JavaScript rendering for dynamic pages (default false)
            stealth: Enable stealth mode to avoid bot detection (default false)
    
        Returns:
            Dictionary containing the extracted data
        """
        url = f"{self.BASE_URL}/smartscraper"
        data = {"user_prompt": user_prompt}
    
        # Add input source (mutually exclusive)
        if website_url is not None:
            data["website_url"] = website_url
        elif website_html is not None:
            data["website_html"] = website_html
        elif website_markdown is not None:
            data["website_markdown"] = website_markdown
        else:
            raise ValueError("Must provide one of: website_url, website_html, or website_markdown")
    
        # Add optional parameters
        if output_schema is not None:
            data["output_schema"] = output_schema
        if number_of_scrolls is not None:
            data["number_of_scrolls"] = number_of_scrolls
        if total_pages is not None:
            data["total_pages"] = total_pages
        if render_heavy_js is not None:
            data["render_heavy_js"] = render_heavy_js
        if stealth is not None:
            data["stealth"] = stealth
    
        response = self.client.post(url, headers=self.headers, json=data)
    
        if response.status_code != 200:
            error_msg = f"Error {response.status_code}: {response.text}"
            raise Exception(error_msg)
    
        return response.json()
  • @mcp.tool decorator registers the smartscraper function as an MCP tool with read-only, non-destructive, idempotent hints.
    @mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True})
  • Pydantic schema validation for output_schema parameter in MCP tool: accepts str or dict, with oneOf JSON schema extra for MCP compatibility.
    output_schema: Optional[Annotated[Union[str, Dict[str, Any]], Field(
        default=None,
        description="JSON schema dict or JSON string defining the expected output structure",
        json_schema_extra={
            "oneOf": [
                {"type": "string"},
                {"type": "object"}
            ]
        }
    )]] = None,
    number_of_scrolls: Optional[int] = None,
    total_pages: Optional[int] = None,
    render_heavy_js: Optional[bool] = None,
    stealth: Optional[bool] = None
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description adds substantial behavioral context beyond what annotations provide. While annotations indicate read-only/idempotent/non-destructive operations, the description adds crucial details: cost (10 credits per page), processing time implications, rate limits (size limits of 2MB for HTML/markdown), authentication needs (none mentioned), and specific behavioral traits like mutual exclusivity of input sources, scroll behavior, pagination handling, and JavaScript rendering options.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness3/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is extremely long (over 800 words) with extensive parameter documentation that might be better placed in a separate reference. While well-structured with clear sections, it's not front-loaded - the core purpose gets buried in verbose parameter details. Some sentences could be more concise while maintaining clarity.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (9 parameters, AI-powered extraction, multiple input modes) and the presence of an output schema, the description is remarkably complete. It covers all parameters thoroughly, explains the return structure, documents errors/exceptions, provides cost information, and gives practical examples throughout. The output schema existence means the description doesn't need to detail return values, which it appropriately delegates.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

With 0% schema description coverage for 9 parameters, the description carries the full burden of explaining parameter semantics and does so comprehensively. Each parameter gets detailed explanations with examples, constraints, defaults, and practical usage guidance. The description transforms what would be opaque parameters into well-understood inputs.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose as 'Extract structured data from a webpage, HTML, or markdown using AI-powered extraction' with specific examples of use cases (product info, contact details, etc.). It distinguishes from siblings by mentioning 'AI-powered extraction' and 'structured data', but doesn't explicitly differentiate from all sibling tools like 'scrape' or 'searchscraper'.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides clear context about when to use this tool (for AI-powered structured extraction from web content) and includes some usage tips. However, it doesn't explicitly state when NOT to use it or name specific alternatives among the sibling tools, though it implies this is for structured extraction vs. other scraping approaches.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ScrapeGraphAI/scrapegraph-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server