Skip to main content
Glama
ScrapeGraphAI

ScrapeGraph MCP Server

Official

smartscraper

Extract structured data from webpages, HTML, or markdown using AI-powered natural language prompts to get specific information like product details, contact methods, or article metadata.

Instructions

Extract structured data from a webpage, HTML, or markdown using AI-powered extraction. This tool uses advanced AI to understand your natural language prompt and extract specific structured data from web content. Supports three input modes: URL scraping. Ideal for extracting product information, contact details, article metadata, or any structured content. Costs 10 credits per page. Read-only operation. Args: user_prompt (str): Natural language instructions describing what data to extract. - Be specific about the fields you want for better results - Use clear, descriptive language about the target data - Examples: * "Extract product name, price, description, and availability status" * "Find all contact methods: email addresses, phone numbers, and social media links" * "Get article title, author, publication date, and summary" * "Extract all job listings with title, company, location, and salary" - Tips for better results: * Specify exact field names you want * Mention data types (numbers, dates, URLs, etc.) * Include context about where data might be located website_url (Optional[str]): The complete URL of the webpage to scrape. - Mutually exclusive with website_html and website_markdown - Must include protocol (http:// or https://) - Supports dynamic and static content - Examples: * https://example.com/products/item * https://news.site.com/article/123 * https://company.com/contact - Default: None (must provide one of the three input sources) website_html (Optional[str]): Raw HTML content to process locally. - Mutually exclusive with website_url and website_markdown - Maximum size: 2MB - Useful for processing pre-fetched or generated HTML - Use when you already have HTML content from another source - Example: "<html><body><h1>Title</h1><p>Content</p></body></html>" - Default: None website_markdown (Optional[str]): Markdown content to process locally. - Mutually exclusive with website_url and website_html - Maximum size: 2MB - Useful for extracting from markdown documents or converted content - Works well with documentation, README files, or converted web content - Example: "# Title

Section

Content here..." - Default: None

output_schema (Optional[Union[str, Dict]]): JSON schema defining expected output structure. - Can be provided as a dictionary or JSON string - Helps ensure consistent, structured output format - Optional but recommended for complex extractions - IMPORTANT: Must include a "required" field (can be empty array [] if no fields are required) - Examples: * As dict: {'type': 'object', 'properties': {'title': {'type': 'string'}, 'price': {'type': 'number'}}, 'required': []} * As JSON string: '{"type": "object", "properties": {"name": {"type": "string"}}, "required": []}' * For arrays: {'type': 'array', 'items': {'type': 'object', 'properties': {...}, 'required': []}, 'required': []} * With required fields: {'type': 'object', 'properties': {'name': {'type': 'string'}, 'email': {'type': 'string'}}, 'required': ['name', 'email']} - Note: If "required" field is missing, it will be automatically added as an empty array [] - Default: None (AI will infer structure from prompt) number_of_scrolls (Optional[int]): Number of infinite scrolls to perform before scraping. - Range: 0-50 scrolls - Default: 0 (no scrolling) - Useful for dynamically loaded content (lazy loading, infinite scroll) - Each scroll waits for content to load before continuing - Examples: * 0: Static content, no scrolling needed * 3: Social media feeds, product listings * 10: Long articles, extensive product catalogs - Note: Increases processing time proportionally total_pages (Optional[int]): Number of pages to process for pagination. - Range: 1-100 pages - Default: 1 (single page only) - Automatically follows pagination links when available - Useful for multi-page listings, search results, catalogs - Examples: * 1: Single page extraction * 5: First 5 pages of search results * 20: Comprehensive catalog scraping - Note: Each page counts toward credit usage (10 credits × pages) render_heavy_js (Optional[bool]): Enable heavy JavaScript rendering for dynamic sites. - Default: false - Set to true for Single Page Applications (SPAs), React apps, Vue.js sites - Increases processing time but captures client-side rendered content - Use when content is loaded dynamically via JavaScript - Examples of when to use: * React/Angular/Vue applications * Sites with dynamic content loading * AJAX-heavy interfaces * Content that appears after page load - Note: Significantly increases processing time (30-60 seconds vs 5-15 seconds) stealth (Optional[bool]): Enable stealth mode to avoid bot detection. - Default: false - Helps bypass basic anti-scraping measures - Uses techniques to appear more like a human browser - Useful for sites with bot detection systems - Examples of when to use: * Sites that block automated requests * E-commerce sites with protection * Sites that require "human-like" behavior - Note: May increase processing time and is not 100% guaranteed Returns: Dictionary containing: - extracted_data: The structured data matching your prompt and optional schema - metadata: Information about the extraction process - credits_used: Number of credits consumed (10 per page processed) - processing_time: Time taken for the extraction - pages_processed: Number of pages that were analyzed - status: Success/error status of the operation Raises: ValueError: If no input source provided or multiple sources provided HTTPError: If website_url cannot be accessed TimeoutError: If processing exceeds timeout limits ValidationError: If output_schema is malformed JSON

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
user_promptYes
website_urlNo
website_htmlNo
website_markdownNo
output_schemaNo
number_of_scrollsNo
total_pagesNo
render_heavy_jsNo
stealthNo

Implementation Reference

  • MCP tool handler for 'smartscraper': validates parameters, normalizes output_schema, instantiates ScapeGraphClient, calls the API method, and handles exceptions with error responses.
    @mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True}) def smartscraper( user_prompt: str, ctx: Context, website_url: Optional[str] = None, website_html: Optional[str] = None, website_markdown: Optional[str] = None, output_schema: Optional[Annotated[Union[str, Dict[str, Any]], Field( default=None, description="JSON schema dict or JSON string defining the expected output structure", json_schema_extra={ "oneOf": [ {"type": "string"}, {"type": "object"} ] } )]] = None, number_of_scrolls: Optional[int] = None, total_pages: Optional[int] = None, render_heavy_js: Optional[bool] = None, stealth: Optional[bool] = None ) -> Dict[str, Any]: """ Extract structured data from a webpage, HTML, or markdown using AI-powered extraction. This tool uses advanced AI to understand your natural language prompt and extract specific structured data from web content. Supports three input modes: URL scraping. Ideal for extracting product information, contact details, article metadata, or any structured content. Costs 10 credits per page. Read-only operation. Args: user_prompt (str): Natural language instructions describing what data to extract. - Be specific about the fields you want for better results - Use clear, descriptive language about the target data - Examples: * "Extract product name, price, description, and availability status" * "Find all contact methods: email addresses, phone numbers, and social media links" * "Get article title, author, publication date, and summary" * "Extract all job listings with title, company, location, and salary" - Tips for better results: * Specify exact field names you want * Mention data types (numbers, dates, URLs, etc.) * Include context about where data might be located website_url (Optional[str]): The complete URL of the webpage to scrape. - Mutually exclusive with website_html and website_markdown - Must include protocol (http:// or https://) - Supports dynamic and static content - Examples: * https://example.com/products/item * https://news.site.com/article/123 * https://company.com/contact - Default: None (must provide one of the three input sources) website_html (Optional[str]): Raw HTML content to process locally. - Mutually exclusive with website_url and website_markdown - Maximum size: 2MB - Useful for processing pre-fetched or generated HTML - Use when you already have HTML content from another source - Example: "<html><body><h1>Title</h1><p>Content</p></body></html>" - Default: None website_markdown (Optional[str]): Markdown content to process locally. - Mutually exclusive with website_url and website_html - Maximum size: 2MB - Useful for extracting from markdown documents or converted content - Works well with documentation, README files, or converted web content - Example: "# Title\n\n## Section\n\nContent here..." - Default: None output_schema (Optional[Union[str, Dict]]): JSON schema defining expected output structure. - Can be provided as a dictionary or JSON string - Helps ensure consistent, structured output format - Optional but recommended for complex extractions - IMPORTANT: Must include a "required" field (can be empty array [] if no fields are required) - Examples: * As dict: {'type': 'object', 'properties': {'title': {'type': 'string'}, 'price': {'type': 'number'}}, 'required': []} * As JSON string: '{"type": "object", "properties": {"name": {"type": "string"}}, "required": []}' * For arrays: {'type': 'array', 'items': {'type': 'object', 'properties': {...}, 'required': []}, 'required': []} * With required fields: {'type': 'object', 'properties': {'name': {'type': 'string'}, 'email': {'type': 'string'}}, 'required': ['name', 'email']} - Note: If "required" field is missing, it will be automatically added as an empty array [] - Default: None (AI will infer structure from prompt) number_of_scrolls (Optional[int]): Number of infinite scrolls to perform before scraping. - Range: 0-50 scrolls - Default: 0 (no scrolling) - Useful for dynamically loaded content (lazy loading, infinite scroll) - Each scroll waits for content to load before continuing - Examples: * 0: Static content, no scrolling needed * 3: Social media feeds, product listings * 10: Long articles, extensive product catalogs - Note: Increases processing time proportionally total_pages (Optional[int]): Number of pages to process for pagination. - Range: 1-100 pages - Default: 1 (single page only) - Automatically follows pagination links when available - Useful for multi-page listings, search results, catalogs - Examples: * 1: Single page extraction * 5: First 5 pages of search results * 20: Comprehensive catalog scraping - Note: Each page counts toward credit usage (10 credits × pages) render_heavy_js (Optional[bool]): Enable heavy JavaScript rendering for dynamic sites. - Default: false - Set to true for Single Page Applications (SPAs), React apps, Vue.js sites - Increases processing time but captures client-side rendered content - Use when content is loaded dynamically via JavaScript - Examples of when to use: * React/Angular/Vue applications * Sites with dynamic content loading * AJAX-heavy interfaces * Content that appears after page load - Note: Significantly increases processing time (30-60 seconds vs 5-15 seconds) stealth (Optional[bool]): Enable stealth mode to avoid bot detection. - Default: false - Helps bypass basic anti-scraping measures - Uses techniques to appear more like a human browser - Useful for sites with bot detection systems - Examples of when to use: * Sites that block automated requests * E-commerce sites with protection * Sites that require "human-like" behavior - Note: May increase processing time and is not 100% guaranteed Returns: Dictionary containing: - extracted_data: The structured data matching your prompt and optional schema - metadata: Information about the extraction process - credits_used: Number of credits consumed (10 per page processed) - processing_time: Time taken for the extraction - pages_processed: Number of pages that were analyzed - status: Success/error status of the operation Raises: ValueError: If no input source provided or multiple sources provided HTTPError: If website_url cannot be accessed TimeoutError: If processing exceeds timeout limits ValidationError: If output_schema is malformed JSON """ try: api_key = get_api_key(ctx) client = ScapeGraphClient(api_key) # Parse output_schema if it's a JSON string normalized_schema: Optional[Dict[str, Any]] = None if isinstance(output_schema, dict): normalized_schema = output_schema elif isinstance(output_schema, str): try: parsed_schema = json.loads(output_schema) if isinstance(parsed_schema, dict): normalized_schema = parsed_schema else: return {"error": "output_schema must be a JSON object"} except json.JSONDecodeError as e: return {"error": f"Invalid JSON for output_schema: {str(e)}"} # Ensure output_schema has a 'required' field if it exists if normalized_schema is not None: if "required" not in normalized_schema: normalized_schema["required"] = [] return client.smartscraper( user_prompt=user_prompt, website_url=website_url, website_html=website_html, website_markdown=website_markdown, output_schema=normalized_schema, number_of_scrolls=number_of_scrolls, total_pages=total_pages, render_heavy_js=render_heavy_js, stealth=stealth ) except Exception as e: return {"error": str(e)}
  • ScapeGraphClient.smartscraper: Constructs the API request payload with mutual exclusion validation for input sources, makes POST to /smartscraper endpoint, and returns JSON response or raises on error.
    def smartscraper( self, user_prompt: str, website_url: str = None, website_html: str = None, website_markdown: str = None, output_schema: Dict[str, Any] = None, number_of_scrolls: int = None, total_pages: int = None, render_heavy_js: bool = None, stealth: bool = None ) -> Dict[str, Any]: """ Extract structured data from a webpage using AI. Args: user_prompt: Instructions for what data to extract website_url: URL of the webpage to scrape (mutually exclusive with website_html and website_markdown) website_html: HTML content to process locally (mutually exclusive with website_url and website_markdown, max 2MB) website_markdown: Markdown content to process locally (mutually exclusive with website_url and website_html, max 2MB) output_schema: JSON schema defining expected output structure (optional) number_of_scrolls: Number of infinite scrolls to perform (0-50, default 0) total_pages: Number of pages to process for pagination (1-100, default 1) render_heavy_js: Enable heavy JavaScript rendering for dynamic pages (default false) stealth: Enable stealth mode to avoid bot detection (default false) Returns: Dictionary containing the extracted data """ url = f"{self.BASE_URL}/smartscraper" data = {"user_prompt": user_prompt} # Add input source (mutually exclusive) if website_url is not None: data["website_url"] = website_url elif website_html is not None: data["website_html"] = website_html elif website_markdown is not None: data["website_markdown"] = website_markdown else: raise ValueError("Must provide one of: website_url, website_html, or website_markdown") # Add optional parameters if output_schema is not None: data["output_schema"] = output_schema if number_of_scrolls is not None: data["number_of_scrolls"] = number_of_scrolls if total_pages is not None: data["total_pages"] = total_pages if render_heavy_js is not None: data["render_heavy_js"] = render_heavy_js if stealth is not None: data["stealth"] = stealth response = self.client.post(url, headers=self.headers, json=data) if response.status_code != 200: error_msg = f"Error {response.status_code}: {response.text}" raise Exception(error_msg) return response.json()
  • @mcp.tool decorator registers the smartscraper function as an MCP tool with read-only, non-destructive, idempotent hints.
    @mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True})
  • Pydantic schema validation for output_schema parameter in MCP tool: accepts str or dict, with oneOf JSON schema extra for MCP compatibility.
    output_schema: Optional[Annotated[Union[str, Dict[str, Any]], Field( default=None, description="JSON schema dict or JSON string defining the expected output structure", json_schema_extra={ "oneOf": [ {"type": "string"}, {"type": "object"} ] } )]] = None, number_of_scrolls: Optional[int] = None, total_pages: Optional[int] = None, render_heavy_js: Optional[bool] = None, stealth: Optional[bool] = None

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ScrapeGraphAI/scrapegraph-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server