Skip to main content
Glama
ScrapeGraphAI

ScrapeGraph MCP Server

Official

smartscraper

Extract structured data from webpages using AI by specifying user prompts and URLs. Ideal for automating data retrieval and transforming web content into markdown or processed formats.

Instructions

Extract structured data from a webpage using AI. Args: user_prompt: Instructions for what data to extract website_url: URL of the webpage to scrape number_of_scrolls: Number of infinite scrolls to perform (optional) markdown_only: Whether to return only markdown content without AI processing (optional) Returns: Dictionary containing the extracted data or markdown content

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
markdown_onlyNo
number_of_scrollsNo
user_promptYes
website_urlYes

Implementation Reference

  • MCP tool handler function for 'smartscraper'. Handles input validation, schema normalization, API key retrieval, creates ScapeGraphClient instance, and delegates to the client's smartscraper method.
    @mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True}) def smartscraper( user_prompt: str, ctx: Context, website_url: Optional[str] = None, website_html: Optional[str] = None, website_markdown: Optional[str] = None, output_schema: Optional[Annotated[Union[str, Dict[str, Any]], Field( default=None, description="JSON schema dict or JSON string defining the expected output structure", json_schema_extra={ "oneOf": [ {"type": "string"}, {"type": "object"} ] } )]] = None, number_of_scrolls: Optional[int] = None, total_pages: Optional[int] = None, render_heavy_js: Optional[bool] = None, stealth: Optional[bool] = None ) -> Dict[str, Any]: """ Extract structured data from a webpage, HTML, or markdown using AI-powered extraction. This tool uses advanced AI to understand your natural language prompt and extract specific structured data from web content. Supports three input modes: URL scraping, local HTML processing, or local markdown processing. Ideal for extracting product information, contact details, article metadata, or any structured content. Costs 10 credits per page. Read-only operation. Args: user_prompt (str): Natural language instructions describing what data to extract. - Be specific about the fields you want for better results - Use clear, descriptive language about the target data - Examples: * "Extract product name, price, description, and availability status" * "Find all contact methods: email addresses, phone numbers, and social media links" * "Get article title, author, publication date, and summary" * "Extract all job listings with title, company, location, and salary" - Tips for better results: * Specify exact field names you want * Mention data types (numbers, dates, URLs, etc.) * Include context about where data might be located website_url (Optional[str]): The complete URL of the webpage to scrape. - Mutually exclusive with website_html and website_markdown - Must include protocol (http:// or https://) - Supports dynamic and static content - Examples: * https://example.com/products/item * https://news.site.com/article/123 * https://company.com/contact - Default: None (must provide one of the three input sources) website_html (Optional[str]): Raw HTML content to process locally. - Mutually exclusive with website_url and website_markdown - Maximum size: 2MB - Useful for processing pre-fetched or generated HTML - Use when you already have HTML content from another source - Example: "<html><body><h1>Title</h1><p>Content</p></body></html>" - Default: None website_markdown (Optional[str]): Markdown content to process locally. - Mutually exclusive with website_url and website_html - Maximum size: 2MB - Useful for extracting from markdown documents or converted content - Works well with documentation, README files, or converted web content - Example: "# Title\n\n## Section\n\nContent here..." - Default: None output_schema (Optional[Union[str, Dict]]): JSON schema defining expected output structure. - Can be provided as a dictionary or JSON string - Helps ensure consistent, structured output format - Optional but recommended for complex extractions - Examples: * As dict: {'type': 'object', 'properties': {'title': {'type': 'string'}, 'price': {'type': 'number'}}} * As JSON string: '{"type": "object", "properties": {"name": {"type": "string"}}}' * For arrays: {'type': 'array', 'items': {'type': 'object', 'properties': {...}}} - Default: None (AI will infer structure from prompt) number_of_scrolls (Optional[int]): Number of infinite scrolls to perform before scraping. - Range: 0-50 scrolls - Default: 0 (no scrolling) - Useful for dynamically loaded content (lazy loading, infinite scroll) - Each scroll waits for content to load before continuing - Examples: * 0: Static content, no scrolling needed * 3: Social media feeds, product listings * 10: Long articles, extensive product catalogs - Note: Increases processing time proportionally total_pages (Optional[int]): Number of pages to process for pagination. - Range: 1-100 pages - Default: 1 (single page only) - Automatically follows pagination links when available - Useful for multi-page listings, search results, catalogs - Examples: * 1: Single page extraction * 5: First 5 pages of search results * 20: Comprehensive catalog scraping - Note: Each page counts toward credit usage (10 credits × pages) render_heavy_js (Optional[bool]): Enable heavy JavaScript rendering for dynamic sites. - Default: false - Set to true for Single Page Applications (SPAs), React apps, Vue.js sites - Increases processing time but captures client-side rendered content - Use when content is loaded dynamically via JavaScript - Examples of when to use: * React/Angular/Vue applications * Sites with dynamic content loading * AJAX-heavy interfaces * Content that appears after page load - Note: Significantly increases processing time (30-60 seconds vs 5-15 seconds) stealth (Optional[bool]): Enable stealth mode to avoid bot detection. - Default: false - Helps bypass basic anti-scraping measures - Uses techniques to appear more like a human browser - Useful for sites with bot detection systems - Examples of when to use: * Sites that block automated requests * E-commerce sites with protection * Sites that require "human-like" behavior - Note: May increase processing time and is not 100% guaranteed Returns: Dictionary containing: - extracted_data: The structured data matching your prompt and optional schema - metadata: Information about the extraction process - credits_used: Number of credits consumed (10 per page processed) - processing_time: Time taken for the extraction - pages_processed: Number of pages that were analyzed - status: Success/error status of the operation Raises: ValueError: If no input source provided or multiple sources provided HTTPError: If website_url cannot be accessed TimeoutError: If processing exceeds timeout limits ValidationError: If output_schema is malformed JSON """ try: api_key = get_api_key(ctx) client = ScapeGraphClient(api_key) # Parse output_schema if it's a JSON string normalized_schema: Optional[Dict[str, Any]] = None if isinstance(output_schema, dict): normalized_schema = output_schema elif isinstance(output_schema, str): try: parsed_schema = json.loads(output_schema) if isinstance(parsed_schema, dict): normalized_schema = parsed_schema else: return {"error": "output_schema must be a JSON object"} except json.JSONDecodeError as e: return {"error": f"Invalid JSON for output_schema: {str(e)}"} return client.smartscraper( user_prompt=user_prompt, website_url=website_url, website_html=website_html, website_markdown=website_markdown, output_schema=normalized_schema, number_of_scrolls=number_of_scrolls, total_pages=total_pages, render_heavy_js=render_heavy_js, stealth=stealth ) except Exception as e: return {"error": str(e)}
  • Core helper method in ScapeGraphClient class that constructs the API request payload with mutual exclusion validation for input sources, adds optional parameters, makes HTTP POST to https://api.scrapegraphai.com/v1/smartscraper, and returns the JSON response.
    def smartscraper( self, user_prompt: str, website_url: str = None, website_html: str = None, website_markdown: str = None, output_schema: Dict[str, Any] = None, number_of_scrolls: int = None, total_pages: int = None, render_heavy_js: bool = None, stealth: bool = None ) -> Dict[str, Any]: """ Extract structured data from a webpage using AI. Args: user_prompt: Instructions for what data to extract website_url: URL of the webpage to scrape (mutually exclusive with website_html and website_markdown) website_html: HTML content to process locally (mutually exclusive with website_url and website_markdown, max 2MB) website_markdown: Markdown content to process locally (mutually exclusive with website_url and website_html, max 2MB) output_schema: JSON schema defining expected output structure (optional) number_of_scrolls: Number of infinite scrolls to perform (0-50, default 0) total_pages: Number of pages to process for pagination (1-100, default 1) render_heavy_js: Enable heavy JavaScript rendering for dynamic pages (default false) stealth: Enable stealth mode to avoid bot detection (default false) Returns: Dictionary containing the extracted data """ url = f"{self.BASE_URL}/smartscraper" data = {"user_prompt": user_prompt} # Add input source (mutually exclusive) if website_url is not None: data["website_url"] = website_url elif website_html is not None: data["website_html"] = website_html elif website_markdown is not None: data["website_markdown"] = website_markdown else: raise ValueError("Must provide one of: website_url, website_html, or website_markdown") # Add optional parameters if output_schema is not None: data["output_schema"] = output_schema if number_of_scrolls is not None: data["number_of_scrolls"] = number_of_scrolls if total_pages is not None: data["total_pages"] = total_pages if render_heavy_js is not None: data["render_heavy_js"] = render_heavy_js if stealth is not None: data["stealth"] = stealth response = self.client.post(url, headers=self.headers, json=data) if response.status_code != 200: error_msg = f"Error {response.status_code}: {response.text}" raise Exception(error_msg) return response.json()
  • MCP tool registration decorator for the smartscraper handler with annotations indicating read-only, non-destructive, idempotent behavior.
    @mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True})
  • Pydantic schema definition for the optional output_schema parameter in the MCP tool, supporting both string (JSON) and dict formats with JSON schema extra for validation.
    website_markdown: Optional[str] = None, output_schema: Optional[Annotated[Union[str, Dict[str, Any]], Field( default=None, description="JSON schema dict or JSON string defining the expected output structure", json_schema_extra={ "oneOf": [ {"type": "string"}, {"type": "object"} ] } )]] = None, number_of_scrolls: Optional[int] = None, total_pages: Optional[int] = None, render_heavy_js: Optional[bool] = None, stealth: Optional[bool] = None ) -> Dict[str, Any]: """ Extract structured data from a webpage, HTML, or markdown using AI-powered extraction. This tool uses advanced AI to understand your natural language prompt and extract specific structured data from web content. Supports three input modes: URL scraping, local HTML processing, or local markdown processing. Ideal for extracting product information, contact details, article metadata, or any structured content. Costs 10 credits per page. Read-only operation. Args: user_prompt (str): Natural language instructions describing what data to extract. - Be specific about the fields you want for better results - Use clear, descriptive language about the target data - Examples: * "Extract product name, price, description, and availability status" * "Find all contact methods: email addresses, phone numbers, and social media links" * "Get article title, author, publication date, and summary" * "Extract all job listings with title, company, location, and salary" - Tips for better results: * Specify exact field names you want * Mention data types (numbers, dates, URLs, etc.) * Include context about where data might be located website_url (Optional[str]): The complete URL of the webpage to scrape. - Mutually exclusive with website_html and website_markdown - Must include protocol (http:// or https://) - Supports dynamic and static content - Examples: * https://example.com/products/item * https://news.site.com/article/123 * https://company.com/contact - Default: None (must provide one of the three input sources) website_html (Optional[str]): Raw HTML content to process locally. - Mutually exclusive with website_url and website_markdown - Maximum size: 2MB - Useful for processing pre-fetched or generated HTML - Use when you already have HTML content from another source - Example: "<html><body><h1>Title</h1><p>Content</p></body></html>" - Default: None website_markdown (Optional[str]): Markdown content to process locally. - Mutually exclusive with website_url and website_html - Maximum size: 2MB - Useful for extracting from markdown documents or converted content - Works well with documentation, README files, or converted web content - Example: "# Title\n\n## Section\n\nContent here..." - Default: None output_schema (Optional[Union[str, Dict]]): JSON schema defining expected output structure. - Can be provided as a dictionary or JSON string - Helps ensure consistent, structured output format - Optional but recommended for complex extractions - Examples: * As dict: {'type': 'object', 'properties': {'title': {'type': 'string'}, 'price': {'type': 'number'}}} * As JSON string: '{"type": "object", "properties": {"name": {"type": "string"}}}' * For arrays: {'type': 'array', 'items': {'type': 'object', 'properties': {...}}} - Default: None (AI will infer structure from prompt) number_of_scrolls (Optional[int]): Number of infinite scrolls to perform before scraping. - Range: 0-50 scrolls - Default: 0 (no scrolling) - Useful for dynamically loaded content (lazy loading, infinite scroll) - Each scroll waits for content to load before continuing - Examples: * 0: Static content, no scrolling needed * 3: Social media feeds, product listings * 10: Long articles, extensive product catalogs - Note: Increases processing time proportionally total_pages (Optional[int]): Number of pages to process for pagination. - Range: 1-100 pages - Default: 1 (single page only) - Automatically follows pagination links when available - Useful for multi-page listings, search results, catalogs - Examples: * 1: Single page extraction * 5: First 5 pages of search results * 20: Comprehensive catalog scraping - Note: Each page counts toward credit usage (10 credits × pages) render_heavy_js (Optional[bool]): Enable heavy JavaScript rendering for dynamic sites. - Default: false - Set to true for Single Page Applications (SPAs), React apps, Vue.js sites - Increases processing time but captures client-side rendered content - Use when content is loaded dynamically via JavaScript - Examples of when to use: * React/Angular/Vue applications * Sites with dynamic content loading * AJAX-heavy interfaces * Content that appears after page load - Note: Significantly increases processing time (30-60 seconds vs 5-15 seconds)

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ScrapeGraphAI/scrapegraph-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server