Skip to main content
Glama
ScrapeGraphAI

ScrapeGraph MCP Server

Official

agentic_scrapper

Execute multi-step web scraping workflows with AI automation, navigating websites, interacting with forms, and extracting structured data for complex scenarios requiring user simulation.

Instructions

Execute complex multi-step web scraping workflows with AI-powered automation.

This tool runs an intelligent agent that can navigate websites, interact with forms and buttons, follow multi-step workflows, and extract structured data. Ideal for complex scraping scenarios requiring user interaction simulation, form submissions, or multi-page navigation flows. Supports custom output schemas and step-by-step instructions. Variable credit cost based on complexity. Can perform actions on the website (non-read-only, non-idempotent).

The agent accepts flexible input formats for steps (list or JSON string) and output_schema (dict or JSON string) to accommodate different client implementations.

Args: url (str): The target website URL where the agentic scraping workflow should start. - Must include protocol (http:// or https://) - Should be the starting page for your automation workflow - The agent will begin its actions from this URL - Examples: * https://example.com/search (start at search page) * https://shop.example.com/login (begin with login flow) * https://app.example.com/dashboard (start at main interface) * https://forms.example.com/contact (begin at form page) - Considerations: * Choose a starting point that makes sense for your workflow * Ensure the page is publicly accessible or handle authentication * Consider the logical flow of actions from this starting point

user_prompt (Optional[str]): High-level instructions for what the agent should accomplish.
    - Describes the overall goal and desired outcome of the automation
    - Should be clear and specific about what you want to achieve
    - Works in conjunction with the steps parameter for detailed guidance
    - Examples:
      * "Navigate to the search page, search for laptops, and extract the top 5 results with prices"
      * "Fill out the contact form with sample data and submit it"
      * "Login to the dashboard and extract all recent notifications"
      * "Browse the product catalog and collect information about all items"
      * "Navigate through the multi-step checkout process and capture each step"
    - Tips for better results:
      * Be specific about the end goal
      * Mention what data you want extracted
      * Include context about the expected workflow
      * Specify any particular elements or sections to focus on

output_schema (Optional[Union[str, Dict]]): Desired output structure for extracted data.
    - Can be provided as a dictionary or JSON string
    - Defines the format and structure of the final extracted data
    - Helps ensure consistent, predictable output format
    - IMPORTANT: Must include a "required" field (can be empty array [] if no fields are required)
    - Examples:
      * Simple object: {'type': 'object', 'properties': {'title': {'type': 'string'}, 'price': {'type': 'number'}}, 'required': []}
      * Array of objects: {'type': 'array', 'items': {'type': 'object', 'properties': {'name': {'type': 'string'}, 'value': {'type': 'string'}}, 'required': []}, 'required': []}
      * Complex nested: {'type': 'object', 'properties': {'products': {'type': 'array', 'items': {...}}, 'total_count': {'type': 'number'}}, 'required': []}
      * As JSON string: '{"type": "object", "properties": {"results": {"type": "array"}}, "required": []}'
      * With required fields: {'type': 'object', 'properties': {'id': {'type': 'string'}, 'name': {'type': 'string'}}, 'required': ['id']}
    - Note: If "required" field is missing, it will be automatically added as an empty array []
    - Default: None (agent will infer structure from prompt and steps)

steps (Optional[Union[str, List[str]]]): Step-by-step instructions for the agent.
    - Can be provided as a list of strings or JSON array string
    - Provides detailed, sequential instructions for the automation workflow
    - Each step should be a clear, actionable instruction
    - Examples as list:
      * ['Click the search button', 'Enter "laptops" in the search box', 'Press Enter', 'Wait for results to load', 'Extract product information']
      * ['Fill in email field with test@example.com', 'Fill in password field', 'Click login button', 'Navigate to profile page']
    - Examples as JSON string:
      * '["Open navigation menu", "Click on Products", "Select category filters", "Extract all product data"]'
    - Best practices:
      * Break complex actions into simple steps
      * Be specific about UI elements (button text, field names, etc.)
      * Include waiting/loading steps when necessary
      * Specify extraction points clearly
      * Order steps logically for the workflow

ai_extraction (Optional[bool]): Enable AI-powered extraction mode for intelligent data parsing.
    - Default: true (recommended for most use cases)
    - Options:
      * true: Uses advanced AI to intelligently extract and structure data
        - Better at handling complex page layouts
        - Can adapt to different content structures
        - Provides more accurate data extraction
        - Recommended for most scenarios
      * false: Uses simpler extraction methods
        - Faster processing but less intelligent
        - May miss complex or nested data
        - Use when speed is more important than accuracy
    - Performance impact:
      * true: Higher processing time but better results
      * false: Faster execution but potentially less accurate extraction

persistent_session (Optional[bool]): Maintain session state between steps.
    - Default: false (each step starts fresh)
    - Options:
      * true: Keeps cookies, login state, and session data between steps
        - Essential for authenticated workflows
        - Maintains shopping cart contents, user preferences, etc.
        - Required for multi-step processes that depend on previous actions
        - Use for: Login flows, shopping processes, form wizards
      * false: Each step starts with a clean session
        - Faster and simpler for independent actions
        - No state carried between steps
        - Use for: Simple data extraction, public content scraping
    - Examples when to use true:
      * Login → Navigate to protected area → Extract data
      * Add items to cart → Proceed to checkout → Extract order details
      * Multi-step form completion with session dependencies

timeout_seconds (Optional[float]): Maximum time to wait for the entire workflow.
    - Default: 120 seconds (2 minutes)
    - Recommended ranges:
      * 60-120: Simple workflows (2-5 steps)
      * 180-300: Medium complexity (5-10 steps)
      * 300-600: Complex workflows (10+ steps or slow sites)
      * 600+: Very complex or slow-loading workflows
    - Considerations:
      * Include time for page loads, form submissions, and processing
      * Factor in network latency and site response times
      * Allow extra time for AI processing and extraction
      * Balance between thoroughness and efficiency
    - Examples:
      * 60.0: Quick single-page data extraction
      * 180.0: Multi-step form filling and submission
      * 300.0: Complex navigation and comprehensive data extraction
      * 600.0: Extensive workflows with multiple page interactions

Returns: Dictionary containing: - extracted_data: The structured data matching your prompt and optional schema - workflow_log: Detailed log of all actions performed by the agent - pages_visited: List of URLs visited during the workflow - actions_performed: Summary of interactions (clicks, form fills, navigations) - execution_time: Total time taken for the workflow - steps_completed: Number of steps successfully executed - final_page_url: The URL where the workflow ended - session_data: Session information if persistent_session was enabled - credits_used: Number of credits consumed (varies by complexity) - status: Success/failure status with any error details

Raises: ValueError: If URL is malformed or required parameters are missing TimeoutError: If the workflow exceeds the specified timeout NavigationError: If the agent cannot navigate to required pages InteractionError: If the agent cannot interact with specified elements ExtractionError: If data extraction fails or returns invalid results

Use Cases: - Automated form filling and submission - Multi-step checkout processes - Login-protected content extraction - Interactive search and filtering workflows - Complex navigation scenarios requiring user simulation - Data collection from dynamic, JavaScript-heavy applications

Best Practices: - Start with simple workflows and gradually increase complexity - Use specific element identifiers in steps (button text, field labels) - Include appropriate wait times for page loads and dynamic content - Test with persistent_session=true for authentication-dependent workflows - Set realistic timeouts based on workflow complexity - Provide clear, sequential steps that build on each other - Use output_schema to ensure consistent data structure

Note: - This tool can perform actions on websites (non-read-only) - Results may vary between runs due to dynamic content (non-idempotent) - Credit cost varies based on workflow complexity and execution time - Some websites may have anti-automation measures that could affect success - Consider using simpler tools (smartscraper, markdownify) for basic extraction needs

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYes
user_promptNo
output_schemaNo
stepsNo
ai_extractionNo
persistent_sessionNo
timeout_secondsNo

Implementation Reference

  • Core handler logic in ScapeGraphClient: constructs payload and makes HTTP POST to https://api.scrapegraphai.com/v1/agentic-scrapper endpoint, handling optional parameters and timeout.
    def agentic_scrapper(
        self,
        url: str,
        user_prompt: Optional[str] = None,
        output_schema: Optional[Dict[str, Any]] = None,
        steps: Optional[List[str]] = None,
        ai_extraction: Optional[bool] = None,
        persistent_session: Optional[bool] = None,
        timeout_seconds: Optional[float] = None,
    ) -> Dict[str, Any]:
        """
        Run the Agentic Scraper workflow (no live session/browser interaction).
    
        Args:
            url: Target website URL
            user_prompt: Instructions for what to do/extract (optional)
            output_schema: Desired structured output schema (optional)
            steps: High-level steps/instructions for the agent (optional)
            ai_extraction: Whether to enable AI extraction mode (optional)
            persistent_session: Whether to keep session alive between steps (optional)
            timeout_seconds: Per-request timeout override in seconds (optional)
        """
        endpoint = f"{self.BASE_URL}/agentic-scrapper"
        payload: Dict[str, Any] = {"url": url}
        if user_prompt is not None:
            payload["user_prompt"] = user_prompt
        if output_schema is not None:
            payload["output_schema"] = output_schema
        if steps is not None:
            payload["steps"] = steps
        if ai_extraction is not None:
            payload["ai_extraction"] = ai_extraction
        if persistent_session is not None:
            payload["persistent_session"] = persistent_session
    
        if timeout_seconds is not None:
            response = self.client.post(endpoint, headers=self.headers, json=payload, timeout=timeout_seconds)
        else:
            response = self.client.post(endpoint, headers=self.headers, json=payload)
        response.raise_for_status()
        return response.json()
  • MCP tool registration (@mcp.tool) and wrapper handler: normalizes input formats (JSON strings to objects), gets API key from context, instantiates client and calls core handler.
    def agentic_scrapper(
        url: str,
        ctx: Context,
        user_prompt: Optional[str] = None,
        output_schema: Optional[Annotated[Union[str, Dict[str, Any]], Field(
            default=None,
            description="Desired output structure as a JSON schema dict or JSON string",
            json_schema_extra={
                "oneOf": [
                    {"type": "string"},
                    {"type": "object"}
                ]
            }
        )]] = None,
        steps: Optional[Annotated[Union[str, List[str]], Field(
            default=None,
            description="Step-by-step instructions for the agent as a list of strings or JSON array string",
            json_schema_extra={
                "oneOf": [
                    {"type": "string"},
                    {"type": "array", "items": {"type": "string"}}
                ]
            }
        )]] = None,
        ai_extraction: Optional[bool] = None,
        persistent_session: Optional[bool] = None,
        timeout_seconds: Optional[float] = None
    ) -> Dict[str, Any]:
        """
        Execute complex multi-step web scraping workflows with AI-powered automation.
    
        This tool runs an intelligent agent that can navigate websites, interact with forms and buttons,
        follow multi-step workflows, and extract structured data. Ideal for complex scraping scenarios
        requiring user interaction simulation, form submissions, or multi-page navigation flows.
        Supports custom output schemas and step-by-step instructions. Variable credit cost based on
        complexity. Can perform actions on the website (non-read-only, non-idempotent).
    
        The agent accepts flexible input formats for steps (list or JSON string) and output_schema
        (dict or JSON string) to accommodate different client implementations.
    
        Args:
            url (str): The target website URL where the agentic scraping workflow should start.
                - Must include protocol (http:// or https://)
                - Should be the starting page for your automation workflow
                - The agent will begin its actions from this URL
                - Examples:
                  * https://example.com/search (start at search page)
                  * https://shop.example.com/login (begin with login flow)
                  * https://app.example.com/dashboard (start at main interface)
                  * https://forms.example.com/contact (begin at form page)
                - Considerations:
                  * Choose a starting point that makes sense for your workflow
                  * Ensure the page is publicly accessible or handle authentication
                  * Consider the logical flow of actions from this starting point
    
            user_prompt (Optional[str]): High-level instructions for what the agent should accomplish.
                - Describes the overall goal and desired outcome of the automation
                - Should be clear and specific about what you want to achieve
                - Works in conjunction with the steps parameter for detailed guidance
                - Examples:
                  * "Navigate to the search page, search for laptops, and extract the top 5 results with prices"
                  * "Fill out the contact form with sample data and submit it"
                  * "Login to the dashboard and extract all recent notifications"
                  * "Browse the product catalog and collect information about all items"
                  * "Navigate through the multi-step checkout process and capture each step"
                - Tips for better results:
                  * Be specific about the end goal
                  * Mention what data you want extracted
                  * Include context about the expected workflow
                  * Specify any particular elements or sections to focus on
    
            output_schema (Optional[Union[str, Dict]]): Desired output structure for extracted data.
                - Can be provided as a dictionary or JSON string
                - Defines the format and structure of the final extracted data
                - Helps ensure consistent, predictable output format
                - IMPORTANT: Must include a "required" field (can be empty array [] if no fields are required)
                - Examples:
                  * Simple object: {'type': 'object', 'properties': {'title': {'type': 'string'}, 'price': {'type': 'number'}}, 'required': []}
                  * Array of objects: {'type': 'array', 'items': {'type': 'object', 'properties': {'name': {'type': 'string'}, 'value': {'type': 'string'}}, 'required': []}, 'required': []}
                  * Complex nested: {'type': 'object', 'properties': {'products': {'type': 'array', 'items': {...}}, 'total_count': {'type': 'number'}}, 'required': []}
                  * As JSON string: '{"type": "object", "properties": {"results": {"type": "array"}}, "required": []}'
                  * With required fields: {'type': 'object', 'properties': {'id': {'type': 'string'}, 'name': {'type': 'string'}}, 'required': ['id']}
                - Note: If "required" field is missing, it will be automatically added as an empty array []
                - Default: None (agent will infer structure from prompt and steps)
    
            steps (Optional[Union[str, List[str]]]): Step-by-step instructions for the agent.
                - Can be provided as a list of strings or JSON array string
                - Provides detailed, sequential instructions for the automation workflow
                - Each step should be a clear, actionable instruction
                - Examples as list:
                  * ['Click the search button', 'Enter "laptops" in the search box', 'Press Enter', 'Wait for results to load', 'Extract product information']
                  * ['Fill in email field with test@example.com', 'Fill in password field', 'Click login button', 'Navigate to profile page']
                - Examples as JSON string:
                  * '["Open navigation menu", "Click on Products", "Select category filters", "Extract all product data"]'
                - Best practices:
                  * Break complex actions into simple steps
                  * Be specific about UI elements (button text, field names, etc.)
                  * Include waiting/loading steps when necessary
                  * Specify extraction points clearly
                  * Order steps logically for the workflow
    
            ai_extraction (Optional[bool]): Enable AI-powered extraction mode for intelligent data parsing.
                - Default: true (recommended for most use cases)
                - Options:
                  * true: Uses advanced AI to intelligently extract and structure data
                    - Better at handling complex page layouts
                    - Can adapt to different content structures
                    - Provides more accurate data extraction
                    - Recommended for most scenarios
                  * false: Uses simpler extraction methods
                    - Faster processing but less intelligent
                    - May miss complex or nested data
                    - Use when speed is more important than accuracy
                - Performance impact:
                  * true: Higher processing time but better results
                  * false: Faster execution but potentially less accurate extraction
    
            persistent_session (Optional[bool]): Maintain session state between steps.
                - Default: false (each step starts fresh)
                - Options:
                  * true: Keeps cookies, login state, and session data between steps
                    - Essential for authenticated workflows
                    - Maintains shopping cart contents, user preferences, etc.
                    - Required for multi-step processes that depend on previous actions
                    - Use for: Login flows, shopping processes, form wizards
                  * false: Each step starts with a clean session
                    - Faster and simpler for independent actions
                    - No state carried between steps
                    - Use for: Simple data extraction, public content scraping
                - Examples when to use true:
                  * Login → Navigate to protected area → Extract data
                  * Add items to cart → Proceed to checkout → Extract order details
                  * Multi-step form completion with session dependencies
    
            timeout_seconds (Optional[float]): Maximum time to wait for the entire workflow.
                - Default: 120 seconds (2 minutes)
                - Recommended ranges:
                  * 60-120: Simple workflows (2-5 steps)
                  * 180-300: Medium complexity (5-10 steps)
                  * 300-600: Complex workflows (10+ steps or slow sites)
                  * 600+: Very complex or slow-loading workflows
                - Considerations:
                  * Include time for page loads, form submissions, and processing
                  * Factor in network latency and site response times
                  * Allow extra time for AI processing and extraction
                  * Balance between thoroughness and efficiency
                - Examples:
                  * 60.0: Quick single-page data extraction
                  * 180.0: Multi-step form filling and submission
                  * 300.0: Complex navigation and comprehensive data extraction
                  * 600.0: Extensive workflows with multiple page interactions
    
        Returns:
            Dictionary containing:
            - extracted_data: The structured data matching your prompt and optional schema
            - workflow_log: Detailed log of all actions performed by the agent
            - pages_visited: List of URLs visited during the workflow
            - actions_performed: Summary of interactions (clicks, form fills, navigations)
            - execution_time: Total time taken for the workflow
            - steps_completed: Number of steps successfully executed
            - final_page_url: The URL where the workflow ended
            - session_data: Session information if persistent_session was enabled
            - credits_used: Number of credits consumed (varies by complexity)
            - status: Success/failure status with any error details
    
        Raises:
            ValueError: If URL is malformed or required parameters are missing
            TimeoutError: If the workflow exceeds the specified timeout
            NavigationError: If the agent cannot navigate to required pages
            InteractionError: If the agent cannot interact with specified elements
            ExtractionError: If data extraction fails or returns invalid results
    
        Use Cases:
            - Automated form filling and submission
            - Multi-step checkout processes
            - Login-protected content extraction
            - Interactive search and filtering workflows
            - Complex navigation scenarios requiring user simulation
            - Data collection from dynamic, JavaScript-heavy applications
    
        Best Practices:
            - Start with simple workflows and gradually increase complexity
            - Use specific element identifiers in steps (button text, field labels)
            - Include appropriate wait times for page loads and dynamic content
            - Test with persistent_session=true for authentication-dependent workflows
            - Set realistic timeouts based on workflow complexity
            - Provide clear, sequential steps that build on each other
            - Use output_schema to ensure consistent data structure
    
        Note:
            - This tool can perform actions on websites (non-read-only)
            - Results may vary between runs due to dynamic content (non-idempotent)
            - Credit cost varies based on workflow complexity and execution time
            - Some websites may have anti-automation measures that could affect success
            - Consider using simpler tools (smartscraper, markdownify) for basic extraction needs
        """
        # Normalize inputs to handle flexible formats from different MCP clients
        normalized_steps: Optional[List[str]] = None
        if isinstance(steps, list):
            normalized_steps = steps
        elif isinstance(steps, str):
            parsed_steps: Optional[Any] = None
            try:
                parsed_steps = json.loads(steps)
            except json.JSONDecodeError:
                parsed_steps = None
            if isinstance(parsed_steps, list):
                normalized_steps = parsed_steps
            else:
                normalized_steps = [steps]
    
        normalized_schema: Optional[Dict[str, Any]] = None
        if isinstance(output_schema, dict):
            normalized_schema = output_schema
        elif isinstance(output_schema, str):
            try:
                parsed_schema = json.loads(output_schema)
                if isinstance(parsed_schema, dict):
                    normalized_schema = parsed_schema
                else:
                    return {"error": "output_schema must be a JSON object"}
            except json.JSONDecodeError as e:
                return {"error": f"Invalid JSON for output_schema: {str(e)}"}
    
        # Ensure output_schema has a 'required' field if it exists
        if normalized_schema is not None:
            if "required" not in normalized_schema:
                normalized_schema["required"] = []
    
        try:
            api_key = get_api_key(ctx)
            client = ScapeGraphClient(api_key)
            return client.agentic_scrapper(
                url=url,
                user_prompt=user_prompt,
                output_schema=normalized_schema,
                steps=normalized_steps,
                ai_extraction=ai_extraction,
                persistent_session=persistent_session,
                timeout_seconds=timeout_seconds,
            )
        except httpx.TimeoutException as timeout_err:
            return {"error": f"Request timed out: {str(timeout_err)}"}
        except httpx.HTTPError as http_err:
            return {"error": str(http_err)}
        except ValueError as val_err:
            return {"error": str(val_err)}
  • Input schema definitions using Pydantic Annotated types for parameters like output_schema and steps, supporting both dict/list and JSON string formats for MCP compatibility.
        url: str,
        ctx: Context,
        user_prompt: Optional[str] = None,
        output_schema: Optional[Annotated[Union[str, Dict[str, Any]], Field(
            default=None,
            description="Desired output structure as a JSON schema dict or JSON string",
            json_schema_extra={
                "oneOf": [
                    {"type": "string"},
                    {"type": "object"}
                ]
            }
        )]] = None,
        steps: Optional[Annotated[Union[str, List[str]], Field(
            default=None,
            description="Step-by-step instructions for the agent as a list of strings or JSON array string",
            json_schema_extra={
                "oneOf": [
                    {"type": "string"},
                    {"type": "array", "items": {"type": "string"}}
                ]
            }
        )]] = None,
        ai_extraction: Optional[bool] = None,
        persistent_session: Optional[bool] = None,
        timeout_seconds: Optional[float] = None
    ) -> Dict[str, Any]:
  • ScapeGraphClient class initialization: sets up HTTP client with API key authentication and base URL for all API calls, including agentic_scrapper.
    class ScapeGraphClient:
        """Client for interacting with the ScapeGraph API."""
    
        BASE_URL = "https://api.scrapegraphai.com/v1"
    
        def __init__(self, api_key: str):
            """
            Initialize the ScapeGraph API client.
    
            Args:
                api_key: API key for ScapeGraph API
            """
            self.api_key = api_key
            self.headers = {
                "SGAI-APIKEY": api_key,
                "Content-Type": "application/json"
            }
            self.client = httpx.Client(timeout=httpx.Timeout(120.0))

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ScrapeGraphAI/scrapegraph-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server