agentic_scrapper
Execute multi-step web scraping workflows with AI automation, navigating websites, interacting with forms, and extracting structured data for complex scenarios requiring user simulation.
Instructions
Execute complex multi-step web scraping workflows with AI-powered automation.
This tool runs an intelligent agent that can navigate websites, interact with forms and buttons, follow multi-step workflows, and extract structured data. Ideal for complex scraping scenarios requiring user interaction simulation, form submissions, or multi-page navigation flows. Supports custom output schemas and step-by-step instructions. Variable credit cost based on complexity. Can perform actions on the website (non-read-only, non-idempotent).
The agent accepts flexible input formats for steps (list or JSON string) and output_schema (dict or JSON string) to accommodate different client implementations.
Args: url (str): The target website URL where the agentic scraping workflow should start. - Must include protocol (http:// or https://) - Should be the starting page for your automation workflow - The agent will begin its actions from this URL - Examples: * https://example.com/search (start at search page) * https://shop.example.com/login (begin with login flow) * https://app.example.com/dashboard (start at main interface) * https://forms.example.com/contact (begin at form page) - Considerations: * Choose a starting point that makes sense for your workflow * Ensure the page is publicly accessible or handle authentication * Consider the logical flow of actions from this starting point
Returns: Dictionary containing: - extracted_data: The structured data matching your prompt and optional schema - workflow_log: Detailed log of all actions performed by the agent - pages_visited: List of URLs visited during the workflow - actions_performed: Summary of interactions (clicks, form fills, navigations) - execution_time: Total time taken for the workflow - steps_completed: Number of steps successfully executed - final_page_url: The URL where the workflow ended - session_data: Session information if persistent_session was enabled - credits_used: Number of credits consumed (varies by complexity) - status: Success/failure status with any error details
Raises: ValueError: If URL is malformed or required parameters are missing TimeoutError: If the workflow exceeds the specified timeout NavigationError: If the agent cannot navigate to required pages InteractionError: If the agent cannot interact with specified elements ExtractionError: If data extraction fails or returns invalid results
Use Cases: - Automated form filling and submission - Multi-step checkout processes - Login-protected content extraction - Interactive search and filtering workflows - Complex navigation scenarios requiring user simulation - Data collection from dynamic, JavaScript-heavy applications
Best Practices: - Start with simple workflows and gradually increase complexity - Use specific element identifiers in steps (button text, field labels) - Include appropriate wait times for page loads and dynamic content - Test with persistent_session=true for authentication-dependent workflows - Set realistic timeouts based on workflow complexity - Provide clear, sequential steps that build on each other - Use output_schema to ensure consistent data structure
Note: - This tool can perform actions on websites (non-read-only) - Results may vary between runs due to dynamic content (non-idempotent) - Credit cost varies based on workflow complexity and execution time - Some websites may have anti-automation measures that could affect success - Consider using simpler tools (smartscraper, markdownify) for basic extraction needs
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | ||
| user_prompt | No | ||
| output_schema | No | ||
| steps | No | ||
| ai_extraction | No | ||
| persistent_session | No | ||
| timeout_seconds | No |
Implementation Reference
- src/scrapegraph_mcp/server.py:248-289 (handler)Core handler logic in ScapeGraphClient: constructs payload and makes HTTP POST to https://api.scrapegraphai.com/v1/agentic-scrapper endpoint, handling optional parameters and timeout.def agentic_scrapper( self, url: str, user_prompt: Optional[str] = None, output_schema: Optional[Dict[str, Any]] = None, steps: Optional[List[str]] = None, ai_extraction: Optional[bool] = None, persistent_session: Optional[bool] = None, timeout_seconds: Optional[float] = None, ) -> Dict[str, Any]: """ Run the Agentic Scraper workflow (no live session/browser interaction). Args: url: Target website URL user_prompt: Instructions for what to do/extract (optional) output_schema: Desired structured output schema (optional) steps: High-level steps/instructions for the agent (optional) ai_extraction: Whether to enable AI extraction mode (optional) persistent_session: Whether to keep session alive between steps (optional) timeout_seconds: Per-request timeout override in seconds (optional) """ endpoint = f"{self.BASE_URL}/agentic-scrapper" payload: Dict[str, Any] = {"url": url} if user_prompt is not None: payload["user_prompt"] = user_prompt if output_schema is not None: payload["output_schema"] = output_schema if steps is not None: payload["steps"] = steps if ai_extraction is not None: payload["ai_extraction"] = ai_extraction if persistent_session is not None: payload["persistent_session"] = persistent_session if timeout_seconds is not None: response = self.client.post(endpoint, headers=self.headers, json=payload, timeout=timeout_seconds) else: response = self.client.post(endpoint, headers=self.headers, json=payload) response.raise_for_status() return response.json()
- src/scrapegraph_mcp/server.py:2024-2271 (registration)MCP tool registration (@mcp.tool) and wrapper handler: normalizes input formats (JSON strings to objects), gets API key from context, instantiates client and calls core handler.def agentic_scrapper( url: str, ctx: Context, user_prompt: Optional[str] = None, output_schema: Optional[Annotated[Union[str, Dict[str, Any]], Field( default=None, description="Desired output structure as a JSON schema dict or JSON string", json_schema_extra={ "oneOf": [ {"type": "string"}, {"type": "object"} ] } )]] = None, steps: Optional[Annotated[Union[str, List[str]], Field( default=None, description="Step-by-step instructions for the agent as a list of strings or JSON array string", json_schema_extra={ "oneOf": [ {"type": "string"}, {"type": "array", "items": {"type": "string"}} ] } )]] = None, ai_extraction: Optional[bool] = None, persistent_session: Optional[bool] = None, timeout_seconds: Optional[float] = None ) -> Dict[str, Any]: """ Execute complex multi-step web scraping workflows with AI-powered automation. This tool runs an intelligent agent that can navigate websites, interact with forms and buttons, follow multi-step workflows, and extract structured data. Ideal for complex scraping scenarios requiring user interaction simulation, form submissions, or multi-page navigation flows. Supports custom output schemas and step-by-step instructions. Variable credit cost based on complexity. Can perform actions on the website (non-read-only, non-idempotent). The agent accepts flexible input formats for steps (list or JSON string) and output_schema (dict or JSON string) to accommodate different client implementations. Args: url (str): The target website URL where the agentic scraping workflow should start. - Must include protocol (http:// or https://) - Should be the starting page for your automation workflow - The agent will begin its actions from this URL - Examples: * https://example.com/search (start at search page) * https://shop.example.com/login (begin with login flow) * https://app.example.com/dashboard (start at main interface) * https://forms.example.com/contact (begin at form page) - Considerations: * Choose a starting point that makes sense for your workflow * Ensure the page is publicly accessible or handle authentication * Consider the logical flow of actions from this starting point user_prompt (Optional[str]): High-level instructions for what the agent should accomplish. - Describes the overall goal and desired outcome of the automation - Should be clear and specific about what you want to achieve - Works in conjunction with the steps parameter for detailed guidance - Examples: * "Navigate to the search page, search for laptops, and extract the top 5 results with prices" * "Fill out the contact form with sample data and submit it" * "Login to the dashboard and extract all recent notifications" * "Browse the product catalog and collect information about all items" * "Navigate through the multi-step checkout process and capture each step" - Tips for better results: * Be specific about the end goal * Mention what data you want extracted * Include context about the expected workflow * Specify any particular elements or sections to focus on output_schema (Optional[Union[str, Dict]]): Desired output structure for extracted data. - Can be provided as a dictionary or JSON string - Defines the format and structure of the final extracted data - Helps ensure consistent, predictable output format - IMPORTANT: Must include a "required" field (can be empty array [] if no fields are required) - Examples: * Simple object: {'type': 'object', 'properties': {'title': {'type': 'string'}, 'price': {'type': 'number'}}, 'required': []} * Array of objects: {'type': 'array', 'items': {'type': 'object', 'properties': {'name': {'type': 'string'}, 'value': {'type': 'string'}}, 'required': []}, 'required': []} * Complex nested: {'type': 'object', 'properties': {'products': {'type': 'array', 'items': {...}}, 'total_count': {'type': 'number'}}, 'required': []} * As JSON string: '{"type": "object", "properties": {"results": {"type": "array"}}, "required": []}' * With required fields: {'type': 'object', 'properties': {'id': {'type': 'string'}, 'name': {'type': 'string'}}, 'required': ['id']} - Note: If "required" field is missing, it will be automatically added as an empty array [] - Default: None (agent will infer structure from prompt and steps) steps (Optional[Union[str, List[str]]]): Step-by-step instructions for the agent. - Can be provided as a list of strings or JSON array string - Provides detailed, sequential instructions for the automation workflow - Each step should be a clear, actionable instruction - Examples as list: * ['Click the search button', 'Enter "laptops" in the search box', 'Press Enter', 'Wait for results to load', 'Extract product information'] * ['Fill in email field with test@example.com', 'Fill in password field', 'Click login button', 'Navigate to profile page'] - Examples as JSON string: * '["Open navigation menu", "Click on Products", "Select category filters", "Extract all product data"]' - Best practices: * Break complex actions into simple steps * Be specific about UI elements (button text, field names, etc.) * Include waiting/loading steps when necessary * Specify extraction points clearly * Order steps logically for the workflow ai_extraction (Optional[bool]): Enable AI-powered extraction mode for intelligent data parsing. - Default: true (recommended for most use cases) - Options: * true: Uses advanced AI to intelligently extract and structure data - Better at handling complex page layouts - Can adapt to different content structures - Provides more accurate data extraction - Recommended for most scenarios * false: Uses simpler extraction methods - Faster processing but less intelligent - May miss complex or nested data - Use when speed is more important than accuracy - Performance impact: * true: Higher processing time but better results * false: Faster execution but potentially less accurate extraction persistent_session (Optional[bool]): Maintain session state between steps. - Default: false (each step starts fresh) - Options: * true: Keeps cookies, login state, and session data between steps - Essential for authenticated workflows - Maintains shopping cart contents, user preferences, etc. - Required for multi-step processes that depend on previous actions - Use for: Login flows, shopping processes, form wizards * false: Each step starts with a clean session - Faster and simpler for independent actions - No state carried between steps - Use for: Simple data extraction, public content scraping - Examples when to use true: * Login → Navigate to protected area → Extract data * Add items to cart → Proceed to checkout → Extract order details * Multi-step form completion with session dependencies timeout_seconds (Optional[float]): Maximum time to wait for the entire workflow. - Default: 120 seconds (2 minutes) - Recommended ranges: * 60-120: Simple workflows (2-5 steps) * 180-300: Medium complexity (5-10 steps) * 300-600: Complex workflows (10+ steps or slow sites) * 600+: Very complex or slow-loading workflows - Considerations: * Include time for page loads, form submissions, and processing * Factor in network latency and site response times * Allow extra time for AI processing and extraction * Balance between thoroughness and efficiency - Examples: * 60.0: Quick single-page data extraction * 180.0: Multi-step form filling and submission * 300.0: Complex navigation and comprehensive data extraction * 600.0: Extensive workflows with multiple page interactions Returns: Dictionary containing: - extracted_data: The structured data matching your prompt and optional schema - workflow_log: Detailed log of all actions performed by the agent - pages_visited: List of URLs visited during the workflow - actions_performed: Summary of interactions (clicks, form fills, navigations) - execution_time: Total time taken for the workflow - steps_completed: Number of steps successfully executed - final_page_url: The URL where the workflow ended - session_data: Session information if persistent_session was enabled - credits_used: Number of credits consumed (varies by complexity) - status: Success/failure status with any error details Raises: ValueError: If URL is malformed or required parameters are missing TimeoutError: If the workflow exceeds the specified timeout NavigationError: If the agent cannot navigate to required pages InteractionError: If the agent cannot interact with specified elements ExtractionError: If data extraction fails or returns invalid results Use Cases: - Automated form filling and submission - Multi-step checkout processes - Login-protected content extraction - Interactive search and filtering workflows - Complex navigation scenarios requiring user simulation - Data collection from dynamic, JavaScript-heavy applications Best Practices: - Start with simple workflows and gradually increase complexity - Use specific element identifiers in steps (button text, field labels) - Include appropriate wait times for page loads and dynamic content - Test with persistent_session=true for authentication-dependent workflows - Set realistic timeouts based on workflow complexity - Provide clear, sequential steps that build on each other - Use output_schema to ensure consistent data structure Note: - This tool can perform actions on websites (non-read-only) - Results may vary between runs due to dynamic content (non-idempotent) - Credit cost varies based on workflow complexity and execution time - Some websites may have anti-automation measures that could affect success - Consider using simpler tools (smartscraper, markdownify) for basic extraction needs """ # Normalize inputs to handle flexible formats from different MCP clients normalized_steps: Optional[List[str]] = None if isinstance(steps, list): normalized_steps = steps elif isinstance(steps, str): parsed_steps: Optional[Any] = None try: parsed_steps = json.loads(steps) except json.JSONDecodeError: parsed_steps = None if isinstance(parsed_steps, list): normalized_steps = parsed_steps else: normalized_steps = [steps] normalized_schema: Optional[Dict[str, Any]] = None if isinstance(output_schema, dict): normalized_schema = output_schema elif isinstance(output_schema, str): try: parsed_schema = json.loads(output_schema) if isinstance(parsed_schema, dict): normalized_schema = parsed_schema else: return {"error": "output_schema must be a JSON object"} except json.JSONDecodeError as e: return {"error": f"Invalid JSON for output_schema: {str(e)}"} # Ensure output_schema has a 'required' field if it exists if normalized_schema is not None: if "required" not in normalized_schema: normalized_schema["required"] = [] try: api_key = get_api_key(ctx) client = ScapeGraphClient(api_key) return client.agentic_scrapper( url=url, user_prompt=user_prompt, output_schema=normalized_schema, steps=normalized_steps, ai_extraction=ai_extraction, persistent_session=persistent_session, timeout_seconds=timeout_seconds, ) except httpx.TimeoutException as timeout_err: return {"error": f"Request timed out: {str(timeout_err)}"} except httpx.HTTPError as http_err: return {"error": str(http_err)} except ValueError as val_err: return {"error": str(val_err)}
- Input schema definitions using Pydantic Annotated types for parameters like output_schema and steps, supporting both dict/list and JSON string formats for MCP compatibility.url: str, ctx: Context, user_prompt: Optional[str] = None, output_schema: Optional[Annotated[Union[str, Dict[str, Any]], Field( default=None, description="Desired output structure as a JSON schema dict or JSON string", json_schema_extra={ "oneOf": [ {"type": "string"}, {"type": "object"} ] } )]] = None, steps: Optional[Annotated[Union[str, List[str]], Field( default=None, description="Step-by-step instructions for the agent as a list of strings or JSON array string", json_schema_extra={ "oneOf": [ {"type": "string"}, {"type": "array", "items": {"type": "string"}} ] } )]] = None, ai_extraction: Optional[bool] = None, persistent_session: Optional[bool] = None, timeout_seconds: Optional[float] = None ) -> Dict[str, Any]:
- src/scrapegraph_mcp/server.py:73-92 (helper)ScapeGraphClient class initialization: sets up HTTP client with API key authentication and base URL for all API calls, including agentic_scrapper.class ScapeGraphClient: """Client for interacting with the ScapeGraph API.""" BASE_URL = "https://api.scrapegraphai.com/v1" def __init__(self, api_key: str): """ Initialize the ScapeGraph API client. Args: api_key: API key for ScapeGraph API """ self.api_key = api_key self.headers = { "SGAI-APIKEY": api_key, "Content-Type": "application/json" } self.client = httpx.Client(timeout=httpx.Timeout(120.0))