agentic_scrapper
Execute multi-step web scraping workflows with AI automation, navigating websites, interacting with forms, and extracting structured data for complex scenarios requiring user simulation.
Instructions
Execute complex multi-step web scraping workflows with AI-powered automation.
This tool runs an intelligent agent that can navigate websites, interact with forms and buttons, follow multi-step workflows, and extract structured data. Ideal for complex scraping scenarios requiring user interaction simulation, form submissions, or multi-page navigation flows. Supports custom output schemas and step-by-step instructions. Variable credit cost based on complexity. Can perform actions on the website (non-read-only, non-idempotent).
The agent accepts flexible input formats for steps (list or JSON string) and output_schema (dict or JSON string) to accommodate different client implementations.
Args: url (str): The target website URL where the agentic scraping workflow should start. - Must include protocol (http:// or https://) - Should be the starting page for your automation workflow - The agent will begin its actions from this URL - Examples: * https://example.com/search (start at search page) * https://shop.example.com/login (begin with login flow) * https://app.example.com/dashboard (start at main interface) * https://forms.example.com/contact (begin at form page) - Considerations: * Choose a starting point that makes sense for your workflow * Ensure the page is publicly accessible or handle authentication * Consider the logical flow of actions from this starting point
user_prompt (Optional[str]): High-level instructions for what the agent should accomplish.
- Describes the overall goal and desired outcome of the automation
- Should be clear and specific about what you want to achieve
- Works in conjunction with the steps parameter for detailed guidance
- Examples:
* "Navigate to the search page, search for laptops, and extract the top 5 results with prices"
* "Fill out the contact form with sample data and submit it"
* "Login to the dashboard and extract all recent notifications"
* "Browse the product catalog and collect information about all items"
* "Navigate through the multi-step checkout process and capture each step"
- Tips for better results:
* Be specific about the end goal
* Mention what data you want extracted
* Include context about the expected workflow
* Specify any particular elements or sections to focus on
output_schema (Optional[Union[str, Dict]]): Desired output structure for extracted data.
- Can be provided as a dictionary or JSON string
- Defines the format and structure of the final extracted data
- Helps ensure consistent, predictable output format
- IMPORTANT: Must include a "required" field (can be empty array [] if no fields are required)
- Examples:
* Simple object: {'type': 'object', 'properties': {'title': {'type': 'string'}, 'price': {'type': 'number'}}, 'required': []}
* Array of objects: {'type': 'array', 'items': {'type': 'object', 'properties': {'name': {'type': 'string'}, 'value': {'type': 'string'}}, 'required': []}, 'required': []}
* Complex nested: {'type': 'object', 'properties': {'products': {'type': 'array', 'items': {...}}, 'total_count': {'type': 'number'}}, 'required': []}
* As JSON string: '{"type": "object", "properties": {"results": {"type": "array"}}, "required": []}'
* With required fields: {'type': 'object', 'properties': {'id': {'type': 'string'}, 'name': {'type': 'string'}}, 'required': ['id']}
- Note: If "required" field is missing, it will be automatically added as an empty array []
- Default: None (agent will infer structure from prompt and steps)
steps (Optional[Union[str, List[str]]]): Step-by-step instructions for the agent.
- Can be provided as a list of strings or JSON array string
- Provides detailed, sequential instructions for the automation workflow
- Each step should be a clear, actionable instruction
- Examples as list:
* ['Click the search button', 'Enter "laptops" in the search box', 'Press Enter', 'Wait for results to load', 'Extract product information']
* ['Fill in email field with test@example.com', 'Fill in password field', 'Click login button', 'Navigate to profile page']
- Examples as JSON string:
* '["Open navigation menu", "Click on Products", "Select category filters", "Extract all product data"]'
- Best practices:
* Break complex actions into simple steps
* Be specific about UI elements (button text, field names, etc.)
* Include waiting/loading steps when necessary
* Specify extraction points clearly
* Order steps logically for the workflow
ai_extraction (Optional[bool]): Enable AI-powered extraction mode for intelligent data parsing.
- Default: true (recommended for most use cases)
- Options:
* true: Uses advanced AI to intelligently extract and structure data
- Better at handling complex page layouts
- Can adapt to different content structures
- Provides more accurate data extraction
- Recommended for most scenarios
* false: Uses simpler extraction methods
- Faster processing but less intelligent
- May miss complex or nested data
- Use when speed is more important than accuracy
- Performance impact:
* true: Higher processing time but better results
* false: Faster execution but potentially less accurate extraction
persistent_session (Optional[bool]): Maintain session state between steps.
- Default: false (each step starts fresh)
- Options:
* true: Keeps cookies, login state, and session data between steps
- Essential for authenticated workflows
- Maintains shopping cart contents, user preferences, etc.
- Required for multi-step processes that depend on previous actions
- Use for: Login flows, shopping processes, form wizards
* false: Each step starts with a clean session
- Faster and simpler for independent actions
- No state carried between steps
- Use for: Simple data extraction, public content scraping
- Examples when to use true:
* Login → Navigate to protected area → Extract data
* Add items to cart → Proceed to checkout → Extract order details
* Multi-step form completion with session dependencies
timeout_seconds (Optional[float]): Maximum time to wait for the entire workflow.
- Default: 120 seconds (2 minutes)
- Recommended ranges:
* 60-120: Simple workflows (2-5 steps)
* 180-300: Medium complexity (5-10 steps)
* 300-600: Complex workflows (10+ steps or slow sites)
* 600+: Very complex or slow-loading workflows
- Considerations:
* Include time for page loads, form submissions, and processing
* Factor in network latency and site response times
* Allow extra time for AI processing and extraction
* Balance between thoroughness and efficiency
- Examples:
* 60.0: Quick single-page data extraction
* 180.0: Multi-step form filling and submission
* 300.0: Complex navigation and comprehensive data extraction
* 600.0: Extensive workflows with multiple page interactionsReturns: Dictionary containing: - extracted_data: The structured data matching your prompt and optional schema - workflow_log: Detailed log of all actions performed by the agent - pages_visited: List of URLs visited during the workflow - actions_performed: Summary of interactions (clicks, form fills, navigations) - execution_time: Total time taken for the workflow - steps_completed: Number of steps successfully executed - final_page_url: The URL where the workflow ended - session_data: Session information if persistent_session was enabled - credits_used: Number of credits consumed (varies by complexity) - status: Success/failure status with any error details
Raises: ValueError: If URL is malformed or required parameters are missing TimeoutError: If the workflow exceeds the specified timeout NavigationError: If the agent cannot navigate to required pages InteractionError: If the agent cannot interact with specified elements ExtractionError: If data extraction fails or returns invalid results
Use Cases: - Automated form filling and submission - Multi-step checkout processes - Login-protected content extraction - Interactive search and filtering workflows - Complex navigation scenarios requiring user simulation - Data collection from dynamic, JavaScript-heavy applications
Best Practices: - Start with simple workflows and gradually increase complexity - Use specific element identifiers in steps (button text, field labels) - Include appropriate wait times for page loads and dynamic content - Test with persistent_session=true for authentication-dependent workflows - Set realistic timeouts based on workflow complexity - Provide clear, sequential steps that build on each other - Use output_schema to ensure consistent data structure
Note: - This tool can perform actions on websites (non-read-only) - Results may vary between runs due to dynamic content (non-idempotent) - Credit cost varies based on workflow complexity and execution time - Some websites may have anti-automation measures that could affect success - Consider using simpler tools (smartscraper, markdownify) for basic extraction needs
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | ||
| user_prompt | No | ||
| output_schema | No | ||
| steps | No | ||
| ai_extraction | No | ||
| persistent_session | No | ||
| timeout_seconds | No |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
No arguments | |||