Skip to main content
Glama
ScrapeGraphAI

ScrapeGraph MCP Server

Official

Server Configuration

Describes the environment variables required to run the server.

NameRequiredDescriptionDefault
SGAI_API_KEYYesYour ScrapeGraph API key obtained from the ScrapeGraph Dashboard

Tools

Functions exposed to the LLM to take actions

NameDescription
markdownify

Convert a webpage into clean, formatted markdown.

This tool fetches any webpage and converts its content into clean, readable markdown format. Useful for extracting content from documentation, articles, and web pages for further processing. Costs 2 credits per page. Read-only operation with no side effects.

Args: website_url (str): The complete URL of the webpage to convert to markdown format. - Must include protocol (http:// or https://) - Supports most web content types (HTML, articles, documentation) - Works with both static and dynamic content - Examples: * https://example.com/page * https://docs.python.org/3/tutorial/ * https://github.com/user/repo/README.md - Invalid examples: * example.com (missing protocol) * ftp://example.com (unsupported protocol) * localhost:3000 (missing protocol)

Returns: Dictionary containing: - markdown: The converted markdown content as a string - metadata: Additional information about the conversion (title, description, etc.) - status: Success/error status of the operation - credits_used: Number of credits consumed (always 2 for this operation)

Raises: ValueError: If website_url is malformed or missing protocol HTTPError: If the webpage cannot be accessed or returns an error TimeoutError: If the webpage takes too long to load (>120 seconds)

smartscraper
Extract structured data from a webpage, HTML, or markdown using AI-powered extraction. This tool uses advanced AI to understand your natural language prompt and extract specific structured data from web content. Supports three input modes: URL scraping. Ideal for extracting product information, contact details, article metadata, or any structured content. Costs 10 credits per page. Read-only operation. Args: user_prompt (str): Natural language instructions describing what data to extract. - Be specific about the fields you want for better results - Use clear, descriptive language about the target data - Examples: * "Extract product name, price, description, and availability status" * "Find all contact methods: email addresses, phone numbers, and social media links" * "Get article title, author, publication date, and summary" * "Extract all job listings with title, company, location, and salary" - Tips for better results: * Specify exact field names you want * Mention data types (numbers, dates, URLs, etc.) * Include context about where data might be located website_url (Optional[str]): The complete URL of the webpage to scrape. - Mutually exclusive with website_html and website_markdown - Must include protocol (http:// or https://) - Supports dynamic and static content - Examples: * https://example.com/products/item * https://news.site.com/article/123 * https://company.com/contact - Default: None (must provide one of the three input sources) website_html (Optional[str]): Raw HTML content to process locally. - Mutually exclusive with website_url and website_markdown - Maximum size: 2MB - Useful for processing pre-fetched or generated HTML - Use when you already have HTML content from another source - Example: "<html><body><h1>Title</h1><p>Content</p></body></html>" - Default: None website_markdown (Optional[str]): Markdown content to process locally. - Mutually exclusive with website_url and website_html - Maximum size: 2MB - Useful for extracting from markdown documents or converted content - Works well with documentation, README files, or converted web content - Example: "# Title

Section

Content here..." - Default: None

output_schema (Optional[Union[str, Dict]]): JSON schema defining expected output structure. - Can be provided as a dictionary or JSON string - Helps ensure consistent, structured output format - Optional but recommended for complex extractions - IMPORTANT: Must include a "required" field (can be empty array [] if no fields are required) - Examples: * As dict: {'type': 'object', 'properties': {'title': {'type': 'string'}, 'price': {'type': 'number'}}, 'required': []} * As JSON string: '{"type": "object", "properties": {"name": {"type": "string"}}, "required": []}' * For arrays: {'type': 'array', 'items': {'type': 'object', 'properties': {...}, 'required': []}, 'required': []} * With required fields: {'type': 'object', 'properties': {'name': {'type': 'string'}, 'email': {'type': 'string'}}, 'required': ['name', 'email']} - Note: If "required" field is missing, it will be automatically added as an empty array [] - Default: None (AI will infer structure from prompt) number_of_scrolls (Optional[int]): Number of infinite scrolls to perform before scraping. - Range: 0-50 scrolls - Default: 0 (no scrolling) - Useful for dynamically loaded content (lazy loading, infinite scroll) - Each scroll waits for content to load before continuing - Examples: * 0: Static content, no scrolling needed * 3: Social media feeds, product listings * 10: Long articles, extensive product catalogs - Note: Increases processing time proportionally total_pages (Optional[int]): Number of pages to process for pagination. - Range: 1-100 pages - Default: 1 (single page only) - Automatically follows pagination links when available - Useful for multi-page listings, search results, catalogs - Examples: * 1: Single page extraction * 5: First 5 pages of search results * 20: Comprehensive catalog scraping - Note: Each page counts toward credit usage (10 credits × pages) render_heavy_js (Optional[bool]): Enable heavy JavaScript rendering for dynamic sites. - Default: false - Set to true for Single Page Applications (SPAs), React apps, Vue.js sites - Increases processing time but captures client-side rendered content - Use when content is loaded dynamically via JavaScript - Examples of when to use: * React/Angular/Vue applications * Sites with dynamic content loading * AJAX-heavy interfaces * Content that appears after page load - Note: Significantly increases processing time (30-60 seconds vs 5-15 seconds) stealth (Optional[bool]): Enable stealth mode to avoid bot detection. - Default: false - Helps bypass basic anti-scraping measures - Uses techniques to appear more like a human browser - Useful for sites with bot detection systems - Examples of when to use: * Sites that block automated requests * E-commerce sites with protection * Sites that require "human-like" behavior - Note: May increase processing time and is not 100% guaranteed Returns: Dictionary containing: - extracted_data: The structured data matching your prompt and optional schema - metadata: Information about the extraction process - credits_used: Number of credits consumed (10 per page processed) - processing_time: Time taken for the extraction - pages_processed: Number of pages that were analyzed - status: Success/error status of the operation Raises: ValueError: If no input source provided or multiple sources provided HTTPError: If website_url cannot be accessed TimeoutError: If processing exceeds timeout limits ValidationError: If output_schema is malformed JSON
searchscraper

Perform AI-powered web searches with structured data extraction.

This tool searches the web based on your query and uses AI to extract structured information from the search results. Ideal for research, competitive analysis, and gathering information from multiple sources. Each website searched costs 10 credits (default 3 websites = 30 credits). Read-only operation but results may vary over time (non-idempotent).

Args: user_prompt (str): Search query or natural language instructions for information to find. - Can be a simple search query or detailed extraction instructions - The AI will search the web and extract relevant data from found pages - Be specific about what information you want extracted - Examples: * "Find latest AI research papers published in 2024 with author names and abstracts" * "Search for Python web scraping tutorials with ratings and difficulty levels" * "Get current cryptocurrency prices and market caps for top 10 coins" * "Find contact information for tech startups in San Francisco" * "Search for job openings for data scientists with salary information" - Tips for better results: * Include specific fields you want extracted * Mention timeframes or filters (e.g., "latest", "2024", "top 10") * Specify data types needed (prices, dates, ratings, etc.)

num_results (Optional[int]): Number of websites to search and extract data from. - Default: 3 websites (costs 30 credits total) - Range: 1-20 websites (recommended to stay under 10 for cost efficiency) - Each website costs 10 credits, so total cost = num_results × 10 - Examples: * 1: Quick single-source lookup (10 credits) * 3: Standard research (30 credits) - good balance of coverage and cost * 5: Comprehensive research (50 credits) * 10: Extensive analysis (100 credits) - Note: More results provide broader coverage but increase costs and processing time number_of_scrolls (Optional[int]): Number of infinite scrolls per searched webpage. - Default: 0 (no scrolling on search result pages) - Range: 0-10 scrolls per page - Useful when search results point to pages with dynamic content loading - Each scroll waits for content to load before continuing - Examples: * 0: Static content pages, news articles, documentation * 2: Social media pages, product listings with lazy loading * 5: Extensive feeds, long-form content with infinite scroll - Note: Increases processing time significantly (adds 5-10 seconds per scroll per page)

Returns: Dictionary containing: - search_results: Array of extracted data from each website found - sources: List of URLs that were searched and processed - total_websites_processed: Number of websites successfully analyzed - credits_used: Total credits consumed (num_results × 10) - processing_time: Total time taken for search and extraction - search_query_used: The actual search query sent to search engines - metadata: Additional information about the search process

Raises: ValueError: If user_prompt is empty or num_results is out of range HTTPError: If search engines are unavailable or return errors TimeoutError: If search or extraction process exceeds timeout limits RateLimitError: If too many requests are made in a short time period

Note: - Results may vary between calls due to changing web content (non-idempotent) - Search engines may return different results over time - Some websites may be inaccessible or block automated access - Processing time increases with num_results and number_of_scrolls - Consider using smartscraper on specific URLs if you know the target sites

smartcrawler_initiate

Initiate an asynchronous multi-page web crawling operation with AI extraction or markdown conversion.

This tool starts an intelligent crawler that discovers and processes multiple pages from a starting URL. Choose between AI Extraction Mode (10 credits/page) for structured data or Markdown Mode (2 credits/page) for content conversion. The operation is asynchronous - use smartcrawler_fetch_results to retrieve results. Creates a new crawl request (non-idempotent, non-read-only).

SmartCrawler supports two modes:

  • AI Extraction Mode: Extracts structured data based on your prompt from every crawled page

  • Markdown Conversion Mode: Converts each page to clean markdown format

Args: url (str): The starting URL to begin crawling from. - Must include protocol (http:// or https://) - The crawler will discover and process linked pages from this starting point - Should be a page with links to other pages you want to crawl - Examples: * https://docs.example.com (documentation site root) * https://blog.company.com (blog homepage) * https://example.com/products (product category page) * https://news.site.com/category/tech (news section) - Best practices: * Use homepage or main category pages as starting points * Ensure the starting page has links to content you want to crawl * Consider site structure when choosing the starting URL

prompt (Optional[str]): AI prompt for data extraction. - REQUIRED when extraction_mode is 'ai' - Ignored when extraction_mode is 'markdown' - Describes what data to extract from each crawled page - Applied consistently across all discovered pages - Examples: * "Extract API endpoint name, method, parameters, and description" * "Get article title, author, publication date, and summary" * "Find product name, price, description, and availability" * "Extract job title, company, location, salary, and requirements" - Tips for better results: * Be specific about fields you want from each page * Consider that different pages may have different content structures * Use general terms that apply across multiple page types extraction_mode (str): Extraction mode for processing crawled pages. - Default: "ai" - Options: * "ai": AI-powered structured data extraction (10 credits per page) - Uses the prompt to extract specific data from each page - Returns structured JSON data - More expensive but provides targeted information - Best for: Data collection, research, structured analysis * "markdown": Simple markdown conversion (2 credits per page) - Converts each page to clean markdown format - No AI processing, just content conversion - More cost-effective for content archival - Best for: Documentation backup, content migration, reading - Cost comparison: * AI mode: 50 pages = 500 credits * Markdown mode: 50 pages = 100 credits depth (Optional[int]): Maximum depth of link traversal from the starting URL. - Default: unlimited (will follow links until max_pages or no more links) - Depth levels: * 0: Only the starting URL (no link following) * 1: Starting URL + pages directly linked from it * 2: Starting URL + direct links + links from those pages * 3+: Continues following links to specified depth - Examples: * 1: Crawl blog homepage + all blog posts * 2: Crawl docs homepage + category pages + individual doc pages * 3: Deep crawling for comprehensive site coverage - Considerations: * Higher depth can lead to exponential page growth * Use with max_pages to control scope and cost * Consider site structure when setting depth max_pages (Optional[int]): Maximum number of pages to crawl in total. - Default: unlimited (will crawl until no more links or depth limit) - Recommended ranges: * 10-20: Testing and small sites * 50-100: Medium sites and focused crawling * 200-500: Large sites and comprehensive analysis * 1000+: Enterprise-level crawling (high cost) - Cost implications: * AI mode: max_pages × 10 credits * Markdown mode: max_pages × 2 credits - Examples: * 10: Quick site sampling (20-100 credits) * 50: Standard documentation crawl (100-500 credits) * 200: Comprehensive site analysis (400-2000 credits) - Note: Crawler stops when this limit is reached, regardless of remaining links same_domain_only (Optional[bool]): Whether to crawl only within the same domain. - Default: true (recommended for most use cases) - Options: * true: Only crawl pages within the same domain as starting URL - Prevents following external links - Keeps crawling focused on the target site - Reduces risk of crawling unrelated content - Example: Starting at docs.example.com only crawls docs.example.com pages * false: Allow crawling external domains - Follows links to other domains - Can lead to very broad crawling scope - May crawl unrelated or unwanted content - Use with caution and appropriate max_pages limit - Recommendations: * Use true for focused site crawling * Use false only when you specifically need cross-domain data * Always set max_pages when using false to prevent runaway crawling

Returns: Dictionary containing: - request_id: Unique identifier for this crawl operation (use with smartcrawler_fetch_results) - status: Initial status of the crawl request ("initiated" or "processing") - estimated_cost: Estimated credit cost based on parameters (actual cost may vary) - crawl_parameters: Summary of the crawling configuration - estimated_time: Rough estimate of processing time - next_steps: Instructions for retrieving results

Raises: ValueError: If URL is malformed, prompt is missing for AI mode, or parameters are invalid HTTPError: If the starting URL cannot be accessed RateLimitError: If too many crawl requests are initiated too quickly

Note: - This operation is asynchronous and may take several minutes to complete - Use smartcrawler_fetch_results with the returned request_id to get results - Keep polling smartcrawler_fetch_results until status is "completed" - Actual pages crawled may be less than max_pages if fewer links are found - Processing time increases with max_pages, depth, and extraction_mode complexity

smartcrawler_fetch_results

Retrieve the results of an asynchronous SmartCrawler operation.

This tool fetches the results from a previously initiated crawling operation using the request_id. The crawl request processes asynchronously in the background. Keep polling this endpoint until the status field indicates 'completed'. While processing, you'll receive status updates. Read-only operation that safely retrieves results without side effects.

Args: request_id: The unique request ID returned by smartcrawler_initiate. Use this to retrieve the crawling results. Keep polling until status is 'completed'. Example: 'req_abc123xyz'

Returns: Dictionary containing: - status: Current status of the crawl operation ('processing', 'completed', 'failed') - results: Crawled data (structured extraction or markdown) when completed - metadata: Information about processed pages, URLs visited, and processing statistics Keep polling until status is 'completed' to get final results

scrape

Fetch raw page content from any URL with optional JavaScript rendering.

This tool performs basic web scraping to retrieve the raw HTML content of a webpage. Optionally enable JavaScript rendering for Single Page Applications (SPAs) and sites with heavy client-side rendering. Lower cost than AI extraction (1 credit/page). Read-only operation with no side effects.

Args: website_url (str): The complete URL of the webpage to scrape. - Must include protocol (http:// or https://) - Returns raw HTML content of the page - Works with both static and dynamic websites - Examples: * https://example.com/page * https://api.example.com/docs * https://news.site.com/article/123 * https://app.example.com/dashboard (may need render_heavy_js=true) - Supported protocols: HTTP, HTTPS - Invalid examples: * example.com (missing protocol) * ftp://example.com (unsupported protocol)

render_heavy_js (Optional[bool]): Enable full JavaScript rendering for dynamic content. - Default: false (faster, lower cost, works for most static sites) - Set to true for sites that require JavaScript execution to display content - When to use true: * Single Page Applications (React, Angular, Vue.js) * Sites with dynamic content loading via AJAX * Content that appears only after JavaScript execution * Interactive web applications * Sites where initial HTML is mostly empty - When to use false (default): * Static websites and blogs * Server-side rendered content * Traditional HTML pages * News articles and documentation * When you need faster processing - Performance impact: * false: 2-5 seconds processing time * true: 15-30 seconds processing time (waits for JS execution) - Cost: Same (1 credit) regardless of render_heavy_js setting

Returns: Dictionary containing: - html_content: The raw HTML content of the page as a string - page_title: Extracted page title if available - status_code: HTTP response status code (200 for success) - final_url: Final URL after any redirects - content_length: Size of the HTML content in bytes - processing_time: Time taken to fetch and process the page - javascript_rendered: Whether JavaScript rendering was used - credits_used: Number of credits consumed (always 1)

Raises: ValueError: If website_url is malformed or missing protocol HTTPError: If the webpage returns an error status (404, 500, etc.) TimeoutError: If the page takes too long to load ConnectionError: If the website cannot be reached

Use Cases: - Getting raw HTML for custom parsing - Checking page structure before using other tools - Fetching content for offline processing - Debugging website content issues - Pre-processing before AI extraction

Note: - This tool returns raw HTML without any AI processing - Use smartscraper for structured data extraction - Use markdownify for clean, readable content - Consider render_heavy_js=true if initial results seem incomplete

sitemap

Extract and discover the complete sitemap structure of any website.

This tool automatically discovers all accessible URLs and pages within a website, providing a comprehensive map of the site's structure. Useful for understanding site architecture before crawling or for discovering all available content. Very cost-effective at 1 credit per request. Read-only operation with no side effects.

Args: website_url (str): The base URL of the website to extract sitemap from. - Must include protocol (http:// or https://) - Should be the root domain or main section you want to map - The tool will discover all accessible pages from this starting point - Examples: * https://example.com (discover entire website structure) * https://docs.example.com (map documentation site) * https://blog.company.com (discover all blog pages) * https://shop.example.com (map e-commerce structure) - Best practices: * Use root domain (https://example.com) for complete site mapping * Use subdomain (https://docs.example.com) for focused mapping * Ensure the URL is accessible and doesn't require authentication - Discovery methods: * Checks for robots.txt and sitemap.xml files * Crawls navigation links and menus * Discovers pages through internal link analysis * Identifies common URL patterns and structures

Returns: Dictionary containing: - discovered_urls: List of all URLs found on the website - site_structure: Hierarchical organization of pages and sections - url_categories: URLs grouped by type (pages, images, documents, etc.) - total_pages: Total number of pages discovered - subdomains: List of subdomains found (if any) - sitemap_sources: Sources used for discovery (sitemap.xml, robots.txt, crawling) - page_types: Breakdown of different content types found - depth_analysis: URL organization by depth from root - external_links: Links pointing to external domains (if found) - processing_time: Time taken to complete the discovery - credits_used: Number of credits consumed (always 1)

Raises: ValueError: If website_url is malformed or missing protocol HTTPError: If the website cannot be accessed or returns errors TimeoutError: If the discovery process takes too long ConnectionError: If the website cannot be reached

Use Cases: - Planning comprehensive crawling operations - Understanding website architecture and organization - Discovering all available content before targeted scraping - SEO analysis and site structure optimization - Content inventory and audit preparation - Identifying pages for bulk processing operations

Best Practices: - Run sitemap before using smartcrawler_initiate for better planning - Use results to set appropriate max_pages and depth parameters - Check discovered URLs to understand site organization - Identify high-value pages for targeted extraction - Use for cost estimation before large crawling operations

Note: - Very cost-effective at only 1 credit per request - Results may vary based on site structure and accessibility - Some pages may require authentication and won't be discovered - Large sites may have thousands of URLs - consider filtering results - Use discovered URLs as input for other scraping tools

agentic_scrapper

Execute complex multi-step web scraping workflows with AI-powered automation.

This tool runs an intelligent agent that can navigate websites, interact with forms and buttons, follow multi-step workflows, and extract structured data. Ideal for complex scraping scenarios requiring user interaction simulation, form submissions, or multi-page navigation flows. Supports custom output schemas and step-by-step instructions. Variable credit cost based on complexity. Can perform actions on the website (non-read-only, non-idempotent).

The agent accepts flexible input formats for steps (list or JSON string) and output_schema (dict or JSON string) to accommodate different client implementations.

Args: url (str): The target website URL where the agentic scraping workflow should start. - Must include protocol (http:// or https://) - Should be the starting page for your automation workflow - The agent will begin its actions from this URL - Examples: * https://example.com/search (start at search page) * https://shop.example.com/login (begin with login flow) * https://app.example.com/dashboard (start at main interface) * https://forms.example.com/contact (begin at form page) - Considerations: * Choose a starting point that makes sense for your workflow * Ensure the page is publicly accessible or handle authentication * Consider the logical flow of actions from this starting point

user_prompt (Optional[str]): High-level instructions for what the agent should accomplish. - Describes the overall goal and desired outcome of the automation - Should be clear and specific about what you want to achieve - Works in conjunction with the steps parameter for detailed guidance - Examples: * "Navigate to the search page, search for laptops, and extract the top 5 results with prices" * "Fill out the contact form with sample data and submit it" * "Login to the dashboard and extract all recent notifications" * "Browse the product catalog and collect information about all items" * "Navigate through the multi-step checkout process and capture each step" - Tips for better results: * Be specific about the end goal * Mention what data you want extracted * Include context about the expected workflow * Specify any particular elements or sections to focus on output_schema (Optional[Union[str, Dict]]): Desired output structure for extracted data. - Can be provided as a dictionary or JSON string - Defines the format and structure of the final extracted data - Helps ensure consistent, predictable output format - IMPORTANT: Must include a "required" field (can be empty array [] if no fields are required) - Examples: * Simple object: {'type': 'object', 'properties': {'title': {'type': 'string'}, 'price': {'type': 'number'}}, 'required': []} * Array of objects: {'type': 'array', 'items': {'type': 'object', 'properties': {'name': {'type': 'string'}, 'value': {'type': 'string'}}, 'required': []}, 'required': []} * Complex nested: {'type': 'object', 'properties': {'products': {'type': 'array', 'items': {...}}, 'total_count': {'type': 'number'}}, 'required': []} * As JSON string: '{"type": "object", "properties": {"results": {"type": "array"}}, "required": []}' * With required fields: {'type': 'object', 'properties': {'id': {'type': 'string'}, 'name': {'type': 'string'}}, 'required': ['id']} - Note: If "required" field is missing, it will be automatically added as an empty array [] - Default: None (agent will infer structure from prompt and steps) steps (Optional[Union[str, List[str]]]): Step-by-step instructions for the agent. - Can be provided as a list of strings or JSON array string - Provides detailed, sequential instructions for the automation workflow - Each step should be a clear, actionable instruction - Examples as list: * ['Click the search button', 'Enter "laptops" in the search box', 'Press Enter', 'Wait for results to load', 'Extract product information'] * ['Fill in email field with test@example.com', 'Fill in password field', 'Click login button', 'Navigate to profile page'] - Examples as JSON string: * '["Open navigation menu", "Click on Products", "Select category filters", "Extract all product data"]' - Best practices: * Break complex actions into simple steps * Be specific about UI elements (button text, field names, etc.) * Include waiting/loading steps when necessary * Specify extraction points clearly * Order steps logically for the workflow ai_extraction (Optional[bool]): Enable AI-powered extraction mode for intelligent data parsing. - Default: true (recommended for most use cases) - Options: * true: Uses advanced AI to intelligently extract and structure data - Better at handling complex page layouts - Can adapt to different content structures - Provides more accurate data extraction - Recommended for most scenarios * false: Uses simpler extraction methods - Faster processing but less intelligent - May miss complex or nested data - Use when speed is more important than accuracy - Performance impact: * true: Higher processing time but better results * false: Faster execution but potentially less accurate extraction persistent_session (Optional[bool]): Maintain session state between steps. - Default: false (each step starts fresh) - Options: * true: Keeps cookies, login state, and session data between steps - Essential for authenticated workflows - Maintains shopping cart contents, user preferences, etc. - Required for multi-step processes that depend on previous actions - Use for: Login flows, shopping processes, form wizards * false: Each step starts with a clean session - Faster and simpler for independent actions - No state carried between steps - Use for: Simple data extraction, public content scraping - Examples when to use true: * Login → Navigate to protected area → Extract data * Add items to cart → Proceed to checkout → Extract order details * Multi-step form completion with session dependencies timeout_seconds (Optional[float]): Maximum time to wait for the entire workflow. - Default: 120 seconds (2 minutes) - Recommended ranges: * 60-120: Simple workflows (2-5 steps) * 180-300: Medium complexity (5-10 steps) * 300-600: Complex workflows (10+ steps or slow sites) * 600+: Very complex or slow-loading workflows - Considerations: * Include time for page loads, form submissions, and processing * Factor in network latency and site response times * Allow extra time for AI processing and extraction * Balance between thoroughness and efficiency - Examples: * 60.0: Quick single-page data extraction * 180.0: Multi-step form filling and submission * 300.0: Complex navigation and comprehensive data extraction * 600.0: Extensive workflows with multiple page interactions

Returns: Dictionary containing: - extracted_data: The structured data matching your prompt and optional schema - workflow_log: Detailed log of all actions performed by the agent - pages_visited: List of URLs visited during the workflow - actions_performed: Summary of interactions (clicks, form fills, navigations) - execution_time: Total time taken for the workflow - steps_completed: Number of steps successfully executed - final_page_url: The URL where the workflow ended - session_data: Session information if persistent_session was enabled - credits_used: Number of credits consumed (varies by complexity) - status: Success/failure status with any error details

Raises: ValueError: If URL is malformed or required parameters are missing TimeoutError: If the workflow exceeds the specified timeout NavigationError: If the agent cannot navigate to required pages InteractionError: If the agent cannot interact with specified elements ExtractionError: If data extraction fails or returns invalid results

Use Cases: - Automated form filling and submission - Multi-step checkout processes - Login-protected content extraction - Interactive search and filtering workflows - Complex navigation scenarios requiring user simulation - Data collection from dynamic, JavaScript-heavy applications

Best Practices: - Start with simple workflows and gradually increase complexity - Use specific element identifiers in steps (button text, field labels) - Include appropriate wait times for page loads and dynamic content - Test with persistent_session=true for authentication-dependent workflows - Set realistic timeouts based on workflow complexity - Provide clear, sequential steps that build on each other - Use output_schema to ensure consistent data structure

Note: - This tool can perform actions on websites (non-read-only) - Results may vary between runs due to dynamic content (non-idempotent) - Credit cost varies based on workflow complexity and execution time - Some websites may have anti-automation measures that could affect success - Consider using simpler tools (smartscraper, markdownify) for basic extraction needs

Prompts

Interactive templates invoked by user choice

NameDescription
web_scraping_guideA comprehensive guide to using ScapeGraph's web scraping tools effectively. This prompt provides examples and best practices for each tool in the ScapeGraph MCP server.
quick_start_examplesQuick start examples for common ScapeGraph use cases. Ready-to-use examples for immediate productivity.

Resources

Contextual data attached and managed by the client

NameDescription
api_statusCurrent status and capabilities of the ScapeGraph API server. Provides real-time information about available tools, credit usage, and server health.
common_use_casesCommon use cases and example implementations for ScapeGraph tools. Real-world examples with expected inputs and outputs.
parameter_reference_guideComprehensive parameter reference guide for all ScapeGraph MCP tools. Complete documentation of every parameter with examples, constraints, and best practices.
tool_comparison_guideDetailed comparison of ScapeGraph tools to help choose the right tool for each task. Decision matrix and feature comparison across all available tools.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ScrapeGraphAI/scrapegraph-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server