Server Configuration
Describes the environment variables required to run the server.
| Name | Required | Description | Default |
|---|---|---|---|
| SGAI_API_KEY | Yes | Your ScrapeGraph API key obtained from the ScrapeGraph Dashboard |
Capabilities
Server capabilities have not been inspected yet.
Tools
Functions exposed to the LLM to take actions
| Name | Description |
|---|---|
| markdownify | Convert a webpage into clean, formatted markdown. This tool fetches any webpage and converts its content into clean, readable markdown format. Useful for extracting content from documentation, articles, and web pages for further processing. Costs 2 credits per page. Read-only operation with no side effects. Args: website_url (str): The complete URL of the webpage to convert to markdown format. - Must include protocol (http:// or https://) - Supports most web content types (HTML, articles, documentation) - Works with both static and dynamic content - Examples: * https://example.com/page * https://docs.python.org/3/tutorial/ * https://github.com/user/repo/README.md - Invalid examples: * example.com (missing protocol) * ftp://example.com (unsupported protocol) * localhost:3000 (missing protocol) Returns: Dictionary containing: - markdown: The converted markdown content as a string - metadata: Additional information about the conversion (title, description, etc.) - status: Success/error status of the operation - credits_used: Number of credits consumed (always 2 for this operation) Raises: ValueError: If website_url is malformed or missing protocol HTTPError: If the webpage cannot be accessed or returns an error TimeoutError: If the webpage takes too long to load (>120 seconds) |
| smartscraper | SectionContent here..." - Default: None |
| searchscraper | Perform AI-powered web searches with structured data extraction. This tool searches the web based on your query and uses AI to extract structured information from the search results. Ideal for research, competitive analysis, and gathering information from multiple sources. Each website searched costs 10 credits (default 3 websites = 30 credits). Read-only operation but results may vary over time (non-idempotent). Args: user_prompt (str): Search query or natural language instructions for information to find. - Can be a simple search query or detailed extraction instructions - The AI will search the web and extract relevant data from found pages - Be specific about what information you want extracted - Examples: * "Find latest AI research papers published in 2024 with author names and abstracts" * "Search for Python web scraping tutorials with ratings and difficulty levels" * "Get current cryptocurrency prices and market caps for top 10 coins" * "Find contact information for tech startups in San Francisco" * "Search for job openings for data scientists with salary information" - Tips for better results: * Include specific fields you want extracted * Mention timeframes or filters (e.g., "latest", "2024", "top 10") * Specify data types needed (prices, dates, ratings, etc.) Returns: Dictionary containing: - search_results: Array of extracted data from each website found - sources: List of URLs that were searched and processed - total_websites_processed: Number of websites successfully analyzed - credits_used: Total credits consumed (num_results × 10) - processing_time: Total time taken for search and extraction - search_query_used: The actual search query sent to search engines - metadata: Additional information about the search process Raises: ValueError: If user_prompt is empty or num_results is out of range HTTPError: If search engines are unavailable or return errors TimeoutError: If search or extraction process exceeds timeout limits RateLimitError: If too many requests are made in a short time period Note: - Results may vary between calls due to changing web content (non-idempotent) - Search engines may return different results over time - Some websites may be inaccessible or block automated access - Processing time increases with num_results and number_of_scrolls - Consider using smartscraper on specific URLs if you know the target sites |
| smartcrawler_initiate | Initiate an asynchronous multi-page web crawling operation with AI extraction or markdown conversion. This tool starts an intelligent crawler that discovers and processes multiple pages from a starting URL. Choose between AI Extraction Mode (10 credits/page) for structured data or Markdown Mode (2 credits/page) for content conversion. The operation is asynchronous - use smartcrawler_fetch_results to retrieve results. Creates a new crawl request (non-idempotent, non-read-only). SmartCrawler supports two modes:
Args: url (str): The starting URL to begin crawling from. - Must include protocol (http:// or https://) - The crawler will discover and process linked pages from this starting point - Should be a page with links to other pages you want to crawl - Examples: * https://docs.example.com (documentation site root) * https://blog.company.com (blog homepage) * https://example.com/products (product category page) * https://news.site.com/category/tech (news section) - Best practices: * Use homepage or main category pages as starting points * Ensure the starting page has links to content you want to crawl * Consider site structure when choosing the starting URL Returns: Dictionary containing: - request_id: Unique identifier for this crawl operation (use with smartcrawler_fetch_results) - status: Initial status of the crawl request ("initiated" or "processing") - estimated_cost: Estimated credit cost based on parameters (actual cost may vary) - crawl_parameters: Summary of the crawling configuration - estimated_time: Rough estimate of processing time - next_steps: Instructions for retrieving results Raises: ValueError: If URL is malformed, prompt is missing for AI mode, or parameters are invalid HTTPError: If the starting URL cannot be accessed RateLimitError: If too many crawl requests are initiated too quickly Note: - This operation is asynchronous and may take several minutes to complete - Use smartcrawler_fetch_results with the returned request_id to get results - Keep polling smartcrawler_fetch_results until status is "completed" - Actual pages crawled may be less than max_pages if fewer links are found - Processing time increases with max_pages, depth, and extraction_mode complexity |
| smartcrawler_fetch_results | Retrieve the results of an asynchronous SmartCrawler operation. This tool fetches the results from a previously initiated crawling operation using the request_id. The crawl request processes asynchronously in the background. Keep polling this endpoint until the status field indicates 'completed'. While processing, you'll receive status updates. Read-only operation that safely retrieves results without side effects. Args: request_id: The unique request ID returned by smartcrawler_initiate. Use this to retrieve the crawling results. Keep polling until status is 'completed'. Example: 'req_abc123xyz' Returns: Dictionary containing: - status: Current status of the crawl operation ('processing', 'completed', 'failed') - results: Crawled data (structured extraction or markdown) when completed - metadata: Information about processed pages, URLs visited, and processing statistics Keep polling until status is 'completed' to get final results |
| scrape | Fetch raw page content from any URL with optional JavaScript rendering. This tool performs basic web scraping to retrieve the raw HTML content of a webpage. Optionally enable JavaScript rendering for Single Page Applications (SPAs) and sites with heavy client-side rendering. Lower cost than AI extraction (1 credit/page). Read-only operation with no side effects. Args: website_url (str): The complete URL of the webpage to scrape. - Must include protocol (http:// or https://) - Returns raw HTML content of the page - Works with both static and dynamic websites - Examples: * https://example.com/page * https://api.example.com/docs * https://news.site.com/article/123 * https://app.example.com/dashboard (may need render_heavy_js=true) - Supported protocols: HTTP, HTTPS - Invalid examples: * example.com (missing protocol) * ftp://example.com (unsupported protocol) Returns: Dictionary containing: - html_content: The raw HTML content of the page as a string - page_title: Extracted page title if available - status_code: HTTP response status code (200 for success) - final_url: Final URL after any redirects - content_length: Size of the HTML content in bytes - processing_time: Time taken to fetch and process the page - javascript_rendered: Whether JavaScript rendering was used - credits_used: Number of credits consumed (always 1) Raises: ValueError: If website_url is malformed or missing protocol HTTPError: If the webpage returns an error status (404, 500, etc.) TimeoutError: If the page takes too long to load ConnectionError: If the website cannot be reached Use Cases: - Getting raw HTML for custom parsing - Checking page structure before using other tools - Fetching content for offline processing - Debugging website content issues - Pre-processing before AI extraction Note: - This tool returns raw HTML without any AI processing - Use smartscraper for structured data extraction - Use markdownify for clean, readable content - Consider render_heavy_js=true if initial results seem incomplete |
| sitemap | Extract and discover the complete sitemap structure of any website. This tool automatically discovers all accessible URLs and pages within a website, providing a comprehensive map of the site's structure. Useful for understanding site architecture before crawling or for discovering all available content. Very cost-effective at 1 credit per request. Read-only operation with no side effects. Args: website_url (str): The base URL of the website to extract sitemap from. - Must include protocol (http:// or https://) - Should be the root domain or main section you want to map - The tool will discover all accessible pages from this starting point - Examples: * https://example.com (discover entire website structure) * https://docs.example.com (map documentation site) * https://blog.company.com (discover all blog pages) * https://shop.example.com (map e-commerce structure) - Best practices: * Use root domain (https://example.com) for complete site mapping * Use subdomain (https://docs.example.com) for focused mapping * Ensure the URL is accessible and doesn't require authentication - Discovery methods: * Checks for robots.txt and sitemap.xml files * Crawls navigation links and menus * Discovers pages through internal link analysis * Identifies common URL patterns and structures Returns: Dictionary containing: - discovered_urls: List of all URLs found on the website - site_structure: Hierarchical organization of pages and sections - url_categories: URLs grouped by type (pages, images, documents, etc.) - total_pages: Total number of pages discovered - subdomains: List of subdomains found (if any) - sitemap_sources: Sources used for discovery (sitemap.xml, robots.txt, crawling) - page_types: Breakdown of different content types found - depth_analysis: URL organization by depth from root - external_links: Links pointing to external domains (if found) - processing_time: Time taken to complete the discovery - credits_used: Number of credits consumed (always 1) Raises: ValueError: If website_url is malformed or missing protocol HTTPError: If the website cannot be accessed or returns errors TimeoutError: If the discovery process takes too long ConnectionError: If the website cannot be reached Use Cases: - Planning comprehensive crawling operations - Understanding website architecture and organization - Discovering all available content before targeted scraping - SEO analysis and site structure optimization - Content inventory and audit preparation - Identifying pages for bulk processing operations Best Practices: - Run sitemap before using smartcrawler_initiate for better planning - Use results to set appropriate max_pages and depth parameters - Check discovered URLs to understand site organization - Identify high-value pages for targeted extraction - Use for cost estimation before large crawling operations Note: - Very cost-effective at only 1 credit per request - Results may vary based on site structure and accessibility - Some pages may require authentication and won't be discovered - Large sites may have thousands of URLs - consider filtering results - Use discovered URLs as input for other scraping tools |
| agentic_scrapper | Execute complex multi-step web scraping workflows with AI-powered automation. This tool runs an intelligent agent that can navigate websites, interact with forms and buttons, follow multi-step workflows, and extract structured data. Ideal for complex scraping scenarios requiring user interaction simulation, form submissions, or multi-page navigation flows. Supports custom output schemas and step-by-step instructions. Variable credit cost based on complexity. Can perform actions on the website (non-read-only, non-idempotent). The agent accepts flexible input formats for steps (list or JSON string) and output_schema (dict or JSON string) to accommodate different client implementations. Args: url (str): The target website URL where the agentic scraping workflow should start. - Must include protocol (http:// or https://) - Should be the starting page for your automation workflow - The agent will begin its actions from this URL - Examples: * https://example.com/search (start at search page) * https://shop.example.com/login (begin with login flow) * https://app.example.com/dashboard (start at main interface) * https://forms.example.com/contact (begin at form page) - Considerations: * Choose a starting point that makes sense for your workflow * Ensure the page is publicly accessible or handle authentication * Consider the logical flow of actions from this starting point Returns: Dictionary containing: - extracted_data: The structured data matching your prompt and optional schema - workflow_log: Detailed log of all actions performed by the agent - pages_visited: List of URLs visited during the workflow - actions_performed: Summary of interactions (clicks, form fills, navigations) - execution_time: Total time taken for the workflow - steps_completed: Number of steps successfully executed - final_page_url: The URL where the workflow ended - session_data: Session information if persistent_session was enabled - credits_used: Number of credits consumed (varies by complexity) - status: Success/failure status with any error details Raises: ValueError: If URL is malformed or required parameters are missing TimeoutError: If the workflow exceeds the specified timeout NavigationError: If the agent cannot navigate to required pages InteractionError: If the agent cannot interact with specified elements ExtractionError: If data extraction fails or returns invalid results Use Cases: - Automated form filling and submission - Multi-step checkout processes - Login-protected content extraction - Interactive search and filtering workflows - Complex navigation scenarios requiring user simulation - Data collection from dynamic, JavaScript-heavy applications Best Practices: - Start with simple workflows and gradually increase complexity - Use specific element identifiers in steps (button text, field labels) - Include appropriate wait times for page loads and dynamic content - Test with persistent_session=true for authentication-dependent workflows - Set realistic timeouts based on workflow complexity - Provide clear, sequential steps that build on each other - Use output_schema to ensure consistent data structure Note: - This tool can perform actions on websites (non-read-only) - Results may vary between runs due to dynamic content (non-idempotent) - Credit cost varies based on workflow complexity and execution time - Some websites may have anti-automation measures that could affect success - Consider using simpler tools (smartscraper, markdownify) for basic extraction needs |
Prompts
Interactive templates invoked by user choice
| Name | Description |
|---|---|
| web_scraping_guide | A comprehensive guide to using ScapeGraph's web scraping tools effectively. This prompt provides examples and best practices for each tool in the ScapeGraph MCP server. |
| quick_start_examples | Quick start examples for common ScapeGraph use cases. Ready-to-use examples for immediate productivity. |
Resources
Contextual data attached and managed by the client
| Name | Description |
|---|---|
| api_status | Current status and capabilities of the ScapeGraph API server. Provides real-time information about available tools, credit usage, and server health. |
| common_use_cases | Common use cases and example implementations for ScapeGraph tools. Real-world examples with expected inputs and outputs. |
| parameter_reference_guide | Comprehensive parameter reference guide for all ScapeGraph MCP tools. Complete documentation of every parameter with examples, constraints, and best practices. |
| tool_comparison_guide | Detailed comparison of ScapeGraph tools to help choose the right tool for each task. Decision matrix and feature comparison across all available tools. |