Skip to main content
Glama
ScrapeGraphAI

ScrapeGraph MCP Server

Official

smartcrawler_initiate

Start multi-page web crawling to extract structured data with AI or convert content to markdown from a starting URL.

Instructions

Initiate an asynchronous multi-page web crawling operation with AI extraction or markdown conversion.

This tool starts an intelligent crawler that discovers and processes multiple pages from a starting URL. Choose between AI Extraction Mode (10 credits/page) for structured data or Markdown Mode (2 credits/page) for content conversion. The operation is asynchronous - use smartcrawler_fetch_results to retrieve results. Creates a new crawl request (non-idempotent, non-read-only).

SmartCrawler supports two modes:

  • AI Extraction Mode: Extracts structured data based on your prompt from every crawled page

  • Markdown Conversion Mode: Converts each page to clean markdown format

Args: url (str): The starting URL to begin crawling from. - Must include protocol (http:// or https://) - The crawler will discover and process linked pages from this starting point - Should be a page with links to other pages you want to crawl - Examples: * https://docs.example.com (documentation site root) * https://blog.company.com (blog homepage) * https://example.com/products (product category page) * https://news.site.com/category/tech (news section) - Best practices: * Use homepage or main category pages as starting points * Ensure the starting page has links to content you want to crawl * Consider site structure when choosing the starting URL

prompt (Optional[str]): AI prompt for data extraction.
    - REQUIRED when extraction_mode is 'ai'
    - Ignored when extraction_mode is 'markdown'
    - Describes what data to extract from each crawled page
    - Applied consistently across all discovered pages
    - Examples:
      * "Extract API endpoint name, method, parameters, and description"
      * "Get article title, author, publication date, and summary"
      * "Find product name, price, description, and availability"
      * "Extract job title, company, location, salary, and requirements"
    - Tips for better results:
      * Be specific about fields you want from each page
      * Consider that different pages may have different content structures
      * Use general terms that apply across multiple page types

extraction_mode (str): Extraction mode for processing crawled pages.
    - Default: "ai"
    - Options:
      * "ai": AI-powered structured data extraction (10 credits per page)
        - Uses the prompt to extract specific data from each page
        - Returns structured JSON data
        - More expensive but provides targeted information
        - Best for: Data collection, research, structured analysis
      * "markdown": Simple markdown conversion (2 credits per page)
        - Converts each page to clean markdown format
        - No AI processing, just content conversion
        - More cost-effective for content archival
        - Best for: Documentation backup, content migration, reading
    - Cost comparison:
      * AI mode: 50 pages = 500 credits
      * Markdown mode: 50 pages = 100 credits

depth (Optional[int]): Maximum depth of link traversal from the starting URL.
    - Default: unlimited (will follow links until max_pages or no more links)
    - Depth levels:
      * 0: Only the starting URL (no link following)
      * 1: Starting URL + pages directly linked from it
      * 2: Starting URL + direct links + links from those pages
      * 3+: Continues following links to specified depth
    - Examples:
      * 1: Crawl blog homepage + all blog posts
      * 2: Crawl docs homepage + category pages + individual doc pages
      * 3: Deep crawling for comprehensive site coverage
    - Considerations:
      * Higher depth can lead to exponential page growth
      * Use with max_pages to control scope and cost
      * Consider site structure when setting depth

max_pages (Optional[int]): Maximum number of pages to crawl in total.
    - Default: unlimited (will crawl until no more links or depth limit)
    - Recommended ranges:
      * 10-20: Testing and small sites
      * 50-100: Medium sites and focused crawling
      * 200-500: Large sites and comprehensive analysis
      * 1000+: Enterprise-level crawling (high cost)
    - Cost implications:
      * AI mode: max_pages × 10 credits
      * Markdown mode: max_pages × 2 credits
    - Examples:
      * 10: Quick site sampling (20-100 credits)
      * 50: Standard documentation crawl (100-500 credits)
      * 200: Comprehensive site analysis (400-2000 credits)
    - Note: Crawler stops when this limit is reached, regardless of remaining links

same_domain_only (Optional[bool]): Whether to crawl only within the same domain.
    - Default: true (recommended for most use cases)
    - Options:
      * true: Only crawl pages within the same domain as starting URL
        - Prevents following external links
        - Keeps crawling focused on the target site
        - Reduces risk of crawling unrelated content
        - Example: Starting at docs.example.com only crawls docs.example.com pages
      * false: Allow crawling external domains
        - Follows links to other domains
        - Can lead to very broad crawling scope
        - May crawl unrelated or unwanted content
        - Use with caution and appropriate max_pages limit
    - Recommendations:
      * Use true for focused site crawling
      * Use false only when you specifically need cross-domain data
      * Always set max_pages when using false to prevent runaway crawling

Returns: Dictionary containing: - request_id: Unique identifier for this crawl operation (use with smartcrawler_fetch_results) - status: Initial status of the crawl request ("initiated" or "processing") - estimated_cost: Estimated credit cost based on parameters (actual cost may vary) - crawl_parameters: Summary of the crawling configuration - estimated_time: Rough estimate of processing time - next_steps: Instructions for retrieving results

Raises: ValueError: If URL is malformed, prompt is missing for AI mode, or parameters are invalid HTTPError: If the starting URL cannot be accessed RateLimitError: If too many crawl requests are initiated too quickly

Note: - This operation is asynchronous and may take several minutes to complete - Use smartcrawler_fetch_results with the returned request_id to get results - Keep polling smartcrawler_fetch_results until status is "completed" - Actual pages crawled may be less than max_pages if fewer links are found - Processing time increases with max_pages, depth, and extraction_mode complexity

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYes
promptNo
extraction_modeNoai
depthNo
max_pagesNo
same_domain_onlyNo

Implementation Reference

  • MCP tool handler for 'smartcrawler_initiate'. Validates inputs via type hints, retrieves API key, instantiates ScapeGraphClient, and calls the client method to initiate the crawl. Returns request ID or error.
    @mcp.tool(annotations={"readOnlyHint": False, "destructiveHint": False, "idempotentHint": False})
    def smartcrawler_initiate(
        url: str,
        ctx: Context,
        prompt: Optional[str] = None,
        extraction_mode: str = "ai",
        depth: Optional[int] = None,
        max_pages: Optional[int] = None,
        same_domain_only: Optional[bool] = None
    ) -> Dict[str, Any]:
        """
        Initiate an asynchronous multi-page web crawling operation with AI extraction or markdown conversion.
    
        This tool starts an intelligent crawler that discovers and processes multiple pages from a starting URL.
        Choose between AI Extraction Mode (10 credits/page) for structured data or Markdown Mode (2 credits/page)
        for content conversion. The operation is asynchronous - use smartcrawler_fetch_results to retrieve results.
        Creates a new crawl request (non-idempotent, non-read-only).
    
        SmartCrawler supports two modes:
        - AI Extraction Mode: Extracts structured data based on your prompt from every crawled page
        - Markdown Conversion Mode: Converts each page to clean markdown format
    
        Args:
            url (str): The starting URL to begin crawling from.
                - Must include protocol (http:// or https://)
                - The crawler will discover and process linked pages from this starting point
                - Should be a page with links to other pages you want to crawl
                - Examples:
                  * https://docs.example.com (documentation site root)
                  * https://blog.company.com (blog homepage)
                  * https://example.com/products (product category page)
                  * https://news.site.com/category/tech (news section)
                - Best practices:
                  * Use homepage or main category pages as starting points
                  * Ensure the starting page has links to content you want to crawl
                  * Consider site structure when choosing the starting URL
    
            prompt (Optional[str]): AI prompt for data extraction.
                - REQUIRED when extraction_mode is 'ai'
                - Ignored when extraction_mode is 'markdown'
                - Describes what data to extract from each crawled page
                - Applied consistently across all discovered pages
                - Examples:
                  * "Extract API endpoint name, method, parameters, and description"
                  * "Get article title, author, publication date, and summary"
                  * "Find product name, price, description, and availability"
                  * "Extract job title, company, location, salary, and requirements"
                - Tips for better results:
                  * Be specific about fields you want from each page
                  * Consider that different pages may have different content structures
                  * Use general terms that apply across multiple page types
    
            extraction_mode (str): Extraction mode for processing crawled pages.
                - Default: "ai"
                - Options:
                  * "ai": AI-powered structured data extraction (10 credits per page)
                    - Uses the prompt to extract specific data from each page
                    - Returns structured JSON data
                    - More expensive but provides targeted information
                    - Best for: Data collection, research, structured analysis
                  * "markdown": Simple markdown conversion (2 credits per page)
                    - Converts each page to clean markdown format
                    - No AI processing, just content conversion
                    - More cost-effective for content archival
                    - Best for: Documentation backup, content migration, reading
                - Cost comparison:
                  * AI mode: 50 pages = 500 credits
                  * Markdown mode: 50 pages = 100 credits
    
            depth (Optional[int]): Maximum depth of link traversal from the starting URL.
                - Default: unlimited (will follow links until max_pages or no more links)
                - Depth levels:
                  * 0: Only the starting URL (no link following)
                  * 1: Starting URL + pages directly linked from it
                  * 2: Starting URL + direct links + links from those pages
                  * 3+: Continues following links to specified depth
                - Examples:
                  * 1: Crawl blog homepage + all blog posts
                  * 2: Crawl docs homepage + category pages + individual doc pages
                  * 3: Deep crawling for comprehensive site coverage
                - Considerations:
                  * Higher depth can lead to exponential page growth
                  * Use with max_pages to control scope and cost
                  * Consider site structure when setting depth
    
            max_pages (Optional[int]): Maximum number of pages to crawl in total.
                - Default: unlimited (will crawl until no more links or depth limit)
                - Recommended ranges:
                  * 10-20: Testing and small sites
                  * 50-100: Medium sites and focused crawling
                  * 200-500: Large sites and comprehensive analysis
                  * 1000+: Enterprise-level crawling (high cost)
                - Cost implications:
                  * AI mode: max_pages × 10 credits
                  * Markdown mode: max_pages × 2 credits
                - Examples:
                  * 10: Quick site sampling (20-100 credits)
                  * 50: Standard documentation crawl (100-500 credits)
                  * 200: Comprehensive site analysis (400-2000 credits)
                - Note: Crawler stops when this limit is reached, regardless of remaining links
    
            same_domain_only (Optional[bool]): Whether to crawl only within the same domain.
                - Default: true (recommended for most use cases)
                - Options:
                  * true: Only crawl pages within the same domain as starting URL
                    - Prevents following external links
                    - Keeps crawling focused on the target site
                    - Reduces risk of crawling unrelated content
                    - Example: Starting at docs.example.com only crawls docs.example.com pages
                  * false: Allow crawling external domains
                    - Follows links to other domains
                    - Can lead to very broad crawling scope
                    - May crawl unrelated or unwanted content
                    - Use with caution and appropriate max_pages limit
                - Recommendations:
                  * Use true for focused site crawling
                  * Use false only when you specifically need cross-domain data
                  * Always set max_pages when using false to prevent runaway crawling
    
        Returns:
            Dictionary containing:
            - request_id: Unique identifier for this crawl operation (use with smartcrawler_fetch_results)
            - status: Initial status of the crawl request ("initiated" or "processing")
            - estimated_cost: Estimated credit cost based on parameters (actual cost may vary)
            - crawl_parameters: Summary of the crawling configuration
            - estimated_time: Rough estimate of processing time
            - next_steps: Instructions for retrieving results
    
        Raises:
            ValueError: If URL is malformed, prompt is missing for AI mode, or parameters are invalid
            HTTPError: If the starting URL cannot be accessed
            RateLimitError: If too many crawl requests are initiated too quickly
    
        Note:
            - This operation is asynchronous and may take several minutes to complete
            - Use smartcrawler_fetch_results with the returned request_id to get results
            - Keep polling smartcrawler_fetch_results until status is "completed"
            - Actual pages crawled may be less than max_pages if fewer links are found
            - Processing time increases with max_pages, depth, and extraction_mode complexity
        """
        try:
            api_key = get_api_key(ctx)
            client = ScapeGraphClient(api_key)
            return client.smartcrawler_initiate(
                url=url,
                prompt=prompt,
                extraction_mode=extraction_mode,
                depth=depth,
                max_pages=max_pages,
                same_domain_only=same_domain_only
            )
        except Exception as e:
            return {"error": str(e)}
  • Core helper method in ScapeGraphClient class that constructs the API request payload based on parameters, handles extraction mode logic, makes HTTP POST to /crawl endpoint, and returns the JSON response containing the request ID.
    def smartcrawler_initiate(
        self, 
        url: str, 
        prompt: str = None, 
        extraction_mode: str = "ai",
        depth: int = None,
        max_pages: int = None,
        same_domain_only: bool = None
    ) -> Dict[str, Any]:
        """
        Initiate a SmartCrawler request for multi-page web crawling.
        
        SmartCrawler supports two modes:
        - AI Extraction Mode (10 credits per page): Extracts structured data based on your prompt
        - Markdown Conversion Mode (2 credits per page): Converts pages to clean markdown
    
        Smartcrawler takes some time to process the request and returns the request id.
        Use smartcrawler_fetch_results to get the results of the request.
        You have to keep polling the smartcrawler_fetch_results until the request is complete.
        The request is complete when the status is "completed".
    
        Args:
            url: Starting URL to crawl
            prompt: AI prompt for data extraction (required for AI mode)
            extraction_mode: "ai" for AI extraction or "markdown" for markdown conversion (default: "ai")
            depth: Maximum link traversal depth (optional)
            max_pages: Maximum number of pages to crawl (optional)
            same_domain_only: Whether to crawl only within the same domain (optional)
    
        Returns:
            Dictionary containing the request ID for async processing
        """
        endpoint = f"{self.BASE_URL}/crawl"
        data = {
            "url": url
        }
        
        # Handle extraction mode
        if extraction_mode == "markdown":
            data["markdown_only"] = True
        elif extraction_mode == "ai":
            if prompt is None:
                raise ValueError("prompt is required when extraction_mode is 'ai'")
            data["prompt"] = prompt
        else:
            raise ValueError(f"Invalid extraction_mode: {extraction_mode}. Must be 'ai' or 'markdown'")
        if depth is not None:
            data["depth"] = depth
        if max_pages is not None:
            data["max_pages"] = max_pages
        if same_domain_only is not None:
            data["same_domain_only"] = same_domain_only
    
        response = self.client.post(endpoint, headers=self.headers, json=data)
    
        if response.status_code != 200:
            error_msg = f"Error {response.status_code}: {response.text}"
            raise Exception(error_msg)
    
        return response.json()
  • Function signature defines input schema via type hints: required url (str), optional prompt (str), extraction_mode (str default 'ai'), depth/max_pages (int), same_domain_only (bool). Returns Dict[str, Any] typically containing request_id.
    def smartcrawler_initiate(
        url: str,
        ctx: Context,
        prompt: Optional[str] = None,
        extraction_mode: str = "ai",
        depth: Optional[int] = None,
        max_pages: Optional[int] = None,
        same_domain_only: Optional[bool] = None
    ) -> Dict[str, Any]:
  • Registers the smartcrawler_initiate function as an MCP tool with specific annotations indicating it's non-read-only, non-destructive, non-idempotent.
    @mcp.tool(annotations={"readOnlyHint": False, "destructiveHint": False, "idempotentHint": False})

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ScrapeGraphAI/scrapegraph-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server