Skip to main content
Glama
ScrapeGraphAI

ScrapeGraph MCP Server

Official

smartcrawler_initiate

Start multi-page web crawling to extract structured data with AI or convert content to markdown from a starting URL.

Instructions

Initiate an asynchronous multi-page web crawling operation with AI extraction or markdown conversion.

This tool starts an intelligent crawler that discovers and processes multiple pages from a starting URL. Choose between AI Extraction Mode (10 credits/page) for structured data or Markdown Mode (2 credits/page) for content conversion. The operation is asynchronous - use smartcrawler_fetch_results to retrieve results. Creates a new crawl request (non-idempotent, non-read-only).

SmartCrawler supports two modes:

  • AI Extraction Mode: Extracts structured data based on your prompt from every crawled page

  • Markdown Conversion Mode: Converts each page to clean markdown format

Args: url (str): The starting URL to begin crawling from. - Must include protocol (http:// or https://) - The crawler will discover and process linked pages from this starting point - Should be a page with links to other pages you want to crawl - Examples: * https://docs.example.com (documentation site root) * https://blog.company.com (blog homepage) * https://example.com/products (product category page) * https://news.site.com/category/tech (news section) - Best practices: * Use homepage or main category pages as starting points * Ensure the starting page has links to content you want to crawl * Consider site structure when choosing the starting URL

prompt (Optional[str]): AI prompt for data extraction.
    - REQUIRED when extraction_mode is 'ai'
    - Ignored when extraction_mode is 'markdown'
    - Describes what data to extract from each crawled page
    - Applied consistently across all discovered pages
    - Examples:
      * "Extract API endpoint name, method, parameters, and description"
      * "Get article title, author, publication date, and summary"
      * "Find product name, price, description, and availability"
      * "Extract job title, company, location, salary, and requirements"
    - Tips for better results:
      * Be specific about fields you want from each page
      * Consider that different pages may have different content structures
      * Use general terms that apply across multiple page types

extraction_mode (str): Extraction mode for processing crawled pages.
    - Default: "ai"
    - Options:
      * "ai": AI-powered structured data extraction (10 credits per page)
        - Uses the prompt to extract specific data from each page
        - Returns structured JSON data
        - More expensive but provides targeted information
        - Best for: Data collection, research, structured analysis
      * "markdown": Simple markdown conversion (2 credits per page)
        - Converts each page to clean markdown format
        - No AI processing, just content conversion
        - More cost-effective for content archival
        - Best for: Documentation backup, content migration, reading
    - Cost comparison:
      * AI mode: 50 pages = 500 credits
      * Markdown mode: 50 pages = 100 credits

depth (Optional[int]): Maximum depth of link traversal from the starting URL.
    - Default: unlimited (will follow links until max_pages or no more links)
    - Depth levels:
      * 0: Only the starting URL (no link following)
      * 1: Starting URL + pages directly linked from it
      * 2: Starting URL + direct links + links from those pages
      * 3+: Continues following links to specified depth
    - Examples:
      * 1: Crawl blog homepage + all blog posts
      * 2: Crawl docs homepage + category pages + individual doc pages
      * 3: Deep crawling for comprehensive site coverage
    - Considerations:
      * Higher depth can lead to exponential page growth
      * Use with max_pages to control scope and cost
      * Consider site structure when setting depth

max_pages (Optional[int]): Maximum number of pages to crawl in total.
    - Default: unlimited (will crawl until no more links or depth limit)
    - Recommended ranges:
      * 10-20: Testing and small sites
      * 50-100: Medium sites and focused crawling
      * 200-500: Large sites and comprehensive analysis
      * 1000+: Enterprise-level crawling (high cost)
    - Cost implications:
      * AI mode: max_pages × 10 credits
      * Markdown mode: max_pages × 2 credits
    - Examples:
      * 10: Quick site sampling (20-100 credits)
      * 50: Standard documentation crawl (100-500 credits)
      * 200: Comprehensive site analysis (400-2000 credits)
    - Note: Crawler stops when this limit is reached, regardless of remaining links

same_domain_only (Optional[bool]): Whether to crawl only within the same domain.
    - Default: true (recommended for most use cases)
    - Options:
      * true: Only crawl pages within the same domain as starting URL
        - Prevents following external links
        - Keeps crawling focused on the target site
        - Reduces risk of crawling unrelated content
        - Example: Starting at docs.example.com only crawls docs.example.com pages
      * false: Allow crawling external domains
        - Follows links to other domains
        - Can lead to very broad crawling scope
        - May crawl unrelated or unwanted content
        - Use with caution and appropriate max_pages limit
    - Recommendations:
      * Use true for focused site crawling
      * Use false only when you specifically need cross-domain data
      * Always set max_pages when using false to prevent runaway crawling

Returns: Dictionary containing: - request_id: Unique identifier for this crawl operation (use with smartcrawler_fetch_results) - status: Initial status of the crawl request ("initiated" or "processing") - estimated_cost: Estimated credit cost based on parameters (actual cost may vary) - crawl_parameters: Summary of the crawling configuration - estimated_time: Rough estimate of processing time - next_steps: Instructions for retrieving results

Raises: ValueError: If URL is malformed, prompt is missing for AI mode, or parameters are invalid HTTPError: If the starting URL cannot be accessed RateLimitError: If too many crawl requests are initiated too quickly

Note: - This operation is asynchronous and may take several minutes to complete - Use smartcrawler_fetch_results with the returned request_id to get results - Keep polling smartcrawler_fetch_results until status is "completed" - Actual pages crawled may be less than max_pages if fewer links are found - Processing time increases with max_pages, depth, and extraction_mode complexity

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYes
promptNo
extraction_modeNoai
depthNo
max_pagesNo
same_domain_onlyNo

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault

No arguments

Implementation Reference

  • MCP tool handler for 'smartcrawler_initiate'. Validates inputs via type hints, retrieves API key, instantiates ScapeGraphClient, and calls the client method to initiate the crawl. Returns request ID or error.
    @mcp.tool(annotations={"readOnlyHint": False, "destructiveHint": False, "idempotentHint": False})
    def smartcrawler_initiate(
        url: str,
        ctx: Context,
        prompt: Optional[str] = None,
        extraction_mode: str = "ai",
        depth: Optional[int] = None,
        max_pages: Optional[int] = None,
        same_domain_only: Optional[bool] = None
    ) -> Dict[str, Any]:
        """
        Initiate an asynchronous multi-page web crawling operation with AI extraction or markdown conversion.
    
        This tool starts an intelligent crawler that discovers and processes multiple pages from a starting URL.
        Choose between AI Extraction Mode (10 credits/page) for structured data or Markdown Mode (2 credits/page)
        for content conversion. The operation is asynchronous - use smartcrawler_fetch_results to retrieve results.
        Creates a new crawl request (non-idempotent, non-read-only).
    
        SmartCrawler supports two modes:
        - AI Extraction Mode: Extracts structured data based on your prompt from every crawled page
        - Markdown Conversion Mode: Converts each page to clean markdown format
    
        Args:
            url (str): The starting URL to begin crawling from.
                - Must include protocol (http:// or https://)
                - The crawler will discover and process linked pages from this starting point
                - Should be a page with links to other pages you want to crawl
                - Examples:
                  * https://docs.example.com (documentation site root)
                  * https://blog.company.com (blog homepage)
                  * https://example.com/products (product category page)
                  * https://news.site.com/category/tech (news section)
                - Best practices:
                  * Use homepage or main category pages as starting points
                  * Ensure the starting page has links to content you want to crawl
                  * Consider site structure when choosing the starting URL
    
            prompt (Optional[str]): AI prompt for data extraction.
                - REQUIRED when extraction_mode is 'ai'
                - Ignored when extraction_mode is 'markdown'
                - Describes what data to extract from each crawled page
                - Applied consistently across all discovered pages
                - Examples:
                  * "Extract API endpoint name, method, parameters, and description"
                  * "Get article title, author, publication date, and summary"
                  * "Find product name, price, description, and availability"
                  * "Extract job title, company, location, salary, and requirements"
                - Tips for better results:
                  * Be specific about fields you want from each page
                  * Consider that different pages may have different content structures
                  * Use general terms that apply across multiple page types
    
            extraction_mode (str): Extraction mode for processing crawled pages.
                - Default: "ai"
                - Options:
                  * "ai": AI-powered structured data extraction (10 credits per page)
                    - Uses the prompt to extract specific data from each page
                    - Returns structured JSON data
                    - More expensive but provides targeted information
                    - Best for: Data collection, research, structured analysis
                  * "markdown": Simple markdown conversion (2 credits per page)
                    - Converts each page to clean markdown format
                    - No AI processing, just content conversion
                    - More cost-effective for content archival
                    - Best for: Documentation backup, content migration, reading
                - Cost comparison:
                  * AI mode: 50 pages = 500 credits
                  * Markdown mode: 50 pages = 100 credits
    
            depth (Optional[int]): Maximum depth of link traversal from the starting URL.
                - Default: unlimited (will follow links until max_pages or no more links)
                - Depth levels:
                  * 0: Only the starting URL (no link following)
                  * 1: Starting URL + pages directly linked from it
                  * 2: Starting URL + direct links + links from those pages
                  * 3+: Continues following links to specified depth
                - Examples:
                  * 1: Crawl blog homepage + all blog posts
                  * 2: Crawl docs homepage + category pages + individual doc pages
                  * 3: Deep crawling for comprehensive site coverage
                - Considerations:
                  * Higher depth can lead to exponential page growth
                  * Use with max_pages to control scope and cost
                  * Consider site structure when setting depth
    
            max_pages (Optional[int]): Maximum number of pages to crawl in total.
                - Default: unlimited (will crawl until no more links or depth limit)
                - Recommended ranges:
                  * 10-20: Testing and small sites
                  * 50-100: Medium sites and focused crawling
                  * 200-500: Large sites and comprehensive analysis
                  * 1000+: Enterprise-level crawling (high cost)
                - Cost implications:
                  * AI mode: max_pages × 10 credits
                  * Markdown mode: max_pages × 2 credits
                - Examples:
                  * 10: Quick site sampling (20-100 credits)
                  * 50: Standard documentation crawl (100-500 credits)
                  * 200: Comprehensive site analysis (400-2000 credits)
                - Note: Crawler stops when this limit is reached, regardless of remaining links
    
            same_domain_only (Optional[bool]): Whether to crawl only within the same domain.
                - Default: true (recommended for most use cases)
                - Options:
                  * true: Only crawl pages within the same domain as starting URL
                    - Prevents following external links
                    - Keeps crawling focused on the target site
                    - Reduces risk of crawling unrelated content
                    - Example: Starting at docs.example.com only crawls docs.example.com pages
                  * false: Allow crawling external domains
                    - Follows links to other domains
                    - Can lead to very broad crawling scope
                    - May crawl unrelated or unwanted content
                    - Use with caution and appropriate max_pages limit
                - Recommendations:
                  * Use true for focused site crawling
                  * Use false only when you specifically need cross-domain data
                  * Always set max_pages when using false to prevent runaway crawling
    
        Returns:
            Dictionary containing:
            - request_id: Unique identifier for this crawl operation (use with smartcrawler_fetch_results)
            - status: Initial status of the crawl request ("initiated" or "processing")
            - estimated_cost: Estimated credit cost based on parameters (actual cost may vary)
            - crawl_parameters: Summary of the crawling configuration
            - estimated_time: Rough estimate of processing time
            - next_steps: Instructions for retrieving results
    
        Raises:
            ValueError: If URL is malformed, prompt is missing for AI mode, or parameters are invalid
            HTTPError: If the starting URL cannot be accessed
            RateLimitError: If too many crawl requests are initiated too quickly
    
        Note:
            - This operation is asynchronous and may take several minutes to complete
            - Use smartcrawler_fetch_results with the returned request_id to get results
            - Keep polling smartcrawler_fetch_results until status is "completed"
            - Actual pages crawled may be less than max_pages if fewer links are found
            - Processing time increases with max_pages, depth, and extraction_mode complexity
        """
        try:
            api_key = get_api_key(ctx)
            client = ScapeGraphClient(api_key)
            return client.smartcrawler_initiate(
                url=url,
                prompt=prompt,
                extraction_mode=extraction_mode,
                depth=depth,
                max_pages=max_pages,
                same_domain_only=same_domain_only
            )
        except Exception as e:
            return {"error": str(e)}
  • Core helper method in ScapeGraphClient class that constructs the API request payload based on parameters, handles extraction mode logic, makes HTTP POST to /crawl endpoint, and returns the JSON response containing the request ID.
    def smartcrawler_initiate(
        self, 
        url: str, 
        prompt: str = None, 
        extraction_mode: str = "ai",
        depth: int = None,
        max_pages: int = None,
        same_domain_only: bool = None
    ) -> Dict[str, Any]:
        """
        Initiate a SmartCrawler request for multi-page web crawling.
        
        SmartCrawler supports two modes:
        - AI Extraction Mode (10 credits per page): Extracts structured data based on your prompt
        - Markdown Conversion Mode (2 credits per page): Converts pages to clean markdown
    
        Smartcrawler takes some time to process the request and returns the request id.
        Use smartcrawler_fetch_results to get the results of the request.
        You have to keep polling the smartcrawler_fetch_results until the request is complete.
        The request is complete when the status is "completed".
    
        Args:
            url: Starting URL to crawl
            prompt: AI prompt for data extraction (required for AI mode)
            extraction_mode: "ai" for AI extraction or "markdown" for markdown conversion (default: "ai")
            depth: Maximum link traversal depth (optional)
            max_pages: Maximum number of pages to crawl (optional)
            same_domain_only: Whether to crawl only within the same domain (optional)
    
        Returns:
            Dictionary containing the request ID for async processing
        """
        endpoint = f"{self.BASE_URL}/crawl"
        data = {
            "url": url
        }
        
        # Handle extraction mode
        if extraction_mode == "markdown":
            data["markdown_only"] = True
        elif extraction_mode == "ai":
            if prompt is None:
                raise ValueError("prompt is required when extraction_mode is 'ai'")
            data["prompt"] = prompt
        else:
            raise ValueError(f"Invalid extraction_mode: {extraction_mode}. Must be 'ai' or 'markdown'")
        if depth is not None:
            data["depth"] = depth
        if max_pages is not None:
            data["max_pages"] = max_pages
        if same_domain_only is not None:
            data["same_domain_only"] = same_domain_only
    
        response = self.client.post(endpoint, headers=self.headers, json=data)
    
        if response.status_code != 200:
            error_msg = f"Error {response.status_code}: {response.text}"
            raise Exception(error_msg)
    
        return response.json()
  • Function signature defines input schema via type hints: required url (str), optional prompt (str), extraction_mode (str default 'ai'), depth/max_pages (int), same_domain_only (bool). Returns Dict[str, Any] typically containing request_id.
    def smartcrawler_initiate(
        url: str,
        ctx: Context,
        prompt: Optional[str] = None,
        extraction_mode: str = "ai",
        depth: Optional[int] = None,
        max_pages: Optional[int] = None,
        same_domain_only: Optional[bool] = None
    ) -> Dict[str, Any]:
  • Registers the smartcrawler_initiate function as an MCP tool with specific annotations indicating it's non-read-only, non-destructive, non-idempotent.
    @mcp.tool(annotations={"readOnlyHint": False, "destructiveHint": False, "idempotentHint": False})
Behavior5/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description adds significant behavioral context beyond annotations: it explains the asynchronous nature, credit costs per mode (10 vs 2 credits/page), non-idempotent behavior, estimated processing time, and need for polling with fetch_results. Annotations only indicate it's not read-only/idempotent/destructive, so this provides crucial operational details.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-structured with clear sections (overview, modes, args, returns, raises, notes) but is quite lengthy. While every sentence adds value, it could be more front-loaded; the core functionality is clear early, but parameter details are extensive. Still, no wasted text.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (6 parameters, async operation, cost implications) and 0% schema coverage, the description is exceptionally complete. It covers purpose, usage, all parameters, return values, errors, costs, and next steps. The output schema exists but the description still usefully summarizes returns.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

With 0% schema description coverage, the description fully compensates by providing detailed semantics for all 6 parameters: url requirements, prompt usage, extraction_mode options with costs, depth levels, max_pages ranges, and same_domain_only implications. Each parameter includes examples, defaults, and practical guidance.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool 'starts an intelligent crawler that discovers and processes multiple pages from a starting URL' and distinguishes it from siblings by specifying it's for 'asynchronous multi-page web crawling' with AI extraction or markdown conversion, unlike simpler scraping tools like 'scrape' or 'smartscraper'.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides explicit guidance on when to use this tool vs alternatives: it specifies this is for 'asynchronous multi-page web crawling' and directs users to 'use smartcrawler_fetch_results to retrieve results'. It also contrasts modes (AI vs markdown) with cost implications and best-use cases.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ScrapeGraphAI/scrapegraph-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server