smartcrawler_initiate
Start multi-page web crawling to extract structured data with AI or convert content to markdown from a starting URL.
Instructions
Initiate an asynchronous multi-page web crawling operation with AI extraction or markdown conversion.
This tool starts an intelligent crawler that discovers and processes multiple pages from a starting URL. Choose between AI Extraction Mode (10 credits/page) for structured data or Markdown Mode (2 credits/page) for content conversion. The operation is asynchronous - use smartcrawler_fetch_results to retrieve results. Creates a new crawl request (non-idempotent, non-read-only).
SmartCrawler supports two modes:
AI Extraction Mode: Extracts structured data based on your prompt from every crawled page
Markdown Conversion Mode: Converts each page to clean markdown format
Args: url (str): The starting URL to begin crawling from. - Must include protocol (http:// or https://) - The crawler will discover and process linked pages from this starting point - Should be a page with links to other pages you want to crawl - Examples: * https://docs.example.com (documentation site root) * https://blog.company.com (blog homepage) * https://example.com/products (product category page) * https://news.site.com/category/tech (news section) - Best practices: * Use homepage or main category pages as starting points * Ensure the starting page has links to content you want to crawl * Consider site structure when choosing the starting URL
Returns: Dictionary containing: - request_id: Unique identifier for this crawl operation (use with smartcrawler_fetch_results) - status: Initial status of the crawl request ("initiated" or "processing") - estimated_cost: Estimated credit cost based on parameters (actual cost may vary) - crawl_parameters: Summary of the crawling configuration - estimated_time: Rough estimate of processing time - next_steps: Instructions for retrieving results
Raises: ValueError: If URL is malformed, prompt is missing for AI mode, or parameters are invalid HTTPError: If the starting URL cannot be accessed RateLimitError: If too many crawl requests are initiated too quickly
Note: - This operation is asynchronous and may take several minutes to complete - Use smartcrawler_fetch_results with the returned request_id to get results - Keep polling smartcrawler_fetch_results until status is "completed" - Actual pages crawled may be less than max_pages if fewer links are found - Processing time increases with max_pages, depth, and extraction_mode complexity
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | ||
| prompt | No | ||
| extraction_mode | No | ai | |
| depth | No | ||
| max_pages | No | ||
| same_domain_only | No |
Implementation Reference
- src/scrapegraph_mcp/server.py:1665-1818 (handler)MCP tool handler for 'smartcrawler_initiate'. Validates inputs via type hints, retrieves API key, instantiates ScapeGraphClient, and calls the client method to initiate the crawl. Returns request ID or error.@mcp.tool(annotations={"readOnlyHint": False, "destructiveHint": False, "idempotentHint": False}) def smartcrawler_initiate( url: str, ctx: Context, prompt: Optional[str] = None, extraction_mode: str = "ai", depth: Optional[int] = None, max_pages: Optional[int] = None, same_domain_only: Optional[bool] = None ) -> Dict[str, Any]: """ Initiate an asynchronous multi-page web crawling operation with AI extraction or markdown conversion. This tool starts an intelligent crawler that discovers and processes multiple pages from a starting URL. Choose between AI Extraction Mode (10 credits/page) for structured data or Markdown Mode (2 credits/page) for content conversion. The operation is asynchronous - use smartcrawler_fetch_results to retrieve results. Creates a new crawl request (non-idempotent, non-read-only). SmartCrawler supports two modes: - AI Extraction Mode: Extracts structured data based on your prompt from every crawled page - Markdown Conversion Mode: Converts each page to clean markdown format Args: url (str): The starting URL to begin crawling from. - Must include protocol (http:// or https://) - The crawler will discover and process linked pages from this starting point - Should be a page with links to other pages you want to crawl - Examples: * https://docs.example.com (documentation site root) * https://blog.company.com (blog homepage) * https://example.com/products (product category page) * https://news.site.com/category/tech (news section) - Best practices: * Use homepage or main category pages as starting points * Ensure the starting page has links to content you want to crawl * Consider site structure when choosing the starting URL prompt (Optional[str]): AI prompt for data extraction. - REQUIRED when extraction_mode is 'ai' - Ignored when extraction_mode is 'markdown' - Describes what data to extract from each crawled page - Applied consistently across all discovered pages - Examples: * "Extract API endpoint name, method, parameters, and description" * "Get article title, author, publication date, and summary" * "Find product name, price, description, and availability" * "Extract job title, company, location, salary, and requirements" - Tips for better results: * Be specific about fields you want from each page * Consider that different pages may have different content structures * Use general terms that apply across multiple page types extraction_mode (str): Extraction mode for processing crawled pages. - Default: "ai" - Options: * "ai": AI-powered structured data extraction (10 credits per page) - Uses the prompt to extract specific data from each page - Returns structured JSON data - More expensive but provides targeted information - Best for: Data collection, research, structured analysis * "markdown": Simple markdown conversion (2 credits per page) - Converts each page to clean markdown format - No AI processing, just content conversion - More cost-effective for content archival - Best for: Documentation backup, content migration, reading - Cost comparison: * AI mode: 50 pages = 500 credits * Markdown mode: 50 pages = 100 credits depth (Optional[int]): Maximum depth of link traversal from the starting URL. - Default: unlimited (will follow links until max_pages or no more links) - Depth levels: * 0: Only the starting URL (no link following) * 1: Starting URL + pages directly linked from it * 2: Starting URL + direct links + links from those pages * 3+: Continues following links to specified depth - Examples: * 1: Crawl blog homepage + all blog posts * 2: Crawl docs homepage + category pages + individual doc pages * 3: Deep crawling for comprehensive site coverage - Considerations: * Higher depth can lead to exponential page growth * Use with max_pages to control scope and cost * Consider site structure when setting depth max_pages (Optional[int]): Maximum number of pages to crawl in total. - Default: unlimited (will crawl until no more links or depth limit) - Recommended ranges: * 10-20: Testing and small sites * 50-100: Medium sites and focused crawling * 200-500: Large sites and comprehensive analysis * 1000+: Enterprise-level crawling (high cost) - Cost implications: * AI mode: max_pages × 10 credits * Markdown mode: max_pages × 2 credits - Examples: * 10: Quick site sampling (20-100 credits) * 50: Standard documentation crawl (100-500 credits) * 200: Comprehensive site analysis (400-2000 credits) - Note: Crawler stops when this limit is reached, regardless of remaining links same_domain_only (Optional[bool]): Whether to crawl only within the same domain. - Default: true (recommended for most use cases) - Options: * true: Only crawl pages within the same domain as starting URL - Prevents following external links - Keeps crawling focused on the target site - Reduces risk of crawling unrelated content - Example: Starting at docs.example.com only crawls docs.example.com pages * false: Allow crawling external domains - Follows links to other domains - Can lead to very broad crawling scope - May crawl unrelated or unwanted content - Use with caution and appropriate max_pages limit - Recommendations: * Use true for focused site crawling * Use false only when you specifically need cross-domain data * Always set max_pages when using false to prevent runaway crawling Returns: Dictionary containing: - request_id: Unique identifier for this crawl operation (use with smartcrawler_fetch_results) - status: Initial status of the crawl request ("initiated" or "processing") - estimated_cost: Estimated credit cost based on parameters (actual cost may vary) - crawl_parameters: Summary of the crawling configuration - estimated_time: Rough estimate of processing time - next_steps: Instructions for retrieving results Raises: ValueError: If URL is malformed, prompt is missing for AI mode, or parameters are invalid HTTPError: If the starting URL cannot be accessed RateLimitError: If too many crawl requests are initiated too quickly Note: - This operation is asynchronous and may take several minutes to complete - Use smartcrawler_fetch_results with the returned request_id to get results - Keep polling smartcrawler_fetch_results until status is "completed" - Actual pages crawled may be less than max_pages if fewer links are found - Processing time increases with max_pages, depth, and extraction_mode complexity """ try: api_key = get_api_key(ctx) client = ScapeGraphClient(api_key) return client.smartcrawler_initiate( url=url, prompt=prompt, extraction_mode=extraction_mode, depth=depth, max_pages=max_pages, same_domain_only=same_domain_only ) except Exception as e: return {"error": str(e)}
- Core helper method in ScapeGraphClient class that constructs the API request payload based on parameters, handles extraction mode logic, makes HTTP POST to /crawl endpoint, and returns the JSON response containing the request ID.def smartcrawler_initiate( self, url: str, prompt: str = None, extraction_mode: str = "ai", depth: int = None, max_pages: int = None, same_domain_only: bool = None ) -> Dict[str, Any]: """ Initiate a SmartCrawler request for multi-page web crawling. SmartCrawler supports two modes: - AI Extraction Mode (10 credits per page): Extracts structured data based on your prompt - Markdown Conversion Mode (2 credits per page): Converts pages to clean markdown Smartcrawler takes some time to process the request and returns the request id. Use smartcrawler_fetch_results to get the results of the request. You have to keep polling the smartcrawler_fetch_results until the request is complete. The request is complete when the status is "completed". Args: url: Starting URL to crawl prompt: AI prompt for data extraction (required for AI mode) extraction_mode: "ai" for AI extraction or "markdown" for markdown conversion (default: "ai") depth: Maximum link traversal depth (optional) max_pages: Maximum number of pages to crawl (optional) same_domain_only: Whether to crawl only within the same domain (optional) Returns: Dictionary containing the request ID for async processing """ endpoint = f"{self.BASE_URL}/crawl" data = { "url": url } # Handle extraction mode if extraction_mode == "markdown": data["markdown_only"] = True elif extraction_mode == "ai": if prompt is None: raise ValueError("prompt is required when extraction_mode is 'ai'") data["prompt"] = prompt else: raise ValueError(f"Invalid extraction_mode: {extraction_mode}. Must be 'ai' or 'markdown'") if depth is not None: data["depth"] = depth if max_pages is not None: data["max_pages"] = max_pages if same_domain_only is not None: data["same_domain_only"] = same_domain_only response = self.client.post(endpoint, headers=self.headers, json=data) if response.status_code != 200: error_msg = f"Error {response.status_code}: {response.text}" raise Exception(error_msg) return response.json()
- Function signature defines input schema via type hints: required url (str), optional prompt (str), extraction_mode (str default 'ai'), depth/max_pages (int), same_domain_only (bool). Returns Dict[str, Any] typically containing request_id.def smartcrawler_initiate( url: str, ctx: Context, prompt: Optional[str] = None, extraction_mode: str = "ai", depth: Optional[int] = None, max_pages: Optional[int] = None, same_domain_only: Optional[bool] = None ) -> Dict[str, Any]:
- src/scrapegraph_mcp/server.py:1665-1665 (registration)Registers the smartcrawler_initiate function as an MCP tool with specific annotations indicating it's non-read-only, non-destructive, non-idempotent.@mcp.tool(annotations={"readOnlyHint": False, "destructiveHint": False, "idempotentHint": False})