smartcrawler_initiate

Initiate an asynchronous multi-page web crawling operation with AI extraction or markdown conversion.

This tool starts an intelligent crawler that discovers and processes multiple pages from a starting URL. Choose between AI Extraction Mode (10 credits/page) for structured data or Markdown Mode (2 credits/page) for content conversion. The operation is asynchronous - use smartcrawler_fetch_results to retrieve results. Creates a new crawl request (non-idempotent, non-read-only).

SmartCrawler supports two modes:

AI Extraction Mode: Extracts structured data based on your prompt from every crawled page
Markdown Conversion Mode: Converts each page to clean markdown format

Args: url (str): The starting URL to begin crawling from. - Must include protocol (http:// or https://) - The crawler will discover and process linked pages from this starting point - Should be a page with links to other pages you want to crawl - Examples: * https://docs.example.com (documentation site root) * https://blog.company.com (blog homepage) * https://example.com/products (product category page) * https://news.site.com/category/tech (news section) - Best practices: * Use homepage or main category pages as starting points * Ensure the starting page has links to content you want to crawl * Consider site structure when choosing the starting URL

prompt (Optional[str]): AI prompt for data extraction.
    - REQUIRED when extraction_mode is 'ai'
    - Ignored when extraction_mode is 'markdown'
    - Describes what data to extract from each crawled page
    - Applied consistently across all discovered pages
    - Examples:
      * "Extract API endpoint name, method, parameters, and description"
      * "Get article title, author, publication date, and summary"
      * "Find product name, price, description, and availability"
      * "Extract job title, company, location, salary, and requirements"
    - Tips for better results:
      * Be specific about fields you want from each page
      * Consider that different pages may have different content structures
      * Use general terms that apply across multiple page types

extraction_mode (str): Extraction mode for processing crawled pages.
    - Default: "ai"
    - Options:
      * "ai": AI-powered structured data extraction (10 credits per page)
        - Uses the prompt to extract specific data from each page
        - Returns structured JSON data
        - More expensive but provides targeted information
        - Best for: Data collection, research, structured analysis
      * "markdown": Simple markdown conversion (2 credits per page)
        - Converts each page to clean markdown format
        - No AI processing, just content conversion
        - More cost-effective for content archival
        - Best for: Documentation backup, content migration, reading
    - Cost comparison:
      * AI mode: 50 pages = 500 credits
      * Markdown mode: 50 pages = 100 credits

depth (Optional[int]): Maximum depth of link traversal from the starting URL.
    - Default: unlimited (will follow links until max_pages or no more links)
    - Depth levels:
      * 0: Only the starting URL (no link following)
      * 1: Starting URL + pages directly linked from it
      * 2: Starting URL + direct links + links from those pages
      * 3+: Continues following links to specified depth
    - Examples:
      * 1: Crawl blog homepage + all blog posts
      * 2: Crawl docs homepage + category pages + individual doc pages
      * 3: Deep crawling for comprehensive site coverage
    - Considerations:
      * Higher depth can lead to exponential page growth
      * Use with max_pages to control scope and cost
      * Consider site structure when setting depth

max_pages (Optional[int]): Maximum number of pages to crawl in total.
    - Default: unlimited (will crawl until no more links or depth limit)
    - Recommended ranges:
      * 10-20: Testing and small sites
      * 50-100: Medium sites and focused crawling
      * 200-500: Large sites and comprehensive analysis
      * 1000+: Enterprise-level crawling (high cost)
    - Cost implications:
      * AI mode: max_pages × 10 credits
      * Markdown mode: max_pages × 2 credits
    - Examples:
      * 10: Quick site sampling (20-100 credits)
      * 50: Standard documentation crawl (100-500 credits)
      * 200: Comprehensive site analysis (400-2000 credits)
    - Note: Crawler stops when this limit is reached, regardless of remaining links

same_domain_only (Optional[bool]): Whether to crawl only within the same domain.
    - Default: true (recommended for most use cases)
    - Options:
      * true: Only crawl pages within the same domain as starting URL
        - Prevents following external links
        - Keeps crawling focused on the target site
        - Reduces risk of crawling unrelated content
        - Example: Starting at docs.example.com only crawls docs.example.com pages
      * false: Allow crawling external domains
        - Follows links to other domains
        - Can lead to very broad crawling scope
        - May crawl unrelated or unwanted content
        - Use with caution and appropriate max_pages limit
    - Recommendations:
      * Use true for focused site crawling
      * Use false only when you specifically need cross-domain data
      * Always set max_pages when using false to prevent runaway crawling

Returns: Dictionary containing: - request_id: Unique identifier for this crawl operation (use with smartcrawler_fetch_results) - status: Initial status of the crawl request ("initiated" or "processing") - estimated_cost: Estimated credit cost based on parameters (actual cost may vary) - crawl_parameters: Summary of the crawling configuration - estimated_time: Rough estimate of processing time - next_steps: Instructions for retrieving results

Raises: ValueError: If URL is malformed, prompt is missing for AI mode, or parameters are invalid HTTPError: If the starting URL cannot be accessed RateLimitError: If too many crawl requests are initiated too quickly

Note: - This operation is asynchronous and may take several minutes to complete - Use smartcrawler_fetch_results with the returned request_id to get results - Keep polling smartcrawler_fetch_results until status is "completed" - Actual pages crawled may be less than max_pages if fewer links are found - Processing time increases with max_pages, depth, and extraction_mode complexity

Name	Required	Default
`url`	Yes
`prompt`	No
`extraction_mode`	No	ai
`depth`	No
`max_pages`	No
`same_domain_only`	No

ScrapeGraph MCP Server

Instructions

Input Schema

Output Schema

Tool Definition Quality

Other Tools

Latest Blog Posts

MCP directory API