smartcrawler_initiate
Start multi-page web crawling to extract structured data with AI or convert content to markdown from a starting URL.
Instructions
Initiate an asynchronous multi-page web crawling operation with AI extraction or markdown conversion.
This tool starts an intelligent crawler that discovers and processes multiple pages from a starting URL. Choose between AI Extraction Mode (10 credits/page) for structured data or Markdown Mode (2 credits/page) for content conversion. The operation is asynchronous - use smartcrawler_fetch_results to retrieve results. Creates a new crawl request (non-idempotent, non-read-only).
SmartCrawler supports two modes:
AI Extraction Mode: Extracts structured data based on your prompt from every crawled page
Markdown Conversion Mode: Converts each page to clean markdown format
Args: url (str): The starting URL to begin crawling from. - Must include protocol (http:// or https://) - The crawler will discover and process linked pages from this starting point - Should be a page with links to other pages you want to crawl - Examples: * https://docs.example.com (documentation site root) * https://blog.company.com (blog homepage) * https://example.com/products (product category page) * https://news.site.com/category/tech (news section) - Best practices: * Use homepage or main category pages as starting points * Ensure the starting page has links to content you want to crawl * Consider site structure when choosing the starting URL
prompt (Optional[str]): AI prompt for data extraction.
- REQUIRED when extraction_mode is 'ai'
- Ignored when extraction_mode is 'markdown'
- Describes what data to extract from each crawled page
- Applied consistently across all discovered pages
- Examples:
* "Extract API endpoint name, method, parameters, and description"
* "Get article title, author, publication date, and summary"
* "Find product name, price, description, and availability"
* "Extract job title, company, location, salary, and requirements"
- Tips for better results:
* Be specific about fields you want from each page
* Consider that different pages may have different content structures
* Use general terms that apply across multiple page types
extraction_mode (str): Extraction mode for processing crawled pages.
- Default: "ai"
- Options:
* "ai": AI-powered structured data extraction (10 credits per page)
- Uses the prompt to extract specific data from each page
- Returns structured JSON data
- More expensive but provides targeted information
- Best for: Data collection, research, structured analysis
* "markdown": Simple markdown conversion (2 credits per page)
- Converts each page to clean markdown format
- No AI processing, just content conversion
- More cost-effective for content archival
- Best for: Documentation backup, content migration, reading
- Cost comparison:
* AI mode: 50 pages = 500 credits
* Markdown mode: 50 pages = 100 credits
depth (Optional[int]): Maximum depth of link traversal from the starting URL.
- Default: unlimited (will follow links until max_pages or no more links)
- Depth levels:
* 0: Only the starting URL (no link following)
* 1: Starting URL + pages directly linked from it
* 2: Starting URL + direct links + links from those pages
* 3+: Continues following links to specified depth
- Examples:
* 1: Crawl blog homepage + all blog posts
* 2: Crawl docs homepage + category pages + individual doc pages
* 3: Deep crawling for comprehensive site coverage
- Considerations:
* Higher depth can lead to exponential page growth
* Use with max_pages to control scope and cost
* Consider site structure when setting depth
max_pages (Optional[int]): Maximum number of pages to crawl in total.
- Default: unlimited (will crawl until no more links or depth limit)
- Recommended ranges:
* 10-20: Testing and small sites
* 50-100: Medium sites and focused crawling
* 200-500: Large sites and comprehensive analysis
* 1000+: Enterprise-level crawling (high cost)
- Cost implications:
* AI mode: max_pages × 10 credits
* Markdown mode: max_pages × 2 credits
- Examples:
* 10: Quick site sampling (20-100 credits)
* 50: Standard documentation crawl (100-500 credits)
* 200: Comprehensive site analysis (400-2000 credits)
- Note: Crawler stops when this limit is reached, regardless of remaining links
same_domain_only (Optional[bool]): Whether to crawl only within the same domain.
- Default: true (recommended for most use cases)
- Options:
* true: Only crawl pages within the same domain as starting URL
- Prevents following external links
- Keeps crawling focused on the target site
- Reduces risk of crawling unrelated content
- Example: Starting at docs.example.com only crawls docs.example.com pages
* false: Allow crawling external domains
- Follows links to other domains
- Can lead to very broad crawling scope
- May crawl unrelated or unwanted content
- Use with caution and appropriate max_pages limit
- Recommendations:
* Use true for focused site crawling
* Use false only when you specifically need cross-domain data
* Always set max_pages when using false to prevent runaway crawlingReturns: Dictionary containing: - request_id: Unique identifier for this crawl operation (use with smartcrawler_fetch_results) - status: Initial status of the crawl request ("initiated" or "processing") - estimated_cost: Estimated credit cost based on parameters (actual cost may vary) - crawl_parameters: Summary of the crawling configuration - estimated_time: Rough estimate of processing time - next_steps: Instructions for retrieving results
Raises: ValueError: If URL is malformed, prompt is missing for AI mode, or parameters are invalid HTTPError: If the starting URL cannot be accessed RateLimitError: If too many crawl requests are initiated too quickly
Note: - This operation is asynchronous and may take several minutes to complete - Use smartcrawler_fetch_results with the returned request_id to get results - Keep polling smartcrawler_fetch_results until status is "completed" - Actual pages crawled may be less than max_pages if fewer links are found - Processing time increases with max_pages, depth, and extraction_mode complexity
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | ||
| prompt | No | ||
| extraction_mode | No | ai | |
| depth | No | ||
| max_pages | No | ||
| same_domain_only | No |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
No arguments | |||