crawl_website
Crawl a website from a specified URL, extracting content up to a defined depth and page limit. Returns structured data for crawled pages.
Instructions
Crawl a website starting from the given URL up to a specified depth and page limit.
Args: url: The starting URL to crawl. crawl_depth: The maximum depth to crawl relative to the starting URL (default: 1). max_pages: The maximum number of pages to scrape during the crawl (default: 5).
Returns: List containing TextContent with a JSON array of results for crawled pages.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| crawl_depth | No | ||
| max_pages | No | ||
| url | Yes |
Implementation Reference
- main.py:26-43 (handler)The FastMCP tool handler for 'crawl_website', registered via @mcp.tool() decorator. Defines input schema via type annotations and delegates execution to the imported crawl_website_async helper.@mcp.tool() async def crawl_website( url: str, crawl_depth: int = 1, max_pages: int = 5, ) -> list[types.TextContent | types.ImageContent | types.EmbeddedResource]: """ Crawl a website starting from the given URL up to a specified depth and page limit. Args: url: The starting URL to crawl. crawl_depth: The maximum depth to crawl relative to the starting URL (default: 1). max_pages: The maximum number of pages to scrape during the crawl (default: 5). Returns: List containing TextContent with a JSON array of results for crawled pages. """ return await crawl_website_async(url, crawl_depth, max_pages)
- tools/crawl.py:13-126 (helper)Core helper function implementing the website crawling logic using Crawl4AI library, including BFS deep crawling strategy, error handling, and result processing into MCP TextContent format.async def crawl_website_async(url: str, crawl_depth: int, max_pages: int) -> List[Any]: """Crawl a website using crawl4ai. Args: url: The starting URL to crawl. crawl_depth: The maximum depth to crawl. max_pages: The maximum number of pages to crawl. Returns: A list containing TextContent objects with the results as JSON. """ normalized_url = validate_and_normalize_url(url) if not normalized_url: return [ types.TextContent( type="text", text=json.dumps( { "success": False, "url": url, "error": "Invalid URL format", } ), ) ] try: # Use default configurations with minimal customization browser_config = BrowserConfig( browser_type="chromium", headless=True, ignore_https_errors=True, verbose=False, extra_args=[ "--no-sandbox", "--disable-setuid-sandbox", "--disable-dev-shm-usage", ], ) # 1. Create the deep crawl strategy with depth and page limits crawl_strategy = BFSDeepCrawlStrategy( max_depth=crawl_depth, max_pages=max_pages ) # 2. Create the run config, passing the strategy run_config = CrawlerRunConfig( cache_mode=CacheMode.BYPASS, verbose=False, page_timeout=30 * 1000, # 30 seconds per page deep_crawl_strategy=crawl_strategy, # Pass the strategy here ) results_list = [] async with AsyncWebCrawler(config=browser_config) as crawler: # 3. Use arun and wrap in asyncio.wait_for for overall timeout crawl_results: List[CrawlResult] = await asyncio.wait_for( crawler.arun( url=normalized_url, config=run_config, ), timeout=CRAWL_TIMEOUT_SECONDS, ) # Process results, checking 'success' attribute for result in crawl_results: if result.success: # Check .success instead of .status results_list.append( { "url": result.url, "success": True, "markdown": result.markdown, } ) else: results_list.append( { "url": result.url, "success": False, "error": result.error, # Assume .error holds the message } ) # Return a single TextContent with a JSON array of results return [ types.TextContent( type="text", text=json.dumps({"results": results_list}) ) ] except asyncio.TimeoutError: return [ types.TextContent( type="text", text=json.dumps( { "success": False, "url": normalized_url, "error": f"Crawl operation timed out after {CRAWL_TIMEOUT_SECONDS} seconds.", } ), ) ] except Exception as e: return [ types.TextContent( type="text", text=json.dumps( {"success": False, "url": normalized_url, "error": str(e)} ), ) ]
- tools/utils.py:4-32 (helper)Utility function used by crawl_website_async to validate and normalize the input URL before crawling.def validate_and_normalize_url(url: str) -> str | None: """Validate and normalize a URL. Args: url: The URL string to validate. Returns: The normalized URL with https scheme if valid, otherwise None. """ # Simple validation for domains/subdomains with http(s) # Allows for optional paths url_pattern = re.compile( r"^(?:https?://)?(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|" # domain... r"localhost|" # localhost... r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})" # ...or ip r"(?::\d+)?" # optional port r"(?:/?|[/?]\S+)$", re.IGNORECASE, ) if not url_pattern.match(url): return None # Add https:// if missing if not url.startswith("http://") and not url.startswith("https://"): url = f"https://{url}" return url