Skip to main content
Glama

crawl_website

Crawl a website from a specified URL, extracting content up to a defined depth and page limit. Returns structured data for crawled pages.

Instructions

Crawl a website starting from the given URL up to a specified depth and page limit.

Args: url: The starting URL to crawl. crawl_depth: The maximum depth to crawl relative to the starting URL (default: 1). max_pages: The maximum number of pages to scrape during the crawl (default: 5).

Returns: List containing TextContent with a JSON array of results for crawled pages.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
crawl_depthNo
max_pagesNo
urlYes

Implementation Reference

  • main.py:26-43 (handler)
    The FastMCP tool handler for 'crawl_website', registered via @mcp.tool() decorator. Defines input schema via type annotations and delegates execution to the imported crawl_website_async helper.
    @mcp.tool() async def crawl_website( url: str, crawl_depth: int = 1, max_pages: int = 5, ) -> list[types.TextContent | types.ImageContent | types.EmbeddedResource]: """ Crawl a website starting from the given URL up to a specified depth and page limit. Args: url: The starting URL to crawl. crawl_depth: The maximum depth to crawl relative to the starting URL (default: 1). max_pages: The maximum number of pages to scrape during the crawl (default: 5). Returns: List containing TextContent with a JSON array of results for crawled pages. """ return await crawl_website_async(url, crawl_depth, max_pages)
  • Core helper function implementing the website crawling logic using Crawl4AI library, including BFS deep crawling strategy, error handling, and result processing into MCP TextContent format.
    async def crawl_website_async(url: str, crawl_depth: int, max_pages: int) -> List[Any]: """Crawl a website using crawl4ai. Args: url: The starting URL to crawl. crawl_depth: The maximum depth to crawl. max_pages: The maximum number of pages to crawl. Returns: A list containing TextContent objects with the results as JSON. """ normalized_url = validate_and_normalize_url(url) if not normalized_url: return [ types.TextContent( type="text", text=json.dumps( { "success": False, "url": url, "error": "Invalid URL format", } ), ) ] try: # Use default configurations with minimal customization browser_config = BrowserConfig( browser_type="chromium", headless=True, ignore_https_errors=True, verbose=False, extra_args=[ "--no-sandbox", "--disable-setuid-sandbox", "--disable-dev-shm-usage", ], ) # 1. Create the deep crawl strategy with depth and page limits crawl_strategy = BFSDeepCrawlStrategy( max_depth=crawl_depth, max_pages=max_pages ) # 2. Create the run config, passing the strategy run_config = CrawlerRunConfig( cache_mode=CacheMode.BYPASS, verbose=False, page_timeout=30 * 1000, # 30 seconds per page deep_crawl_strategy=crawl_strategy, # Pass the strategy here ) results_list = [] async with AsyncWebCrawler(config=browser_config) as crawler: # 3. Use arun and wrap in asyncio.wait_for for overall timeout crawl_results: List[CrawlResult] = await asyncio.wait_for( crawler.arun( url=normalized_url, config=run_config, ), timeout=CRAWL_TIMEOUT_SECONDS, ) # Process results, checking 'success' attribute for result in crawl_results: if result.success: # Check .success instead of .status results_list.append( { "url": result.url, "success": True, "markdown": result.markdown, } ) else: results_list.append( { "url": result.url, "success": False, "error": result.error, # Assume .error holds the message } ) # Return a single TextContent with a JSON array of results return [ types.TextContent( type="text", text=json.dumps({"results": results_list}) ) ] except asyncio.TimeoutError: return [ types.TextContent( type="text", text=json.dumps( { "success": False, "url": normalized_url, "error": f"Crawl operation timed out after {CRAWL_TIMEOUT_SECONDS} seconds.", } ), ) ] except Exception as e: return [ types.TextContent( type="text", text=json.dumps( {"success": False, "url": normalized_url, "error": str(e)} ), ) ]
  • Utility function used by crawl_website_async to validate and normalize the input URL before crawling.
    def validate_and_normalize_url(url: str) -> str | None: """Validate and normalize a URL. Args: url: The URL string to validate. Returns: The normalized URL with https scheme if valid, otherwise None. """ # Simple validation for domains/subdomains with http(s) # Allows for optional paths url_pattern = re.compile( r"^(?:https?://)?(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|" # domain... r"localhost|" # localhost... r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})" # ...or ip r"(?::\d+)?" # optional port r"(?:/?|[/?]\S+)$", re.IGNORECASE, ) if not url_pattern.match(url): return None # Add https:// if missing if not url.startswith("http://") and not url.startswith("https://"): url = f"https://{url}" return url

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ritvij14/crawl4ai-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server