Skip to main content
Glama

crawl_website

Crawl a website from a specified URL, extracting content up to a defined depth and page limit. Returns structured data for crawled pages.

Instructions

Crawl a website starting from the given URL up to a specified depth and page limit.

Args: url: The starting URL to crawl. crawl_depth: The maximum depth to crawl relative to the starting URL (default: 1). max_pages: The maximum number of pages to scrape during the crawl (default: 5).

Returns: List containing TextContent with a JSON array of results for crawled pages.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
crawl_depthNo
max_pagesNo
urlYes

Implementation Reference

  • main.py:26-43 (handler)
    The FastMCP tool handler for 'crawl_website', registered via @mcp.tool() decorator. Defines input schema via type annotations and delegates execution to the imported crawl_website_async helper.
    @mcp.tool()
    async def crawl_website(
        url: str,
        crawl_depth: int = 1,
        max_pages: int = 5,
    ) -> list[types.TextContent | types.ImageContent | types.EmbeddedResource]:
        """
        Crawl a website starting from the given URL up to a specified depth and page limit.
    
        Args:
            url: The starting URL to crawl.
            crawl_depth: The maximum depth to crawl relative to the starting URL (default: 1).
            max_pages: The maximum number of pages to scrape during the crawl (default: 5).
    
        Returns:
            List containing TextContent with a JSON array of results for crawled pages.
        """
        return await crawl_website_async(url, crawl_depth, max_pages)
  • Core helper function implementing the website crawling logic using Crawl4AI library, including BFS deep crawling strategy, error handling, and result processing into MCP TextContent format.
    async def crawl_website_async(url: str, crawl_depth: int, max_pages: int) -> List[Any]:
        """Crawl a website using crawl4ai.
    
        Args:
            url: The starting URL to crawl.
            crawl_depth: The maximum depth to crawl.
            max_pages: The maximum number of pages to crawl.
    
        Returns:
            A list containing TextContent objects with the results as JSON.
        """
    
        normalized_url = validate_and_normalize_url(url)
        if not normalized_url:
            return [
                types.TextContent(
                    type="text",
                    text=json.dumps(
                        {
                            "success": False,
                            "url": url,
                            "error": "Invalid URL format",
                        }
                    ),
                )
            ]
    
        try:
            # Use default configurations with minimal customization
            browser_config = BrowserConfig(
                browser_type="chromium",
                headless=True,
                ignore_https_errors=True,
                verbose=False,
                extra_args=[
                    "--no-sandbox",
                    "--disable-setuid-sandbox",
                    "--disable-dev-shm-usage",
                ],
            )
    
            # 1. Create the deep crawl strategy with depth and page limits
            crawl_strategy = BFSDeepCrawlStrategy(
                max_depth=crawl_depth, max_pages=max_pages
            )
    
            # 2. Create the run config, passing the strategy
            run_config = CrawlerRunConfig(
                cache_mode=CacheMode.BYPASS,
                verbose=False,
                page_timeout=30 * 1000,  # 30 seconds per page
                deep_crawl_strategy=crawl_strategy,  # Pass the strategy here
            )
    
            results_list = []
            async with AsyncWebCrawler(config=browser_config) as crawler:
                # 3. Use arun and wrap in asyncio.wait_for for overall timeout
                crawl_results: List[CrawlResult] = await asyncio.wait_for(
                    crawler.arun(
                        url=normalized_url,
                        config=run_config,
                    ),
                    timeout=CRAWL_TIMEOUT_SECONDS,
                )
    
                # Process results, checking 'success' attribute
                for result in crawl_results:
                    if result.success:  # Check .success instead of .status
                        results_list.append(
                            {
                                "url": result.url,
                                "success": True,
                                "markdown": result.markdown,
                            }
                        )
                    else:
                        results_list.append(
                            {
                                "url": result.url,
                                "success": False,
                                "error": result.error,  # Assume .error holds the message
                            }
                        )
    
                # Return a single TextContent with a JSON array of results
                return [
                    types.TextContent(
                        type="text", text=json.dumps({"results": results_list})
                    )
                ]
    
        except asyncio.TimeoutError:
            return [
                types.TextContent(
                    type="text",
                    text=json.dumps(
                        {
                            "success": False,
                            "url": normalized_url,
                            "error": f"Crawl operation timed out after {CRAWL_TIMEOUT_SECONDS} seconds.",
                        }
                    ),
                )
            ]
        except Exception as e:
            return [
                types.TextContent(
                    type="text",
                    text=json.dumps(
                        {"success": False, "url": normalized_url, "error": str(e)}
                    ),
                )
            ]
  • Utility function used by crawl_website_async to validate and normalize the input URL before crawling.
    def validate_and_normalize_url(url: str) -> str | None:
        """Validate and normalize a URL.
    
        Args:
            url: The URL string to validate.
    
        Returns:
            The normalized URL with https scheme if valid, otherwise None.
        """
        # Simple validation for domains/subdomains with http(s)
        # Allows for optional paths
        url_pattern = re.compile(
            r"^(?:https?://)?(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|"  # domain...
            r"localhost|"  # localhost...
            r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"  # ...or ip
            r"(?::\d+)?"  # optional port
            r"(?:/?|[/?]\S+)$",
            re.IGNORECASE,
        )
    
        if not url_pattern.match(url):
            return None
    
        # Add https:// if missing
        if not url.startswith("http://") and not url.startswith("https://"):
            url = f"https://{url}"
    
        return url
Install Server

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ritvij14/crawl4ai-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server