Skip to main content
Glama

crawl_website

Crawl a website from a specified URL, extracting content up to a defined depth and page limit. Returns structured data for crawled pages.

Instructions

Crawl a website starting from the given URL up to a specified depth and page limit.

Args: url: The starting URL to crawl. crawl_depth: The maximum depth to crawl relative to the starting URL (default: 1). max_pages: The maximum number of pages to scrape during the crawl (default: 5).

Returns: List containing TextContent with a JSON array of results for crawled pages.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
crawl_depthNo
max_pagesNo
urlYes

Implementation Reference

  • main.py:26-43 (handler)
    The FastMCP tool handler for 'crawl_website', registered via @mcp.tool() decorator. Defines input schema via type annotations and delegates execution to the imported crawl_website_async helper.
    @mcp.tool()
    async def crawl_website(
        url: str,
        crawl_depth: int = 1,
        max_pages: int = 5,
    ) -> list[types.TextContent | types.ImageContent | types.EmbeddedResource]:
        """
        Crawl a website starting from the given URL up to a specified depth and page limit.
    
        Args:
            url: The starting URL to crawl.
            crawl_depth: The maximum depth to crawl relative to the starting URL (default: 1).
            max_pages: The maximum number of pages to scrape during the crawl (default: 5).
    
        Returns:
            List containing TextContent with a JSON array of results for crawled pages.
        """
        return await crawl_website_async(url, crawl_depth, max_pages)
  • Core helper function implementing the website crawling logic using Crawl4AI library, including BFS deep crawling strategy, error handling, and result processing into MCP TextContent format.
    async def crawl_website_async(url: str, crawl_depth: int, max_pages: int) -> List[Any]:
        """Crawl a website using crawl4ai.
    
        Args:
            url: The starting URL to crawl.
            crawl_depth: The maximum depth to crawl.
            max_pages: The maximum number of pages to crawl.
    
        Returns:
            A list containing TextContent objects with the results as JSON.
        """
    
        normalized_url = validate_and_normalize_url(url)
        if not normalized_url:
            return [
                types.TextContent(
                    type="text",
                    text=json.dumps(
                        {
                            "success": False,
                            "url": url,
                            "error": "Invalid URL format",
                        }
                    ),
                )
            ]
    
        try:
            # Use default configurations with minimal customization
            browser_config = BrowserConfig(
                browser_type="chromium",
                headless=True,
                ignore_https_errors=True,
                verbose=False,
                extra_args=[
                    "--no-sandbox",
                    "--disable-setuid-sandbox",
                    "--disable-dev-shm-usage",
                ],
            )
    
            # 1. Create the deep crawl strategy with depth and page limits
            crawl_strategy = BFSDeepCrawlStrategy(
                max_depth=crawl_depth, max_pages=max_pages
            )
    
            # 2. Create the run config, passing the strategy
            run_config = CrawlerRunConfig(
                cache_mode=CacheMode.BYPASS,
                verbose=False,
                page_timeout=30 * 1000,  # 30 seconds per page
                deep_crawl_strategy=crawl_strategy,  # Pass the strategy here
            )
    
            results_list = []
            async with AsyncWebCrawler(config=browser_config) as crawler:
                # 3. Use arun and wrap in asyncio.wait_for for overall timeout
                crawl_results: List[CrawlResult] = await asyncio.wait_for(
                    crawler.arun(
                        url=normalized_url,
                        config=run_config,
                    ),
                    timeout=CRAWL_TIMEOUT_SECONDS,
                )
    
                # Process results, checking 'success' attribute
                for result in crawl_results:
                    if result.success:  # Check .success instead of .status
                        results_list.append(
                            {
                                "url": result.url,
                                "success": True,
                                "markdown": result.markdown,
                            }
                        )
                    else:
                        results_list.append(
                            {
                                "url": result.url,
                                "success": False,
                                "error": result.error,  # Assume .error holds the message
                            }
                        )
    
                # Return a single TextContent with a JSON array of results
                return [
                    types.TextContent(
                        type="text", text=json.dumps({"results": results_list})
                    )
                ]
    
        except asyncio.TimeoutError:
            return [
                types.TextContent(
                    type="text",
                    text=json.dumps(
                        {
                            "success": False,
                            "url": normalized_url,
                            "error": f"Crawl operation timed out after {CRAWL_TIMEOUT_SECONDS} seconds.",
                        }
                    ),
                )
            ]
        except Exception as e:
            return [
                types.TextContent(
                    type="text",
                    text=json.dumps(
                        {"success": False, "url": normalized_url, "error": str(e)}
                    ),
                )
            ]
  • Utility function used by crawl_website_async to validate and normalize the input URL before crawling.
    def validate_and_normalize_url(url: str) -> str | None:
        """Validate and normalize a URL.
    
        Args:
            url: The URL string to validate.
    
        Returns:
            The normalized URL with https scheme if valid, otherwise None.
        """
        # Simple validation for domains/subdomains with http(s)
        # Allows for optional paths
        url_pattern = re.compile(
            r"^(?:https?://)?(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|"  # domain...
            r"localhost|"  # localhost...
            r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"  # ...or ip
            r"(?::\d+)?"  # optional port
            r"(?:/?|[/?]\S+)$",
            re.IGNORECASE,
        )
    
        if not url_pattern.match(url):
            return None
    
        # Add https:// if missing
        if not url.startswith("http://") and not url.startswith("https://"):
            url = f"https://{url}"
    
        return url
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations are provided, so the description carries the full burden. It mentions crawling behavior but lacks critical details: it doesn't specify what 'crawl' entails (e.g., following links, scraping content), whether it respects robots.txt, potential rate limits, authentication needs, or error handling. For a tool with no annotations, this leaves significant behavioral gaps.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-structured and appropriately sized. It begins with a clear purpose statement, followed by organized sections for Args and Returns. Each sentence serves a distinct purpose with zero wasted words, making it easy to parse.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's moderate complexity (3 parameters, no annotations, no output schema), the description covers the basic operation and parameters adequately. However, it lacks information about return format details (beyond 'List containing TextContent with a JSON array'), error conditions, and behavioral constraints that would be needed for robust use.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 0%, so the description must compensate. It provides clear semantic explanations for all three parameters: 'url' as the starting URL, 'crawl_depth' as maximum depth relative to starting URL, and 'max_pages' as maximum pages to scrape. Default values are also documented. This adds substantial value beyond the bare schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: 'Crawl a website starting from the given URL up to a specified depth and page limit.' It specifies the verb ('crawl'), resource ('website'), and scope ('starting from the given URL'). However, it doesn't explicitly differentiate from the sibling tool 'scrape_webpage' beyond the crawling aspect.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus the sibling 'scrape_webpage' or other alternatives. It mentions the action but lacks context about appropriate use cases, prerequisites, or exclusions. This leaves the agent without clear direction on tool selection.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Related Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ritvij14/crawl4ai-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server