Skip to main content
Glama
andyliszewski

webcrawl-mcp

webcrawl_crawl

Crawl web pages starting from a URL, discovering linked pages up to a specified depth. Control the crawl with limits on pages and URL patterns.

Instructions

Crawl multiple pages starting from a URL.

Uses BFS to discover and fetch pages up to max_depth links away. Respects rate limiting between requests.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesStarting URL
max_pagesNoMaximum number of pages to fetch (default: 10)
max_depthNoMaximum link depth from start (default: 2)
include_patternsNoGlob patterns for URLs to include (e.g., ["*/docs/*"])

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault
resultYes

Implementation Reference

  • The core `crawl()` function that performs BFS-based multi-page crawling. It fetches pages, extracts titles/content, filters by domain and glob patterns, and respects max_pages/max_depth limits.
    async def crawl(
        url: str,
        max_pages: int = 10,
        max_depth: int = 2,
        include_patterns: list[str] | None = None,
    ) -> list[dict]:
        """Crawl multiple pages using BFS.
    
        Args:
            url: Starting URL
            max_pages: Maximum number of pages to fetch
            max_depth: Maximum link depth from start
            include_patterns: Glob patterns for URLs to include
    
        Returns:
            List of {url, title, content} dicts for each crawled page
        """
        print(f"[webcrawl] crawling from {url} (max_pages={max_pages}, max_depth={max_depth})", file=sys.stderr)
    
        results = []
        visited = set()
        # Queue contains (url, depth) tuples
        queue = deque([(url, 0)])
    
        while queue and len(results) < max_pages:
            current_url, depth = queue.popleft()
    
            # Skip if already visited
            if current_url in visited:
                continue
            visited.add(current_url)
    
            # Check pattern filter
            if not matches_patterns(current_url, include_patterns):
                continue
    
            try:
                # Fetch the page
                html = await fetch_url(current_url)
                title = extract_title(html)
    
                # Extract content using scraper (will use cache if available)
                scraped = await scrape(current_url)
    
                results.append({
                    "url": current_url,
                    "title": title,
                    "content": scraped.content,
                    "source": scraped.source,
                })
    
                print(
                    f"[webcrawl] crawled {len(results)}/{max_pages}: {current_url}",
                    file=sys.stderr,
                )
    
                # Add links to queue if not at max depth
                if depth < max_depth:
                    links = extract_links(html, current_url)
                    same_domain = filter_same_domain(links, url)
    
                    for link in same_domain:
                        if link not in visited:
                            queue.append((link, depth + 1))
    
            except Exception as e:
                print(f"[webcrawl] failed to crawl {current_url}: {e}", file=sys.stderr)
                continue
    
        print(f"[webcrawl] crawl complete: {len(results)} pages", file=sys.stderr)
        return results
  • The `webcrawl_crawl` MCP tool handler that defines the tool interface (url, max_pages, max_depth, include_patterns) and delegates to the `crawl()` function.
    @mcp.tool
    async def webcrawl_crawl(
        url: str,
        max_pages: int = 10,
        max_depth: int = 2,
        include_patterns: list[str] | None = None,
    ) -> list[dict]:
        """Crawl multiple pages starting from a URL.
    
        Uses BFS to discover and fetch pages up to max_depth links away.
        Respects rate limiting between requests.
    
        Args:
            url: Starting URL
            max_pages: Maximum number of pages to fetch (default: 10)
            max_depth: Maximum link depth from start (default: 2)
            include_patterns: Glob patterns for URLs to include (e.g., ["*/docs/*"])
    
        Returns:
            List of {url, title, content} for each crawled page
        """
        return await crawl(url, max_pages, max_depth, include_patterns)
  • Input type signature for the webcrawl_crawl tool: url (str), max_pages (int=10), max_depth (int=2), include_patterns (list[str]|None). Return type is list[dict].
    async def webcrawl_crawl(
        url: str,
        max_pages: int = 10,
        max_depth: int = 2,
        include_patterns: list[str] | None = None,
  • The `@mcp.tool` decorator registers `webcrawl_crawl` as an MCP tool on the FastMCP server instance.
    @mcp.tool
    async def webcrawl_crawl(
  • `extract_links()` helper that parses HTML and extracts absolute URLs from anchor tags.
    def extract_links(html: str, base_url: str) -> list[str]:
        """Extract all links from HTML.
    
        Args:
            html: Raw HTML content
            base_url: Base URL for resolving relative links
    
        Returns:
            List of absolute URLs found in the page
        """
        soup = BeautifulSoup(html, "html.parser")
        links = []
    
        for anchor in soup.find_all("a", href=True):
            href = anchor["href"]
    
            # Skip non-http links
            if href.startswith(("javascript:", "mailto:", "tel:", "#")):
                continue
    
            # Resolve relative URLs
            absolute_url = urljoin(base_url, href)
    
            # Remove fragments
            parsed = urlparse(absolute_url)
            clean_url = parsed._replace(fragment="").geturl()
    
            links.append(clean_url)
    
        return links
  • `filter_same_domain()` helper that filters URLs to same domain as base URL.
    def filter_same_domain(urls: list[str], base_url: str) -> list[str]:
        """Filter URLs to same domain as base URL.
    
        Args:
            urls: List of URLs to filter
            base_url: Base URL to match domain against
    
        Returns:
            URLs that are on the same domain
        """
        base_domain = urlparse(base_url).netloc.lower()
    
        same_domain = []
        for url in urls:
            url_domain = urlparse(url).netloc.lower()
            if url_domain == base_domain:
                same_domain.append(url)
    
        return same_domain
  • `extract_title()` helper that extracts the page title from HTML.
    def extract_title(html: str) -> str:
        """Extract page title from HTML.
    
        Args:
            html: Raw HTML content
    
        Returns:
            Page title or empty string if not found
        """
        soup = BeautifulSoup(html, "html.parser")
        title_tag = soup.find("title")
        if title_tag and title_tag.string:
            return title_tag.string.strip()
        return ""
  • `matches_patterns()` helper that checks if a URL matches glob include patterns.
    def matches_patterns(url: str, patterns: list[str] | None) -> bool:
        """Check if URL matches any of the glob patterns.
    
        Args:
            url: URL to check
            patterns: List of glob patterns (e.g., ["*/docs/*", "*/api/*"])
    
        Returns:
            True if no patterns specified or URL matches at least one pattern
        """
        if not patterns:
            return True
    
        for pattern in patterns:
            if fnmatch.fnmatch(url, pattern):
                return True
        return False
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

Without annotations, the description carries full burden. It discloses BFS traversal and rate limiting, but lacks details on error handling, page content returned, or what happens on max_pages reached. Adequate but not thorough.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Three concise sentences, front-loaded with purpose. No fluff. However, could be slightly more structured with explicit parameter usage guidance, but overall efficient.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Covers basic operation and rate limiting, but missing details on include_patterns parameter behavior and output specifics (though output schema exists). Gaps in what pages are returned and error scenarios.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100% with clear parameter descriptions. The tool description mentions BFS and depth, but does not add new meaning beyond the schema. Baseline 3 is appropriate.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool 'crawl multiple pages starting from a URL', which is specific and distinguishable from sibling tools (map, scrape, search). The mention of BFS and depth further clarifies the operation.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines3/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description implies usage for crawling linked pages but does not explicitly state when to use versus alternatives or exclude when not appropriate. No guidance on when to pick this over webcrawl_scrape or webcrawl_search.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/andyliszewski/webcrawl-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server