webcrawl_crawl

Crawl web pages starting from a URL, discovering linked pages up to a specified depth. Control the crawl with limits on pages and URL patterns.

Instructions

Crawl multiple pages starting from a URL.

Uses BFS to discover and fetch pages up to max_depth links away. Respects rate limiting between requests.

Input Schema

TableJSON Schema

Name	Required	Description
`url`	Yes	Starting URL
`max_pages`	No	Maximum number of pages to fetch (default: 10)
`max_depth`	No	Maximum link depth from start (default: 2)
`include_patterns`	No	Glob patterns for URLs to include (e.g., ["/docs/"])

Output Schema

TableJSON Schema

Name	Required	Description	Default
`result`	Yes

Implementation Reference

src/webcrawl_mcp/crawler.py:133-203 (handler)

The core `crawl()` function that performs BFS-based multi-page crawling. It fetches pages, extracts titles/content, filters by domain and glob patterns, and respects max_pages/max_depth limits.

async def crawl(
    url: str,
    max_pages: int = 10,
    max_depth: int = 2,
    include_patterns: list[str] | None = None,
) -> list[dict]:
    """Crawl multiple pages using BFS.

    Args:
        url: Starting URL
        max_pages: Maximum number of pages to fetch
        max_depth: Maximum link depth from start
        include_patterns: Glob patterns for URLs to include

    Returns:
        List of {url, title, content} dicts for each crawled page
    """
    print(f"[webcrawl] crawling from {url} (max_pages={max_pages}, max_depth={max_depth})", file=sys.stderr)

    results = []
    visited = set()
    # Queue contains (url, depth) tuples
    queue = deque([(url, 0)])

    while queue and len(results) < max_pages:
        current_url, depth = queue.popleft()

        # Skip if already visited
        if current_url in visited:
            continue
        visited.add(current_url)

        # Check pattern filter
        if not matches_patterns(current_url, include_patterns):
            continue

        try:
            # Fetch the page
            html = await fetch_url(current_url)
            title = extract_title(html)

            # Extract content using scraper (will use cache if available)
            scraped = await scrape(current_url)

            results.append({
                "url": current_url,
                "title": title,
                "content": scraped.content,
                "source": scraped.source,
            })

            print(
                f"[webcrawl] crawled {len(results)}/{max_pages}: {current_url}",
                file=sys.stderr,
            )

            # Add links to queue if not at max depth
            if depth < max_depth:
                links = extract_links(html, current_url)
                same_domain = filter_same_domain(links, url)

                for link in same_domain:
                    if link not in visited:
                        queue.append((link, depth + 1))

        except Exception as e:
            print(f"[webcrawl] failed to crawl {current_url}: {e}", file=sys.stderr)
            continue

    print(f"[webcrawl] crawl complete: {len(results)} pages", file=sys.stderr)
    return results

src/webcrawl_mcp/server.py:68-89 (handler)

The `webcrawl_crawl` MCP tool handler that defines the tool interface (url, max_pages, max_depth, include_patterns) and delegates to the `crawl()` function.

@mcp.tool
async def webcrawl_crawl(
    url: str,
    max_pages: int = 10,
    max_depth: int = 2,
    include_patterns: list[str] | None = None,
) -> list[dict]:
    """Crawl multiple pages starting from a URL.

    Uses BFS to discover and fetch pages up to max_depth links away.
    Respects rate limiting between requests.

    Args:
        url: Starting URL
        max_pages: Maximum number of pages to fetch (default: 10)
        max_depth: Maximum link depth from start (default: 2)
        include_patterns: Glob patterns for URLs to include (e.g., ["*/docs/*"])

    Returns:
        List of {url, title, content} for each crawled page
    """
    return await crawl(url, max_pages, max_depth, include_patterns)

src/webcrawl_mcp/server.py:69-73 (schema)
Input type signature for the webcrawl_crawl tool: url (str), max_pages (int=10), max_depth (int=2), include_patterns (list[str]|None). Return type is list[dict].
```
async def webcrawl_crawl(
    url: str,
    max_pages: int = 10,
    max_depth: int = 2,
    include_patterns: list[str] | None = None,
```
src/webcrawl_mcp/server.py:68-69 (registration)
The `@mcp.tool` decorator registers `webcrawl_crawl` as an MCP tool on the FastMCP server instance.
```
@mcp.tool
async def webcrawl_crawl(
```

src/webcrawl_mcp/crawler.py:13-42 (helper)

`extract_links()` helper that parses HTML and extracts absolute URLs from anchor tags.

def extract_links(html: str, base_url: str) -> list[str]:
    """Extract all links from HTML.

    Args:
        html: Raw HTML content
        base_url: Base URL for resolving relative links

    Returns:
        List of absolute URLs found in the page
    """
    soup = BeautifulSoup(html, "html.parser")
    links = []

    for anchor in soup.find_all("a", href=True):
        href = anchor["href"]

        # Skip non-http links
        if href.startswith(("javascript:", "mailto:", "tel:", "#")):
            continue

        # Resolve relative URLs
        absolute_url = urljoin(base_url, href)

        # Remove fragments
        parsed = urlparse(absolute_url)
        clean_url = parsed._replace(fragment="").geturl()

        links.append(clean_url)

    return links

src/webcrawl_mcp/crawler.py:45-63 (helper)

`filter_same_domain()` helper that filters URLs to same domain as base URL.

def filter_same_domain(urls: list[str], base_url: str) -> list[str]:
    """Filter URLs to same domain as base URL.

    Args:
        urls: List of URLs to filter
        base_url: Base URL to match domain against

    Returns:
        URLs that are on the same domain
    """
    base_domain = urlparse(base_url).netloc.lower()

    same_domain = []
    for url in urls:
        url_domain = urlparse(url).netloc.lower()
        if url_domain == base_domain:
            same_domain.append(url)

    return same_domain

src/webcrawl_mcp/crawler.py:98-111 (helper)

`extract_title()` helper that extracts the page title from HTML.

def extract_title(html: str) -> str:
    """Extract page title from HTML.

    Args:
        html: Raw HTML content

    Returns:
        Page title or empty string if not found
    """
    soup = BeautifulSoup(html, "html.parser")
    title_tag = soup.find("title")
    if title_tag and title_tag.string:
        return title_tag.string.strip()
    return ""

src/webcrawl_mcp/crawler.py:114-130 (helper)

`matches_patterns()` helper that checks if a URL matches glob include patterns.

def matches_patterns(url: str, patterns: list[str] | None) -> bool:
    """Check if URL matches any of the glob patterns.

    Args:
        url: URL to check
        patterns: List of glob patterns (e.g., ["*/docs/*", "*/api/*"])

    Returns:
        True if no patterns specified or URL matches at least one pattern
    """
    if not patterns:
        return True

    for pattern in patterns:
        if fnmatch.fnmatch(url, pattern):
            return True
    return False

webcrawl-mcp

webcrawl_crawl

Instructions

Input Schema

Output Schema

Implementation Reference

Tool Definition Quality

Other Tools

Latest Blog Posts

MCP directory API