webcrawl_crawl
Crawl web pages starting from a URL, discovering linked pages up to a specified depth. Control the crawl with limits on pages and URL patterns.
Instructions
Crawl multiple pages starting from a URL.
Uses BFS to discover and fetch pages up to max_depth links away. Respects rate limiting between requests.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | Starting URL | |
| max_pages | No | Maximum number of pages to fetch (default: 10) | |
| max_depth | No | Maximum link depth from start (default: 2) | |
| include_patterns | No | Glob patterns for URLs to include (e.g., ["*/docs/*"]) |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |
Implementation Reference
- src/webcrawl_mcp/crawler.py:133-203 (handler)The core `crawl()` function that performs BFS-based multi-page crawling. It fetches pages, extracts titles/content, filters by domain and glob patterns, and respects max_pages/max_depth limits.
async def crawl( url: str, max_pages: int = 10, max_depth: int = 2, include_patterns: list[str] | None = None, ) -> list[dict]: """Crawl multiple pages using BFS. Args: url: Starting URL max_pages: Maximum number of pages to fetch max_depth: Maximum link depth from start include_patterns: Glob patterns for URLs to include Returns: List of {url, title, content} dicts for each crawled page """ print(f"[webcrawl] crawling from {url} (max_pages={max_pages}, max_depth={max_depth})", file=sys.stderr) results = [] visited = set() # Queue contains (url, depth) tuples queue = deque([(url, 0)]) while queue and len(results) < max_pages: current_url, depth = queue.popleft() # Skip if already visited if current_url in visited: continue visited.add(current_url) # Check pattern filter if not matches_patterns(current_url, include_patterns): continue try: # Fetch the page html = await fetch_url(current_url) title = extract_title(html) # Extract content using scraper (will use cache if available) scraped = await scrape(current_url) results.append({ "url": current_url, "title": title, "content": scraped.content, "source": scraped.source, }) print( f"[webcrawl] crawled {len(results)}/{max_pages}: {current_url}", file=sys.stderr, ) # Add links to queue if not at max depth if depth < max_depth: links = extract_links(html, current_url) same_domain = filter_same_domain(links, url) for link in same_domain: if link not in visited: queue.append((link, depth + 1)) except Exception as e: print(f"[webcrawl] failed to crawl {current_url}: {e}", file=sys.stderr) continue print(f"[webcrawl] crawl complete: {len(results)} pages", file=sys.stderr) return results - src/webcrawl_mcp/server.py:68-89 (handler)The `webcrawl_crawl` MCP tool handler that defines the tool interface (url, max_pages, max_depth, include_patterns) and delegates to the `crawl()` function.
@mcp.tool async def webcrawl_crawl( url: str, max_pages: int = 10, max_depth: int = 2, include_patterns: list[str] | None = None, ) -> list[dict]: """Crawl multiple pages starting from a URL. Uses BFS to discover and fetch pages up to max_depth links away. Respects rate limiting between requests. Args: url: Starting URL max_pages: Maximum number of pages to fetch (default: 10) max_depth: Maximum link depth from start (default: 2) include_patterns: Glob patterns for URLs to include (e.g., ["*/docs/*"]) Returns: List of {url, title, content} for each crawled page """ return await crawl(url, max_pages, max_depth, include_patterns) - src/webcrawl_mcp/server.py:69-73 (schema)Input type signature for the webcrawl_crawl tool: url (str), max_pages (int=10), max_depth (int=2), include_patterns (list[str]|None). Return type is list[dict].
async def webcrawl_crawl( url: str, max_pages: int = 10, max_depth: int = 2, include_patterns: list[str] | None = None, - src/webcrawl_mcp/server.py:68-69 (registration)The `@mcp.tool` decorator registers `webcrawl_crawl` as an MCP tool on the FastMCP server instance.
@mcp.tool async def webcrawl_crawl( - src/webcrawl_mcp/crawler.py:13-42 (helper)`extract_links()` helper that parses HTML and extracts absolute URLs from anchor tags.
def extract_links(html: str, base_url: str) -> list[str]: """Extract all links from HTML. Args: html: Raw HTML content base_url: Base URL for resolving relative links Returns: List of absolute URLs found in the page """ soup = BeautifulSoup(html, "html.parser") links = [] for anchor in soup.find_all("a", href=True): href = anchor["href"] # Skip non-http links if href.startswith(("javascript:", "mailto:", "tel:", "#")): continue # Resolve relative URLs absolute_url = urljoin(base_url, href) # Remove fragments parsed = urlparse(absolute_url) clean_url = parsed._replace(fragment="").geturl() links.append(clean_url) return links - src/webcrawl_mcp/crawler.py:45-63 (helper)`filter_same_domain()` helper that filters URLs to same domain as base URL.
def filter_same_domain(urls: list[str], base_url: str) -> list[str]: """Filter URLs to same domain as base URL. Args: urls: List of URLs to filter base_url: Base URL to match domain against Returns: URLs that are on the same domain """ base_domain = urlparse(base_url).netloc.lower() same_domain = [] for url in urls: url_domain = urlparse(url).netloc.lower() if url_domain == base_domain: same_domain.append(url) return same_domain - src/webcrawl_mcp/crawler.py:98-111 (helper)`extract_title()` helper that extracts the page title from HTML.
def extract_title(html: str) -> str: """Extract page title from HTML. Args: html: Raw HTML content Returns: Page title or empty string if not found """ soup = BeautifulSoup(html, "html.parser") title_tag = soup.find("title") if title_tag and title_tag.string: return title_tag.string.strip() return "" - src/webcrawl_mcp/crawler.py:114-130 (helper)`matches_patterns()` helper that checks if a URL matches glob include patterns.
def matches_patterns(url: str, patterns: list[str] | None) -> bool: """Check if URL matches any of the glob patterns. Args: url: URL to check patterns: List of glob patterns (e.g., ["*/docs/*", "*/api/*"]) Returns: True if no patterns specified or URL matches at least one pattern """ if not patterns: return True for pattern in patterns: if fnmatch.fnmatch(url, pattern): return True return False