tool_crawl_docs
Crawl and combine multi-page documentation from a starting URL into a single Markdown document with table of contents for efficient reference.
Instructions
Crawl multi-page documentation.
Follows same-domain links to build combined docs.
Args: root_url: Starting URL. max_pages: Max pages to crawl (1-20, default 5).
Returns: Combined Markdown with table of contents.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| root_url | Yes | ||
| max_pages | No |
Implementation Reference
- src/devlens/tools/scraper.py:49-115 (handler)The core implementation of the crawler logic that fetches and aggregates documentation pages.
async def crawl_docs( root_url: str, max_pages: int = 5, *, follow_external: bool = False ) -> str: """Crawl documentation starting from a root URL. Follows same-domain links to build a combined document with table of contents. Args: root_url: Starting URL for crawl. max_pages: Maximum pages to crawl (1-20). follow_external: Allow following external links (not recommended). Returns: Combined Markdown with table of contents. Example: >>> docs = await crawl_docs("https://docs.python.org/3/library/asyncio.html") """ from urllib.parse import urlparse max_pages = min(max(max_pages, 1), 20) visited: set[str] = set() to_visit: list[str] = [root_url] pages: list[tuple[str, str, str]] = [] # (url, title, content) root_domain = urlparse(root_url).netloc while to_visit and len(visited) < max_pages: url = to_visit.pop(0) if url in visited: continue # Skip non-documentation URLs if any( skip in url.lower() for skip in ["login", "signup", "download", "print", ".pdf", ".zip"] ): continue try: doc = await _adapter.fetch(url, retry=1) # Less retries for crawling visited.add(url) pages.append((url, doc.title, doc.content)) # Find more links async with asyncio.timeout(10): import httpx async with httpx.AsyncClient( timeout=10, follow_redirects=True ) as client: resp = await client.get(url) links = _adapter.get_same_domain_links(resp.text, url) # Filter links for link in links: if link in visited or link in to_visit: continue # Check domain restriction if not follow_external: link_domain = urlparse(link).netloc if link_domain != root_domain: continue # Prioritize docs-like URLs - src/devlens/server.py:67-80 (registration)The MCP tool registration for 'tool_crawl_docs' which serves as the entry point wrapper for the crawl_docs function.
@mcp.tool() async def tool_crawl_docs(root_url: str, max_pages: int = 5) -> str: """Crawl multi-page documentation. Follows same-domain links to build combined docs. Args: root_url: Starting URL. max_pages: Max pages to crawl (1-20, default 5). Returns: Combined Markdown with table of contents. """ return await crawl_docs(root_url, max_pages)