Skip to main content
Glama

tool_crawl_docs

Crawl and combine multi-page documentation from a starting URL into a single Markdown document with table of contents for efficient reference.

Instructions

Crawl multi-page documentation.

Follows same-domain links to build combined docs.

Args: root_url: Starting URL. max_pages: Max pages to crawl (1-20, default 5).

Returns: Combined Markdown with table of contents.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
root_urlYes
max_pagesNo

Implementation Reference

  • The core implementation of the crawler logic that fetches and aggregates documentation pages.
    async def crawl_docs(
        root_url: str, max_pages: int = 5, *, follow_external: bool = False
    ) -> str:
        """Crawl documentation starting from a root URL.
    
        Follows same-domain links to build a combined document with
        table of contents.
    
        Args:
            root_url: Starting URL for crawl.
            max_pages: Maximum pages to crawl (1-20).
            follow_external: Allow following external links (not recommended).
    
        Returns:
            Combined Markdown with table of contents.
    
        Example:
            >>> docs = await crawl_docs("https://docs.python.org/3/library/asyncio.html")
        """
        from urllib.parse import urlparse
    
        max_pages = min(max(max_pages, 1), 20)
        visited: set[str] = set()
        to_visit: list[str] = [root_url]
        pages: list[tuple[str, str, str]] = []  # (url, title, content)
        root_domain = urlparse(root_url).netloc
    
        while to_visit and len(visited) < max_pages:
            url = to_visit.pop(0)
    
            if url in visited:
                continue
    
            # Skip non-documentation URLs
            if any(
                skip in url.lower()
                for skip in ["login", "signup", "download", "print", ".pdf", ".zip"]
            ):
                continue
    
            try:
                doc = await _adapter.fetch(url, retry=1)  # Less retries for crawling
                visited.add(url)
                pages.append((url, doc.title, doc.content))
    
                # Find more links
                async with asyncio.timeout(10):
                    import httpx
    
                    async with httpx.AsyncClient(
                        timeout=10, follow_redirects=True
                    ) as client:
                        resp = await client.get(url)
                        links = _adapter.get_same_domain_links(resp.text, url)
    
                        # Filter links
                        for link in links:
                            if link in visited or link in to_visit:
                                continue
    
                            # Check domain restriction
                            if not follow_external:
                                link_domain = urlparse(link).netloc
                                if link_domain != root_domain:
                                    continue
    
                            # Prioritize docs-like URLs
  • The MCP tool registration for 'tool_crawl_docs' which serves as the entry point wrapper for the crawl_docs function.
    @mcp.tool()
    async def tool_crawl_docs(root_url: str, max_pages: int = 5) -> str:
        """Crawl multi-page documentation.
    
        Follows same-domain links to build combined docs.
    
        Args:
            root_url: Starting URL.
            max_pages: Max pages to crawl (1-20, default 5).
    
        Returns:
            Combined Markdown with table of contents.
        """
        return await crawl_docs(root_url, max_pages)

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Y4NN777/devlens-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server