Skip to main content
Glama
ScrapeGraphAI

ScrapeGraph MCP Server

Official

sitemap

Read-onlyIdempotent

Extract a website's complete sitemap structure to discover all accessible URLs and pages for planning crawls, analyzing architecture, or preparing content audits.

Instructions

Extract and discover the complete sitemap structure of any website.

This tool automatically discovers all accessible URLs and pages within a website, providing a comprehensive map of the site's structure. Useful for understanding site architecture before crawling or for discovering all available content. Very cost-effective at 1 credit per request. Read-only operation with no side effects.

Args: website_url (str): The base URL of the website to extract sitemap from. - Must include protocol (http:// or https://) - Should be the root domain or main section you want to map - The tool will discover all accessible pages from this starting point - Examples: * https://example.com (discover entire website structure) * https://docs.example.com (map documentation site) * https://blog.company.com (discover all blog pages) * https://shop.example.com (map e-commerce structure) - Best practices: * Use root domain (https://example.com) for complete site mapping * Use subdomain (https://docs.example.com) for focused mapping * Ensure the URL is accessible and doesn't require authentication - Discovery methods: * Checks for robots.txt and sitemap.xml files * Crawls navigation links and menus * Discovers pages through internal link analysis * Identifies common URL patterns and structures

Returns: Dictionary containing: - discovered_urls: List of all URLs found on the website - site_structure: Hierarchical organization of pages and sections - url_categories: URLs grouped by type (pages, images, documents, etc.) - total_pages: Total number of pages discovered - subdomains: List of subdomains found (if any) - sitemap_sources: Sources used for discovery (sitemap.xml, robots.txt, crawling) - page_types: Breakdown of different content types found - depth_analysis: URL organization by depth from root - external_links: Links pointing to external domains (if found) - processing_time: Time taken to complete the discovery - credits_used: Number of credits consumed (always 1)

Raises: ValueError: If website_url is malformed or missing protocol HTTPError: If the website cannot be accessed or returns errors TimeoutError: If the discovery process takes too long ConnectionError: If the website cannot be reached

Use Cases: - Planning comprehensive crawling operations - Understanding website architecture and organization - Discovering all available content before targeted scraping - SEO analysis and site structure optimization - Content inventory and audit preparation - Identifying pages for bulk processing operations

Best Practices: - Run sitemap before using smartcrawler_initiate for better planning - Use results to set appropriate max_pages and depth parameters - Check discovered URLs to understand site organization - Identify high-value pages for targeted extraction - Use for cost estimation before large crawling operations

Note: - Very cost-effective at only 1 credit per request - Results may vary based on site structure and accessibility - Some pages may require authentication and won't be discovered - Large sites may have thousands of URLs - consider filtering results - Use discovered URLs as input for other scraping tools

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
website_urlYes

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault

No arguments

Implementation Reference

  • The main handler function for the 'sitemap' MCP tool. It retrieves the API key, creates a ScapeGraphClient instance, and calls the client's sitemap method to POST to the /sitemap API endpoint.
    @mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True})
    def sitemap(website_url: str, ctx: Context) -> Dict[str, Any]:
        """
        Extract and discover the complete sitemap structure of any website.
    
        This tool automatically discovers all accessible URLs and pages within a website, providing
        a comprehensive map of the site's structure. Useful for understanding site architecture before
        crawling or for discovering all available content. Very cost-effective at 1 credit per request.
        Read-only operation with no side effects.
    
        Args:
            website_url (str): The base URL of the website to extract sitemap from.
                - Must include protocol (http:// or https://)
                - Should be the root domain or main section you want to map
                - The tool will discover all accessible pages from this starting point
                - Examples:
                  * https://example.com (discover entire website structure)
                  * https://docs.example.com (map documentation site)
                  * https://blog.company.com (discover all blog pages)
                  * https://shop.example.com (map e-commerce structure)
                - Best practices:
                  * Use root domain (https://example.com) for complete site mapping
                  * Use subdomain (https://docs.example.com) for focused mapping
                  * Ensure the URL is accessible and doesn't require authentication
                - Discovery methods:
                  * Checks for robots.txt and sitemap.xml files
                  * Crawls navigation links and menus
                  * Discovers pages through internal link analysis
                  * Identifies common URL patterns and structures
    
        Returns:
            Dictionary containing:
            - discovered_urls: List of all URLs found on the website
            - site_structure: Hierarchical organization of pages and sections
            - url_categories: URLs grouped by type (pages, images, documents, etc.)
            - total_pages: Total number of pages discovered
            - subdomains: List of subdomains found (if any)
            - sitemap_sources: Sources used for discovery (sitemap.xml, robots.txt, crawling)
            - page_types: Breakdown of different content types found
            - depth_analysis: URL organization by depth from root
            - external_links: Links pointing to external domains (if found)
            - processing_time: Time taken to complete the discovery
            - credits_used: Number of credits consumed (always 1)
    
        Raises:
            ValueError: If website_url is malformed or missing protocol
            HTTPError: If the website cannot be accessed or returns errors
            TimeoutError: If the discovery process takes too long
            ConnectionError: If the website cannot be reached
    
        Use Cases:
            - Planning comprehensive crawling operations
            - Understanding website architecture and organization
            - Discovering all available content before targeted scraping
            - SEO analysis and site structure optimization
            - Content inventory and audit preparation
            - Identifying pages for bulk processing operations
    
        Best Practices:
            - Run sitemap before using smartcrawler_initiate for better planning
            - Use results to set appropriate max_pages and depth parameters
            - Check discovered URLs to understand site organization
            - Identify high-value pages for targeted extraction
            - Use for cost estimation before large crawling operations
    
        Note:
            - Very cost-effective at only 1 credit per request
            - Results may vary based on site structure and accessibility
            - Some pages may require authentication and won't be discovered
            - Large sites may have thousands of URLs - consider filtering results
            - Use discovered URLs as input for other scraping tools
        """
        try:
            api_key = get_api_key(ctx)
            client = ScapeGraphClient(api_key)
            return client.sitemap(website_url=website_url)
        except httpx.HTTPError as http_err:
            return {"error": str(http_err)}
        except ValueError as val_err:
            return {"error": str(val_err)}
  • Helper method in ScapeGraphClient class that performs the actual HTTP POST request to the ScrapeGraph API's /sitemap endpoint with the website_url payload.
    def sitemap(self, website_url: str) -> Dict[str, Any]:
        """
        Extract sitemap for a given website.
    
        Args:
            website_url: Base website URL
    
        Returns:
            Dictionary containing sitemap URLs/structure
        """
        url = f"{self.BASE_URL}/sitemap"
        payload: Dict[str, Any] = {"website_url": website_url}
    
        response = self.client.post(url, headers=self.headers, json=payload)
        response.raise_for_status()
        return response.json()
  • The @mcp.tool decorator registers the sitemap function as an MCP tool with specified annotations.
    @mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True})
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description adds valuable behavioral context beyond annotations. Annotations indicate read-only, idempotent, and non-destructive operations, but the description elaborates on cost ('1 credit per request'), discovery methods (e.g., checking robots.txt, crawling links), limitations (pages requiring authentication won't be discovered), and performance considerations (large sites may have thousands of URLs). No contradictions with annotations exist.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-structured with clear sections (Args, Returns, Raises, Use Cases, Best Practices, Note), but it is lengthy. While most sentences add value (e.g., explaining cost-effectiveness, discovery methods, limitations), some redundancy exists (e.g., repeating 'cost-effective' in multiple sections), slightly reducing efficiency.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (website mapping), the description is highly complete. It covers purpose, usage, parameters, return values (detailed in Returns section), error handling (Raises), practical applications (Use Cases), and operational notes. With annotations and an output schema present, the description provides all necessary contextual information without gaps.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

With 0% schema description coverage for the single parameter 'website_url', the description fully compensates by providing extensive semantic details. It explains the parameter's purpose, format requirements (must include protocol), usage examples (e.g., root domain vs. subdomain), best practices, and discovery methods, adding significant value beyond the bare schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: 'Extract and discover the complete sitemap structure of any website.' It specifies the verb ('extract and discover'), resource ('sitemap structure'), and scope ('any website'), and distinguishes it from siblings like 'smartcrawler_initiate' by focusing on comprehensive mapping rather than targeted crawling.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides explicit guidance on when to use this tool versus alternatives. It states: 'Useful for understanding site architecture before crawling or for discovering all available content,' and under 'Best Practices' advises: 'Run sitemap before using smartcrawler_initiate for better planning.' This clearly positions it as a preparatory tool for other scraping operations.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ScrapeGraphAI/scrapegraph-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server