sitemap

sitemap

Extract a website's complete sitemap structure to discover all accessible URLs and pages for planning crawls, analyzing architecture, or preparing content audits.

Instructions

Extract and discover the complete sitemap structure of any website.

This tool automatically discovers all accessible URLs and pages within a website, providing a comprehensive map of the site's structure. Useful for understanding site architecture before crawling or for discovering all available content. Very cost-effective at 1 credit per request. Read-only operation with no side effects.

Args: website_url (str): The base URL of the website to extract sitemap from. - Must include protocol (http:// or https://) - Should be the root domain or main section you want to map - The tool will discover all accessible pages from this starting point - Examples: * https://example.com (discover entire website structure) * https://docs.example.com (map documentation site) * https://blog.company.com (discover all blog pages) * https://shop.example.com (map e-commerce structure) - Best practices: * Use root domain (https://example.com) for complete site mapping * Use subdomain (https://docs.example.com) for focused mapping * Ensure the URL is accessible and doesn't require authentication - Discovery methods: * Checks for robots.txt and sitemap.xml files * Crawls navigation links and menus * Discovers pages through internal link analysis * Identifies common URL patterns and structures

Returns: Dictionary containing: - discovered_urls: List of all URLs found on the website - site_structure: Hierarchical organization of pages and sections - url_categories: URLs grouped by type (pages, images, documents, etc.) - total_pages: Total number of pages discovered - subdomains: List of subdomains found (if any) - sitemap_sources: Sources used for discovery (sitemap.xml, robots.txt, crawling) - page_types: Breakdown of different content types found - depth_analysis: URL organization by depth from root - external_links: Links pointing to external domains (if found) - processing_time: Time taken to complete the discovery - credits_used: Number of credits consumed (always 1)

Raises: ValueError: If website_url is malformed or missing protocol HTTPError: If the website cannot be accessed or returns errors TimeoutError: If the discovery process takes too long ConnectionError: If the website cannot be reached

Use Cases: - Planning comprehensive crawling operations - Understanding website architecture and organization - Discovering all available content before targeted scraping - SEO analysis and site structure optimization - Content inventory and audit preparation - Identifying pages for bulk processing operations

Best Practices: - Run sitemap before using smartcrawler_initiate for better planning - Use results to set appropriate max_pages and depth parameters - Check discovered URLs to understand site organization - Identify high-value pages for targeted extraction - Use for cost estimation before large crawling operations

Note: - Very cost-effective at only 1 credit per request - Results may vary based on site structure and accessibility - Some pages may require authentication and won't be discovered - Large sites may have thousands of URLs - consider filtering results - Use discovered URLs as input for other scraping tools

Input Schema

TableJSON Schema

Name	Required	Description	Default
`website_url`	Yes

Implementation Reference

src/scrapegraph_mcp/server.py:1940-2020 (handler)
The main handler function for the 'sitemap' MCP tool. It retrieves the API key, creates a ScapeGraphClient instance, and calls the client's sitemap method to POST to the /sitemap API endpoint.
@mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True}) def sitemap(website_url: str, ctx: Context) -> Dict[str, Any]: """ Extract and discover the complete sitemap structure of any website. This tool automatically discovers all accessible URLs and pages within a website, providing a comprehensive map of the site's structure. Useful for understanding site architecture before crawling or for discovering all available content. Very cost-effective at 1 credit per request. Read-only operation with no side effects. Args: website_url (str): The base URL of the website to extract sitemap from. - Must include protocol (http:// or https://) - Should be the root domain or main section you want to map - The tool will discover all accessible pages from this starting point - Examples: * https://example.com (discover entire website structure) * https://docs.example.com (map documentation site) * https://blog.company.com (discover all blog pages) * https://shop.example.com (map e-commerce structure) - Best practices: * Use root domain (https://example.com) for complete site mapping * Use subdomain (https://docs.example.com) for focused mapping * Ensure the URL is accessible and doesn't require authentication - Discovery methods: * Checks for robots.txt and sitemap.xml files * Crawls navigation links and menus * Discovers pages through internal link analysis * Identifies common URL patterns and structures Returns: Dictionary containing: - discovered_urls: List of all URLs found on the website - site_structure: Hierarchical organization of pages and sections - url_categories: URLs grouped by type (pages, images, documents, etc.) - total_pages: Total number of pages discovered - subdomains: List of subdomains found (if any) - sitemap_sources: Sources used for discovery (sitemap.xml, robots.txt, crawling) - page_types: Breakdown of different content types found - depth_analysis: URL organization by depth from root - external_links: Links pointing to external domains (if found) - processing_time: Time taken to complete the discovery - credits_used: Number of credits consumed (always 1) Raises: ValueError: If website_url is malformed or missing protocol HTTPError: If the website cannot be accessed or returns errors TimeoutError: If the discovery process takes too long ConnectionError: If the website cannot be reached Use Cases: - Planning comprehensive crawling operations - Understanding website architecture and organization - Discovering all available content before targeted scraping - SEO analysis and site structure optimization - Content inventory and audit preparation - Identifying pages for bulk processing operations Best Practices: - Run sitemap before using smartcrawler_initiate for better planning - Use results to set appropriate max_pages and depth parameters - Check discovered URLs to understand site organization - Identify high-value pages for targeted extraction - Use for cost estimation before large crawling operations Note: - Very cost-effective at only 1 credit per request - Results may vary based on site structure and accessibility - Some pages may require authentication and won't be discovered - Large sites may have thousands of URLs - consider filtering results - Use discovered URLs as input for other scraping tools """ try: api_key = get_api_key(ctx) client = ScapeGraphClient(api_key) return client.sitemap(website_url=website_url) except httpx.HTTPError as http_err: return {"error": str(http_err)} except ValueError as val_err: return {"error": str(val_err)}
src/scrapegraph_mcp/server.py:231-247 (helper)
Helper method in ScapeGraphClient class that performs the actual HTTP POST request to the ScrapeGraph API's /sitemap endpoint with the website_url payload.
def sitemap(self, website_url: str) -> Dict[str, Any]: """ Extract sitemap for a given website. Args: website_url: Base website URL Returns: Dictionary containing sitemap URLs/structure """ url = f"{self.BASE_URL}/sitemap" payload: Dict[str, Any] = {"website_url": website_url} response = self.client.post(url, headers=self.headers, json=payload) response.raise_for_status() return response.json()
src/scrapegraph_mcp/server.py:1940-1940 (registration)
The @mcp.tool decorator registers the sitemap function as an MCP tool with specified annotations.
@mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True})

ScrapeGraph MCP Server

Instructions

Input Schema

Implementation Reference

Other Tools

Latest Blog Posts

MCP directory API