ScrapeGraph MCP Server

Official

Overview Schema Related Servers Score Discussions

sitemap

Read-onlyIdempotent

Extract a website's complete sitemap structure to discover all accessible URLs and pages for planning crawls, analyzing architecture, or preparing content audits.

Instructions

Extract and discover the complete sitemap structure of any website.

This tool automatically discovers all accessible URLs and pages within a website, providing a comprehensive map of the site's structure. Useful for understanding site architecture before crawling or for discovering all available content. Very cost-effective at 1 credit per request. Read-only operation with no side effects.

Args: website_url (str): The base URL of the website to extract sitemap from. - Must include protocol (http:// or https://) - Should be the root domain or main section you want to map - The tool will discover all accessible pages from this starting point - Examples: * https://example.com (discover entire website structure) * https://docs.example.com (map documentation site) * https://blog.company.com (discover all blog pages) * https://shop.example.com (map e-commerce structure) - Best practices: * Use root domain (https://example.com) for complete site mapping * Use subdomain (https://docs.example.com) for focused mapping * Ensure the URL is accessible and doesn't require authentication - Discovery methods: * Checks for robots.txt and sitemap.xml files * Crawls navigation links and menus * Discovers pages through internal link analysis * Identifies common URL patterns and structures

Returns: Dictionary containing: - discovered_urls: List of all URLs found on the website - site_structure: Hierarchical organization of pages and sections - url_categories: URLs grouped by type (pages, images, documents, etc.) - total_pages: Total number of pages discovered - subdomains: List of subdomains found (if any) - sitemap_sources: Sources used for discovery (sitemap.xml, robots.txt, crawling) - page_types: Breakdown of different content types found - depth_analysis: URL organization by depth from root - external_links: Links pointing to external domains (if found) - processing_time: Time taken to complete the discovery - credits_used: Number of credits consumed (always 1)

Raises: ValueError: If website_url is malformed or missing protocol HTTPError: If the website cannot be accessed or returns errors TimeoutError: If the discovery process takes too long ConnectionError: If the website cannot be reached

Use Cases: - Planning comprehensive crawling operations - Understanding website architecture and organization - Discovering all available content before targeted scraping - SEO analysis and site structure optimization - Content inventory and audit preparation - Identifying pages for bulk processing operations

Best Practices: - Run sitemap before using smartcrawler_initiate for better planning - Use results to set appropriate max_pages and depth parameters - Check discovered URLs to understand site organization - Identify high-value pages for targeted extraction - Use for cost estimation before large crawling operations

Note: - Very cost-effective at only 1 credit per request - Results may vary based on site structure and accessibility - Some pages may require authentication and won't be discovered - Large sites may have thousands of URLs - consider filtering results - Use discovered URLs as input for other scraping tools

Input Schema

TableJSON Schema

Name	Required	Description	Default
`website_url`	Yes

Output Schema

TableJSON Schema

Name	Required	Description	Default
No arguments

Tool Definition Quality

A4.7/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

The description adds valuable behavioral context beyond annotations. Annotations indicate read-only, idempotent, and non-destructive operations, but the description elaborates on cost ('1 credit per request'), discovery methods (e.g., checking robots.txt, crawling links), limitations (pages requiring authentication won't be discovered), and performance considerations (large sites may have thousands of URLs). No contradictions with annotations exist.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-structured with clear sections (Args, Returns, Raises, Use Cases, Best Practices, Note), but it is lengthy. While most sentences add value (e.g., explaining cost-effectiveness, discovery methods, limitations), some redundancy exists (e.g., repeating 'cost-effective' in multiple sections), slightly reducing efficiency.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (website mapping), the description is highly complete. It covers purpose, usage, parameters, return values (detailed in Returns section), error handling (Raises), practical applications (Use Cases), and operational notes. With annotations and an output schema present, the description provides all necessary contextual information without gaps.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

With 0% schema description coverage for the single parameter 'website_url', the description fully compensates by providing extensive semantic details. It explains the parameter's purpose, format requirements (must include protocol), usage examples (e.g., root domain vs. subdomain), best practices, and discovery methods, adding significant value beyond the bare schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: 'Extract and discover the complete sitemap structure of any website.' It specifies the verb ('extract and discover'), resource ('sitemap structure'), and scope ('any website'), and distinguishes it from siblings like 'smartcrawler_initiate' by focusing on comprehensive mapping rather than targeted crawling.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides explicit guidance on when to use this tool versus alternatives. It states: 'Useful for understanding site architecture before crawling or for discovering all available content,' and under 'Best Practices' advises: 'Run sitemap before using smartcrawler_initiate for better planning.' This clearly positions it as a preparatory tool for other scraping operations.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

Your AI Chatbot Just Exposed Your CEO's Salary to an Intern
By Om-Shree-0709 on July 2, 2026.
Agent Identity
MCP Security
OAuth Delegation
Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)
By Om-Shree-0709 on June 30, 2026.
Agentic Ai
Prompt Injection
WebAssembly
Lightport: Open-Sourcing Glama's AI Gateway
By punkpeye on April 27, 2026.
OpenAI
open source

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ScrapeGraphAI/scrapegraph-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server