Skip to main content
Glama

scrape_documentation

Scrape website documentation using automated sub-agents. Supports CSS selectors and URL filtering for targeted content extraction.

Instructions

Scrape documentation from a website using intelligent sub-agents. Jobs are queued and processed automatically by the background worker. Supports plain string selectors for content extraction.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesThe URL of the website to scrape. Must be a valid HTTP/HTTPS URL. This is the starting point for the scraping process.
nameNoOptional human-readable name for this documentation source. If not provided, the hostname from the URL will be used.
source_typeNoType of documentation being scraped. Used for optimization and categorization. Choose "api" for API documentation, "guide" for tutorials/guides, "reference" for reference docs, or "tutorial" for step-by-step tutorials.guide
max_pagesNoMaximum number of pages to scrape from the website. Helps prevent runaway scraping. Range: 1-1000 pages.
selectorsNoCSS selectors to target specific content areas on pages. Use standard CSS selector syntax (e.g., "main article", ".content", "#documentation"). If not provided, the entire page content will be extracted.
allow_patternsNoLegacy pattern support for URL filtering. Use allow_path_segments, allow_url_contains, or other typed parameters instead. Patterns can be glob patterns (*/docs/*), regex patterns (/api\/v[0-9]+\/.*/) or JSON objects with specific matching rules.
ignore_patternsNoLegacy pattern support for URL exclusion. Use ignore_path_segments, ignore_url_contains, or other typed parameters instead. Patterns can be glob patterns (*/private/*), regex patterns (/login|admin/) or JSON objects with specific matching rules.
allow_path_segmentsNoArray of path segments that URLs must contain to be scraped. For example, ["docs", "api"] will only scrape URLs containing /docs/ or /api/ in their path.
ignore_path_segmentsNoArray of path segments to exclude from scraping. For example, ["admin", "private"] will skip URLs containing /admin/ or /private/ in their path.
allow_file_extensionsNoArray of file extensions to include in scraping. For example, ["html", "php"] will only scrape URLs ending with .html or .php. Do not include the dot prefix.
ignore_file_extensionsNoArray of file extensions to exclude from scraping. For example, ["js", "css", "png"] will skip JavaScript, CSS, and image files. Do not include the dot prefix.
allow_url_containsNoArray of substrings that URLs must contain to be scraped. For example, ["documentation", "guide"] will only scrape URLs containing these terms anywhere in the URL.
ignore_url_containsNoArray of substrings that will exclude URLs from scraping. For example, ["login", "signup", "404"] will skip URLs containing these terms anywhere in the URL.
allow_url_starts_withNoArray of URL prefixes that must match for URLs to be scraped. For example, ["https://docs.example.com/v2/"] will only scrape URLs starting with this prefix.
ignore_url_starts_withNoArray of URL prefixes that will exclude URLs from scraping. For example, ["https://example.com/legacy/"] will skip URLs starting with this prefix.
allow_version_patternsNoArray of version patterns to include in scraping. Useful for versioned documentation. For example, to scrape only v2.x.x docs, use: [{"prefix": "https://docs.example.com/v", "major": 2}]
ignore_version_patternsNoArray of version patterns to exclude from scraping. Useful for skipping deprecated versions. For example, to skip v1.x.x docs, use: [{"prefix": "https://docs.example.com/v", "major": 1}]
allow_glob_patternsNoArray of glob patterns for URLs to include in scraping. Supports wildcards: * (match any characters), ? (match single character), [abc] (match any character in brackets). For example, ["*/docs/*", "*/api/v*"]
ignore_glob_patternsNoArray of glob patterns for URLs to exclude from scraping. Supports wildcards: * (match any characters), ? (match single character), [abc] (match any character in brackets). For example, ["*/private/*", "*/admin/*"]
allow_regex_patternsNoArray of regular expressions for URLs to include in scraping. Use standard regex syntax. For example, ["/api/v[0-9]+/", "/docs/[a-z]+/"] will match versioned API paths and alphabetic doc paths.
ignore_regex_patternsNoArray of regular expressions for URLs to exclude from scraping. Use standard regex syntax. For example, ["/login", "/admin", "/\\.(js|css|png|jpg)$"] will skip login, admin, and static asset URLs.
include_subdomainsNoWhether to include subdomains in the scraping process. If true, links to subdomains (e.g., api.example.com when scraping docs.example.com) will be followed.
force_refreshNoWhether to force refresh of previously scraped pages. If true, pages will be re-scraped even if they already exist in the database.
agent_idNoOptional agent ID for tracking and memory storage. If provided, scraping insights and results will be stored in the agent's memory for future reference.
enable_samplingNoWhether to enable intelligent parameter optimization through website sampling. When enabled, the scraper will analyze the website structure and optimize filtering parameters automatically.
sampling_timeoutNoTimeout in milliseconds for the sampling/optimization process. Default is 30 seconds. Only used when enable_sampling is true.
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations, the description must disclose behavioral traits, but it only mentions queuing and selectors. It lacks information on side effects (e.g., whether it modifies the website), authentication needs, rate limits, or what happens to the scraped data. This is insufficient for a complex scraping tool.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is brief and front-loaded, but for a tool with many parameters, it could benefit from slightly more structure (e.g., grouping filtering options). It earns its place but is a bit too terse.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness2/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the complexity of 26 parameters and no output schema, the description is too minimal. It doesn't explain return values, error handling, or background behavior. The schema descriptions help, but the tool description itself lacks context for an agent to use it effectively.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

All 26 parameters have full descriptions in the input schema, so the description adds minimal value. It mentions 'plain string selectors' but the schema already covers that. Baseline 3 is appropriate.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool scrapes documentation using sub-agents, but it does not distinguish itself from sibling scraping tools like 'scrape_content' or 'navigate_and_scrape', which would be necessary for an agent to choose correctly.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives, such as when to prefer queued scraping over direct scraping. There is no mention of prerequisites or limitations.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ZachHandley/ZMCPTools'

If you have feedback or need assistance with the MCP directory API, please join our Discord server