scrape_documentation
Scrape website documentation using automated sub-agents. Supports CSS selectors and URL filtering for targeted content extraction.
Instructions
Scrape documentation from a website using intelligent sub-agents. Jobs are queued and processed automatically by the background worker. Supports plain string selectors for content extraction.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | The URL of the website to scrape. Must be a valid HTTP/HTTPS URL. This is the starting point for the scraping process. | |
| name | No | Optional human-readable name for this documentation source. If not provided, the hostname from the URL will be used. | |
| source_type | No | Type of documentation being scraped. Used for optimization and categorization. Choose "api" for API documentation, "guide" for tutorials/guides, "reference" for reference docs, or "tutorial" for step-by-step tutorials. | guide |
| max_pages | No | Maximum number of pages to scrape from the website. Helps prevent runaway scraping. Range: 1-1000 pages. | |
| selectors | No | CSS selectors to target specific content areas on pages. Use standard CSS selector syntax (e.g., "main article", ".content", "#documentation"). If not provided, the entire page content will be extracted. | |
| allow_patterns | No | Legacy pattern support for URL filtering. Use allow_path_segments, allow_url_contains, or other typed parameters instead. Patterns can be glob patterns (*/docs/*), regex patterns (/api\/v[0-9]+\/.*/) or JSON objects with specific matching rules. | |
| ignore_patterns | No | Legacy pattern support for URL exclusion. Use ignore_path_segments, ignore_url_contains, or other typed parameters instead. Patterns can be glob patterns (*/private/*), regex patterns (/login|admin/) or JSON objects with specific matching rules. | |
| allow_path_segments | No | Array of path segments that URLs must contain to be scraped. For example, ["docs", "api"] will only scrape URLs containing /docs/ or /api/ in their path. | |
| ignore_path_segments | No | Array of path segments to exclude from scraping. For example, ["admin", "private"] will skip URLs containing /admin/ or /private/ in their path. | |
| allow_file_extensions | No | Array of file extensions to include in scraping. For example, ["html", "php"] will only scrape URLs ending with .html or .php. Do not include the dot prefix. | |
| ignore_file_extensions | No | Array of file extensions to exclude from scraping. For example, ["js", "css", "png"] will skip JavaScript, CSS, and image files. Do not include the dot prefix. | |
| allow_url_contains | No | Array of substrings that URLs must contain to be scraped. For example, ["documentation", "guide"] will only scrape URLs containing these terms anywhere in the URL. | |
| ignore_url_contains | No | Array of substrings that will exclude URLs from scraping. For example, ["login", "signup", "404"] will skip URLs containing these terms anywhere in the URL. | |
| allow_url_starts_with | No | Array of URL prefixes that must match for URLs to be scraped. For example, ["https://docs.example.com/v2/"] will only scrape URLs starting with this prefix. | |
| ignore_url_starts_with | No | Array of URL prefixes that will exclude URLs from scraping. For example, ["https://example.com/legacy/"] will skip URLs starting with this prefix. | |
| allow_version_patterns | No | Array of version patterns to include in scraping. Useful for versioned documentation. For example, to scrape only v2.x.x docs, use: [{"prefix": "https://docs.example.com/v", "major": 2}] | |
| ignore_version_patterns | No | Array of version patterns to exclude from scraping. Useful for skipping deprecated versions. For example, to skip v1.x.x docs, use: [{"prefix": "https://docs.example.com/v", "major": 1}] | |
| allow_glob_patterns | No | Array of glob patterns for URLs to include in scraping. Supports wildcards: * (match any characters), ? (match single character), [abc] (match any character in brackets). For example, ["*/docs/*", "*/api/v*"] | |
| ignore_glob_patterns | No | Array of glob patterns for URLs to exclude from scraping. Supports wildcards: * (match any characters), ? (match single character), [abc] (match any character in brackets). For example, ["*/private/*", "*/admin/*"] | |
| allow_regex_patterns | No | Array of regular expressions for URLs to include in scraping. Use standard regex syntax. For example, ["/api/v[0-9]+/", "/docs/[a-z]+/"] will match versioned API paths and alphabetic doc paths. | |
| ignore_regex_patterns | No | Array of regular expressions for URLs to exclude from scraping. Use standard regex syntax. For example, ["/login", "/admin", "/\\.(js|css|png|jpg)$"] will skip login, admin, and static asset URLs. | |
| include_subdomains | No | Whether to include subdomains in the scraping process. If true, links to subdomains (e.g., api.example.com when scraping docs.example.com) will be followed. | |
| force_refresh | No | Whether to force refresh of previously scraped pages. If true, pages will be re-scraped even if they already exist in the database. | |
| agent_id | No | Optional agent ID for tracking and memory storage. If provided, scraping insights and results will be stored in the agent's memory for future reference. | |
| enable_sampling | No | Whether to enable intelligent parameter optimization through website sampling. When enabled, the scraper will analyze the website structure and optimize filtering parameters automatically. | |
| sampling_timeout | No | Timeout in milliseconds for the sampling/optimization process. Default is 30 seconds. Only used when enable_sampling is true. |