robots_txt
Fetch and parse a target domain's robots.txt to retrieve sitemaps, per-user-agent allow/disallow rules, crawl-delay, and host directive. Use before crawling to honor published site rules.
Instructions
Fetch + parse the target domain's robots.txt โ sitemaps, per-User-agent allow/disallow rules, crawl-delay, Host directive. Use BEFORE crawling/scraping a target site (seo_audit, brand_assets, redirect_chain) to honour the site's published rules. status_code=404 means no robots.txt exists = implicit allow-all per RFC 9309 ยง2.4. ContrastAPI fetches with User-agent: ContrastAPI/<version> (+https://contrastcyber.com/bot) so site operators can identify + opt out via robots.txt; we honour Disallow: / for our UA in seo_audit and brand_assets. Per-target eTLD+1 throttle (60 req/min) prevents weaponising this endpoint against a single site; subdomain rotation collapses to the same bucket. Free: 100/hr, Pro: 1000/hr. Returns {domain, fetched_url, status_code, sitemaps, user_agents:{ua:{allow,disallow,crawl_delay}}, host, truncated, summary}. Returns 502 ErrorResponse if the target rejected the connection (DNS/TCP/TLS failure); the agent should NOT assume "no robots" in that case โ it's an upstream-failure signal.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| domain | Yes | Registrable domain to fetch robots.txt for (e.g. 'example.com', 'github.com'). No scheme, no path, no port. Subdomains accepted; the bot fetches https://<domain>/robots.txt with HTTP fallback. |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |