crawl_site
Crawl any public website to extract clean content as Markdown or text, stripping navigation and boilerplate. Respects robots.txt and supports JavaScript rendering for SPAs. Outputs structured files for search and analysis.
Instructions
Crawl a website and save extracted content as clean Markdown or plain text.
This tool fetches pages from the given URL, strips navigation, footers,
scripts, and boilerplate, then saves each page as a Markdown file with a
JSONL index (pages.jsonl). It respects robots.txt and uses sitemap-first
discovery when available.
Use this tool when asked to research, read, analyze, or archive a website.
The output_dir from this tool is required by search_pages, read_page,
list_pages, and extract_data.
Typical workflow: crawl_site → list_pages or search_pages → read_page.
Args:
url: The base URL to crawl (e.g. "https://docs.example.com/"). Only
public, non-authenticated pages will be fetched.
output_dir: Directory to save output files. Each crawl creates .md files
and a pages.jsonl index here. Default: ./crawl_output
format: Output format — "markdown" (preserves headings, code blocks,
lists) or "text" (plain text). Default: "markdown".
max_pages: Maximum number of pages to save. Set to 0 for unlimited.
Default: 100. Use lower values (10-20) for quick previews.
include_subdomains: If True, also crawl subdomains (e.g. docs.example.com
when crawling example.com). Default: False.
render_js: If True, use a headless Chromium browser to render JavaScript
before extracting content. Required for React/Vue/Angular sites.
Slower but necessary for SPAs. Default: False.Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | ||
| output_dir | No | ./crawl_output | |
| format | No | markdown | |
| max_pages | No | ||
| include_subdomains | No | ||
| render_js | No |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |