crawl_site
Extract website content into clean Markdown or text files by crawling pages and removing navigation, scripts, and boilerplate. Build searchable archives for research and data analysis.
Instructions
Crawl a website and save extracted content as clean Markdown or plain text.
This tool fetches pages from the given URL, strips navigation, footers,
scripts, and boilerplate, then saves each page as a Markdown file with a
JSONL index (pages.jsonl). It respects robots.txt and uses sitemap-first
discovery when available.
Use this tool when asked to research, read, analyze, or archive a website.
The output_dir from this tool is required by search_pages, read_page,
list_pages, and extract_data.
Typical workflow: crawl_site → list_pages or search_pages → read_page.
Args:
url: The base URL to crawl (e.g. "https://docs.example.com/"). Only
public, non-authenticated pages will be fetched.
output_dir: Directory to save output files. Each crawl creates .md files
and a pages.jsonl index here. Default: ./crawl_output
format: Output format — "markdown" (preserves headings, code blocks,
lists) or "text" (plain text). Default: "markdown".
max_pages: Maximum number of pages to save. Set to 0 for unlimited.
Default: 100. Use lower values (10-20) for quick previews.
include_subdomains: If True, also crawl subdomains (e.g. docs.example.com
when crawling example.com). Default: False.
render_js: If True, use a headless Chromium browser to render JavaScript
before extracting content. Required for React/Vue/Angular sites.
Slower but necessary for SPAs. Default: False.Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | ||
| output_dir | No | ./crawl_output | |
| format | No | markdown | |
| max_pages | No | ||
| include_subdomains | No | ||
| render_js | No |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |