| crawl_siteA | Crawl a website and save extracted content as clean Markdown or plain text. This tool fetches pages from the given URL, strips navigation, footers,
scripts, and boilerplate, then saves each page as a Markdown file with a
JSONL index (pages.jsonl). It respects robots.txt and uses sitemap-first
discovery when available.
Use this tool when asked to research, read, analyze, or archive a website.
The output_dir from this tool is required by search_pages, read_page,
list_pages, and extract_data.
Typical workflow: crawl_site → list_pages or search_pages → read_page.
Args:
url: The base URL to crawl (e.g. "https://docs.example.com/"). Only
public, non-authenticated pages will be fetched.
output_dir: Directory to save output files. Each crawl creates .md files
and a pages.jsonl index here. Default: ./crawl_output
format: Output format — "markdown" (preserves headings, code blocks,
lists) or "text" (plain text). Default: "markdown".
max_pages: Maximum number of pages to save. Set to 0 for unlimited.
Default: 100. Use lower values (10-20) for quick previews.
include_subdomains: If True, also crawl subdomains (e.g. docs.example.com
when crawling example.com). Default: False.
render_js: If True, use a headless Chromium browser to render JavaScript
before extracting content. Required for React/Vue/Angular sites.
Slower but necessary for SPAs. Default: False.
|
| search_pagesA | Search through previously crawled pages by keyword. Performs case-insensitive keyword search across page titles and text content.
Results are ranked by the number of matching query words found. Each result
includes the page URL, title, and a text snippet showing context around the
first match.
This is a read-only operation on local files — no network requests are made.
Requires a prior crawl_site call to have populated the pages.jsonl file.
Args:
query: Search query — one or more keywords separated by spaces. All words
are searched independently (OR logic). Example: "authentication API key".
jsonl_path: Full path to the pages.jsonl file from a previous crawl. If
empty, defaults to <WEBCRAWLER_OUTPUT_DIR>/pages.jsonl.
max_results: Maximum number of results to return. Default: 10. Use lower
values for focused searches, higher for comprehensive surveys.
|
| read_pageA | Read the full extracted content of a specific crawled page by its URL. Returns the complete Markdown or text content of a single page, including
its title and source URL. Use this after search_pages to read the full
content of a relevant result.
This is a read-only operation on local files — no network requests are made.
URL matching is case-insensitive and tolerates trailing slashes.
Args:
url: The exact URL of the page to read. Must match a URL from a previous
crawl. Case-insensitive. Example: "https://docs.example.com/auth".
jsonl_path: Full path to the pages.jsonl file. If empty, defaults to
<WEBCRAWLER_OUTPUT_DIR>/pages.jsonl.
|
| list_pagesA | List all pages from a previous crawl with their URLs, titles, and word counts. Returns a summary of every page in the crawl index. Use this to get an
overview of available content before searching or reading specific pages.
Word counts help identify content-rich pages vs. thin landing pages.
This is a read-only operation on local files — no network requests are made.
Args:
jsonl_path: Full path to the pages.jsonl file. If empty, defaults to
<WEBCRAWLER_OUTPUT_DIR>/pages.jsonl.
|
| extract_dataA | Extract structured fields from crawled pages using an LLM. Analyzes each crawled page and pulls out specific data fields you define
(e.g. company_name, pricing, features, api_endpoints). If no fields are
specified, the LLM automatically discovers relevant fields by sampling
pages from the crawl.
This tool makes external API calls to OpenAI (requires OPENAI_API_KEY
environment variable). Results are saved to extracted.jsonl and include
LLM attribution metadata.
Use this for competitive research, API documentation analysis, or building
structured datasets from unstructured web content.
Args:
jsonl_path: Full path to the pages.jsonl file. If empty, defaults to
<WEBCRAWLER_OUTPUT_DIR>/pages.jsonl.
fields: Comma-separated field names to extract. Example:
"company_name,pricing,features,api_endpoints". Leave empty to
let the LLM auto-discover the most relevant fields.
context: Description of your analysis goal. Improves auto-field
discovery quality. Example: "competitor pricing analysis" or
"API documentation review". Ignored when fields are specified.
sample_size: Number of pages to sample for auto-field discovery.
Default: 3. Higher values give better field suggestions but
cost more tokens.
|