novada_crawl
Crawls up to 20 pages from a domain using BFS or DFS to extract content, ideal for documentation ingestion and knowledge base creation.
Instructions
Use when you need content from multiple pages of a site and don't have the URLs yet. Crawls BFS or DFS up to 20 pages, extracts content from each. Use select_paths regex to target specific sections (e.g. "/docs/api/.*").
Best for: Doc site ingestion, competitive content analysis, building knowledge bases from a domain. Not for: A single page (use novada_extract), URL discovery without content extraction (use novada_map — much faster).
Common mistakes:
Do NOT set max_pages > 10 for large sites — crawl time scales linearly (~1.4s/page). At max_pages=20, expect 28s minimum.
Do NOT use novada_crawl to fetch one page — use novada_extract which is faster and simpler.
Use select_paths to restrict to relevant URL patterns before setting max_pages high.
When to use:
You need content from multiple pages on one domain (e.g., all /docs/* pages).
You need BFS discovery of related content under a path prefix.
Not for:
Single-URL extraction — use novada_extract.
Finding all URLs on a site without downloading content — use novada_map.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | ||
| max_pages | Yes | ||
| strategy | Yes | Crawl traversal order. 'bfs' (default): breadth-first — visits all pages at current depth before going deeper, good for broad discovery. 'dfs': depth-first — follows links deeply before backtracking, good for exploring specific paths. | bfs |
| instructions | No | Natural language hint for which pages to prioritize. E.g. 'only API reference pages', 'skip blog and changelog'. Applied as path-level filtering; semantic filtering is agent-side. | |
| select_paths | No | Regex patterns to restrict crawled URL paths. E.g. ['/docs/.*', '/api/.*']. | |
| exclude_paths | No | Regex patterns for URL paths to skip entirely. E.g. ['/blog/.*', '/changelog/.*']. | |
| format | Yes | Output format. 'markdown': human-readable (default). 'json': structured object for programmatic agent use. | markdown |
| render | Yes | Rendering mode. 'auto': uses static, escalates to render on first JS-heavy page detection. 'static': always static. 'render': always render (slower, handles JS sites). | auto |
| limit | No | Alias for max_pages — use max_pages for the canonical name. Max 20. | |
| mode | No | Alias for strategy — use strategy for the canonical name. |