novada_extract
Extract clean content from URLs, including anti-bot pages, with automatic rendering escalation. Supports batch mode for up to 10 pages in parallel. Output as markdown, text, HTML, or JSON.
Instructions
Extract clean content from any URL. Handles Cloudflare, DataDome, Kasada automatically via auto-escalation (static → JS render → Browser CDP). Batch mode: pass url as array for up to 10 pages in parallel.
Use for: Reading pages, batch-extracting search results, pulling structured fields (price, author, date). Works on anti-bot pages automatically. Not for: URL discovery (novada_map), multi-page crawl (novada_crawl), platform data like Amazon/LinkedIn (novada_scrape is richer). Key rule: Leave render="auto" (default). Only set render="render" for known JS-heavy SPAs. Auto mode is 15-100x faster on static sites.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | URL or array of URLs (max 10) to extract. Batch mode processes in parallel. For multiple URLs, use the urls array param instead. | |
| urls | No | Array of URLs to extract in parallel (max 10). Alias for url when passing multiple URLs. Use for batch research workflows extracting from several pages in one call. Returns a structured markdown document with one labeled section per URL (### [1/N] url). Single url param still returns a single markdown document. | |
| format | Yes | Output format. 'markdown' (default): structured readable output. 'text': plain text. 'html': raw HTML (truncated at 10K). 'json': structured JSON object with typed fields — best for programmatic agent consumption. | markdown |
| query | No | Optional query for relevance context. Helps the calling agent focus on relevant sections. | |
| render | Yes | Rendering mode. 'auto' (default): tries static first, escalates if JS-heavy. 'static': static HTML only. 'js' (or 'render'): force JS rendering via Web Unblocker. 'browser': force Browser API CDP (requires NOVADA_BROWSER_WS). | auto |
| fields | No | Specific fields to extract (e.g. ['price', 'author', 'availability', 'rating']). Returns a structured ## Requested Fields block. JSON-LD structured data is checked first; falls back to pattern matching. | |
| max_chars | No | Maximum characters to return (default: 25000, max: 100000). When content exceeds this limit, it is truncated and a notice is appended. Common mistake: do not set max_chars=100000 by default — use 25000 for most pages. |