scrape_page
Extract readable content from any URL, including web pages, PDFs, DOCX, PPTX, and YouTube transcripts. Returns structured JSON with metadata and citations.
Instructions
Extract readable content from a single URL. Handles web pages (HTML and JS-rendered SPAs), PDFs, DOCX, PPTX, and YouTube transcripts via auto-detected extraction with tiered fallback (markdown, stealth, HTML, headless browser). Returns JSON with fields: url, content, contentType, contentLength, truncated, estimatedTokens, sizeCategory, citation (with formatted APA/MLA), metadata ({title, author}). Content capped at max_length (default 50000 bytes); truncated=true if cut. Mode 'preview' forces max_length to 5000 bytes for quick relevance checks. Max 5 concurrent scrapes; additional calls queue. On failure returns isError with reason (e.g. blocked, timeout, invalid URL). Use search_and_scrape instead to discover and extract in one step; use web_search if you only need URLs. Results cached 1 hour.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | The HTTP/HTTPS URL to extract content from. Supports web pages, PDFs, DOCX, PPTX, and YouTube video URLs.,required | |
| mode | No | Extraction depth: full (default, up to max_length) or preview (first 5000 bytes, faster). Use preview for quick relevance checks. | |
| max_length | No | Maximum content length in bytes (default: 50000). Reduce for faster responses when you only need a summary. |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| citation | No | ||
| content | No | ||
| contentLength | No | ||
| contentType | No | ||
| estimatedTokens | No | ||
| metadata | No | ||
| sizeCategory | No | ||
| truncated | No | ||
| url | No |