Scrape Page (+ YouTube, PDF, DOCX, PPTX)
scrape_pageExtract text content from web pages, YouTube videos, and documents (PDF, DOCX, PPTX) with support for JavaScript-rendered pages.
Instructions
Extract text content from a URL. Automatically handles: web pages (static + JavaScript-rendered), YouTube videos (extracts transcript), and documents (PDF, DOCX, PPTX).
When to use:
You already have a specific URL to extract content from
Need content from YouTube videos, PDFs, or Office documents
Want to check page structure before fetching full content (preview mode)
When to use search_and_scrape instead:
Researching a topic across multiple sources
Content size control:
max_length: Limit response size (default: server max of 50KB)
mode: 'full' returns content, 'preview' returns metadata + structure only
Preview mode benefits:
Check content size before fetching full content
Get page structure (headings) to decide which sections to read
Avoid context exhaustion with very large pages
Caching: Results cached for 1 hour.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | The URL to scrape. Supports: web pages (static HTML and JavaScript-rendered SPAs), YouTube videos (extracts transcript automatically), and documents (PDF, DOCX, PPTX - extracts text content). | |
| max_length | No | Maximum content length in characters. Content exceeding this will be truncated at natural breakpoints. Default: server max (50KB). | |
| mode | No | 'full' returns content (default), 'preview' returns metadata and structure without full content. | full |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | The URL that was scraped | |
| content | Yes | The extracted text content from the page | |
| contentType | Yes | The type of content that was extracted | |
| contentLength | Yes | Length of the extracted content in characters | |
| truncated | Yes | Whether the content was truncated due to size limits | |
| estimatedTokens | Yes | Estimated token count (~4 chars/token) | |
| sizeCategory | Yes | Size category based on content length | |
| originalLength | No | Original content length before truncation | |
| metadata | No | Additional metadata for documents | |
| citation | No | Citation information with metadata and formatted strings | |
| preview | No | Content preview with structure (when mode=preview) |