extract
Get clean content from web pages or local files. Use batch extraction, deep crawling, file conversion, structured data extraction, multi-step research, or browser interaction to gather information.
Instructions
Read and return full page content from URLs or local files. Use this when you have a specific URL and need its content. For finding URLs first, use the search tool instead.
Actions:
extract: Get clean content from URLs. Example: extract(action="extract", urls=["https://example.com/article"])
batch: Batch extract with per-domain rate limiting (max 50 URLs). Example: extract(action="batch", urls=["https://a.com/1", "https://b.com/2"])
crawl: Deep crawl following links from root URLs. Example: extract(action="crawl", urls=["https://docs.example.com"], depth=2)
map: Discover site URL structure without extracting content. Example: extract(action="map", urls=["https://example.com"])
convert: Convert local files (PDF, DOCX, PPTX, XLSX) to Markdown. Example: extract(action="convert", paths=["/home/user/report.pdf"])
extract_structured: Extract structured data using JSON Schema + LLM. Example: extract(action="extract_structured", urls=["https://example.com/pricing"], schema={"type": "object", "properties": {"price": {"type": "string"}}})
agent: Multi-step research orchestration -- search the web, extract top results, synthesize a cited Markdown answer. Example: extract(action="agent", query="latest pydantic 2 changes", max_urls=5)
interact: Drive a page with click/fill/submit via patchright. Example: extract(action="interact", url="https://example.com/login", actions=[{"type": "fill", "selector": "#email", "value": "x@y.com"}, {"type": "submit", "selector": "form"}])
Key parameters:
urls (required for extract/batch/crawl/map/extract_structured): List of URLs
paths (required for convert): List of local file paths
query (required for agent): Research question to answer
url (required for interact): Page URL to drive
actions (required for interact): List of {type, selector?, description?, value?} ops
max_urls (agent): Default 5, hard cap 20
synthesis_model (agent): Override LLM model for the synthesis step
token_budget (agent): Max prompt tokens (default 10000)
session (interact): Persistent session id; reuses browser across calls
screenshot (interact): Capture post-interaction screenshot
format: Output format -- "markdown" (default), "text", "html"
depth: Crawl depth (default: 2, max: 5)
max_pages: Max pages for crawl/map (default: 20, max: 100)
stealth: Enable anti-bot bypass for protected sites (default: false)
schema: JSON Schema dict for extract_structured
Use help tool with tool_name="extract" for full parameter documentation.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| action | Yes | ||
| urls | No | ||
| paths | No | ||
| depth | No | ||
| max_pages | No | ||
| format | No | markdown | |
| stealth | No | ||
| schema | No | ||
| prompt | No | ||
| query | No | ||
| max_urls | No | ||
| synthesis_model | No | ||
| token_budget | No | ||
| actions | No | ||
| session | No | ||
| screenshot | No | ||
| url | No |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |