Extract Article Content
extractRetrieve clean article text from any public URL. Supports HTML and academic PDFs with modes for full content, abstract, or metadata.
Instructions
Fetch one public URL -> clean article text. HTML via Mozilla Readability; academic PDFs (arxiv/biorxiv/Nature/OpenReview/NeurIPS/JMLR/PMLR/Springer/PubMed-via-PMC) auto-detected via Content-Type, %PDF magic, citation_pdf_url meta, and per-domain URL rules. Tiered depth: mode="abstract" returns ~1500 chars (PDF page 1 or HTML meta description) -- cheap survey to triage relevance before paying for full body. mode="full" (default) returns the whole article. Best-effort: failures return an errorInfo instead of throwing.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | Public http(s) URL. Loopback/private IPs blocked unless SURF_ALLOW_PRIVATE=true. | |
| max_chars | No | Truncate body to this many chars (default 8000). | |
| mode | No | Extraction depth. `full` = whole article body (default; uses Playwright if needed). `abstract` = cheap survey: PDF page 1 OR HTML meta description (~1500 chars); use to triage relevance before paying for full text. `metadata` = page count only (PDF). Academic PDFs (arxiv/biorxiv/Nature/OpenReview/NeurIPS/JMLR/PMLR/Springer/PubMed-via-PMC) are auto-detected; abstract mode skips Playwright for them. | full |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | No | ||
| title | No | ||
| content | No | ||
| excerpt | No | ||
| length | No | ||
| is_pdf | No | ||
| page_count | No | ||
| extraction_quality | No | ||
| elapsed_ms | No | ||
| error | No | ||
| meta | No |