scrape_article
:
Instructions
Call ping first
New session or unsure the extension is online: ping first, then any scrape*. If EXTENSION_NOT_CONNECTED: ping again, then fix WebSocket using error.details.bridge, MCP stderr, and ~/.lionscraper/port, then retry.
lang (optional)
en-US | zh-CN: human-readable errors for this call; omitted → English; Chinese users pass lang: "zh-CN" on each call.
Do not substitute raw HTTP
When you need a real browser DOM, logged-in session cookies, JS-rendered content (SPAs), extension-side pagination or multi-URL scheduling, or structured field extraction, do not use WebFetch, curl, wget, or cookie-less IDE fetch tools instead of this server’s ping + scrape*. Only if the page is fully public and mostly static and the user clearly wants a trivial GET of raw HTML may you consider a plain HTTP client.
Purpose
Extract Markdown body and metadata (title, author, time, …) from single-column long-form pages.
When to use
Default for news, blogs, long docs, long product copy—one main reading flow per URL.
Listing/home with only links: use
scrapeorscrape_urlsfor URLs, thenscrape_article({ url: [...] }).
Returns
MultiUrlResult; success items include body (Markdown), title, quality, method, … in data.
Fallback
If body is empty or clearly wrong, call scrape on the same URL(s).
Parameters
Same common fields as other scrape tools (url, lang, timeouts, waitForScroll, …). waitForScroll.scrollSpeed ≠ top-level scrollSpeed.
Limits
Weak on heavy SPAs; listings need URL discovery first.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | One http(s) URL or string[] (batch). Max 50 URLs per request (extension-enforced). Server forwards as-is. | |
| lang | No | BCP 47 for human-readable errors this call: en-US | zh-CN. Omitted → English. Pass zh-CN when the user works in Chinese; the Server cannot infer chat language. | |
| delay | No | Milliseconds to wait after load before extraction (default 0). Use for late-rendered DOM. | |
| waitForScroll | No | Scroll the page (or a container) before extraction to trigger lazy-loaded content. Distinct from top-level scrollSpeed. | |
| timeoutMs | No | Per-URL task timeout for the extension (default 60000). Not the MCP WebSocket wait; see bridgeTimeoutMs. | |
| bridgeTimeoutMs | No | MCP Server only: max ms to wait for one tool call on the WebSocket bridge (capped). Omitted → derived from URL count, maxPages, timeoutMs, scrapeInterval. Stripped before forwarding to the extension. | |
| includeHtml | No | If true, include document.documentElement.outerHTML in result meta for that URL. | |
| includeText | No | If true, include document.body.innerText in result meta for that URL. | |
| scrapeInterval | No | Ms between starting tasks in a multi-URL run. Omitted → extension default; extension enforces min 500ms. | |
| concurrency | No | Parallel tabs for multi-URL runs. Omitted → extension default; extension enforces max 3. | |
| scrollSpeed | No | Optional global scroll speed (px) for batch tuning—not the same as waitForScroll.scrollSpeed. |