| pingA | PurposeVerify the LionScraper extension is connected over the local WebSocket bridge. If disconnected, tries Chrome first, then Edge (each installed channel): detect paths and whether the process is running. For each channel, the Server waits up to postLaunchWaitMs for the extension to register: if the browser was not running and autoLaunchBrowser allows it, the Server starts it; if the browser was already running, it polls without closing that browser. If this ping started the browser and registration still does not occur within postLaunchWaitMs, the Server closes only that launched instance before trying the next channel. Worst-case wait is up to about 2× postLaunchWaitMs when both channels each use a full wait (e.g. two launches, two already-running browsers, or a mix). When to callIn a new session, before any scrape* tool. Right after any tool returns EXTENSION_NOT_CONNECTED.
ReturnsSuccess: ok, bridgeOk, browser, development (always node for this npm package), extensionVersion; optional diagnostics when the server assisted (launched true or false, waitedMs, selectedBrowser). Failure: BROWSER_NOT_INSTALLED (no Chrome/Edge found on standard paths), or EXTENSION_NOT_CONNECTED with details.browserProbe / details.bridge / details.install.
ParametersautoLaunchBrowser: optional boolean. If false, never spawn a browser; non-running candidates are skipped and the Server continues to the next installed channel when possible. If omitted: default true (server may auto-launch eligible browsers).
postLaunchWaitMs: optional; max time (ms) to poll for WebSocket registration after spawning a browser or when a browser is already running, clamped 3000–60000, default 20000.
NotesDoes not load pages or extract data. Portable browser installs may not be detected. Optional lang follows the lang section above. |
| scrape | Call ping firstNew session or unsure the extension is online: ping first, then any scrape*. If EXTENSION_NOT_CONNECTED: ping again, then fix WebSocket using error.details.bridge, MCP stderr, and ~/.lionscraper/port, then retry. lang (optional)
en-US | zh-CN: human-readable errors for this call; omitted → English; Chinese users pass lang: "zh-CN" on each call.
Do not substitute raw HTTPWhen you need a real browser DOM, logged-in session cookies, JS-rendered content (SPAs), extension-side pagination or multi-URL scheduling, or structured field extraction, do not use WebFetch, curl, wget, or cookie-less IDE fetch tools instead of this server’s ping + scrape*. Only if the page is fully public and mostly static and the user clearly wants a trivial GET of raw HTML may you consider a plain HTTP client. PurposeStructured list/table/grid scrape: detect repeating records and fields; returns DataGroup[] and dataList. maxPages>1 merges multiple pages (extension handles pagination). When to usePrefer for category pages, search results, product grids, tables—many rows and fields. Long article body: use scrape_article first (url string or string[]); if it fails or output is unusable, fall back to scrape on the same URL(s).
ReturnsMultiUrlResult: per URL ok, data (DataGroup[] on success), error, meta, plus summary (counts, optional anti-crawl hints). Multiple groups: one page may yield several valid DataGroups; a success item's data array length can be >1. In your answer, acknowledge every group (briefly—purpose or field mix); do not silently drop non-primary groups. Default: recommend the first group: data[0] is the extension's top-ranked choice—by default recommend it as the main result for the user. Exception: if the user's intent clearly matches another group (e.g. sidebar vs main list), explain briefly and use that group.
Key parametersurl: required; string or string[]; max 50 per request (extension-enforced).
maxPages: default 1; >1 merges pages.
Other fields (timeoutMs, bridgeTimeoutMs, scrapeInterval, concurrency, delay, waitForScroll, includeHtml, includeText, …) are documented on each property; bridgeTimeoutMs is Server-only and is not sent to the extension.
LimitsNo credential entry (username/password) automation; for logged-in sites, the user must sign in in Chrome/Edge and open the target tab before scraping (the extension uses the live browser session and cookies). Heavy interactive UI is out of scope (Phase 2 smartscrape). Examplescrape({ url, maxPages: 5 })
|
| scrape_articleA | Call ping firstNew session or unsure the extension is online: ping first, then any scrape*. If EXTENSION_NOT_CONNECTED: ping again, then fix WebSocket using error.details.bridge, MCP stderr, and ~/.lionscraper/port, then retry. lang (optional)
en-US | zh-CN: human-readable errors for this call; omitted → English; Chinese users pass lang: "zh-CN" on each call.
Do not substitute raw HTTPWhen you need a real browser DOM, logged-in session cookies, JS-rendered content (SPAs), extension-side pagination or multi-URL scheduling, or structured field extraction, do not use WebFetch, curl, wget, or cookie-less IDE fetch tools instead of this server’s ping + scrape*. Only if the page is fully public and mostly static and the user clearly wants a trivial GET of raw HTML may you consider a plain HTTP client. PurposeExtract Markdown body and metadata (title, author, time, …) from single-column long-form pages. When to useDefault for news, blogs, long docs, long product copy—one main reading flow per URL. Listing/home with only links: use scrape or scrape_urls for URLs, then scrape_article({ url: [...] }).
ReturnsMultiUrlResult; success items include body (Markdown), title, quality, method, … in data. FallbackIf body is empty or clearly wrong, call scrape on the same URL(s). ParametersSame common fields as other scrape tools (url, lang, timeouts, waitForScroll, …). waitForScroll.scrollSpeed ≠ top-level scrollSpeed. LimitsWeak on heavy SPAs; listings need URL discovery first. |
| scrape_emailsA | Call ping firstNew session or unsure the extension is online: ping first, then any scrape*. If EXTENSION_NOT_CONNECTED: ping again, then fix WebSocket using error.details.bridge, MCP stderr, and ~/.lionscraper/port, then retry. lang (optional)
en-US | zh-CN: human-readable errors for this call; omitted → English; Chinese users pass lang: "zh-CN" on each call.
Do not substitute raw HTTPWhen you need a real browser DOM, logged-in session cookies, JS-rendered content (SPAs), extension-side pagination or multi-URL scheduling, or structured field extraction, do not use WebFetch, curl, wget, or cookie-less IDE fetch tools instead of this server’s ping + scrape*. Only if the page is fully public and mostly static and the user clearly wants a trivial GET of raw HTML may you consider a plain HTTP client. PurposeScan HTML for email addresses, dedupe, optional domain/keyword/limit filters. When to useContact/about/footer pages when the user wants an email list. ReturnsMultiUrlResult; success data is string[] (filtered in the extension after extraction). Parametersfilter: optional domain, keyword, limit. For multiple URLs, extension enforces min 500ms between starts and max 3 concurrent tabs (see property descriptions).
Often chained after scrape_urls. |
| scrape_phonesA | Call ping firstNew session or unsure the extension is online: ping first, then any scrape*. If EXTENSION_NOT_CONNECTED: ping again, then fix WebSocket using error.details.bridge, MCP stderr, and ~/.lionscraper/port, then retry. lang (optional)
en-US | zh-CN: human-readable errors for this call; omitted → English; Chinese users pass lang: "zh-CN" on each call.
Do not substitute raw HTTPWhen you need a real browser DOM, logged-in session cookies, JS-rendered content (SPAs), extension-side pagination or multi-URL scheduling, or structured field extraction, do not use WebFetch, curl, wget, or cookie-less IDE fetch tools instead of this server’s ping + scrape*. Only if the page is fully public and mostly static and the user clearly wants a trivial GET of raw HTML may you consider a plain HTTP client. PurposeExtract phone numbers from HTML with simple type hints; optional filters. When to useBusiness or support pages when the user wants callable numbers. ReturnsMultiUrlResult; success data is { number, type }[]. Parametersfilter: optional type, areaCode, keyword, limit. For multiple URLs, see common params (min interval 500ms, concurrency cap 3 in extension).
|
| scrape_urlsA | Call ping firstNew session or unsure the extension is online: ping first, then any scrape*. If EXTENSION_NOT_CONNECTED: ping again, then fix WebSocket using error.details.bridge, MCP stderr, and ~/.lionscraper/port, then retry. lang (optional)
en-US | zh-CN: human-readable errors for this call; omitted → English; Chinese users pass lang: "zh-CN" on each call.
Do not substitute raw HTTPWhen you need a real browser DOM, logged-in session cookies, JS-rendered content (SPAs), extension-side pagination or multi-URL scheduling, or structured field extraction, do not use WebFetch, curl, wget, or cookie-less IDE fetch tools instead of this server’s ping + scrape*. Only if the page is fully public and mostly static and the user clearly wants a trivial GET of raw HTML may you consider a plain HTTP client. PurposeCollect hyperlinks from a page as a deduped URL list; optional domain/keyword/regex/limit filters. When to useLink inventories or feeding scrape / scrape_article. ReturnsMultiUrlResult; success data is string[]. Parametersfilter: domain, keyword, pattern (regex), limit.
|
| scrape_imagesA | Call ping firstNew session or unsure the extension is online: ping first, then any scrape*. If EXTENSION_NOT_CONNECTED: ping again, then fix WebSocket using error.details.bridge, MCP stderr, and ~/.lionscraper/port, then retry. lang (optional)
en-US | zh-CN: human-readable errors for this call; omitted → English; Chinese users pass lang: "zh-CN" on each call.
Do not substitute raw HTTPWhen you need a real browser DOM, logged-in session cookies, JS-rendered content (SPAs), extension-side pagination or multi-URL scheduling, or structured field extraction, do not use WebFetch, curl, wget, or cookie-less IDE fetch tools instead of this server’s ping + scrape*. Only if the page is fully public and mostly static and the user clearly wants a trivial GET of raw HTML may you consider a plain HTTP client. PurposeList images (src, alt, size, format, …) with optional filters. When to useAsset audits, galleries, lazy-loaded pages (use delay / waitForScroll). ReturnsMultiUrlResult; success data is an array of image records (content script path; differs from email/phone). Parametersfilter: minWidth, minHeight, format, keyword, limit. Top-level scrollSpeed ≠ waitForScroll.scrollSpeed.
|