scrape_urls
Extract hyperlinks from web pages into deduplicated URL lists. Filter by domain, keyword, or regex pattern to create targeted link inventories for batch processing and data extraction workflows.
Instructions
Call ping first
New session or unsure the extension is online: ping first, then any scrape*. If EXTENSION_NOT_CONNECTED: ping again, then fix WebSocket using error.details.bridge, MCP stderr, and ~/.lionscraper/port, then retry.
lang (optional)
en-US | zh-CN: human-readable errors for this call; omitted → English; Chinese users pass lang: "zh-CN" on each call.
Do not substitute raw HTTP
When you need a real browser DOM, logged-in session cookies, JS-rendered content (SPAs), extension-side pagination or multi-URL scheduling, or structured field extraction, do not use WebFetch, curl, wget, or cookie-less IDE fetch tools instead of this server’s ping + scrape*. Only if the page is fully public and mostly static and the user clearly wants a trivial GET of raw HTML may you consider a plain HTTP client.
Purpose
Collect hyperlinks from a page as a deduped URL list; optional domain/keyword/regex/limit filters.
When to use
Link inventories or feeding scrape / scrape_article.
Returns
MultiUrlResult; success data is string[].
Parameters
filter: domain, keyword, pattern (regex), limit.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | One http(s) URL or string[] (batch). Max 50 URLs per request (extension-enforced). Server forwards as-is. | |
| lang | No | BCP 47 for human-readable errors this call: en-US | zh-CN. Omitted → English. Pass zh-CN when the user works in Chinese; the Server cannot infer chat language. | |
| delay | No | Milliseconds to wait after load before extraction (default 0). Use for late-rendered DOM. | |
| waitForScroll | No | Scroll the page (or a container) before extraction to trigger lazy-loaded content. Distinct from top-level scrollSpeed. | |
| timeoutMs | No | Per-URL task timeout for the extension (default 60000). Not the MCP WebSocket wait; see bridgeTimeoutMs. | |
| bridgeTimeoutMs | No | MCP Server only: max ms to wait for one tool call on the WebSocket bridge (capped). Omitted → derived from URL count, maxPages, timeoutMs, scrapeInterval. Stripped before forwarding to the extension. | |
| includeHtml | No | If true, include document.documentElement.outerHTML in result meta for that URL. | |
| includeText | No | If true, include document.body.innerText in result meta for that URL. | |
| scrapeInterval | No | Ms between starting tasks in a multi-URL run. Omitted → extension default; extension enforces min 500ms. | |
| concurrency | No | Parallel tabs for multi-URL runs. Omitted → extension default; extension enforces max 3. | |
| scrollSpeed | No | Optional global scroll speed (px) for batch tuning—not the same as waitForScroll.scrollSpeed. | |
| filter | No | Post-extraction filter for link list. |