Skip to main content
Glama

scrape_urls

Extract hyperlinks from web pages into deduplicated URL lists. Filter by domain, keyword, or regex pattern to create targeted link inventories for batch processing and data extraction workflows.

Instructions

Call ping first

New session or unsure the extension is online: ping first, then any scrape*. If EXTENSION_NOT_CONNECTED: ping again, then fix WebSocket using error.details.bridge, MCP stderr, and ~/.lionscraper/port, then retry.

lang (optional)

en-US | zh-CN: human-readable errors for this call; omitted → English; Chinese users pass lang: "zh-CN" on each call.

Do not substitute raw HTTP

When you need a real browser DOM, logged-in session cookies, JS-rendered content (SPAs), extension-side pagination or multi-URL scheduling, or structured field extraction, do not use WebFetch, curl, wget, or cookie-less IDE fetch tools instead of this server’s ping + scrape*. Only if the page is fully public and mostly static and the user clearly wants a trivial GET of raw HTML may you consider a plain HTTP client.

Purpose

Collect hyperlinks from a page as a deduped URL list; optional domain/keyword/regex/limit filters.

When to use

Link inventories or feeding scrape / scrape_article.

Returns

MultiUrlResult; success data is string[].

Parameters

filter: domain, keyword, pattern (regex), limit.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesOne http(s) URL or string[] (batch). Max 50 URLs per request (extension-enforced). Server forwards as-is.
langNoBCP 47 for human-readable errors this call: en-US | zh-CN. Omitted → English. Pass zh-CN when the user works in Chinese; the Server cannot infer chat language.
delayNoMilliseconds to wait after load before extraction (default 0). Use for late-rendered DOM.
waitForScrollNoScroll the page (or a container) before extraction to trigger lazy-loaded content. Distinct from top-level scrollSpeed.
timeoutMsNoPer-URL task timeout for the extension (default 60000). Not the MCP WebSocket wait; see bridgeTimeoutMs.
bridgeTimeoutMsNoMCP Server only: max ms to wait for one tool call on the WebSocket bridge (capped). Omitted → derived from URL count, maxPages, timeoutMs, scrapeInterval. Stripped before forwarding to the extension.
includeHtmlNoIf true, include document.documentElement.outerHTML in result meta for that URL.
includeTextNoIf true, include document.body.innerText in result meta for that URL.
scrapeIntervalNoMs between starting tasks in a multi-URL run. Omitted → extension default; extension enforces min 500ms.
concurrencyNoParallel tabs for multi-URL runs. Omitted → extension default; extension enforces max 3.
scrollSpeedNoOptional global scroll speed (px) for batch tuning—not the same as waitForScroll.scrollSpeed.
filterNoPost-extraction filter for link list.
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations provided, so description carries full burden. Discloses return type (MultiUrlResult with string[] data), extension-enforced limits (50 URLs, min 500ms interval), connection error handling (EXTENSION_NOT_CONNECTED), and WebSocket bridge behavior. Does not explicitly state read-only nature, but 'Collect' implies non-destructive.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Well-structured with markdown headers front-loading operational prerequisites (ping). Lengthy but every section serves agent decision-making: connection troubleshooting, localization, substitution warnings, and purpose. Could be slightly tighter but earns its length given complexity.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness5/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Comprehensive for a 12-parameter tool with nested objects and no output schema. Covers prerequisites, error handling (EXTENSION_NOT_CONNECTED), return format (MultiUrlResult), usage context, and consumer relationship to sibling tools. Fully compensates for missing annotations and output schema.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100% with detailed descriptions for all 12 parameters including nested objects. Description adds minimal beyond schema: specific guidance for 'lang' (Chinese users should pass 'zh-CN') and brief filter field recap. Baseline 3 appropriate since schema does heavy lifting.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Clear specific purpose: 'Collect hyperlinks from a page as a deduped URL list'. Explicitly distinguishes from siblings by contrasting with raw HTTP tools (curl/wget) and positioning as input for 'scrape' / 'scrape_article'. Verb+resource+filters are specific.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Excellent explicit guidance: 'When to use' section specifies link inventories or feeding scrape/scrape_article; 'Do not substitute' section explicitly lists alternatives (WebFetch, curl, wget) with conditions for when NOT to use them; 'Call ping first' provides prerequisite workflow.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/dowant/lionscraper-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server