Skip to main content
Glama
RapierCraft

alterlab-mcp-server

by RapierCraft

alterlab_scrape

Extract web page content as markdown, JSON, or structured data with automatic anti-bot bypass. Supports JavaScript rendering, proxy rotation, and extraction schemas for AI workflows.

Instructions

Scrape a URL and return its content as markdown, text, HTML, JSON, or structured sections. Automatically handles anti-bot protection with tier escalation. Returns markdown by default — optimized for LLM context. Supports GET (default) and POST/PUT/PATCH/DELETE/HEAD via the method parameter. Use method='POST' with body for GraphQL APIs, REST endpoints, and form submissions. For GraphQL: set body='{"query": "{ ... }"}' and method='POST'. Use render_js=true for JavaScript-heavy sites (React, Angular, SPAs). Use render_js='auto' for mixed sites to detect JS needs per-page (saves 30-60%). Use use_proxy=true for geo-restricted or heavily protected sites. Use formats=['json_v2'] for a structured section tree (headings + content blocks). Use formats=['rag'] for chunked text optimized for RAG pipelines. Use formats=['content'] for AI/KB pipelines — returns body_markdown, content_hash, images, links. Use extraction_schema to extract structured fields from the page using LLM (add formats=['json'] to retrieve result in content.json, also available in filtered_content). Use extraction_prompt for natural language extraction instructions (mutually exclusive with extraction_schema). Use extraction_profile to select a pre-built extraction template (product, article, job_posting, etc.). Use extraction_provider to select a specific BYOK LLM provider (openai, anthropic, openrouter, groq). Supports authenticated scraping via session_id (stored session) or inline cookies. Use scroll_to_load=true for infinite-scroll pages that lazy-load content. Use location.country to scrape geo-targeted content.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesURL to scrape
methodNoHTTP method for the request. Default GET (standard page scraping). Use POST for GraphQL endpoints, form submissions, REST API calls. Use PUT/PATCH for REST API updates. When using POST/PUT/PATCH, provide body with the request payload.GET
bodyNoRequest body for POST/PUT/PATCH requests. For GraphQL: JSON string with 'query' and optional 'variables' fields (e.g., '{"query": "{ user { id name } }"}').For REST APIs: JSON-encoded payload string. For form submissions: URL-encoded key=value pairs (e.g., 'name=Alice&email=alice@example.com'). Omit for GET/HEAD/DELETE requests.
modeNoScraping mode: auto (recommended), html, js (headless browser), pdf, or ocrauto
formatsNoOutput formats. 'markdown' is best for LLM consumption. 'json_v2' returns a structured section tree (headings + content blocks). 'rag' returns chunked text optimized for retrieval-augmented generation. 'content' returns body_markdown + content_hash + images + links for AI/KB pipelines.
extraction_schemaNoJSON schema for structured extraction. The API extracts fields matching this schema from the scraped page using LLM. Result is returned in content.json (add 'json' to formats) and in the top-level filtered_content field. Example: { "title": "string", "price": "number", "in_stock": "boolean" }
extraction_modelNoPer-request LLM model override in provider-specific format (e.g. 'gpt-4o', 'claude-opus-4-5-20251101', 'llama3-70b-8192'). Overrides the model saved in your BYOK key settings for this request only.
extraction_providerNoLLM provider to use for extraction. Selects the matching BYOK key registered at /dashboard/settings/llm-keys. When omitted, the most recently used registered key is used automatically. Requires extraction_schema or extraction_prompt.
extraction_promptNoNatural language extraction instruction. Describes what fields to extract from the page. Mutually exclusive with extraction_schema. Example: "Extract the product name, price, and availability".
extraction_profileNoPre-built extraction schema template. auto: detect best template. product: e-commerce product details. article: news/blog article fields. job_posting: job listing fields. faq: FAQ entries. recipe: recipe ingredients and instructions. event: event details. ecommerce_homepage: homepage product listings. directory_listing: directory/listing page entries.
render_jsNoRender JavaScript using headless browser (forces Tier 4 minimum — no separate add-on charge). Required for JS-heavy sites. Set to 'auto' for smart detection (probes each page, only renders JS-heavy pages with browser — saves 30-60% on mixed sites).
use_proxyNoRoute through premium proxy (+$0.0002). Helps bypass geo-restrictions and anti-bot
proxy_countryNoISO country code for geo-targeting (e.g., 'US', 'DE'). Requires use_proxy=true
wait_forNoCSS selector to wait for before extracting content (e.g., '#main-content')
timeoutNoRequest timeout in seconds (1-300)
max_response_bytesNoSoft cap on raw response body size in bytes. When the downloaded HTML exceeds this value it is truncated before extraction. Default: 5 MB (5242880). Set to 0 for no limit. Maximum: 50 MB (52428800). Useful for very large pages where you only need the beginning of the content.
include_raw_htmlNoInclude raw HTML in the response alongside formatted content
session_idNoUUID of a stored session for authenticated scraping. Use alterlab_list_sessions to find available sessions. The session's cookies will be injected into the request.
cookiesNoInline cookies as key-value pairs for authenticated scraping (e.g., {"session_token": "abc123"}). Use this for one-off requests; use session_id for reusable sessions.
scroll_to_loadNoScroll page to trigger lazy-loaded content (requires render_js). Performs explicit viewport-height scrolls to load dynamic content. Adds ~2-3s latency.
scroll_countNoNumber of scroll iterations when scroll_to_load is enabled (1-10, default 3)
remove_cookie_bannersNoRemove cookie consent banners from HTML before content extraction (free, enabled by default)
locationNoGeo-targeting parameters for localized content scraping. Controls proxy country routing, Accept-Language header, and browser locale.
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries full burden and discloses important behaviors: automatic anti-bot handling with tier escalation, default markdown output optimized for LLM, and parameter interactions. It does not mention rate limits or cost implications but covers authentication, geo-targeting, and rendering modes.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is long but each sentence is informative. It is front-loaded with core purpose and method usage, then proceeds to other features. While not broken into bullet points, it remains readable and efficient.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Despite 23 parameters and nested objects, the description covers output formats, extraction, anti-bot, authentication, geo-targeting, and rendering. It lacks explanation of error handling or response structure (no output schema), but is otherwise thorough.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100%, so baseline is 3. The description adds significant value beyond schema by explaining combined usage (e.g., extraction_schema with formats=['json'], render_js='auto' saves 30-60%) and providing examples for method and body. This exceeds the baseline.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's primary purpose: 'Scrape a URL and return its content as markdown, text, HTML, JSON, or structured sections.' It specifies the verb (scrape) and resource (URL), and distinguishes from siblings by emphasizing its general-purpose nature versus crawl/map/search.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides explicit usage guidance for various scenarios: using method='POST' for GraphQL, render_js for JS-heavy sites, formats for different pipelines, extraction_schema for structured extraction, etc. It implies when to use each feature but does not explicitly state when not to use certain options or compare to sibling tools.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/RapierCraft/alterlab-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server