Skip to main content
Glama
RapierCraft

alterlab-mcp-server

by RapierCraft

alterlab_scrape

Scrape URLs into markdown, text, HTML, or JSON using automatic anti-bot protection, JavaScript rendering, and structured extraction.

Instructions

Scrape a URL and return its content as markdown, text, HTML, JSON, or structured sections. Automatically handles anti-bot protection with tier escalation. Returns markdown by default — optimized for LLM context. Supports GET (default) and POST/PUT/PATCH/DELETE/HEAD via the method parameter. Use method='POST' with body for GraphQL APIs, REST endpoints, and form submissions. For GraphQL: set body='{"query": "{ ... }"}' and method='POST'. Use render_js=true for JavaScript-heavy sites (React, Angular, SPAs). Use render_js='auto' for mixed sites to detect JS needs per-page (saves 30-60%). Use use_proxy=true for geo-restricted or heavily protected sites. Use formats=['json_v2'] for a structured section tree (headings + content blocks). Use formats=['rag'] for chunked text optimized for RAG pipelines. Use formats=['content'] for AI/KB pipelines — returns body_markdown, content_hash, images, links. Use extraction_schema to extract structured fields from the page using LLM (add formats=['json'] to retrieve result in content.json, also available in filtered_content). Use extraction_prompt for natural language extraction instructions (mutually exclusive with extraction_schema). Use extraction_profile to select a pre-built extraction template (product, article, job_posting, etc.). Use extraction_provider to select a specific BYOK LLM provider (openai, anthropic, openrouter, groq). Supports authenticated scraping via session_id (stored session) or inline cookies. Use scroll_to_load=true for infinite-scroll pages that lazy-load content. Use location.country to scrape geo-targeted content.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
urlYesURL to scrape
bodyNoRequest body for POST/PUT/PATCH requests. For GraphQL: JSON string with 'query' and optional 'variables' fields (e.g., '{"query": "{ user { id name } }"}').For REST APIs: JSON-encoded payload string. For form submissions: URL-encoded key=value pairs (e.g., 'name=Alice&email=alice@example.com'). Omit for GET/HEAD/DELETE requests.
modeNoScraping mode: auto (recommended), html, js (headless browser), pdf, or ocrauto
methodNoHTTP method for the request. Default GET (standard page scraping). Use POST for GraphQL endpoints, form submissions, REST API calls. Use PUT/PATCH for REST API updates. When using POST/PUT/PATCH, provide body with the request payload.GET
cookiesNoInline cookies as key-value pairs for authenticated scraping (e.g., {"session_token": "abc123"}). Use this for one-off requests; use session_id for reusable sessions.
formatsNoOutput formats. 'markdown' is best for LLM consumption. 'json_v2' returns a structured section tree (headings + content blocks). 'rag' returns chunked text optimized for retrieval-augmented generation. 'content' returns body_markdown + content_hash + images + links for AI/KB pipelines.
timeoutNoRequest timeout in seconds (1-300)
locationNoGeo-targeting parameters for localized content scraping. Controls proxy country routing, Accept-Language header, and browser locale.
wait_forNoCSS selector to wait for before extracting content (e.g., '#main-content')
render_jsNoRender JavaScript using headless browser (forces Tier 4 minimum — no separate add-on charge). Required for JS-heavy sites. Set to 'auto' for smart detection (probes each page, only renders JS-heavy pages with browser — saves 30-60% on mixed sites).
use_proxyNoRoute through premium proxy (+$0.0002). Helps bypass geo-restrictions and anti-bot
session_idNoUUID of a stored session for authenticated scraping. Use alterlab_list_sessions to find available sessions. The session's cookies will be injected into the request.
block_imagesNoBlock image downloads during browser rendering. Reduces proxy bandwidth and speeds up scrapes. Only effective with render_js=true.
scroll_countNoNumber of scroll iterations when scroll_to_load is enabled (1-10, default 3)
proxy_countryNoISO country code for geo-targeting (e.g., 'US', 'DE'). Requires use_proxy=true
scroll_to_loadNoScroll page to trigger lazy-loaded content (requires render_js). Performs explicit viewport-height scrolls to load dynamic content. Adds ~2-3s latency.
extraction_modelNoPer-request LLM model override in provider-specific format (e.g. 'gpt-4o', 'claude-opus-4-5-20251101', 'llama3-70b-8192'). Overrides the model saved in your BYOK key settings for this request only.
include_raw_htmlNoInclude raw HTML in the response alongside formatted content
extraction_promptNoNatural language extraction instruction. Describes what fields to extract from the page. Mutually exclusive with extraction_schema. Example: "Extract the product name, price, and availability".
extraction_schemaNoJSON schema for structured extraction. The API extracts fields matching this schema from the scraped page using LLM. Result is returned in content.json (add 'json' to formats) and in the top-level filtered_content field. Example: { "title": "string", "price": "number", "in_stock": "boolean" }
extraction_profileNoPre-built extraction schema template. auto: detect best template. product: e-commerce product details. article: news/blog article fields. job_posting: job listing fields. faq: FAQ entries. recipe: recipe ingredients and instructions. event: event details. ecommerce_homepage: homepage product listings. directory_listing: directory/listing page entries.
max_response_bytesNoSoft cap on raw response body size in bytes. When the downloaded HTML exceeds this value it is truncated before extraction. Default: 5 MB (5242880). Set to 0 for no limit. Maximum: 50 MB (52428800). Useful for very large pages where you only need the beginning of the content.
extraction_providerNoLLM provider to use for extraction. Selects the matching BYOK key registered at /dashboard/settings/llm-keys. When omitted, the most recently used registered key is used automatically. Requires extraction_schema or extraction_prompt.
remove_cookie_bannersNoRemove cookie consent banners from HTML before content extraction (free, enabled by default)
Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations exist, so the description carries full burden. It discloses anti-bot handling ('Automatically handles anti-bot protection with tier escalation'), proxy cost implication ('+$0.0002'), and performance savings ('saves 30-60%'). However, it lacks explicit failure mode descriptions, rate limits, or cost details beyond proxy. For a 24-parameter tool, this is strong but not complete.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness4/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is long but well-organized: it leads with the basic purpose, then groups usage patterns by feature. Each sentence serves a purpose. Minor redundancy exists (e.g., 'method' parameter documented in both description and schema), but given the complexity, the length is justified.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

For a complex tool with 24 parameters and no output schema, the description covers all major aspects: authentication, geo-targeting, JavaScript rendering, extraction, scroll loading, and output formats. It lacks a description of the return structure (e.g., which fields are always present), but that is partially compensated by the schema descriptions of the `formats` parameter.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema coverage is 100% but the description adds substantial meaning beyond the schema's field descriptions. For example, it explains how to use `body` for GraphQL versus form submissions, and how `formats=['rag']` optimizes for RAG pipelines. This goes well beyond the baseline of 3 for full schema coverage.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description opens with a clear verb+resource statement: 'Scrape a URL and return its content as markdown, text, HTML, JSON, or structured sections.' It distinguishes from siblings like alterlab_crawl (multi-page) and alterlab_screenshot (visual capture) by focusing on single-page content extraction with multiple format options.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines5/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides explicit usage guidance for each major feature: 'Use method='POST' with body for GraphQL APIs... Use render_js=true for JavaScript-heavy sites... Use formats=['json_v2'] for structured section tree... Use extraction_schema to extract structured fields.' It also covers authenticated scraping and infinite-scroll pages. While it doesn't explicitly list when not to use this tool, the guidance is comprehensive and context-specific.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/RapierCraft/alterlab-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server