alterlab-mcp-server

alterlab_scrape

Retrieve data from any URL automatically bypassing Cloudflare, Akamai, and other anti-bot protections. Turn web pages into clean markdown or structured JSON for LLMs.

Instructions

Get data from any website, bypass Cloudflare and anti-bot protection, scrape JavaScript-rendered pages, or fetch content from dynamic single-page apps. Turn any URL into clean, LLM-ready markdown — or get text, HTML, JSON, and structured sections. Automatically bypasses anti-bot protection (Cloudflare, Akamai, DataDome, PerimeterX, hCaptcha) with intelligent 4-tier escalation — no manual configuration needed. Cost-efficient: starts at $0.0001/page for simple sites, auto-escalates only when protection is detected. Returns markdown by default — optimized for LLM context. Supports GET (default) and POST via the method parameter. Use method='POST' with body for GraphQL APIs, REST endpoints, and form submissions. Use content_type to set the POST body Content-Type (json, urlencoded, graphql, plain). Use render_js=true to scrape dynamic pages, JavaScript-heavy sites (React, Angular, Vue, SPAs). Use render_js='auto' for mixed sites to detect JS needs per-page (saves 30-60%). Use use_proxy=true for geo-restricted or heavily protected sites. Use formats=['json_v2'] for a structured section tree (headings + content blocks). Use formats=['rag'] for chunked text optimized for RAG pipelines. Use formats=['raw'] for the raw response body without extraction. Use formats=['content'] for AI/KB pipelines — returns body_markdown, content_hash, images, links. Use extraction_schema to extract structured fields from the page using LLM. Use extraction_prompt for natural language extraction instructions. Use extraction_profile for pre-built templates (product, article, job_posting, etc.). Use evidence=true to include source passages alongside extracted fields. Use cache=true and cache_ttl to enable response caching. Use cost_controls to cap spending, pin a tier, or set a time budget. Supports authenticated scraping via session_id or inline cookies. Use scroll_to_load=true for infinite-scroll pages. Use location.country to scrape geo-targeted content from any region. Use prefer_cost=true to minimize credit spend (starts from cheapest tier). Use prefer_speed=true to skip to a fast reliable tier immediately. Use fail_fast=true to error instead of auto-escalating to expensive tiers. Use force_refresh=true to bypass cache and always fetch live content. Use promote_schema_org=true to prefer Schema.org JSON-LD over LLM extraction on structured pages. Use estimate_first=true to run a free cost estimate before scraping (prepended to the result).

Input Schema

TableJSON Schema

Name	Required	Description	Default
`url`	Yes	URL to scrape
`body`	No	Request body for POST requests. For GraphQL: JSON string with 'query' and optional 'variables' fields (e.g., '{"query": "{ user { id name } }"}').For REST APIs: JSON-encoded payload string. For form submissions: URL-encoded key=value pairs (e.g., 'name=Alice&email=alice@example.com'). Omit for GET requests.
`mode`	No	Scraping mode: auto (recommended), html, js (headless browser), pdf, or ocr	auto
`cache`	No	Enable caching for this request. When true, repeat requests with identical parameters may return cached results. Only use for idempotent requests (GET pages, read-only POSTs).
`method`	No	HTTP method for the request. Default GET (standard page scraping). Use POST for GraphQL endpoints, form submissions, and REST API calls. When using POST, provide body with the request payload. POST costs 1.5x base tier price.	GET
`cookies`	No	Inline cookies as key-value pairs for authenticated scraping (e.g., {"session_token": "abc123"}). Use this for one-off requests; use session_id for reusable sessions.
`formats`	No	Output formats. 'markdown' is best for LLM consumption. 'json_v2' returns a structured section tree (headings + content blocks). 'rag' returns chunked text optimized for retrieval-augmented generation. 'raw' returns the raw response body without extraction. 'content' returns body_markdown + content_hash + images + links for AI/KB pipelines.
`timeout`	No	Request timeout in seconds (1-300)
`evidence`	No	Include provenance/evidence snippets alongside extracted fields. Each extracted value will include the source text passage it was derived from. Requires extraction_schema or extraction_prompt.
`location`	No	Geo-targeting parameters for localized content scraping. Controls proxy country routing, Accept-Language header, and browser locale.
`template`	No	Named extraction template to apply to the scrape result. Accepts standard template names (e.g. 'product', 'article', 'job_posting') or a custom template name registered in your account. When provided, routes the request through template-based extraction.
`wait_for`	No	CSS selector to wait for before extracting content (e.g., '#main-content')
`cache_ttl`	No	Cache TTL in seconds (60–86400). Defaults to 3600 (60 min) when cache=true. Requires cache=true.
`fail_fast`	No	Fail immediately if the page requires an expensive tier (browser/captcha) instead of auto-escalating. Use this to protect against unexpected credit spend on protected pages. Returns an error with the required tier instead of automatically upgrading.
`render_js`	No	Render JavaScript using headless browser (forces Tier 4 minimum — no separate add-on charge). Required for JS-heavy sites. Set to 'auto' for smart detection (probes each page, only renders JS-heavy pages with browser — saves 30-60% on mixed sites).
`use_proxy`	No	Route through premium proxy (+$0.0002). Helps bypass geo-restrictions and anti-bot
`session_id`	No	UUID of a stored session for authenticated scraping. Use alterlab_list_sessions to find available sessions. The session's cookies will be injected into the request.
`prefer_cost`	No	Optimize for lowest cost — try cheaper tiers first before escalating. Best for non-time-sensitive scrapes where minimizing credit spend matters. Mutually exclusive intent with prefer_speed.
`block_images`	No	Block image downloads during browser rendering. Reduces proxy bandwidth and speeds up scrapes. Only effective with render_js=true.
`content_type`	No	Content-Type header for the request body. Defaults to 'application/json' when body is provided. Use 'application/graphql' for raw GraphQL queries. Use 'application/x-www-form-urlencoded' for HTML form submissions. Requires body to be set.
`prefer_speed`	No	Optimize for speed — skip to a reliable tier immediately instead of escalating from Tier 1. Best for time-sensitive scrapes where latency matters more than cost. Mutually exclusive intent with prefer_cost.
`scroll_count`	No	Number of scroll iterations when scroll_to_load is enabled (1-10, default 3)
`cost_controls`	No	Fine-grained cost and tier controls. Use to cap spending, pin a tier, or trade off cost vs speed. Prefer these over the top-level prefer_cost/prefer_speed/fail_fast fields for full control.
`force_refresh`	No	Bypass the cache and always fetch a fresh copy of the page. Use when you need real-time content and a cached result would be stale.
`proxy_country`	No	ISO country code for geo-targeting (e.g., 'US', 'DE'). Requires use_proxy=true
`estimate_first`	No	Run a cost estimate before scraping and include it in the response. Adds one lightweight API call (~50ms) with no credit charge. The estimated tier, cost, and confidence are prepended to the scrape result. Useful for unfamiliar or potentially expensive sites — see cost before committing.
`filter_content`	No	Apply quality filtering to extracted content. When false (default), returns all parsed content without quality thresholds (lossless mode). When true, filters low-quality boilerplate.
`scroll_to_load`	No	Scroll page to trigger lazy-loaded content (requires render_js). Performs explicit viewport-height scrolls to load dynamic content. Adds ~2-3s latency.
`extraction_model`	No	Per-request LLM model override in provider-specific format (e.g. 'gpt-4o', 'claude-opus-4-5-20251101', 'llama3-70b-8192'). Overrides the model saved in your BYOK key settings for this request only.
`include_raw_html`	No	Include raw HTML in the response alongside formatted content
`extraction_prompt`	No	Natural language extraction instruction. Describes what fields to extract from the page. Mutually exclusive with extraction_schema. Example: "Extract the product name, price, and availability".
`extraction_schema`	No	JSON schema for structured extraction. The API extracts fields matching this schema from the scraped page using LLM. Result is returned in extraction_result. Example: { "title": "string", "price": "number", "in_stock": "boolean" }
`extraction_profile`	No	Pre-built extraction schema template. auto: detect best template. product: e-commerce product details. article: news/blog article fields. job_posting: job listing fields. faq: FAQ entries. recipe: recipe ingredients and instructions. event: event details. ecommerce_homepage: homepage product listings. directory_listing: directory/listing page entries.
`max_response_bytes`	No	Soft cap on raw response body size in bytes. When the downloaded HTML exceeds this value it is truncated before extraction. Default: 5 MB (5242880). Set to 0 for no limit. Maximum: 50 MB (52428800). Useful for very large pages where you only need the beginning of the content.
`promote_schema_org`	No	Use Schema.org JSON-LD/Microdata as the primary structured-data source when present. Promotes machine-readable metadata embedded in the page over LLM extraction. Most effective on e-commerce, recipe, and news article pages.
`extraction_provider`	No	LLM provider to use for extraction. Selects the matching BYOK key registered at /dashboard/settings/llm-keys. When omitted, the most recently used registered key is used automatically. Requires extraction_schema or extraction_prompt.
`remove_cookie_banners`	No	Remove cookie consent banners from HTML before content extraction (free, enabled by default)

Tool Definition Quality

A3.5/5.0

Behavior4/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations, the description carries full responsibility for disclosing behavior. It details intelligent anti-bot escalation, cost mechanics, caching policies, and various parameter effects. However, it does not explicitly state that the tool is read-only or describe potential side effects (e.g., no destructive operations), which would have earned a 5.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness2/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is excessively long (over 1000 words) and contains a dense list of parameter usage examples that could be more succinct. While the first sentence is effective, the rest is verbose and mixes crucial behavioral info with parameter-level details, reducing clarity.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Despite the complexity (37 parameters, nested objects, no output schema), the description covers essential behavioral aspects (caching, cost, authentication, geo-targeting, error escalation). It lacks an explicit description of the output format beyond 'markdown by default', but mentions alternatives. Minor gaps: no error handling details or response structure beyond formats.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the baseline is 3. The description adds minor extra context like pricing ('starts at $0.0001/page') and high-level feature summaries, but largely repeats or elaborates on what is already in the schema. It does not significantly enhance understanding of parameter semantics beyond the schema.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the verb ('Get data from any website', 'scrape', 'fetch') and resource ('any website', 'JavaScript-rendered pages', 'dynamic single-page apps'). It distinguishes itself from siblings by emphasizing anti-bot bypass and JavaScript rendering capabilities, which are unique among the listed tools.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description does not provide explicit guidance on when to use alterlab_scrape versus alternatives like alterlab_crawl, alterlab_extract, or alterlab_screenshot. There is no 'use when' or 'consider using X instead' language, leaving the agent to infer usage from the lengthy parameter list.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Latest Blog Posts

Who's Calling? MCP Hosts Are an Identity Blind Spot (And the Spec Knows It)
By Om-Shree-0709 on July 25, 2026.
mcp
Agent Identity
OAuth 2.1
Your AI Chatbot Just Exposed Your CEO's Salary to an Intern
By Om-Shree-0709 on July 2, 2026.
Agent Identity
MCP Security
OAuth Delegation
Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)
By Om-Shree-0709 on June 30, 2026.
Agentic Ai
Prompt Injection
WebAssembly

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/RapierCraft/alterlab-mcp-server'

If you have feedback or need assistance with the MCP directory API, please join our Discord server