Which integrations are available for this server?

The server includes specific detection and error handling for Cloudflare anti-bot challenge screens during the web scraping process. Supports the use of CSS selectors to identify and remove specific HTML elements during the scraping and content extraction process. The server extracts and converts web page content into Markdown format, facilitating seamless integration with AI tools and automated workflows.

How do I use web-scrapper-stdio?

1. Click on "Install Server". 2. Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state. 3. In the chat, type @ followed by the MCP server name and your instructions, e.g., "@web-scrapper-stdio scrape https://en.wikipedia.org/wiki/Artificial_intelligence in markdown format" That's it! The server will respond to your query, and you can continue using it as needed. Here is a step-by-step guide with screenshots.

de en es ja ko ru zh

web-scrapper-stdio

by JustAzul

Overview Schema Related Servers Score Discussions

Python

Remote

Web Scrapper Service (MCP Stdin/Stdout & HTTP)

Build Test Version License Python PEP8 GHCR Patchright Docker

A Python-based MCP server for robust, headless web scraping—extracts main text content from web pages and outputs Markdown, text, or HTML for seamless AI and automation integration.

Key Features

Headless browser scraping (Playwright, BeautifulSoup, Markdownify)
Outputs Markdown, text, or HTML
Designed for MCP (Model Context Protocol) stdio/JSON-RPC integration
Dual transport: stdio (default) and Streamable HTTP for shared service mode
Persistent browser pool: Chromium stays alive across requests for fast scraping
Smart DOM wait: MutationObserver-based content stabilization instead of fixed sleep
Dockerized, with pre-built images
Configurable via environment variables
Robust error handling (timeouts, HTTP errors, Cloudflare, etc.)
Per-domain rate limiting
Easy integration with AI tools and IDEs (Cursor, Claude Desktop, Continue, JetBrains, Zed, etc.)
One-click install for Cursor, interactive installer for Claude

Related MCP server: Fetcher MCP

Quick Start

Run with Docker (stdio mode — one container per client)

docker run -i --rm ghcr.io/justazul/web-scrapper-stdio

Run as Shared HTTP Service (one container, multiple clients)

docker run -d --name web-scraper \
  -e MCP_TRANSPORT=streamable-http \
  -e MCP_HTTP_PORT=8080 \
  -e BROWSER_POOL_SIZE=3 \
  -p 8080:8080 \
  --shm-size=3gb \
  ghcr.io/justazul/web-scrapper-stdio

Or with Docker Compose:

docker compose --profile service up -d

One-Click Installation (Cursor IDE)

Add to Cursor

Transport Modes

stdio (default)

Each MCP client spawns its own container via docker run -i. Simple, zero configuration, works with any MCP client.

{
  "mcpServers": {
    "web-scrapper-stdio": {
      "command": "docker",
      "args": ["run", "-i", "--rm", "ghcr.io/justazul/web-scrapper-stdio"]
    }
  }
}

Streamable HTTP (shared service)

Run one persistent container that serves multiple MCP clients over HTTP. Saves resources when running multiple AI tool instances (e.g., multiple Claude Code sessions).

Start the service:

docker run -d --name web-scraper \
  -e MCP_TRANSPORT=streamable-http \
  -e MCP_HTTP_PORT=8080 \
  -p 8080:8080 \
  --shm-size=3gb \
  ghcr.io/justazul/web-scrapper-stdio

Connect from your MCP client:

{
  "mcpServers": {
    "web-scrapper": {
      "url": "http://localhost:8080/mcp"
    }
  }
}

Integration with AI Tools & IDEs

This service supports integration with a wide range of AI tools and IDEs that implement the Model Context Protocol (MCP). Below are ready-to-use configuration examples for the most popular environments. Replace the image/tag as needed for custom builds.

Cursor IDE

Add to your .cursor/mcp.json (project-level) or ~/.cursor/mcp.json (global):

{
  "mcpServers": {
    "web-scrapper-stdio": {
      "command": "docker",
      "args": [
        "run",
        "-i",
        "--rm",
        "ghcr.io/justazul/web-scrapper-stdio"
      ]
    }
  }
}

Claude Desktop

Add to your Claude Desktop MCP config (typically claude_desktop_config.json):

{
  "mcpServers": {
    "web-scrapper-stdio": {
      "command": "docker",
      "args": [
        "run",
        "-i",
        "--rm",
        "ghcr.io/justazul/web-scrapper-stdio"
      ]
    }
  }
}

Claude Code

Add to your .mcp.json or global ~/.claude.json:

stdio mode (one container per session):

{
  "mcpServers": {
    "web-scrapper-stdio": {
      "command": "docker",
      "args": ["run", "-i", "--rm", "ghcr.io/justazul/web-scrapper-stdio"]
    }
  }
}

HTTP mode (shared service — start the service first):

{
  "mcpServers": {
    "web-scrapper": {
      "url": "http://localhost:8080/mcp"
    }
  }
}

Continue (VSCode/JetBrains Plugin)

Add to your continue.config.json or via the Continue plugin MCP settings:

{
  "mcpServers": {
    "web-scrapper-stdio": {
      "command": "docker",
      "args": [
        "run",
        "-i",
        "--rm",
        "ghcr.io/justazul/web-scrapper-stdio"
      ]
    }
  }
}

IntelliJ IDEA (JetBrains AI Assistant)

Go to Settings > Tools > AI Assistant > Model Context Protocol (MCP) and add a new server. Use:

{
  "command": "docker",
  "args": [
    "run",
    "-i",
    "--rm",
    "ghcr.io/justazul/web-scrapper-stdio"
  ]
}

Zed Editor

Add to your Zed MCP config (see Zed docs for the exact path):

{
  "mcpServers": {
    "web-scrapper-stdio": {
      "command": "docker",
      "args": [
        "run",
        "-i",
        "--rm",
        "ghcr.io/justazul/web-scrapper-stdio"
      ]
    }
  }
}

Usage

MCP Server (Tool/Prompt)

This web scrapper is used as an MCP (Model Context Protocol) tool, allowing it to be used by AI models or other automation directly.

Tool: `scrape_web`

Parameters:

url (string, required): The URL to scrape
max_length (integer, optional): Maximum length of returned content (default: unlimited)
timeout_seconds (integer, optional): Timeout in seconds for the page load (default: 30)
user_agent (string, optional): Custom User-Agent string passed directly to the browser (defaults to a random agent)
wait_for_network_idle (boolean, optional): Wait for network activity to settle before scraping (default: true)
custom_elements_to_remove (list of strings, optional): Additional HTML elements (CSS selectors) to remove before extraction
grace_period_seconds (float, optional): Time to wait for JS rendering after navigation. Uses MutationObserver for smart detection. Set to 0 to skip entirely. (default: 0.5)
output_format (string, optional): markdown, text, or html (default: markdown)
click_selector (string, optional): If provided, click the element matching this selector after navigation and before extraction

Returns:

Markdown formatted content extracted from the webpage, as a string
Errors are reported as strings starting with [ERROR] ...

Example: Using click_selector and custom_elements_to_remove

{
  "url": "http://uitestingplayground.com/clientdelay",
  "click_selector": "#ajaxButton",
  "grace_period_seconds": 10,
  "custom_elements_to_remove": [".ads-banner", "#popup"],
  "output_format": "markdown"
}

Prompt: `scrape`

Parameters:

url (string, required): The URL to scrape
output_format (string, optional): markdown, text, or html (default: markdown)

Returns:

Content extracted from the webpage in the chosen format

Note:

Markdown is returned by default but text or HTML can be requested via output_format.
The scrapper does not check robots.txt and will attempt to fetch any URL provided.
No REST API or CLI tool is included; this is a pure MCP stdio/JSON-RPC tool.
The scrapper always extracts the full <body> content of web pages, applying only essential noise removal (removing script, style, nav, footer, aside, header, and similar non-content tags). The scrapper detects and handles Cloudflare challenge screens, returning a specific error string.

Configuration

You can override most configuration options using environment variables:

Core Settings

DEFAULT_TIMEOUT_SECONDS: Timeout for page loads and navigation (default: 30)
DEFAULT_MIN_CONTENT_LENGTH: Minimum content length for extracted text (default: 100)
DEFAULT_MIN_CONTENT_LENGTH_SEARCH_APP: Minimum content length for search.app domains (default: 30)
DEFAULT_MIN_SECONDS_BETWEEN_REQUESTS: Minimum delay between requests to the same domain (default: 2)
DEFAULT_GRACE_PERIOD_SECONDS: Default grace period for JS rendering (default: 0.5)
DEBUG_LOGS_ENABLED: Set to true to enable debug-level logs (default: false)

Browser Pool

BROWSER_POOL_ENABLED: Enable persistent browser pool (default: true). Set to false for per-request browser launch (original behavior).
BROWSER_POOL_SIZE: Number of Chromium instances to keep alive (default: 2). Each instance uses ~100-200MB RAM.

Transport

MCP_TRANSPORT: Transport mode — stdio or streamable-http (default: stdio)
MCP_HTTP_PORT: HTTP server port when using streamable-http transport (default: 8080)
MCP_HTTP_HOST: HTTP server bind address (default: 0.0.0.0)

Cloudflare Bypass

CAPTCHA_API_KEY: API key for captcha solver service. When set, Cloudflare Turnstile challenges are solved automatically. When empty (default), CF-protected pages return an error.
CAPTCHA_PROVIDER: Captcha solver provider — 2captcha, capsolver, or capmonster (default: 2captcha)
CAPTCHA_BASE_URL: Custom solver API endpoint (default: uses provider's official URL)
CAPTCHA_TIMEOUT: Timeout in seconds for captcha solving (default: 120)

Test Settings

DEFAULT_TEST_REQUEST_TIMEOUT: Timeout for test requests (default: 10)
DEFAULT_TEST_NO_DELAY_THRESHOLD: Threshold for skipping artificial delays in tests (default: 0.5)

Error Handling & Limitations

The scrapper detects and returns errors for navigation failures, timeouts, HTTP errors (including 404), and Cloudflare anti-bot challenges.
Rate limiting is enforced per domain (default: 2 seconds between requests).
Cloudflare bypass: Uses Patchright (CDP-level anti-detection) for passive evasion. Most CF-protected sites are scraped without triggering a challenge. When a Turnstile challenge is triggered and CAPTCHA_API_KEY is set, it's solved automatically via third-party API.
Limitations:
- No REST API or CLI tool (MCP stdio/JSON-RPC only)
- No support for non-HTML content (PDF, images, etc.)
- No authentication or session management for protected pages
- Not intended for scraping at scale or violating site terms

Development & Testing

Running Tests (Docker Compose)

All tests must be run using Docker Compose. Do not run tests outside Docker.

All tests:

docker compose up --build --abort-on-container-exit test

MCP server tests only:

docker compose up --build --abort-on-container-exit test_mcp

Scrapper tests only:

docker compose up --build --abort-on-container-exit test_scrapper

Running Benchmarks

docker compose run --rm benchmark

Results are stored in benchmarks/RESULTS.md.

Contributing

Contributions are welcome! Please open issues or pull requests for bug fixes, features, or improvements. If you plan to make significant changes, open an issue first to discuss your proposal.

License

This project is licensed under the MIT License.

This server cannot be installed

license - permissive license

quality - not tested

maintenance

How are these scores calculated?

Maintenance

–Maintainers

–Response time

0dRelease cycle

3Releases (12mo)

Commit activity

Resources

GitHub Repository

Need Help?

Related Servers

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Appeared in Searches

A tool for extracting data from websites

Latest Blog Posts

Who's Calling? MCP Hosts Are an Identity Blind Spot (And the Spec Knows It)
By Om-Shree-0709 on July 25, 2026.
mcp
Agent Identity
OAuth 2.1
Your AI Chatbot Just Exposed Your CEO's Salary to an Intern
By Om-Shree-0709 on July 2, 2026.
Agent Identity
MCP Security
OAuth Delegation
Why MCP Servers Need Execution Sandboxing (And Why Your Current Stack Isn't Enough)
By Om-Shree-0709 on June 30, 2026.
Agentic Ai
Prompt Injection
WebAssembly

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/JustAzul/web-scrapper-stdio'

If you have feedback or need assistance with the MCP directory API, please join our Discord server