Provides AI-powered web search with citations and complex reasoning capabilities through Perplexity AI's API, enabling research tasks and fact-checking with step-by-step analysis.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@Scraper MCPscrape and filter .article-content from https://blog.example.com/post"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
Scraper MCP
A context-optimized MCP server for web scraping. Reduces LLM token usage by 70-90% through server-side HTML filtering, markdown conversion, and CSS selector targeting.
Quick Start
# Run with Docker (GitHub Container Registry)
docker run -d -p 8000:8000 --name scraper-mcp ghcr.io/cotdp/scraper-mcp:latest
# Add to Claude Code
claude mcp add --transport http scraper http://localhost:8000/mcp --scope userTry it:
> scrape https://example.com
> scrape and filter .article-content from https://blog.example.com/postEndpoints:
MCP:
http://localhost:8000/mcpDashboard:
http://localhost:8000/
Features
Web Scraping
4 scraping modes: Raw HTML, markdown, plain text, link extraction
JavaScript rendering: Optional Playwright-based rendering for SPAs and dynamic content
CSS selector filtering: Extract only relevant content server-side
Batch operations: Process multiple URLs concurrently
Smart caching: Three-tier cache system (realtime/default/static)
Retry logic: Exponential backoff for transient failures
Perplexity AI Integration
Web search: AI-powered search with citations (
perplexitytool)Reasoning: Complex analysis with step-by-step reasoning (
perplexity_reasontool)Requires
PERPLEXITY_API_KEYenvironment variable
Monitoring Dashboard
Real-time request statistics and cache metrics
Interactive API playground for testing tools
Runtime configuration without restarts

See Dashboard Guide for details.
Available Tools
Tool | Description |
| HTML converted to markdown (best for LLMs) |
| Raw HTML content |
| Plain text extraction |
| Extract all links with metadata |
| AI web search with citations |
| Complex reasoning tasks |
All tools support:
Single URL or batch operations (pass array)
timeoutandmax_retriesparameterscss_selectorfor targeted extractionrender_jsfor JavaScript rendering (SPAs, dynamic content)
Resources
Note: Resources are disabled by default to reduce context overhead. Enable with
--enable-resourcesflag orENABLE_RESOURCES=trueenvironment variable.
MCP resources provide read-only data access via URI-based addressing:
URI | Description |
| Cache hit rate, size, entry counts |
| List of recent request IDs |
| Retrieve cached result by ID |
| Current runtime configuration |
| Timeout, retries, concurrency |
| Version, uptime, capabilities |
| Request counts, success rates |
Prompts
Note: Prompts are disabled by default to reduce context overhead. Enable with
--enable-promptsflag orENABLE_PROMPTS=trueenvironment variable.
MCP prompts provide reusable workflow templates:
Prompt | Description |
| Structured webpage analysis |
| Generate content summaries |
| Extract specific data types |
| Comprehensive SEO check |
| Analyze internal/external links |
| Multi-source research |
| Verify claims across sources |
See API Reference for complete documentation.
JavaScript Rendering
For SPAs (React, Vue, Angular) and pages with dynamic content, enable JavaScript rendering:
# Enable JS rendering with render_js=True
scrape_url(["https://spa-example.com"], render_js=True)
# Combine with CSS selector for targeted extraction
scrape_url(["https://react-app.com"], render_js=True, css_selector=".main-content")When to use
Single-page applications (SPAs) - React, Vue, Angular, etc.
Sites with lazy-loaded content
Pages requiring JavaScript execution
Dynamic content loaded via AJAX/fetch
When NOT needed:
Static HTML pages (most blogs, news sites, documentation)
Server-rendered content
Simple websites without JavaScript dependencies
How it works:
Uses Playwright with headless Chromium
Single browser instance with pooled contexts (~300MB base + 10-20MB per context)
Lazy initialization (browser only starts when first JS render is requested)
Semaphore-controlled concurrency (default: 5 concurrent contexts)
Memory considerations:
Base requests provider: ~50MB
With Playwright active: ~300-500MB depending on concurrent contexts
Recommend minimum 1GB container memory when using JS rendering
Testing JS rendering:
Use the dashboard playground at http://localhost:8000/ to test JavaScript rendering interactively with the toggle switch.
Docker Deployment
Quick Run
# Using GitHub Container Registry (recommended)
docker run -d -p 8000:8000 --name scraper-mcp ghcr.io/cotdp/scraper-mcp:latest
# With JavaScript rendering (requires more memory)
docker run -d -p 8000:8000 --memory=1g --name scraper-mcp ghcr.io/cotdp/scraper-mcp:latest
# With Perplexity AI
docker run -d -p 8000:8000 -e PERPLEXITY_API_KEY=your_key ghcr.io/cotdp/scraper-mcp:latestDocker Compose
For persistent storage and custom configuration:
# docker-compose.yml
services:
scraper-mcp:
image: ghcr.io/cotdp/scraper-mcp:latest
ports:
- "8000:8000"
volumes:
- cache:/app/cache
environment:
- PERPLEXITY_API_KEY=${PERPLEXITY_API_KEY:-}
- PLAYWRIGHT_MAX_CONTEXTS=5
deploy:
resources:
limits:
memory: 1G # Recommended for JS rendering
restart: unless-stopped
volumes:
cache:docker-compose up -dProduction deployment (pre-built image from GHCR):
docker-compose -f docker-compose.prod.yml up -dUpgrading
To upgrade an existing deployment to the latest version:
# Pull the latest image
docker pull ghcr.io/cotdp/scraper-mcp:latest
# Restart with new image (docker-compose)
docker-compose down && docker-compose up -d
# Or for production deployments
docker-compose -f docker-compose.prod.yml pull
docker-compose -f docker-compose.prod.yml up -d
# Or restart a standalone container
docker stop scraper-mcp && docker rm scraper-mcp
docker run -d -p 8000:8000 --name scraper-mcp ghcr.io/cotdp/scraper-mcp:latestYour cache data persists in the named volume across upgrades.
Available Tags
Tag | Description |
| Latest stable release |
| Latest build from main branch |
| Specific version |
Configuration
Create a .env file for custom settings:
# Perplexity AI (optional)
PERPLEXITY_API_KEY=your_key_here
# JavaScript rendering (optional, requires Playwright)
PLAYWRIGHT_MAX_CONTEXTS=5 # Max concurrent browser contexts
PLAYWRIGHT_TIMEOUT=30000 # Page load timeout in ms
PLAYWRIGHT_DISABLE_GPU=true # Reduce memory in containers
# MCP features (disabled by default to reduce context overhead)
ENABLE_RESOURCES=true # Enable MCP resources
ENABLE_PROMPTS=true # Enable MCP prompts
# Proxy (optional)
HTTP_PROXY=http://proxy.example.com:8080
HTTPS_PROXY=http://proxy.example.com:8080
# ScrapeOps proxy service (optional)
SCRAPEOPS_API_KEY=your_key_here
SCRAPEOPS_RENDER_JS=trueSee Configuration Guide for all options.
Claude Desktop
Add to your MCP settings:
{
"mcpServers": {
"scraper": {
"url": "http://localhost:8000/mcp"
}
}
}Claude Code Skills
This project includes Agent Skills that provide Claude Code with specialized knowledge for using the scraper tools effectively.
Skill | Description |
CSS selectors, batch operations, retry configuration | |
AI search, reasoning tasks, conversation patterns |
Install Skills
Copy the skills to your Claude Code skills directory:
# Clone or download this repo, then:
cp -r .claude/skills/web-scraping ~/.claude/skills/
cp -r .claude/skills/perplexity ~/.claude/skills/Or install directly:
# web-scraping skill
mkdir -p ~/.claude/skills/web-scraping
curl -o ~/.claude/skills/web-scraping/SKILL.md \
https://raw.githubusercontent.com/cotdp/scraper-mcp/main/.claude/skills/web-scraping/SKILL.md
# perplexity skill
mkdir -p ~/.claude/skills/perplexity
curl -o ~/.claude/skills/perplexity/SKILL.md \
https://raw.githubusercontent.com/cotdp/scraper-mcp/main/.claude/skills/perplexity/SKILL.mdOnce installed, Claude Code will automatically use these skills when performing web scraping or Perplexity AI tasks.
Documentation
Document | Description |
Complete tool documentation, parameters, CSS selectors | |
Environment variables, proxy setup, ScrapeOps | |
Monitoring UI, playground, runtime config | |
Local setup, architecture, contributing | |
Test suite, coverage, adding tests |
Local Development
# Install
uv pip install -e ".[dev]"
# Run
python -m scraper_mcp
# Test
pytest
# Lint
ruff check . && mypy src/See Development Guide for details.
License
MIT License
Last updated: December 23, 2025