Scrapy MCP Server
The Scrapy MCP Server is a robust, enterprise-grade web scraping platform that offers comprehensive data extraction capabilities for commercial use.
Core Scraping Capabilities:
Multiple scraping methods: HTTP requests, Scrapy framework, Selenium, or Playwright with intelligent method selection
Concurrent processing: Scrape multiple URLs simultaneously with exponential backoff retry mechanisms
JavaScript support: Fully render dynamic, JavaScript-heavy websites using complete browser rendering
Advanced data extraction: Configure flexible extraction rules using simple or advanced selectors, or automatically extract structured data like contact information, social media links, product details, and addresses
Link extraction: Specialized link extraction with domain filtering and internal/external link options
Form interaction: Automatically fill and submit various form types including text inputs, checkboxes, and file uploads
Anti-Detection & Performance:
Stealth techniques: Bypass anti-bot measures using undetected-chromedriver, Playwright stealth, random User-Agent rotation, and proxy support
Performance optimization: In-memory caching, rate limiting, and intelligent request handling to prevent server overload
Monitoring tools: Track server metrics including request counts, success rates, cache statistics, and detailed performance monitoring
Enterprise Features:
Ethical compliance: Check robots.txt files for responsible data collection
Error handling: Robust error classification and handling mechanisms
Cache management: Clear scraping results cache and manage server resources
Provides web scraping capabilities using the Scrapy framework for large-scale data extraction, with support for concurrent requests, custom pipelines, and advanced crawling features.
Enables browser automation and JavaScript-heavy website scraping through Selenium WebDriver, with support for form filling, element waiting, and dynamic content extraction.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@Scrapy MCP Serverscrape the pricing page from example.com and convert it to markdown"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
✨ Why Negentropy Perceives?
In the vast ecosystem of AI agent projects, the "dirty work" of information perception often degenerates into fragile, unmaintainable chaos over time. Grounded in our core engineering philosophy of Orthogonal Decomposition and Entropy Reduction (Negentropy), we completely quarantine the mess of low-level network communications and format deconstruction. We only inject pure, undisputed certainty into your sandbox:
🕵️ Web Page to Markdown: Facing heavily-rendered SPAs and fortified anti-scraping defenses? The engine comes armed with a built-in 5-tier penetration mechanism (ranging from hyper-concurrency to headless stealth browser rotation). "What You See Is What You Get" — tearing through waterfall setups is a walk in the park.
📑 PDF to Markdown: Stop compromising over misaligned tables and mangled characters. Powered by our proprietary "Engine Arena" mechanism, engaging
Smartmode summons an LLM as the ultimate referee. It coordinates 7 specialized engines (including Docling, PyMuPDF, etc.) performing concurrent deconstruction to precisely extract LaTeX formulas, gnarly table matrices, and deep layout structures.🦾 Heavy-Duty Infrastructure: Abandon toy-grade SDK wrappers. Our core is hardwired with resilient exponential backoffs, multi-layered rate-limiting circuit breakers, and aggressive memory caching mechanisms. Riding on full-duplex
asyncio, it maxes out the absolute throughput limit of a single node.🔌 Native MCP Integration: We firmly embrace the pristine Model Context Protocol specification. Leveraging standard HTTP / STDIO / SSE transports, it abandons redundant glue code for seamless, zero-friction injection into Claude Desktop or Cursor environments.
Related MCP server: local-fetch
Quick Start
1. Millisecond Loading
# We recommend using uv (Python 3.13+ required)
uv add negentropy-perceives2. Ignite the Engine
uv run negentropy-perceives # Defaults to listening on localhost:2992, HTTP mode💡 Advanced Arsenal: Upon first launch, Negentropy Perceives will auto-generate its configuration at
~/.negentropy/perceives.config.yaml. Hidden inside are the switches for high-tier warfare.
3. Witness True Perception
import asyncio
from negentropy.perceives.sdk import NegentropyPerceivesClient
async def perceive_world():
async with NegentropyPerceivesClient() as client:
result = await client.parse_webpage_to_markdown(
url="https://en.wikipedia.org/wiki/Entropy",
)
print("====== Pure Nectar Extracted ======")
print(result.markdown_content[:250], "......\n")
print(f"📊 Pure words retrieved from the noise: {result.word_count}")
asyncio.run(perceive_world())4. Connect the MCP Client
Add the following to your claude_desktop_config.json in Claude Desktop:
{
"mcpServers": {
"negentropy-perceives": {
"type": "http",
"url": "http://localhost:2992/mcp"
}
}
}Supports three transport modes: STDIO (local dev), HTTP (production-recommended), and SSE (compatibility mode). See the User Guide for the comprehensive configuration.
Core Capabilities
Toolkit Overview
Tool | Function | Use Case |
| Discover webpage links, supports domain filtering | Site map discovery, link audits |
| Inspect page metadata (status code, content type, etc.) | Target page pre-flight check |
| Webpage to Markdown | Granular single-page extraction |
| Batch Webpages to Markdown | Knowledge base building, site archives |
| PDF to Markdown | Academic papers, financial reports |
| Batch PDFs to Markdown | Mass document digitization |
Please adhere to the targeted website's Terms of Service (TOS) and sensibly restrict request frequencies. This tool is intended exclusively for legal and compliant data acquisition.
Web Scraping Strategies
Method | Description |
| Smart selection (Recommended) |
| Standard HTTP request, ideal for static pages |
| Browser rendering, seamlessly executes JS |
| Covert Selenium, shatters anti-scraping blocks |
| Stealth Playwright, lightweight anti-detection |
PDF Engines
Engine | Specialty | GPU Acceleration |
Docling | AI layout analysis, table recognition | CUDA / MPS / XPU |
MinerU | Deep learning structure analysis, LaTeX | CUDA / MLX |
Marker | Academic documents, Nougat model | CUDA |
PyMuPDF | Lightning-fast text extraction | — |
PyPDF | Absolute baseline fallback | — |
In
automode, the system cascades through a graceful degradation chain: Docling → MinerU → Marker → PyMuPDF → PyPDF. Activatingsmartmode enlists an LLM to orchestrate a competitive parallel run across engines, ultimately fusing the optimum output.
Architectural Landscape
graph TD
A["SDK Layer<br/>NegentropyPerceivesClient"] -.->|"HTTP Transport"| T["MCP Tool Layer<br/>6 Tools · @app.tool()"]
T --> P["Pipeline Layer<br/>Stage Orchestration · Competition/Fallback"]
T --> B["Processing Engine Layer<br/>Scraping · PDF · Markdown"]
P --> B
B --> C["Infrastructure Layer<br/>RateLimiter · Cache · Metrics · ErrorHandler · Retry"]
C --> D["Configuration Layer<br/>pydantic-settings · Env Vars"]
style A fill:#4c1d95,stroke:#a78bfa,color:#ffffff
style T fill:#1e3a8a,stroke:#3b82f6,color:#ffffff
style P fill:#b45309,stroke:#f59e0b,color:#ffffff
style B fill:#166534,stroke:#22c55e,color:#ffffff
style C fill:#134e4a,stroke:#14b8a6,color:#ffffff
style D fill:#581c87,stroke:#9333ea,color:#ffffffA 5-tier orthogonal architecture: SDK → MCP Tools → Pipeline Orchestration → Processing Engines → Infrastructure, with the Configuration Layer interweaving through everything. Featuring a 10-Stage PDF Pipeline and a 12-Stage WebPage Pipeline that strictly enforce both fallback and competitive execution models.
Documentation Navigator
Document | Content | Who is it for |
Deep dive into 6 tools, MCP Server setup, SDK interfaces, advanced tweaks | All Users | |
5-tier architecture, Pipeline orchestration, engine fallbacks, Smart Mode | Architects / Contributors | |
Environment setup, test framework, CI/CD, PR guidelines | Developers | |
Release history and change logs | Everyone |
Community & Contributions
Beyond the World Wide Web and massive unstructured texts lies an abyss of noise. Only through relentless code evolution can we forge ahead steadily. If you hold the inspiration to pull chaos back into order, please do not hesitate to share:
Before striking your keyboard, flip through the Developer Guide along the way.
Hurl your paradigm-shifting ideas at our Issues or directly submit a Pull Request armed with game-changing power.
Maintenance
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/ThreeFish-AI/negentropy-perceives'
If you have feedback or need assistance with the MCP directory API, please join our Discord server