Skip to main content
Glama

Scrapy MCP Server

by ThreeFish-AI
CLAUDE.md5.45 kB
# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Project Overview This is a comprehensive web scraping MCP (Model Context Protocol) server built on FastMCP and Scrapy, designed for enterprise-grade web scraping with anti-detection capabilities. The server provides 10 MCP tools for various scraping scenarios, from simple HTTP requests to sophisticated stealth scraping and form automation. ## Development Commands ### Setup and Installation ```bash # Quick setup using provided script (recommended) ./scripts/setup.sh # Manual setup with uv uv sync # Install with development dependencies uv sync --extra dev # Copy environment configuration cp .env.example .env ``` ### Running the Server ```bash # Start the MCP server (primary command) uv run data-extractor # Alternative: Run as Python module uv run python -m extractor.server # Run with environment variables uv run --env DATA_EXTRACTOR_ENABLE_JAVASCRIPT=true data-extractor ``` ### Code Quality and Testing ```bash # Format code with Black uv run black extractor/ examples/ # Lint with flake8 uv run flake8 extractor/ # Type checking with mypy uv run mypy extractor/ # Run tests uv run pytest # Add dependencies uv add <package-name> uv add --dev <package-name> # Update dependencies uv lock --upgrade ``` ## Architecture Overview ### Core Module Structure The system is built with a layered architecture centered around method auto-selection and enterprise-grade utilities: **extractor/server.py** - FastMCP server with 10 MCP tools using `@app.tool()` decorators. Each tool follows a pattern: Pydantic request models → method selection → error handling → metrics collection. **extractor/scraper.py** - Multi-strategy scraping engine with automatic method selection: - `WebScraper.scrape_url()` orchestrates method selection based on requirements - Supports Simple HTTP, Scrapy framework, Selenium browser automation - Method selection logic considers JavaScript detection and anti-bot protection needs **extractor/advanced_features.py** - Stealth capabilities and form automation: - `AntiDetectionScraper` using undetected-chromedriver and Playwright - `FormHandler` for complex form interactions (dropdowns, checkboxes, file uploads) **extractor/utils.py** - Enterprise utilities with async support: - `RateLimiter`, `RetryManager`, `CacheManager`, `MetricsCollector`, `ErrorHandler` - All utilities follow patterns for async support, error handling, and metrics **extractor/config.py** - Pydantic BaseSettings with automatic environment variable mapping using `DATA_EXTRACTOR_` prefix. ### Key Design Patterns **Method Auto-Selection**: `WebScraper` intelligently chooses scraping methods based on JavaScript requirements, anti-bot protection, and performance needs. **Layered Error Handling**: Errors are caught at multiple levels, categorized (timeout, connection, anti-bot), and handled with appropriate retry strategies. **Enterprise Features**: Built-in rate limiting, caching with TTL, comprehensive metrics collection, and proxy support for production deployment. ## Configuration System Environment variables use `DATA_EXTRACTOR_` prefix (see .env.example): **Critical Settings:** - `DATA_EXTRACTOR_ENABLE_JAVASCRIPT` - Enables browser automation globally - `DATA_EXTRACTOR_USE_RANDOM_USER_AGENT` - Anti-detection feature - `DATA_EXTRACTOR_CONCURRENT_REQUESTS` - Controls Scrapy concurrency - `DATA_EXTRACTOR_BROWSER_TIMEOUT` - Browser wait timeout ## Data Extraction Configuration Flexible extraction configs support simple CSS selectors and complex attribute extraction: - Simple: `{"title": "h1"}` - Complex: `{"products": {"selector": ".product", "multiple": true, "attr": "text"}}` - Attributes: text content, href links, src images, custom attributes See `examples/extraction_configs.py` for comprehensive examples covering e-commerce, news, job listings, and real estate scenarios. ## Working with the Codebase **Adding New MCP Tools**: Add to `server.py` using `@app.tool()` decorator. Follow existing pattern: Pydantic request model → error handling → metrics collection. **Extending Scraping Methods**: Modify `scraper.py` classes. The `WebScraper.scrape_url()` method orchestrates method selection logic. **Adding Anti-Detection Features**: Extend `AntiDetectionScraper` in `advanced_features.py`. Consider browser stealth options, behavior simulation, and proxy rotation. **Configuration Changes**: Add settings to `DataExtractorSettings` in `config.py` using Pydantic Field with environment variable mapping. **Utility Functions**: Add to `utils.py` following existing patterns for async support, error handling, and metrics integration. ## Performance Considerations - Server uses asyncio for concurrent operations - Scrapy runs on Twisted reactor (single-threaded event loop) - Browser automation (Selenium/Playwright) is resource-intensive - Caching significantly improves repeated request performance - Rate limiting prevents overwhelming target servers ## Browser Dependencies Requires Chrome/Chromium browser for Selenium and stealth features. Playwright downloads its own browser binaries automatically. ## Security Notes - Stealth features should be used ethically - Always check robots.txt using provided tool - Proxy support available but ensure HTTPS proxies - No sensitive data logging - handle credentials carefully

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ThreeFish-AI/scrapy-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server