Scraper MCP

DEVELOPMENT.md•6.41 KiB

# Development Guide Guide for local development and contributing to Scraper MCP. ## Prerequisites - Python 3.12+ - [uv](https://github.com/astral-sh/uv) package manager - Docker (optional, for container testing) ## Local Setup ```bash # Clone the repository git clone https://github.com/cotdp/scraper-mcp.git cd scraper-mcp # Install dependencies uv pip install -e ".[dev]" # Run the server python -m scraper_mcp # Run with specific transport and port python -m scraper_mcp streamable-http 0.0.0.0 8000 ``` ## Development Commands ```bash # Run tests pytest # Run tests with coverage pytest --cov=scraper_mcp --cov-report=html # Type checking mypy src/ # Linting ruff check . # Auto-fix linting issues ruff check . --fix # Format code ruff format . ``` ## Project Structure ``` scraper-mcp/ ├── src/scraper_mcp/ │ ├── __init__.py │ ├── __main__.py │ ├── server.py # Main MCP server entry point │ ├── admin/ # Admin API (config, stats, cache) │ │ ├── router.py # HTTP endpoint handlers │ │ └── service.py # Business logic │ ├── dashboard/ # Web dashboard │ │ ├── router.py # Dashboard routes │ │ └── templates/ │ │ └── dashboard.html # Monitoring UI │ ├── tools/ # MCP scraping tools │ │ ├── router.py # Tool registration │ │ └── service.py # Scraping implementations │ ├── models/ # Pydantic data models │ │ ├── scrape.py # Scrape request/response models │ │ └── links.py # Link extraction models │ ├── providers/ # Scraping backend providers │ │ ├── base.py # Abstract provider interface │ │ └── requests_provider.py # HTTP provider (requests library) │ ├── core/ │ │ └── providers.py # Provider registry and selection │ ├── cache.py # Request caching (disk-based) │ ├── cache_manager.py # Cache lifecycle management │ ├── metrics.py # Request/retry metrics tracking │ └── utils.py # HTML processing utilities ├── tests/ # Pytest test suite │ ├── conftest.py # Test fixtures │ ├── test_server.py │ ├── test_tools.py │ └── test_utils.py ├── docs/ # Documentation ├── .github/workflows/ │ ├── ci.yml # CI/CD: tests, linting │ └── docker-publish.yml # Docker image publishing ├── Dockerfile # Multi-stage production build ├── docker-compose.yml # Local development setup ├── pyproject.toml # Python dependencies (uv) └── .env.example # Environment configuration template ``` ## Architecture ### Provider Pattern The server uses an extensible provider architecture for scraping backends: ``` ScraperProvider (abstract) └── RequestsProvider (default HTTP scraper) └── Future: PlaywrightProvider, SeleniumProvider, etc. ``` - **ScraperProvider** (`providers/base.py`): Abstract interface with `scrape()` and `supports_url()` methods - **RequestsProvider** (`providers/requests_provider.py`): Default implementation using `requests` library with exponential backoff The `get_provider()` function routes URLs to appropriate providers based on URL patterns. ### Tool Architecture All MCP tools follow a dual-mode pattern: 1. **Single URL mode**: Returns `ScrapeResponse` directly 2. **Batch mode**: Returns `BatchScrapeResponse` with individual results Batch operations use `asyncio.Semaphore` for concurrency control. ### HTML Processing Utilities in `utils.py` use BeautifulSoup with lxml parser: - `html_to_markdown()`: Converts HTML using `markdownify` - `html_to_text()`: Extracts plain text - `extract_links()`: Extracts all `<a>` tags with URL resolution - `extract_metadata()`: Extracts `<title>` and `<meta>` tags - `filter_html_by_selector()`: CSS selector filtering ## Building Docker Images ### Build Locally ```bash docker build -t scraper-mcp:custom . docker run -p 8000:8000 scraper-mcp:custom ``` ### With Docker Compose ```bash docker-compose build docker-compose up -d ``` ### Multi-Platform Build ```bash docker buildx build --platform linux/amd64,linux/arm64 -t scraper-mcp:multi . ``` ## Adding New Features ### Adding a New Tool 1. Define Pydantic response model in `models/` 2. Add utility function to `utils.py` if needed 3. Create tool function in `tools/service.py` 4. Register tool in `tools/router.py` 5. Add tests in `tests/` ### Adding a New Provider 1. Create new file in `providers/` (e.g., `playwright_provider.py`) 2. Subclass `ScraperProvider` and implement `scrape()` and `supports_url()` 3. Update `core/providers.py` to route specific URL patterns 4. Add provider-specific tests 5. Update `pyproject.toml` dependencies if needed ## Testing ### Test Structure - **Unit tests** (`test_utils.py`): HTML processing, conversion, extraction - **Provider tests** (`test_providers.py`): HTTP scraping, error handling - **Integration tests** (`test_server.py`): MCP tool functionality ### Running Specific Tests ```bash # Specific file pytest tests/test_utils.py # Specific class pytest tests/test_providers.py::TestRequestsProvider # Specific test pytest tests/test_server.py::TestScrapeUrlTool::test_scrape_url_success # Verbose output pytest -v ``` ### Test Fixtures Fixtures in `tests/conftest.py` provide sample HTML for testing: - `sample_html`: Complex HTML with various elements - `simple_html`: Minimal HTML for basic tests - `html_with_links`: HTML with different link types - `html_with_metadata`: HTML with meta tags and OpenGraph data ## Code Style - **Line length**: 100 characters - **Type hints**: Required for all functions - **Docstrings**: Google style - **Imports**: Sorted with `ruff` ## CI/CD GitHub Actions workflows: - **ci.yml**: Runs on every PR - Python 3.12 tests - Type checking (mypy) - Linting (ruff) - Coverage reporting - **docker-publish.yml**: Runs on releases - Multi-platform builds (amd64, arm64) - Pushes to Docker Hub and GHCR - Semantic version tags

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/cotdp/scraper-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

DEVELOPMENT.md•6.41 KiB