Skip to main content
Glama

Scraper MCP

by cotdp
CONTRIBUTING.md7.34 kB
# Contributing to Scraper MCP Thank you for your interest in contributing to Scraper MCP! This document provides guidelines and instructions for contributing to the project. ## Table of Contents - [Code of Conduct](#code-of-conduct) - [Getting Started](#getting-started) - [Development Setup](#development-setup) - [Development Workflow](#development-workflow) - [Code Standards](#code-standards) - [Testing](#testing) - [Submitting Changes](#submitting-changes) - [Areas for Contribution](#areas-for-contribution) ## Code of Conduct We are committed to providing a welcoming and inclusive environment. Please be respectful and constructive in all interactions. ## Getting Started ### Prerequisites - Python 3.12 or higher - [uv](https://github.com/astral-sh/uv) package manager - Docker and Docker Compose (for testing deployment) - Git ### Fork and Clone 1. Fork the repository on GitHub 2. Clone your fork locally: ```bash git clone git@github.com:YOUR_USERNAME/scraper-mcp.git cd scraper-mcp ``` 3. Add the upstream repository: ```bash git remote add upstream git@github.com:carrotly-ai/scraper-mcp.git ``` ## Development Setup ### Install Dependencies ```bash # Install the package with development dependencies uv pip install -e ".[dev]" ``` ### Run the Server Locally ```bash # Run with default settings (stdio transport) python -m scraper_mcp # Run with HTTP transport python -m scraper_mcp streamable-http 0.0.0.0 8000 ``` ### Access the Dashboard Open `http://localhost:8000/` in your browser to access the monitoring dashboard, playground, and configuration interface. ## Development Workflow ### Create a Feature Branch ```bash git checkout -b feature/your-feature-name # or git checkout -b fix/your-bug-fix ``` ### Make Your Changes 1. Write your code following the [Code Standards](#code-standards) 2. Add or update tests as needed 3. Update documentation (README, docstrings, comments) 4. Run tests and linting locally ### Keep Your Branch Updated ```bash git fetch upstream git rebase upstream/main ``` ## Code Standards ### Python Style We use [Ruff](https://github.com/astral-sh/ruff) for linting and formatting: ```bash # Check for linting issues ruff check . # Auto-fix linting issues ruff check . --fix # Format code ruff format . ``` ### Type Hints We use type hints throughout the codebase. Run type checking with: ```bash mypy src/ ``` All public functions and methods should have complete type annotations. ### Code Organization - **Provider Pattern**: New scraping backends should implement the `ScraperProvider` interface - **Utilities**: HTML processing utilities go in `src/scraper_mcp/utils.py` - **Models**: Use Pydantic v2 models for all data structures - **Async/Await**: Use async patterns consistently throughout ### Documentation - Add docstrings to all public functions, classes, and methods - Update README.md for user-facing changes - Update CLAUDE.md for development guidance changes - Include inline comments for complex logic ### Commit Messages Use [Conventional Commits](https://www.conventionalcommits.org/) format: ``` feat: add support for JavaScript rendering fix: resolve timeout issue with slow sites docs: update proxy configuration examples refactor: simplify retry logic test: add tests for batch operations chore: update dependencies ``` Keep commits focused and atomic. Each commit should represent a single logical change. ## Testing ### Run Tests ```bash # Run all tests with coverage pytest # Run specific test file pytest tests/test_server.py # Run specific test class pytest tests/test_server.py::TestScrapeUrlTool # Run with verbose output pytest -v # Run without coverage report pytest --no-cov ``` ### Writing Tests - Use `pytest` with `pytest-asyncio` for async tests - Use `pytest-mock` for mocking - Place test fixtures in `tests/conftest.py` - Aim for >90% code coverage - Test both success and error cases - Test edge cases and boundary conditions ### Test Structure ```python import pytest from unittest.mock import Mock, patch @pytest.mark.asyncio async def test_feature_name(provider: RequestsProvider) -> None: """Test description.""" # Arrange mock_response = Mock() mock_response.status_code = 200 # Act with patch.object(provider.session, "get", return_value=mock_response): result = await provider.scrape("https://example.com") # Assert assert result.status_code == 200 ``` ## Submitting Changes ### Before Submitting 1. **Run all checks**: ```bash # Format code ruff format . # Fix linting issues ruff check . --fix # Type check mypy src/ # Run tests pytest ``` 2. **Update documentation** if needed 3. **Add tests** for new features or bug fixes 4. **Rebase on latest main**: ```bash git fetch upstream git rebase upstream/main ``` ### Create a Pull Request 1. Push your branch to your fork: ```bash git push origin feature/your-feature-name ``` 2. Go to the [repository on GitHub](https://github.com/carrotly-ai/scraper-mcp) 3. Click "New Pull Request" 4. Select your fork and branch 5. Fill out the PR template: - **Title**: Brief description (50 chars max) - **Description**: - What does this PR do? - Why is this change needed? - How was it tested? - Any breaking changes? - **Link related issues**: Use "Closes #123" or "Fixes #123" ### PR Review Process - Maintainers will review your PR - Address feedback and push updates - Once approved, maintainers will merge your PR - Your contribution will be credited in the release notes ## Areas for Contribution We welcome contributions in these areas: ### High Priority - **New Scraping Providers**: Implement `ScraperProvider` for Playwright, Selenium, or Scrapy - **Performance Optimizations**: Improve caching, concurrency, or memory usage - **Documentation**: Improve examples, tutorials, or API documentation - **Test Coverage**: Add tests for edge cases or untested code paths ### Feature Ideas - **Authentication Support**: Add support for authenticated requests (OAuth, cookies, headers) - **Screenshot Capture**: Add tools for capturing page screenshots - **Rate Limiting**: Implement per-domain rate limiting - **Request Pooling**: Connection pooling for improved performance - **Webhook Support**: Trigger scrapes via webhooks - **Scheduled Scraping**: Cron-like scheduling for periodic scrapes - **Export Formats**: Add JSON, XML, or CSV export options - **Browser Fingerprinting**: Advanced anti-detection techniques - **Sitemap Support**: Parse and scrape from XML sitemaps - **Mobile User Agents**: Better mobile scraping support ### Bug Fixes - Check the [Issues](https://github.com/carrotly-ai/scraper-mcp/issues) page for open bugs - Look for issues labeled `good first issue` or `help wanted` ### Documentation - Improve README examples - Add tutorials or guides - Document common use cases - Translate documentation (if applicable) ## Questions? - Open an [Issue](https://github.com/carrotly-ai/scraper-mcp/issues) for questions - Check existing issues and discussions first - Be specific and provide context ## License By contributing, you agree that your contributions will be licensed under the MIT License. --- Thank you for contributing to Scraper MCP! 🎉

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/cotdp/scraper-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server