Site Crawler MCP

README.md•8.6 KiB

# Site Crawler MCP A powerful Model Context Protocol (MCP) server for crawling websites and extracting assets including images and SEO metadata. Built for e-commerce sites and general web crawling needs. ## Features - **Comprehensive website analysis**: 12 different extraction modes for complete website insights - **Multi-mode crawling**: Extract multiple data types in a single pass - **Smart extraction**: Advanced pattern matching for accurate data extraction - **Performance optimized**: Concurrent crawling with rate limiting - **Security analysis**: HTTPS, security headers, SSL/TLS information - **SEO analysis**: Complete SEO audit including meta tags, structured data, and more - **Legal compliance**: KVKK, GDPR, privacy policy detection - **Business intelligence**: Brand info, references, contact details extraction ## Installation ### From PyPI (when published) ```bash pip install site-crawler-mcp ``` ### From Source (Development) #### Using uv (Recommended) ```bash # Clone the repository git clone https://github.com/AndacGuven/site-crawler-mcp.git cd site-crawler-mcp # Create virtual environment with Python 3.12 uv venv --python 3.12 source .venv/bin/activate # On Windows: .venv\Scripts\activate # Install dependencies and package uv sync ``` #### Using pip ```bash # Clone the repository git clone https://github.com/AndacGuven/site-crawler-mcp.git cd site-crawler-mcp # Create virtual environment (recommended) python -m venv venv # Activate virtual environment # On Windows: venv\Scripts\activate # On Linux/Mac: source venv/bin/activate # Install dependencies pip install -r requirements.txt # Install package in development mode pip install -e . ``` ## Usage ### As an MCP Server Add to your MCP configuration file: - **Windows**: `%APPDATA%\Claude\claude_desktop_config.json` - **macOS**: `~/Library/Application Support/Claude/claude_desktop_config.json` - **Linux**: `~/.config/Claude/claude_desktop_config.json` #### Using uvx (Recommended) ```json { "mcpServers": { "site-crawler": { "command": "uvx", "args": ["--from", "/path/to/site-crawler-mcp", "site-crawler-mcp"] } } } ``` #### Using uv run ```json { "mcpServers": { "site-crawler": { "command": "uv", "args": ["run", "site_crawler"], "cwd": "/path/to/site-crawler-mcp" } } } ``` #### Using python directly ```json { "mcpServers": { "site-crawler": { "command": "python", "args": ["-m", "site_crawler.server"], "cwd": "/path/to/site-crawler-mcp/src", "env": { "PYTHONPATH": "/path/to/site-crawler-mcp/src" } } } } ``` **Note**: Replace `/path/to/site-crawler-mcp` with your actual project path. On Windows, use backslashes and drive letters (e.g., `C:\\Users\\YourName\\site-crawler-mcp`). ### Available Tools #### `site_crawlAssets` Crawl a website and extract various assets based on specified modes. **Parameters:** - `url` (string, required): The URL to start crawling from - `modes` (array, required): Array of extraction modes (see below) - `depth` (number, optional): Crawling depth (default: 1) - `max_pages` (number, optional): Maximum pages to crawl (default: 50) **Available Modes:** - `images`: Extract all images with metadata (alt text, dimensions, format) - `meta`: Basic SEO metadata (title, description, H1 tags) - `brand`: Company branding information (logo, name, about pages) - `seo`: Comprehensive SEO analysis (meta tags, structured data, open graph) - `performance`: Page load metrics and performance indicators - `security`: Security headers and HTTPS configuration - `compliance`: Accessibility and regulatory compliance checks - `infrastructure`: Server technology and CDN detection - `legal`: Privacy policies, terms, KVKK compliance - `careers`: Job opportunities and career pages - `references`: Client testimonials and case studies - `contact`: Contact information (email, phone, social media, address) **Example Requests:** 1. Basic image extraction: ```json { "tool": "site_crawlAssets", "arguments": { "url": "https://example.com", "modes": ["images"], "depth": 1 } } ``` 2. Full SEO and security audit: ```json { "tool": "site_crawlAssets", "arguments": { "url": "https://example.com", "modes": ["seo", "security", "performance"], "depth": 2 } } ``` 3. Business intelligence gathering: ```json { "tool": "site_crawlAssets", "arguments": { "url": "https://example.com", "modes": ["brand", "contact", "references", "careers"], "depth": 3 } } ``` 4. Legal compliance check: ```json { "tool": "site_crawlAssets", "arguments": { "url": "https://example.com", "modes": ["legal", "compliance"], "depth": 2 } } ``` ## Development ### Requirements - Python 3.10+ - BeautifulSoup4 - aiohttp - MCP SDK - uv (recommended for development) ### Setup Development Environment #### Using uv (Recommended) ```bash # Clone the repository git clone https://github.com/AndacGuven/site-crawler-mcp.git cd site-crawler-mcp # Create virtual environment with Python 3.12 uv venv --python 3.12 source .venv/bin/activate # On Windows: .venv\Scripts\activate # Install dependencies and package uv sync ``` #### Using pip ```bash # Clone the repository git clone https://github.com/AndacGuven/site-crawler-mcp.git cd site-crawler-mcp # Create virtual environment python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate # Install dependencies pip install -r requirements.txt # Install in development mode pip install -e . ``` ### Running the Server #### Using uv ```bash # Run the MCP server uv run site_crawler # or uv run site-crawler-mcp # or uv run python -m site_crawler.server ``` #### Using python directly ```bash python -m site_crawler.server ``` ### Running Tests ```bash # Using uv uv run pytest tests/ # Using pip pytest tests/ ``` ### Project Structure ``` site-crawler-mcp/ ├── README.md ├── requirements.txt ├── pyproject.toml ├── src/ │ └── site_crawler/ │ ├── __init__.py │ ├── server.py │ ├── crawler.py │ └── utils.py └── tests/ ├── __init__.py └── test_crawler.py ``` ## Configuration ### Environment Variables - `CRAWLER_MAX_CONCURRENT`: Maximum concurrent requests (default: 5) - `CRAWLER_TIMEOUT`: Request timeout in seconds (default: 30) - `CRAWLER_USER_AGENT`: Custom user agent string ### Rate Limiting The crawler respects `robots.txt` and implements polite crawling: - 1-2 second delay between requests to the same domain - Maximum 5 concurrent requests - Automatic retry with exponential backoff ## Use Cases ### E-commerce Analysis Extract product images, pricing, and brand information: ``` "Analyze the e-commerce site example.com for product images, brand info, and contact details" ``` ### SEO and Performance Audit Comprehensive SEO and performance analysis: ``` "Perform a full SEO audit of example.com including performance metrics and structured data" ``` ### Security Assessment Check security headers and HTTPS configuration: ``` "Analyze the security posture of example.com including headers and SSL configuration" ``` ### Legal Compliance Check Verify KVKK/GDPR compliance and privacy policies: ``` "Check example.com for KVKK compliance, privacy policies, and data protection measures" ``` ### Business Intelligence Gather company information and references: ``` "Extract business information from example.com including company details, references, and career opportunities" ``` ### Contact Information Extraction Find all contact details: ``` "Find all contact information on example.com including emails, phones, social media, and addresses" ``` ## Performance Considerations - Images smaller than 50KB are filtered out by default - Concurrent crawling limited to 5 pages simultaneously - Memory-efficient streaming for large sites - Automatic deduplication of URLs ## Error Handling The crawler handles various error scenarios gracefully: - Network timeouts - Invalid URLs - Rate limiting (429 responses) - JavaScript-heavy sites (graceful degradation) - Memory limits ## Contributing Contributions are welcome! Please read our [Contributing Guide](CONTRIBUTING.md) for details on our code of conduct and the process for submitting pull requests. ## License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## Acknowledgments - Built with [MCP SDK](https://github.com/modelcontextprotocol/sdk) - Inspired by the need for better e-commerce crawling tools - Thanks to the open-source community ## Support For issues and feature requests, please use the [GitHub issue tracker](https://github.com/AndacGuven/site-crawler-mcp/issues).

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/AndacGuven/site-crawler-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

README.md•8.6 KiB