Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@Site Crawler MCPPerform a full SEO and security audit on https://example.com"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
Site Crawler MCP
A powerful Model Context Protocol (MCP) server for crawling websites and extracting assets including images and SEO metadata. Built for e-commerce sites and general web crawling needs.
Features
Comprehensive website analysis: 12 different extraction modes for complete website insights
Multi-mode crawling: Extract multiple data types in a single pass
Smart extraction: Advanced pattern matching for accurate data extraction
Performance optimized: Concurrent crawling with rate limiting
Security analysis: HTTPS, security headers, SSL/TLS information
SEO analysis: Complete SEO audit including meta tags, structured data, and more
Legal compliance: KVKK, GDPR, privacy policy detection
Business intelligence: Brand info, references, contact details extraction
Installation
From PyPI (when published)
From Source (Development)
Using uv (Recommended)
Using pip
Usage
As an MCP Server
Add to your MCP configuration file:
Windows:
%APPDATA%\Claude\claude_desktop_config.jsonmacOS:
~/Library/Application Support/Claude/claude_desktop_config.jsonLinux:
~/.config/Claude/claude_desktop_config.json
Using uvx (Recommended)
Using uv run
Using python directly
Note: Replace /path/to/site-crawler-mcp with your actual project path. On Windows, use backslashes and drive letters (e.g., C:\\Users\\YourName\\site-crawler-mcp).
Available Tools
site_crawlAssets
Crawl a website and extract various assets based on specified modes.
Parameters:
url(string, required): The URL to start crawling frommodes(array, required): Array of extraction modes (see below)depth(number, optional): Crawling depth (default: 1)max_pages(number, optional): Maximum pages to crawl (default: 50)
Available Modes:
images: Extract all images with metadata (alt text, dimensions, format)meta: Basic SEO metadata (title, description, H1 tags)brand: Company branding information (logo, name, about pages)seo: Comprehensive SEO analysis (meta tags, structured data, open graph)performance: Page load metrics and performance indicatorssecurity: Security headers and HTTPS configurationcompliance: Accessibility and regulatory compliance checksinfrastructure: Server technology and CDN detectionlegal: Privacy policies, terms, KVKK compliancecareers: Job opportunities and career pagesreferences: Client testimonials and case studiescontact: Contact information (email, phone, social media, address)
Example Requests:
Basic image extraction:
Full SEO and security audit:
Business intelligence gathering:
Legal compliance check:
Development
Requirements
Python 3.10+
BeautifulSoup4
aiohttp
MCP SDK
uv (recommended for development)
Setup Development Environment
Using uv (Recommended)
Using pip
Running the Server
Using uv
Using python directly
Running Tests
Project Structure
Configuration
Environment Variables
CRAWLER_MAX_CONCURRENT: Maximum concurrent requests (default: 5)CRAWLER_TIMEOUT: Request timeout in seconds (default: 30)CRAWLER_USER_AGENT: Custom user agent string
Rate Limiting
The crawler respects robots.txt and implements polite crawling:
1-2 second delay between requests to the same domain
Maximum 5 concurrent requests
Automatic retry with exponential backoff
Use Cases
E-commerce Analysis
Extract product images, pricing, and brand information:
SEO and Performance Audit
Comprehensive SEO and performance analysis:
Security Assessment
Check security headers and HTTPS configuration:
Legal Compliance Check
Verify KVKK/GDPR compliance and privacy policies:
Business Intelligence
Gather company information and references:
Contact Information Extraction
Find all contact details:
Performance Considerations
Images smaller than 50KB are filtered out by default
Concurrent crawling limited to 5 pages simultaneously
Memory-efficient streaming for large sites
Automatic deduplication of URLs
Error Handling
The crawler handles various error scenarios gracefully:
Network timeouts
Invalid URLs
Rate limiting (429 responses)
JavaScript-heavy sites (graceful degradation)
Memory limits
Contributing
Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
Built with MCP SDK
Inspired by the need for better e-commerce crawling tools
Thanks to the open-source community
Support
For issues and feature requests, please use the GitHub issue tracker.