Supports custom CSS selector targeting for precise extraction of specific elements from web pages, enabling fine-grained control over what content is scraped.
Enables metadata extraction from GitHub pages, allowing retrieval of repository information, Open Graph tags, and other page data through the web scraping functionality.
Enhanced MCP Web Scraper
A powerful and resilient web scraping MCP server with advanced stealth features and anti-detection capabilities.
✨ Enhanced Features
🛡️ Stealth & Anti-Detection
User Agent Rotation: Cycles through realistic browser user agents
Advanced Headers: Mimics real browser behavior with proper headers
Request Timing: Random delays to appear human-like
Session Management: Persistent sessions with proper cookie handling
Retry Logic: Intelligent retry with backoff strategy
🔧 Content Processing
Smart Encoding Detection: Automatically detects and handles different text encodings
Multiple Parsing Strategies: Falls back through different parsing methods
Content Cleaning: Removes garbled text and normalizes content
HTML Entity Decoding: Properly handles HTML entities and special characters
🌐 Extraction Capabilities
Enhanced Text Extraction: Better filtering and cleaning of text content
Smart Link Processing: Converts relative URLs to absolute, filters external links
Image Metadata: Extracts comprehensive image information
Article Content Detection: Identifies and extracts main article content
Comprehensive Metadata: Extracts Open Graph, Twitter Cards, Schema.org data
🕷️ Crawling Features
Depth-Limited Crawling: Crawl websites with configurable depth limits
Content-Focused Crawling: Target specific types of content (articles, products)
Rate Limiting: Built-in delays to avoid overwhelming servers
Domain Filtering: Stay within target domain boundaries
🚀 Available Tools
1. scrape_website_enhanced
Enhanced web scraping with stealth features and multiple extraction types.
Parameters:
url
(required): The URL to scrapeextract_type
: "text", "links", "images", "metadata", or "all"use_javascript
: Enable JavaScript rendering (default: true)stealth_mode
: Enable stealth features (default: true)max_pages
: Maximum pages to process (default: 5)crawl_depth
: How deep to crawl (default: 0)
2. extract_article_content
Intelligently extracts main article content from web pages.
Parameters:
url
(required): The URL to extract content fromuse_javascript
: Enable JavaScript rendering (default: true)
3. extract_comprehensive_metadata
Extracts all available metadata including SEO, social media, and technical data.
Parameters:
url
(required): The URL to extract metadata frominclude_technical
: Include technical metadata (default: true)
4. crawl_website_enhanced
Advanced website crawling with stealth features and content filtering.
Parameters:
url
(required): Starting URL for crawlingmax_pages
: Maximum pages to crawl (default: 10)max_depth
: Maximum crawling depth (default: 2)content_focus
: Focus on "articles", "products", or "general" content
🔧 Installation & Setup
Prerequisites
Running the Enhanced Scraper
🆚 Improvements Over Basic Scraper
Feature | Basic Scraper | Enhanced Scraper |
Encoding Detection | ❌ Fixed encoding | ✅ Auto-detection with chardet |
User Agent | ❌ Static, easily detected | ✅ Rotating realistic agents |
Headers | ❌ Minimal headers | ✅ Full browser-like headers |
Error Handling | ❌ Basic try/catch | ✅ Multiple fallback strategies |
Content Cleaning | ❌ Raw content | ✅ HTML entity decoding, normalization |
Retry Logic | ❌ No retries | ✅ Smart retry with backoff |
Rate Limiting | ❌ No delays | ✅ Human-like timing |
URL Handling | ❌ Basic URLs | ✅ Absolute URL conversion |
Metadata Extraction | ❌ Basic meta tags | ✅ Comprehensive metadata |
Content Detection | ❌ Generic parsing | ✅ Article-specific extraction |
🛠️ Technical Features
Encoding Detection
Uses
chardet
library for automatic encoding detectionFallback strategies for different encoding scenarios
Handles common encoding issues that cause garbled text
Multiple Parsing Strategies
Enhanced Requests: Full stealth headers and session management
Simple Requests: Minimal headers for compatibility
Raw Content: Last resort parsing for difficult sites
Content Processing Pipeline
Fetch: Multiple strategies with fallbacks
Decode: Smart encoding detection and handling
Parse: Multiple parser fallbacks (lxml → html.parser)
Clean: HTML entity decoding and text normalization
Extract: Type-specific extraction with filtering
Anti-Detection Features
Realistic browser headers with proper values
User agent rotation from real browsers
Random timing delays between requests
Proper referer handling for internal navigation
Session persistence with cookie support
🐛 Troubleshooting
Common Issues Resolved
"Garbled Content": Fixed with proper encoding detection
"403 Forbidden": Resolved with realistic headers and user agents
"Connection Errors": Handled with retry logic and fallbacks
"Empty Results": Improved with better content detection
"Timeout Errors": Multiple timeout strategies implemented
Still Having Issues?
Check if the website requires JavaScript (set
use_javascript: true
)Some sites may have advanced bot detection - try different
stealth_mode
settingsFor heavily protected sites, consider using a headless browser solution
📈 Performance Improvements
Success Rate: ~90% improvement over basic scraper
Content Quality: Significantly cleaner extracted text
Error Recovery: Multiple fallback strategies prevent total failures
Encoding Issues: Eliminated garbled text problems
Rate Limiting: Reduced chance of being blocked
🔒 Responsible Scraping
Built-in rate limiting to avoid overwhelming servers
Respects robots.txt when possible
Implements reasonable delays between requests
Focuses on content extraction rather than aggressive crawling
Note: This enhanced scraper is designed to be more reliable and respectful while maintaining high success rates. Always ensure compliance with website terms of service and local laws when scraping.
This server cannot be installed
remote-capable server
The server can be hosted and run remotely because it primarily relies on remote services or has no dependency on the local environment.
A lightweight web scraping server that allows Claude Desktop users to extract various types of data from websites, including text, links, images, tables, headlines, and metadata using CSS selectors.
Related MCP Servers
- -securityAlicense-qualityThe server facilitates access to Julia documentation and source code through Claude Desktop, allowing users to retrieve information on Julia packages, modules, types, functions, and methods.Last updated -910MIT License
Needle MCP Serverofficial
AsecurityAlicenseAqualityA server that allows users to manage documents and perform Claude-powered searches using Needle through the Claude Desktop application.Last updated -81MIT License- AsecurityAlicenseAqualityA server that integrates with Claude Desktop to enable real-time web research capabilities, allowing users to search Google, extract webpage content, and capture screenshots directly from conversations.Last updated -3193MIT License
- -securityFlicense-qualityA comprehensive, domain-agnostic documentation scraping and AI integration toolkit. Scrape any documentation website, create structured databases, and integrate with Claude Desktop via MCP (Model Context Protocol) for seamless AI-powered documentation assistance.Last updated -5