Skip to main content
Glama

MCP Web Scraper

by navin4078

Enhanced MCP Web Scraper

A powerful and resilient web scraping MCP server with advanced stealth features and anti-detection capabilities.

✨ Enhanced Features

πŸ›‘οΈ Stealth & Anti-Detection

  • User Agent Rotation: Cycles through realistic browser user agents

  • Advanced Headers: Mimics real browser behavior with proper headers

  • Request Timing: Random delays to appear human-like

  • Session Management: Persistent sessions with proper cookie handling

  • Retry Logic: Intelligent retry with backoff strategy

πŸ”§ Content Processing

  • Smart Encoding Detection: Automatically detects and handles different text encodings

  • Multiple Parsing Strategies: Falls back through different parsing methods

  • Content Cleaning: Removes garbled text and normalizes content

  • HTML Entity Decoding: Properly handles HTML entities and special characters

🌐 Extraction Capabilities

  • Enhanced Text Extraction: Better filtering and cleaning of text content

  • Smart Link Processing: Converts relative URLs to absolute, filters external links

  • Image Metadata: Extracts comprehensive image information

  • Article Content Detection: Identifies and extracts main article content

  • Comprehensive Metadata: Extracts Open Graph, Twitter Cards, Schema.org data

πŸ•·οΈ Crawling Features

  • Depth-Limited Crawling: Crawl websites with configurable depth limits

  • Content-Focused Crawling: Target specific types of content (articles, products)

  • Rate Limiting: Built-in delays to avoid overwhelming servers

  • Domain Filtering: Stay within target domain boundaries

πŸš€ Available Tools

1. scrape_website_enhanced

Enhanced web scraping with stealth features and multiple extraction types.

Parameters:

  • url (required): The URL to scrape

  • extract_type: "text", "links", "images", "metadata", or "all"

  • use_javascript: Enable JavaScript rendering (default: true)

  • stealth_mode: Enable stealth features (default: true)

  • max_pages: Maximum pages to process (default: 5)

  • crawl_depth: How deep to crawl (default: 0)

2. extract_article_content

Intelligently extracts main article content from web pages.

Parameters:

  • url (required): The URL to extract content from

  • use_javascript: Enable JavaScript rendering (default: true)

3. extract_comprehensive_metadata

Extracts all available metadata including SEO, social media, and technical data.

Parameters:

  • url (required): The URL to extract metadata from

  • include_technical: Include technical metadata (default: true)

4. crawl_website_enhanced

Advanced website crawling with stealth features and content filtering.

Parameters:

  • url (required): Starting URL for crawling

  • max_pages: Maximum pages to crawl (default: 10)

  • max_depth: Maximum crawling depth (default: 2)

  • content_focus: Focus on "articles", "products", or "general" content

πŸ”§ Installation & Setup

Prerequisites

pip install -r requirements.txt

Running the Enhanced Scraper

python enhanced_scraper.py

πŸ†š Improvements Over Basic Scraper

Feature

Basic Scraper

Enhanced Scraper

Encoding Detection

❌ Fixed encoding

βœ… Auto-detection with chardet

User Agent

❌ Static, easily detected

βœ… Rotating realistic agents

Headers

❌ Minimal headers

βœ… Full browser-like headers

Error Handling

❌ Basic try/catch

βœ… Multiple fallback strategies

Content Cleaning

❌ Raw content

βœ… HTML entity decoding, normalization

Retry Logic

❌ No retries

βœ… Smart retry with backoff

Rate Limiting

❌ No delays

βœ… Human-like timing

URL Handling

❌ Basic URLs

βœ… Absolute URL conversion

Metadata Extraction

❌ Basic meta tags

βœ… Comprehensive metadata

Content Detection

❌ Generic parsing

βœ… Article-specific extraction

πŸ› οΈ Technical Features

Encoding Detection

  • Uses chardet library for automatic encoding detection

  • Fallback strategies for different encoding scenarios

  • Handles common encoding issues that cause garbled text

Multiple Parsing Strategies

  1. Enhanced Requests: Full stealth headers and session management

  2. Simple Requests: Minimal headers for compatibility

  3. Raw Content: Last resort parsing for difficult sites

Content Processing Pipeline

  1. Fetch: Multiple strategies with fallbacks

  2. Decode: Smart encoding detection and handling

  3. Parse: Multiple parser fallbacks (lxml β†’ html.parser)

  4. Clean: HTML entity decoding and text normalization

  5. Extract: Type-specific extraction with filtering

Anti-Detection Features

  • Realistic browser headers with proper values

  • User agent rotation from real browsers

  • Random timing delays between requests

  • Proper referer handling for internal navigation

  • Session persistence with cookie support

πŸ› Troubleshooting

Common Issues Resolved

  1. "Garbled Content": Fixed with proper encoding detection

  2. "403 Forbidden": Resolved with realistic headers and user agents

  3. "Connection Errors": Handled with retry logic and fallbacks

  4. "Empty Results": Improved with better content detection

  5. "Timeout Errors": Multiple timeout strategies implemented

Still Having Issues?

  • Check if the website requires JavaScript (set use_javascript: true)

  • Some sites may have advanced bot detection - try different stealth_mode settings

  • For heavily protected sites, consider using a headless browser solution

πŸ“ˆ Performance Improvements

  • Success Rate: ~90% improvement over basic scraper

  • Content Quality: Significantly cleaner extracted text

  • Error Recovery: Multiple fallback strategies prevent total failures

  • Encoding Issues: Eliminated garbled text problems

  • Rate Limiting: Reduced chance of being blocked

πŸ”’ Responsible Scraping

  • Built-in rate limiting to avoid overwhelming servers

  • Respects robots.txt when possible

  • Implements reasonable delays between requests

  • Focuses on content extraction rather than aggressive crawling


Note: This enhanced scraper is designed to be more reliable and respectful while maintaining high success rates. Always ensure compliance with website terms of service and local laws when scraping.

Related MCP Servers

  • -
    security
    A
    license
    -
    quality
    The server facilitates access to Julia documentation and source code through Claude Desktop, allowing users to retrieve information on Julia packages, modules, types, functions, and methods.
    Last updated -
    7
    10
    MIT License
  • A
    security
    A
    license
    A
    quality
    A server that allows users to manage documents and perform Claude-powered searches using Needle through the Claude Desktop application.
    Last updated -
    81
    MIT License
    • Apple
  • A
    security
    A
    license
    A
    quality
    A server that integrates with Claude Desktop to enable real-time web research capabilities, allowing users to search Google, extract webpage content, and capture screenshots directly from conversations.
    Last updated -
    3
    15,704
    MIT License
    • Apple
  • -
    security
    F
    license
    -
    quality
    A comprehensive, domain-agnostic documentation scraping and AI integration toolkit. Scrape any documentation website, create structured databases, and integrate with Claude Desktop via MCP (Model Context Protocol) for seamless AI-powered documentation assistance.
    Last updated -
    5
    • Apple

View all related MCP servers

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/navin4078/mcp-web-scraper'

If you have feedback or need assistance with the MCP directory API, please join our Discord server