The MCP Web Scrape server is a comprehensive web content extraction and analysis tool that converts web pages into clean, agent-friendly formats with smart caching and ethical compliance.
Content Extraction & Transformation: Convert HTML to clean Markdown/text/JSON with citations (extract_content), extract plain text (extract_text_only), summarize content with customizable formats (summarize_content), translate content (translate_content), convert pages to PDF (convert_to_pdf), and generate word clouds (generate_word_cloud).
Structured Data Extraction: Extract links with filtering (extract_links), images with metadata (extract_images), forms with validation rules (extract_forms), tables with export options (extract_tables), social media links (extract_social_media), contact information (extract_contact_info), heading hierarchy (extract_headings), RSS/Atom feeds (extract_feeds), and structured data including JSON-LD, microdata, RDFa, OpenGraph, and schema.org markup (extract_structured_data, extract_schema_markup).
Content Analysis: Search within pages with regex support (search_content), extract keywords (extract_keywords), analyze readability using multiple metrics (analyze_readability), detect language with confidence scores (detect_language), extract named entities (extract_entities), perform sentiment analysis at document/paragraph/sentence level (sentiment_analysis), classify content into categories (classify_content), and compare content between URLs (compare_content).
SEO & Marketing Tools: Analyze competitors for insights (analyze_competitors), generate optimized meta tags (generate_meta_tags), check broken links and redirects (check_broken_links), analyze page speed with Core Web Vitals (analyze_page_speed), validate HTML structure (validate_html), analyze overall performance including SEO and accessibility (analyze_performance), and generate sitemaps (generate_sitemap).
Security & Privacy: Scan for vulnerabilities including XSS and CSRF (scan_vulnerabilities), check SSL certificates with chain details (check_ssl_certificate), analyze cookies for security flags (analyze_cookies), detect tracking scripts (detect_tracking), and check privacy policy compliance with GDPR, CCPA, COPPA, and PIPEDA (check_privacy_policy).
Monitoring & Tracking: Monitor uptime with configurable intervals (monitor_uptime), track content changes with similarity analysis (monitor_changes, track_changes_detailed), analyze traffic patterns (analyze_traffic_patterns), and benchmark performance against competitors (benchmark_performance).
Utility & Management: Process multiple URLs efficiently in batches (batch_extract), validate robots.txt compliance (validate_robots), check URL accessibility (check_url_status), manage cache with statistics and selective clearing (clear_cache, get_cache_stats), and generate comprehensive reports in JSON, HTML, or Markdown formats (generate_reports).
Key Benefits: Provides clean Markdown output optimized for AI agents, citation links for fact verification, deterministic and cached results with ETag/304 support, ethical scraping with robots.txt respect and rate limiting, and supports both STDIO and HTTP/SSE transports.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@MCP Web Scrapescrape the main text from https://news.ycombinator.com and format it as markdown"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
π·οΈ MCP Web Scrape
Clean, cached web content for agentsβMarkdown + citations, robots-aware, ETag/304 caching.
π¦ Version
Current Version: 1.0.7
π Quick Start Demo
# Extract content from any webpage
npx mcp-web-scrape@1.0.7
# Example: Extract from a news article
> extract_content https://news.ycombinator.com
β
Extracted 1,247 words with 5 citations
π Clean Markdown ready for your AI agentπ― Tool Examples
# Extract all forms from a webpage
> extract_forms https://example.com/contact
β
Found 3 forms with 12 input fields
# Parse tables into structured data
> extract_tables https://example.com/data --format json
β
Extracted 5 tables with 247 rows
# Find social media profiles
> extract_social_media https://company.com
β
Found Twitter, LinkedIn, Facebook profiles
# Analyze sentiment of content
> sentiment_analysis https://blog.example.com/article
β
Sentiment: Positive (0.85), Emotional tone: Optimistic
# Extract named entities
> extract_entities https://news.example.com/article
β
Found 12 people, 8 organizations, 5 locations
# Check for security vulnerabilities
> scan_vulnerabilities https://mysite.com
β
No XSS vulnerabilities found, 2 header improvements suggested
# Analyze competitor SEO
> analyze_competitors ["https://competitor1.com", "https://competitor2.com"]
β
Competitor analysis complete: keyword gaps identified
# Monitor uptime and performance
> monitor_uptime https://mysite.com --interval 300
β
Uptime: 99.9%, Average response: 245ms
# Generate comprehensive report
> generate_reports https://website.com --metrics ["seo", "performance", "security"]
β
Generated 15-page analysis reportβ‘ Quick Start
# Install globally
npm install -g mcp-web-scrape@1.0.7
# Try it instantly (latest version)
npx mcp-web-scrape@latest
# Try specific version
npx mcp-web-scrape@1.0.7
# Or start HTTP server
node dist/http.jsChatGPT Desktop Setup
Add to your ~/Library/Application Support/ChatGPT/config.json:
{
"mcpServers": {
"web-scrape": {
"command": "npx",
"args": ["mcp-web-scrape@1.0.7"]
}
}
}Claude Desktop Setup
Add to your ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"web-scrape": {
"command": "npx",
"args": ["mcp-web-scrape@1.0.7"]
}
}
}π οΈ Available Tools
Core Extraction Tools
Tool | Description |
| Convert HTML to clean Markdown with citations |
| AI-powered content summarization |
| Extract title, description, author, keywords |
| Get all links with filtering options |
| Extract images with alt text and dimensions |
| Search within page content |
| Verify URL accessibility |
| Check robots.txt compliance |
| Parse JSON-LD, microdata, RDFa |
| Compare two pages for changes |
| Process multiple URLs efficiently |
| View cache performance metrics |
| Manage cached content |
Advanced Extraction Tools
Tool | Description |
| Extract form elements, fields, and validation rules |
| Parse HTML tables with headers and structured data |
| Find social media links and profiles |
| Discover emails, phone numbers, and addresses |
| Analyze heading structure (H1-H6) for content hierarchy |
| Discover and parse RSS/Atom feeds |
Content Transformation Tools
Tool | Description |
| Convert web pages to PDF format with customizable settings |
| Extract plain text content without formatting or HTML |
| Generate word frequency analysis and word cloud data |
| Translate web page content to different languages |
| Extract important keywords and phrases from content |
Advanced Analysis Tools
Tool | Description |
| Analyze text readability using various metrics (Flesch, Gunning-Fog, etc.) |
| Detect the primary language of web page content |
| Extract named entities (people, places, organizations) |
| Analyze sentiment and emotional tone of content |
| Classify content into categories and topics |
SEO & Marketing Tools
Tool | Description |
| Analyze competitor websites for SEO and content insights |
| Extract and validate schema.org structured data |
| Check for broken links and redirects on pages |
| Analyze page loading speed and performance metrics |
| Generate optimized meta tags for SEO |
Security & Privacy Tools
Tool | Description |
| Scan pages for common security vulnerabilities |
| Check SSL certificate validity and security details |
| Analyze cookies and tracking mechanisms |
| Detect tracking scripts and privacy concerns |
| Analyze privacy policy compliance and coverage |
Advanced Monitoring Tools
Tool | Description |
| Monitor website uptime and availability |
| Advanced change tracking with similarity analysis |
| Analyze website traffic patterns and trends |
| Benchmark performance against competitors |
| Generate comprehensive analysis reports |
Analysis & Monitoring Tools
Tool | Description |
| Track content changes over time with similarity analysis |
| Measure page performance, SEO, and accessibility metrics |
| Crawl websites to generate comprehensive sitemaps |
| Validate HTML structure, accessibility, and SEO compliance |
π€ Why Not Just Use Built-in Browsing?
Deterministic Results β Same URL always returns identical content
Smart Citations β Every fact links back to its source
Robots Compliant β Respects robots.txt and rate limits
Lightning Fast β ETag/304 caching + persistent storage
Agent-Optimized β Clean Markdown instead of messy HTML
π Safety First
β Respects robots.txt by default
β Rate limiting prevents server overload
β No paywall bypass - ethical scraping only
β User-Agent identification for transparency
π¦ Installation
# Install specific version
npm install -g mcp-web-scrape@1.0.7
# Or use directly (latest)
npx mcp-web-scrape@latest
# Or use specific version
npx mcp-web-scrape@1.0.7π§ Configuration
# Environment variables
export MCP_WEB_SCRAPE_CACHE_DIR="./cache"
export MCP_WEB_SCRAPE_USER_AGENT="MyBot/1.0"
export MCP_WEB_SCRAPE_RATE_LIMIT="1000"π Transports
STDIO (default)
mcp-web-scrapeHTTP/SSE
node dist/http.js --port 3000π Resources
Access cached content as MCP resources:
cache://news.ycombinator.com/path β Cached page content
cache://stats β Cache statistics
cache://robots/news.ycombinator.com β Robots.txt statusπ€ Contributing
We love contributions! See CONTRIBUTING.md for guidelines.
Good First Issues:
Add new content extractors
Improve error handling
Write more tests
Enhance documentation
π License
MIT Β© Mahipal
π Star History
Built with β€οΈ for the