MCP Web Scrape
The MCP Web Scrape server is a comprehensive web content extraction and analysis tool that converts web pages into clean, agent-friendly formats with smart caching and ethical compliance.
Content Extraction & Transformation: Convert HTML to clean Markdown/text/JSON with citations (extract_content), extract plain text (extract_text_only), summarize content with customizable formats (summarize_content), translate content (translate_content), convert pages to PDF (convert_to_pdf), and generate word clouds (generate_word_cloud).
Structured Data Extraction: Extract links with filtering (extract_links), images with metadata (extract_images), forms with validation rules (extract_forms), tables with export options (extract_tables), social media links (extract_social_media), contact information (extract_contact_info), heading hierarchy (extract_headings), RSS/Atom feeds (extract_feeds), and structured data including JSON-LD, microdata, RDFa, OpenGraph, and schema.org markup (extract_structured_data, extract_schema_markup).
Content Analysis: Search within pages with regex support (search_content), extract keywords (extract_keywords), analyze readability using multiple metrics (analyze_readability), detect language with confidence scores (detect_language), extract named entities (extract_entities), perform sentiment analysis at document/paragraph/sentence level (sentiment_analysis), classify content into categories (classify_content), and compare content between URLs (compare_content).
SEO & Marketing Tools: Analyze competitors for insights (analyze_competitors), generate optimized meta tags (generate_meta_tags), check broken links and redirects (check_broken_links), analyze page speed with Core Web Vitals (analyze_page_speed), validate HTML structure (validate_html), analyze overall performance including SEO and accessibility (analyze_performance), and generate sitemaps (generate_sitemap).
Security & Privacy: Scan for vulnerabilities including XSS and CSRF (scan_vulnerabilities), check SSL certificates with chain details (check_ssl_certificate), analyze cookies for security flags (analyze_cookies), detect tracking scripts (detect_tracking), and check privacy policy compliance with GDPR, CCPA, COPPA, and PIPEDA (check_privacy_policy).
Monitoring & Tracking: Monitor uptime with configurable intervals (monitor_uptime), track content changes with similarity analysis (monitor_changes, track_changes_detailed), analyze traffic patterns (analyze_traffic_patterns), and benchmark performance against competitors (benchmark_performance).
Utility & Management: Process multiple URLs efficiently in batches (batch_extract), validate robots.txt compliance (validate_robots), check URL accessibility (check_url_status), manage cache with statistics and selective clearing (clear_cache, get_cache_stats), and generate comprehensive reports in JSON, HTML, or Markdown formats (generate_reports).
Key Benefits: Provides clean Markdown output optimized for AI agents, citation links for fact verification, deterministic and cached results with ETag/304 support, ethical scraping with robots.txt respect and rate limiting, and supports both STDIO and HTTP/SSE transports.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@MCP Web Scrapescrape the main text from https://news.ycombinator.com and format it as markdown"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
π·οΈ MCP Web Scrape
Clean, cached web content for agentsβMarkdown + citations, robots-aware, ETag/304 caching.
π¦ Version
Current Version: 1.0.7
π Quick Start Demo
# Extract content from any webpage
npx mcp-web-scrape@1.0.7
# Example: Extract from a news article
> extract_content https://news.ycombinator.com
β
Extracted 1,247 words with 5 citations
π Clean Markdown ready for your AI agentπ― Tool Examples
# Extract all forms from a webpage
> extract_forms https://example.com/contact
β
Found 3 forms with 12 input fields
# Parse tables into structured data
> extract_tables https://example.com/data --format json
β
Extracted 5 tables with 247 rows
# Find social media profiles
> extract_social_media https://company.com
β
Found Twitter, LinkedIn, Facebook profiles
# Analyze sentiment of content
> sentiment_analysis https://blog.example.com/article
β
Sentiment: Positive (0.85), Emotional tone: Optimistic
# Extract named entities
> extract_entities https://news.example.com/article
β
Found 12 people, 8 organizations, 5 locations
# Check for security vulnerabilities
> scan_vulnerabilities https://mysite.com
β
No XSS vulnerabilities found, 2 header improvements suggested
# Analyze competitor SEO
> analyze_competitors ["https://competitor1.com", "https://competitor2.com"]
β
Competitor analysis complete: keyword gaps identified
# Monitor uptime and performance
> monitor_uptime https://mysite.com --interval 300
β
Uptime: 99.9%, Average response: 245ms
# Generate comprehensive report
> generate_reports https://website.com --metrics ["seo", "performance", "security"]
β
Generated 15-page analysis reportβ‘ Quick Start
# Install globally
npm install -g mcp-web-scrape@1.0.7
# Try it instantly (latest version)
npx mcp-web-scrape@latest
# Try specific version
npx mcp-web-scrape@1.0.7
# Or start HTTP server
node dist/http.jsChatGPT Desktop Setup
Add to your ~/Library/Application Support/ChatGPT/config.json:
{
"mcpServers": {
"web-scrape": {
"command": "npx",
"args": ["mcp-web-scrape@1.0.7"]
}
}
}Claude Desktop Setup
Add to your ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"web-scrape": {
"command": "npx",
"args": ["mcp-web-scrape@1.0.7"]
}
}
}π οΈ Available Tools
Core Extraction Tools
Tool | Description |
| Convert HTML to clean Markdown with citations |
| AI-powered content summarization |
| Extract title, description, author, keywords |
| Get all links with filtering options |
| Extract images with alt text and dimensions |
| Search within page content |
| Verify URL accessibility |
| Check robots.txt compliance |
| Parse JSON-LD, microdata, RDFa |
| Compare two pages for changes |
| Process multiple URLs efficiently |
| View cache performance metrics |
| Manage cached content |
Advanced Extraction Tools
Tool | Description |
| Extract form elements, fields, and validation rules |
| Parse HTML tables with headers and structured data |
| Find social media links and profiles |
| Discover emails, phone numbers, and addresses |
| Analyze heading structure (H1-H6) for content hierarchy |
| Discover and parse RSS/Atom feeds |
Content Transformation Tools
Tool | Description |
| Convert web pages to PDF format with customizable settings |
| Extract plain text content without formatting or HTML |
| Generate word frequency analysis and word cloud data |
| Translate web page content to different languages |
| Extract important keywords and phrases from content |
Advanced Analysis Tools
Tool | Description |
| Analyze text readability using various metrics (Flesch, Gunning-Fog, etc.) |
| Detect the primary language of web page content |
| Extract named entities (people, places, organizations) |
| Analyze sentiment and emotional tone of content |
| Classify content into categories and topics |
SEO & Marketing Tools
Tool | Description |
| Analyze competitor websites for SEO and content insights |
| Extract and validate schema.org structured data |
| Check for broken links and redirects on pages |
| Analyze page loading speed and performance metrics |
| Generate optimized meta tags for SEO |
Security & Privacy Tools
Tool | Description |
| Scan pages for common security vulnerabilities |
| Check SSL certificate validity and security details |
| Analyze cookies and tracking mechanisms |
| Detect tracking scripts and privacy concerns |
| Analyze privacy policy compliance and coverage |
Advanced Monitoring Tools
Tool | Description |
| Monitor website uptime and availability |
| Advanced change tracking with similarity analysis |
| Analyze website traffic patterns and trends |
| Benchmark performance against competitors |
| Generate comprehensive analysis reports |
Analysis & Monitoring Tools
Tool | Description |
| Track content changes over time with similarity analysis |
| Measure page performance, SEO, and accessibility metrics |
| Crawl websites to generate comprehensive sitemaps |
| Validate HTML structure, accessibility, and SEO compliance |
π€ Why Not Just Use Built-in Browsing?
Deterministic Results β Same URL always returns identical content
Smart Citations β Every fact links back to its source
Robots Compliant β Respects robots.txt and rate limits
Lightning Fast β ETag/304 caching + persistent storage
Agent-Optimized β Clean Markdown instead of messy HTML
π Safety First
β Respects robots.txt by default
β Rate limiting prevents server overload
β No paywall bypass - ethical scraping only
β User-Agent identification for transparency
π¦ Installation
# Install specific version
npm install -g mcp-web-scrape@1.0.7
# Or use directly (latest)
npx mcp-web-scrape@latest
# Or use specific version
npx mcp-web-scrape@1.0.7π§ Configuration
# Environment variables
export MCP_WEB_SCRAPE_CACHE_DIR="./cache"
export MCP_WEB_SCRAPE_USER_AGENT="MyBot/1.0"
export MCP_WEB_SCRAPE_RATE_LIMIT="1000"π Transports
STDIO (default)
mcp-web-scrapeHTTP/SSE
node dist/http.js --port 3000π Resources
Access cached content as MCP resources:
cache://news.ycombinator.com/path β Cached page content
cache://stats β Cache statistics
cache://robots/news.ycombinator.com β Robots.txt statusπ€ Contributing
We love contributions! See CONTRIBUTING.md for guidelines.
Good First Issues:
Add new content extractors
Improve error handling
Write more tests
Enhance documentation
π License
MIT Β© Mahipal
π Star History
Built with β€οΈ for the Model Context Protocol ecosystem
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.
Appeared in Searches
Latest Blog Posts
MCP directory API
We provide all the information about MCP servers via our MCP API.
curl -X GET 'https://glama.ai/api/mcp/v1/servers/mukul975/mcp-web-scrape'
If you have feedback or need assistance with the MCP directory API, please join our Discord server