Skip to main content
Glama

πŸ•·οΈ MCP Web Scrape

Clean, cached web content for agentsβ€”Markdown + citations, robots-aware, ETag/304 caching.

npm version License: MIT GitHub stars

πŸ“¦ Version

Current Version: 1.0.7

🎬 Live Demos

See MCP Web Scrape in action! These demos show real-time extraction and processing:

πŸš€ Quick Start Demo

# Extract content from any webpage npx mcp-web-scrape@1.0.7 # Example: Extract from a news article > extract_content https://news.ycombinator.com βœ… Extracted 1,247 words with 5 citations πŸ“„ Clean Markdown ready for your AI agent

🎯 Tool Examples

# Extract all forms from a webpage > extract_forms https://example.com/contact βœ… Found 3 forms with 12 input fields # Parse tables into structured data > extract_tables https://example.com/data --format json βœ… Extracted 5 tables with 247 rows # Find social media profiles > extract_social_media https://company.com βœ… Found Twitter, LinkedIn, Facebook profiles # Analyze sentiment of content > sentiment_analysis https://blog.example.com/article βœ… Sentiment: Positive (0.85), Emotional tone: Optimistic # Extract named entities > extract_entities https://news.example.com/article βœ… Found 12 people, 8 organizations, 5 locations # Check for security vulnerabilities > scan_vulnerabilities https://mysite.com βœ… No XSS vulnerabilities found, 2 header improvements suggested # Analyze competitor SEO > analyze_competitors ["https://competitor1.com", "https://competitor2.com"] βœ… Competitor analysis complete: keyword gaps identified # Monitor uptime and performance > monitor_uptime https://mysite.com --interval 300 βœ… Uptime: 99.9%, Average response: 245ms # Generate comprehensive report > generate_reports https://website.com --metrics ["seo", "performance", "security"] βœ… Generated 15-page analysis report

⚑ Quick Start

# Install globally npm install -g mcp-web-scrape@1.0.7 # Try it instantly (latest version) npx mcp-web-scrape@latest # Try specific version npx mcp-web-scrape@1.0.7 # Or start HTTP server node dist/http.js

ChatGPT Desktop Setup

Add to your ~/Library/Application Support/ChatGPT/config.json:

{ "mcpServers": { "web-scrape": { "command": "npx", "args": ["mcp-web-scrape@1.0.7"] } } }

Claude Desktop Setup

Add to your ~/Library/Application Support/Claude/claude_desktop_config.json:

{ "mcpServers": { "web-scrape": { "command": "npx", "args": ["mcp-web-scrape@1.0.7"] } } }

πŸ› οΈ Available Tools

Core Extraction Tools

Tool

Description

extract_content

Convert HTML to clean Markdown with citations

summarize_content

AI-powered content summarization

get_page_metadata

Extract title, description, author, keywords

extract_links

Get all links with filtering options

extract_images

Extract images with alt text and dimensions

search_content

Search within page content

check_url_status

Verify URL accessibility

validate_robots

Check robots.txt compliance

extract_structured_data

Parse JSON-LD, microdata, RDFa

compare_content

Compare two pages for changes

batch_extract

Process multiple URLs efficiently

get_cache_stats

View cache performance metrics

clear_cache

Manage cached content

Advanced Extraction Tools

Tool

Description

extract_forms

Extract form elements, fields, and validation rules

extract_tables

Parse HTML tables with headers and structured data

extract_social_media

Find social media links and profiles

extract_contact_info

Discover emails, phone numbers, and addresses

extract_headings

Analyze heading structure (H1-H6) for content hierarchy

extract_feeds

Discover and parse RSS/Atom feeds

Content Transformation Tools

Tool

Description

convert_to_pdf

Convert web pages to PDF format with customizable settings

extract_text_only

Extract plain text content without formatting or HTML

generate_word_cloud

Generate word frequency analysis and word cloud data

translate_content

Translate web page content to different languages

extract_keywords

Extract important keywords and phrases from content

Advanced Analysis Tools

Tool

Description

analyze_readability

Analyze text readability using various metrics (Flesch, Gunning-Fog, etc.)

detect_language

Detect the primary language of web page content

extract_entities

Extract named entities (people, places, organizations)

sentiment_analysis

Analyze sentiment and emotional tone of content

classify_content

Classify content into categories and topics

SEO & Marketing Tools

Tool

Description

analyze_competitors

Analyze competitor websites for SEO and content insights

extract_schema_markup

Extract and validate schema.org structured data

check_broken_links

Check for broken links and redirects on pages

analyze_page_speed

Analyze page loading speed and performance metrics

generate_meta_tags

Generate optimized meta tags for SEO

Security & Privacy Tools

Tool

Description

scan_vulnerabilities

Scan pages for common security vulnerabilities

check_ssl_certificate

Check SSL certificate validity and security details

analyze_cookies

Analyze cookies and tracking mechanisms

detect_tracking

Detect tracking scripts and privacy concerns

check_privacy_policy

Analyze privacy policy compliance and coverage

Advanced Monitoring Tools

Tool

Description

monitor_uptime

Monitor website uptime and availability

track_changes_detailed

Advanced change tracking with similarity analysis

analyze_traffic_patterns

Analyze website traffic patterns and trends

benchmark_performance

Benchmark performance against competitors

generate_reports

Generate comprehensive analysis reports

Analysis & Monitoring Tools

Tool

Description

monitor_changes

Track content changes over time with similarity analysis

analyze_performance

Measure page performance, SEO, and accessibility metrics

generate_sitemap

Crawl websites to generate comprehensive sitemaps

validate_html

Validate HTML structure, accessibility, and SEO compliance

πŸ€” Why Not Just Use Built-in Browsing?

Deterministic Results β†’ Same URL always returns identical content
Smart Citations β†’ Every fact links back to its source
Robots Compliant β†’ Respects robots.txt and rate limits
Lightning Fast β†’ ETag/304 caching + persistent storage
Agent-Optimized β†’ Clean Markdown instead of messy HTML

πŸ”’ Safety First

  • βœ… Respects robots.txt by default

  • βœ… Rate limiting prevents server overload

  • βœ… No paywall bypass - ethical scraping only

  • βœ… User-Agent identification for transparency

πŸ“¦ Installation

# Install specific version npm install -g mcp-web-scrape@1.0.7 # Or use directly (latest) npx mcp-web-scrape@latest # Or use specific version npx mcp-web-scrape@1.0.7

πŸ”§ Configuration

# Environment variables export MCP_WEB_SCRAPE_CACHE_DIR="./cache" export MCP_WEB_SCRAPE_USER_AGENT="MyBot/1.0" export MCP_WEB_SCRAPE_RATE_LIMIT="1000"

🌐 Transports

STDIO (default)

mcp-web-scrape

HTTP/SSE

node dist/http.js --port 3000

πŸ“š Resources

Access cached content as MCP resources:

cache://news.ycombinator.com/path β†’ Cached page content cache://stats β†’ Cache statistics cache://robots/news.ycombinator.com β†’ Robots.txt status

🀝 Contributing

We love contributions! See CONTRIBUTING.md for guidelines.

Good First Issues:

  • Add new content extractors

  • Improve error handling

  • Write more tests

  • Enhance documentation

πŸ“„ License

MIT Β© Mahipal

🌟 Star History

Star History Chart


Built with ❀️ for the

-
security - not tested
A
license - permissive license
-
quality - not tested

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/mukul975/mcp-web-scrape'

If you have feedback or need assistance with the MCP directory API, please join our Discord server