Supports custom CSS selector targeting for precise extraction of specific elements from web pages, enabling fine-grained control over what content is scraped.
Enables metadata extraction from GitHub pages, allowing retrieval of repository information, Open Graph tags, and other page data through the web scraping functionality.
MCP Web Scraper
A lightweight and efficient web scraping MCP server using direct STDIO protocol
🚀 Quick Start
Option 1: Automated Setup
Option 2: Manual Setup
⚙️ Claude Desktop Configuration
Step 1: Find Your Paths
Step 2: Configure Claude Desktop
Open your Claude Desktop config file:
macOS:
Windows:
Linux:
Step 3: Add Configuration
Add this to your config file:
Example:
Step 4: Restart Claude Desktop
- Completely close Claude Desktop (Cmd+Q on Mac)
- Restart the application
- Look for the hammer icon (🔨)
- You should see "web-scraper" in your MCP servers
🛠 Available Tools
scrape_website
Extract data from websites with flexible options:
- extract_type:
text
,links
,images
,table
- selector: CSS selector for targeting specific elements
- max_results: Limit number of results (1-50)
extract_headlines
Get all headlines (h1, h2, h3) from a webpage with hierarchy and attributes.
extract_metadata
Extract comprehensive metadata:
- Basic: title, description, keywords, author
- Open Graph: og:title, og:description, og:image
- Twitter Cards: twitter:title, twitter:description
get_page_info
Get page structure overview:
- Element counts (paragraphs, headings, links, images, tables)
- Basic metadata
- Page statistics
💡 Usage Examples
Basic Scraping
Advanced Examples
Specific Selectors
📁 Project Structure
🔧 Features
Web Scraping Capabilities
- ✅ Text extraction with CSS selectors
- ✅ Link extraction with full attributes
- ✅ Image extraction with metadata
- ✅ Table data extraction and formatting
- ✅ Comprehensive metadata extraction
- ✅ Headline extraction with hierarchy
- ✅ Custom CSS selector support
- ✅ Configurable result limits
- ✅ Error handling and validation
MCP Integration
- ✅ Direct STDIO protocol (no HTTP needed)
- ✅ Native Claude Desktop integration
- ✅ Automatic server lifecycle management
- ✅ Schema validation and documentation
- ✅ Comprehensive error handling
- ✅ Minimal dependencies
🛡 Security & Best Practices
- Respect robots.txt: Always check robots.txt before scraping
- Rate limiting: Built-in 10-second request timeout
- User-Agent: Uses modern browser headers
- Input validation: URL and parameter validation
- Error handling: Graceful error handling and reporting
- Resource limits: Configurable result limits prevent overload
🐛 Troubleshooting
MCP Server Not Appearing
Check your paths:
Validate JSON configuration:
- Use a JSON validator to check syntax
- Ensure no trailing commas
- Use absolute paths (not relative)
Permission Issues
Import Errors
Testing the MCP Server
You can test if the server works by running it manually:
The server should start and wait for STDIO input from Claude Desktop.
📚 Dependencies
- requests: HTTP library for web requests
- beautifulsoup4: HTML/XML parsing
- lxml: Fast XML and HTML processor
- mcp: Model Context Protocol library
🤝 Contributing
- Fork the repository
- Create a feature branch
- Test thoroughly with Claude Desktop
- Submit a pull request
📄 License
This project is open source and available under the MIT License.
🔗 Resources
Simple, efficient web scraping for Claude Desktop! 🕷️✨
This server cannot be installed
A lightweight web scraping server that allows Claude Desktop users to extract various types of data from websites, including text, links, images, tables, headlines, and metadata using CSS selectors.
Related MCP Servers
- -securityAlicense-qualityThe server facilitates access to Julia documentation and source code through Claude Desktop, allowing users to retrieve information on Julia packages, modules, types, functions, and methods.Last updated -402JavaScriptMIT License
- AsecurityFlicenseAqualityA server that enables Claude Desktop users to access the Claude API directly, allowing them to bypass Professional Plan limitations and use advanced features like custom system prompts and conversation management.Last updated -15Python
Needle MCP Serverofficial
AsecurityAlicenseAqualityA server that allows users to manage documents and perform Claude-powered searches using Needle through the Claude Desktop application.Last updated -739PythonMIT License- -securityAlicense-qualityA server that integrates with Claude Desktop to enable real-time web research capabilities, allowing users to search Google, extract webpage content, and capture screenshots directly from conversations.Last updated -854MIT License