Enables persistent storage of scraped web data, including page content, metadata, and logs, within MongoDB collections.
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@Universal Web Data Extraction Platformscrape https://news.ycombinator.com and export the data to CSV"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
AI-Driven Universal Web Data Extraction Platform
A production-grade, MCP-enabled universal web scraping platform with MongoDB storage and advanced anti-bot (antigravity) mechanisms.
๐ฏ Features
Dual Scraping Engines: Static (Requests + BeautifulSoup) and Dynamic (Playwright)
Auto-Detection: Automatically selects the appropriate scraper based on page content
Anti-Bot Protection: User-Agent rotation, rate limiting, robots.txt compliance, stealth mode
MongoDB Storage: Persists all scraped data with full metadata
MCP Integration: Exposes scraping as tools for LLM invocation
Export Options: JSON and CSV export capabilities
๐ Project Structure
d:\mcp\
โโโ requirements.txt # Python dependencies
โโโ config.py # Configuration settings
โโโ main.py # FastAPI MCP server entry point
โโโ scraper/
โ โโโ static_scraper.py # Requests + BeautifulSoup scraper
โ โโโ dynamic_scraper.py # Playwright scraper
โ โโโ strategy_selector.py # Auto-detection logic
โโโ antigravity/
โ โโโ user_agents.py # User-Agent rotation
โ โโโ throttle.py # Request delays & rate limiting
โ โโโ robots_validator.py # robots.txt compliance
โ โโโ stealth.py # Playwright stealth configuration
โโโ database/
โ โโโ mongodb.py # MongoDB connection & operations
โ โโโ models.py # Pydantic data models
โโโ mcp/
โ โโโ tools.py # MCP tool definitions
โโโ utils/
โ โโโ normalizer.py # Data normalization
โ โโโ exporter.py # CSV/JSON export
โโโ tests/ # Test suite
โโโ docs/
โโโ README.md # This file๐ Quick Start
1. Install Dependencies
cd d:\mcp
pip install -r requirements.txt
playwright install chromium2. Start MongoDB
Ensure MongoDB is running on localhost:27017 (or update MONGODB_URI in config.py).
3. Run the Server
python main.pyThe server will start at http://localhost:8000.
4. Test the API
Open http://localhost:8000/docs for interactive Swagger documentation.
๐ API Endpoints
Endpoint | Method | Description |
| POST/GET | Scrape a website |
| GET | Get scraping statistics |
| GET | Get recently scraped data |
| GET | Get scrape logs |
| POST | Export data to JSON |
| POST | Export data to CSV |
| GET | Health check |
Example Scrape Request
curl -X POST "http://localhost:8000/scrape" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "auto_detect": true}'๐ง MCP Tool Usage
The platform exposes a scrape_website tool via MCP:
# Tool Schema
{
"name": "scrape_website",
"parameters": {
"url": "string (required)",
"dynamic": "boolean (default: false)",
"auto_detect": "boolean (default: true)",
"store_in_mongodb": "boolean (default: true)"
}
}๐ก๏ธ Anti-Bot (Antigravity) Features
User-Agent Rotation: 20+ realistic browser User-Agents
Request Throttling: 1-5 second random delays between requests
Rate Limiting: Max 10 requests per domain per minute
robots.txt Compliance: Respects crawling restrictions
Playwright Stealth Mode: Disables automation detection flags
๐ MongoDB Schema
scraped_data Collection
{
"_id": "ObjectId",
"url": "string",
"scraped_at": "ISO timestamp",
"scraper_type": "static | dynamic",
"content": {
"title": "string",
"text": "string",
"links": ["string"]
},
"metadata": {
"status_code": "number",
"response_time": "number",
"user_agent": "string"
}
}scrape_logs Collection
{
"url": "string",
"timestamp": "ISO timestamp",
"success": "boolean",
"error": "string | null"
}๐งช Running Tests
cd d:\mcp
pytest tests/ -vโ๏ธ Ethical Considerations
Always respects
robots.txtdirectivesImplements polite crawling with delays
Only scrapes publicly accessible content
Rate limiting prevents server overload
Designed for responsible use
๐ Limitations
Cannot bypass authentication or CAPTCHAs
JavaScript-heavy SPAs may require dynamic scraping
Some sites may detect and block scraping despite stealth measures
Rate limiting may slow down bulk operations
๐ฎ Future Scope
Proxy rotation support
CAPTCHA solving integration
Distributed scraping with task queues
Advanced content extraction (structured data, tables)
Scheduled/recurring scrapes
WebSocket real-time updates
๐ License
This project is for educational purposes.
This server cannot be installed
Resources
Unclaimed servers have limited discoverability.
Looking for Admin?
If you are the server author, to access and configure the admin panel.