Skip to main content
Glama
Barath2812

Universal Web Data Extraction Platform

by Barath2812

AI-Driven Universal Web Data Extraction Platform

A production-grade, MCP-enabled universal web scraping platform with MongoDB storage and advanced anti-bot (antigravity) mechanisms.

๐ŸŽฏ Features

  • Dual Scraping Engines: Static (Requests + BeautifulSoup) and Dynamic (Playwright)

  • Auto-Detection: Automatically selects the appropriate scraper based on page content

  • Anti-Bot Protection: User-Agent rotation, rate limiting, robots.txt compliance, stealth mode

  • MongoDB Storage: Persists all scraped data with full metadata

  • MCP Integration: Exposes scraping as tools for LLM invocation

  • Export Options: JSON and CSV export capabilities

๐Ÿ“ Project Structure

d:\mcp\
โ”œโ”€โ”€ requirements.txt          # Python dependencies
โ”œโ”€โ”€ config.py                 # Configuration settings
โ”œโ”€โ”€ main.py                   # FastAPI MCP server entry point
โ”œโ”€โ”€ scraper/
โ”‚   โ”œโ”€โ”€ static_scraper.py     # Requests + BeautifulSoup scraper
โ”‚   โ”œโ”€โ”€ dynamic_scraper.py    # Playwright scraper
โ”‚   โ””โ”€โ”€ strategy_selector.py  # Auto-detection logic
โ”œโ”€โ”€ antigravity/
โ”‚   โ”œโ”€โ”€ user_agents.py        # User-Agent rotation
โ”‚   โ”œโ”€โ”€ throttle.py           # Request delays & rate limiting
โ”‚   โ”œโ”€โ”€ robots_validator.py   # robots.txt compliance
โ”‚   โ””โ”€โ”€ stealth.py            # Playwright stealth configuration
โ”œโ”€โ”€ database/
โ”‚   โ”œโ”€โ”€ mongodb.py            # MongoDB connection & operations
โ”‚   โ””โ”€โ”€ models.py             # Pydantic data models
โ”œโ”€โ”€ mcp/
โ”‚   โ””โ”€โ”€ tools.py              # MCP tool definitions
โ”œโ”€โ”€ utils/
โ”‚   โ”œโ”€โ”€ normalizer.py         # Data normalization
โ”‚   โ””โ”€โ”€ exporter.py           # CSV/JSON export
โ”œโ”€โ”€ tests/                    # Test suite
โ””โ”€โ”€ docs/
    โ””โ”€โ”€ README.md             # This file

๐Ÿš€ Quick Start

1. Install Dependencies

cd d:\mcp
pip install -r requirements.txt
playwright install chromium

2. Start MongoDB

Ensure MongoDB is running on localhost:27017 (or update MONGODB_URI in config.py).

3. Run the Server

python main.py

The server will start at http://localhost:8000.

4. Test the API

Open http://localhost:8000/docs for interactive Swagger documentation.

๐Ÿ”Œ API Endpoints

Endpoint

Method

Description

/scrape

POST/GET

Scrape a website

/stats

GET

Get scraping statistics

/recent

GET

Get recently scraped data

/logs

GET

Get scrape logs

/export/json

POST

Export data to JSON

/export/csv

POST

Export data to CSV

/health

GET

Health check

Example Scrape Request

curl -X POST "http://localhost:8000/scrape" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "auto_detect": true}'

๐Ÿง  MCP Tool Usage

The platform exposes a scrape_website tool via MCP:

# Tool Schema
{
    "name": "scrape_website",
    "parameters": {
        "url": "string (required)",
        "dynamic": "boolean (default: false)",
        "auto_detect": "boolean (default: true)",
        "store_in_mongodb": "boolean (default: true)"
    }
}

๐Ÿ›ก๏ธ Anti-Bot (Antigravity) Features

  1. User-Agent Rotation: 20+ realistic browser User-Agents

  2. Request Throttling: 1-5 second random delays between requests

  3. Rate Limiting: Max 10 requests per domain per minute

  4. robots.txt Compliance: Respects crawling restrictions

  5. Playwright Stealth Mode: Disables automation detection flags

๐Ÿ“Š MongoDB Schema

scraped_data Collection

{
  "_id": "ObjectId",
  "url": "string",
  "scraped_at": "ISO timestamp",
  "scraper_type": "static | dynamic",
  "content": {
    "title": "string",
    "text": "string",
    "links": ["string"]
  },
  "metadata": {
    "status_code": "number",
    "response_time": "number",
    "user_agent": "string"
  }
}

scrape_logs Collection

{
  "url": "string",
  "timestamp": "ISO timestamp",
  "success": "boolean",
  "error": "string | null"
}

๐Ÿงช Running Tests

cd d:\mcp
pytest tests/ -v

โš–๏ธ Ethical Considerations

  • Always respects robots.txt directives

  • Implements polite crawling with delays

  • Only scrapes publicly accessible content

  • Rate limiting prevents server overload

  • Designed for responsible use

๐Ÿ“‹ Limitations

  • Cannot bypass authentication or CAPTCHAs

  • JavaScript-heavy SPAs may require dynamic scraping

  • Some sites may detect and block scraping despite stealth measures

  • Rate limiting may slow down bulk operations

๐Ÿ”ฎ Future Scope

  • Proxy rotation support

  • CAPTCHA solving integration

  • Distributed scraping with task queues

  • Advanced content extraction (structured data, tables)

  • Scheduled/recurring scrapes

  • WebSocket real-time updates

๐Ÿ“„ License

This project is for educational purposes.

-
security - not tested
F
license - not found
-
quality - not tested

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Barath2812/mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server