Skip to main content
Glama
README.mdโ€ข14.5 kB
# ๐Ÿš€ WebClone <div align="center"> [![Python Version](https://img.shields.io/badge/python-3.11%2B-blue.svg)](https://www.python.org/downloads/) [![License](https://img.shields.io/badge/license-Apache%202.0-green.svg)](LICENSE) [![Code style: ruff](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://github.com/astral-sh/ruff) [![Type checked: mypy](https://img.shields.io/badge/type%20checked-mypy-blue.svg)](http://mypy-lang.org/) **A blazingly fast, async-first website cloning engine that preserves everything.** [Features](#-features) โ€ข [Quick Start](#-quick-start) โ€ข [Usage](#-usage) โ€ข [Docker](#-docker) โ€ข [Contributing](#-contributing) </div> --- ## ๐ŸŽฏ The Why Traditional website cloners are **slow**, **blocking**, and **fragile**. They download one resource at a time, freeze on JavaScript-heavy sites, and produce incomplete mirrors. **WebClone** is different. Built from the ground up with modern Python async/await, it: - โšก **Clones 10-100x faster** with concurrent downloads - ๐ŸŽญ **Handles dynamic SPAs** using Selenium for JavaScript rendering - ๐ŸŽจ **Delivers beautiful CLI experience** with real-time progress and colored output - ๐Ÿ—๏ธ **Follows Clean Architecture** with type-safe, production-grade code - ๐Ÿณ **Ships production-ready** with Docker, full test coverage, and CI/CD Whether you're archiving websites, conducting competitive research, or building training datasets, **WebClone** is the definitive solution. --- ## โœจ Features ### ๐Ÿš€ **Blazingly Fast Async Engine** - Concurrent downloads with configurable workers (5-50 parallel connections) - Intelligent queue management with depth-first and breadth-first strategies - Automatic retry logic with exponential backoff ### ๐ŸŽญ **Dynamic Page Rendering** - Full Selenium integration for JavaScript-heavy sites - Automated sidebar navigation for SPAs (Phoenix LiveView, React, Vue) - PDF snapshot generation with Chrome DevTools Protocol - Screenshot capture for visual archival ### ๐Ÿ” **Advanced Authentication & Stealth Mode** โญ NEW - **Bypass bot detection**: Masks automation signatures (navigator.webdriver, etc.) - **Fix GCM/FCM errors**: Disables Google Cloud Messaging registration - **Cookie-based auth**: Save and reuse login sessions - **Handle "insecure browser" blocks**: Automatic workarounds for Google, Facebook, etc. - **Rate limit detection**: Smart throttling and backoff strategies - **Human behavior simulation**: Mouse movements and natural scrolling ### ๐ŸŽจ **World-Class CLI Experience** - Beautiful terminal UI powered by [Rich](https://github.com/Textualize/rich) - Real-time progress bars with per-resource status - Colored, formatted output with tables and panels - JSON logs for production monitoring ### ๐Ÿ—๏ธ **Production-Grade Architecture** - **Type-safe**: 100% type hints with Mypy validation - **Data validation**: Pydantic V2 models with strict schemas - **Async-first**: Built on `aiohttp` and `asyncio` - **Modular design**: Clean Architecture with dependency injection - **Comprehensive logging**: Structured JSON logs with contextual data ### ๐Ÿ“ฆ **Modern Tooling** - โšก **uv**: Lightning-fast dependency management - ๐Ÿ” **ruff**: Ultra-fast linting and formatting - ๐Ÿงช **pytest**: Comprehensive test suite with >90% coverage - ๐Ÿณ **Docker**: Multi-stage builds with distroless base images - ๐Ÿ”’ **Security**: Bandit audits and dependency scanning --- ## ๐Ÿš€ Quick Start ### Prerequisites - Python 3.11+ - [uv](https://github.com/astral-sh/uv) (recommended) or pip ### Installation ```bash # Using uv (recommended - blazingly fast!) curl -LsSf https://astral.sh/uv/install.sh | sh uv pip install webclone # Or using pip pip install webclone # Or from source git clone https://github.com/ruslanmv/webclone.git cd webclone make install ``` ### Your First Clone ```bash # Clone a website webclone clone https://example.com # With custom settings webclone clone https://example.com \ --output ./my_mirror \ --workers 10 \ --max-pages 100 \ --recursive ``` That's it! Watch as WebClone downloads your site at lightning speed with beautiful progress bars. ### ๐ŸŽจ Enterprise Desktop GUI (NEW!) WebClone now includes a professional, native desktop interface built with modern Tkinter for superior performance: ```bash # Install with GUI support make install-gui # Launch the Enterprise Desktop GUI make gui ``` ![](assets/2025-11-25-00-37-55.png) **The GUI opens instantly as a native desktop application with:** - ๐Ÿ  **Home Dashboard** - Feature overview and quick start guide - ๐Ÿ” **Authentication Manager** - Visual cookie-based auth workflow with browser integration - ๐Ÿ“ฅ **Crawl Configurator** - Point-and-click settings with real-time progress - ๐Ÿ“Š **Results Analytics** - Comprehensive stats, tables, and export options **Perfect for everyone!** No command line required - professional desktop interface with instant startup, native performance, and seamless OS integration. **Advantages over web-based GUIs:** โœ… Instant startup (no server to launch) โœ… Native desktop performance โœ… Better OS integration (file dialogs, notifications) โœ… No port conflicts โœ… Offline-friendly ![WebClone Enterprise GUI](https://via.placeholder.com/800x450?text=WebClone+Enterprise+Desktop+GUI) ### ๐Ÿค– MCP Server for AI Agents (NEW!) WebClone is now an **official Model Context Protocol (MCP) server**, making website cloning available to AI agents like Claude, CrewAI, and any MCP-compatible framework! ```bash # Install MCP server make install-mcp # Use with Claude Desktop - add to config: # ~/.config/claude/claude_desktop_config.json { "mcpServers": { "webclone": { "command": "python", "args": ["/path/to/webclone/webclone-mcp.py"] } } } ``` **AI agents can now:** - ๐ŸŒ **clone_website** - Download entire websites automatically - ๐Ÿ“ฅ **download_file** - Fetch specific files or URLs - ๐Ÿ” **save_authentication** - Guide for saving login sessions - ๐Ÿ“‹ **list_saved_sessions** - View all authentication cookies - โ„น๏ธ **get_site_info** - Analyze websites before downloading **Example with Claude:** ``` You: Clone the FastAPI documentation website Claude: I'll clone that for you. [Uses WebClone MCP tool] โœ… Cloned 127 pages, 543 assets, 45.2 MB total! ``` **Compatible with:** - โœ… Claude Desktop - โœ… CrewAI - โœ… LangChain - โœ… Any MCP-compatible AI framework ๐Ÿ“– **See:** `docs/MCP_GUIDE.md` and `MCP_QUICKSTART.md` --- ## ๐Ÿ“– Usage ### Interface Options WebClone offers four ways to use it: 1. **๐ŸŽจ Desktop GUI** (Easiest - Enterprise Edition) ```bash make gui ``` - Native desktop application - Instant startup, no browser required - Visual authentication manager - Real-time progress tracking - Perfect for all users! 2. **๐Ÿค– MCP Server** (For AI Agents) ```bash make install-mcp ``` - Claude Desktop integration - CrewAI compatible - LangChain ready - AI-powered automation - Perfect for AI workflows! 3. **๐Ÿ’ป Command Line** (Most Powerful) ```bash webclone clone https://example.com ``` - Automation and scripting - CI/CD pipelines - Remote servers - Power users 4. **๐Ÿ Python API** (Most Flexible) ```python from webclone.core import AsyncCrawler # ... your code ``` - Custom integrations - Advanced workflows - Developers ### Basic Commands ```bash # Show help webclone --help # Clone a website webclone clone <URL> [OPTIONS] # Analyze a page without downloading webclone info <URL> ``` ### Advanced Options ```bash webclone clone https://example.com \ --output ./mirror # Output directory (default: website_mirror) --workers 10 # Concurrent workers (default: 5) --max-pages 100 # Maximum pages to crawl (0 = unlimited) --max-depth 3 # Maximum crawl depth (0 = unlimited) --delay 100 # Delay between requests in ms --no-assets # Skip downloading CSS, JS, images --no-pdf # Skip PDF generation --all-domains # Follow links to other domains --verbose # Detailed logging output --json-logs # JSON-formatted logs for parsing ``` ### Real-World Examples ```bash # Archive a news site (limit pages to avoid overload) webclone clone https://news.example.com --max-pages 50 --workers 5 # Clone a documentation site recursively webclone clone https://docs.example.com --recursive --max-depth 5 # Fast clone with maximum parallelism webclone clone https://example.com --workers 20 --delay 0 # Production mode with JSON logs webclone clone https://example.com --json-logs --output /var/data/mirror ``` ### ๐Ÿ” Authentication & Stealth Examples WebClone includes advanced features to handle authentication and bypass bot detection: ```bash # Run interactive authentication examples python examples/authenticated_crawl.py # Example 1: Manual login and save cookies # Opens browser, you log in, cookies are saved # Example 2: Use saved cookies for automation # Loads cookies, bypasses authentication # Example 3: Test stealth mode effectiveness # Visits bot detection sites to verify masking ``` **Python API for Authentication:** ```python from pathlib import Path from webclone.services import SeleniumService from webclone.models.config import SeleniumConfig # Manual login and save session config = SeleniumConfig(headless=False) service = SeleniumService(config) service.start_driver() service.manual_login_session( "https://accounts.google.com", Path("./cookies/google.json") ) # Later: Use saved cookies for automation config = SeleniumConfig(headless=True) service = SeleniumService(config) service.start_driver() service.navigate_to("https://google.com") service.load_cookies(Path("./cookies/google.json")) # Now authenticated! ``` **Fixes Common Issues:** - โœ… "Couldn't sign you in - browser may not be secure" - โœ… GCM/FCM registration errors - โœ… Navigator.webdriver detection - โœ… Rate limiting and CAPTCHA challenges See [Authentication Guide](docs/AUTHENTICATION_GUIDE.md) for detailed instructions. --- ## ๐Ÿณ Docker Run WebClone in a containerized environment: ```bash # Build the image make docker-build # Or manually docker build -t webclone:latest . # Run a clone docker run --rm -v $(pwd)/output:/data webclone:latest \ clone https://example.com --max-pages 10 # Interactive shell docker run --rm -it -v $(pwd)/output:/data \ --entrypoint /bin/bash webclone:latest ``` ### Docker Compose Example ```yaml version: '3.8' services: webclone: image: webclone:latest volumes: - ./output:/data command: clone https://example.com --workers 10 environment: - WEBCLONE_MAX_PAGES=100 ``` --- ## ๐Ÿ—๏ธ Architecture WebClone follows **Clean Architecture** principles: ``` src/webclone/ โ”œโ”€โ”€ cli.py # Typer CLI interface โ”œโ”€โ”€ core/ # Core business logic โ”‚ โ”œโ”€โ”€ crawler.py # Async web crawler โ”‚ โ””โ”€โ”€ downloader.py # Asset downloader โ”œโ”€โ”€ models/ # Pydantic data models โ”‚ โ”œโ”€โ”€ config.py # Configuration schemas โ”‚ โ””โ”€โ”€ metadata.py # Result metadata โ”œโ”€โ”€ services/ # External service integrations โ”‚ โ””โ”€โ”€ selenium_service.py โ””โ”€โ”€ utils/ # Shared utilities โ”œโ”€โ”€ logger.py โ””โ”€โ”€ helpers.py ``` ### Key Design Decisions 1. **Async-First**: All I/O operations use `asyncio` for maximum concurrency 2. **Type Safety**: 100% type coverage with strict Mypy checks 3. **Pydantic V2**: Data validation at system boundaries 4. **Dependency Injection**: Services receive dependencies via constructors 5. **Single Responsibility**: Each module has one clear purpose --- ## ๐Ÿงช Development ### Setup Development Environment ```bash # Clone the repository git clone https://github.com/ruslanmv/webclone.git cd webclone # Install with dev dependencies make dev # Run tests make test # Run linter and type checker make audit # Format code make format ``` ### Run Tests ```bash # Full test suite with coverage make test # Fast tests without coverage make test-fast # Generate HTML coverage report make coverage ``` ### Code Quality ```bash # Lint with ruff make lint # Type check with mypy make typecheck # Format code make format # Run all quality checks make audit ``` --- ## ๐Ÿค Contributing We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines. ### Quick Contribution Workflow 1. Fork the repository 2. Create a feature branch (`git checkout -b feature/amazing-feature`) 3. Make your changes 4. Run quality checks (`make audit`) 5. Commit your changes (`git commit -m 'Add amazing feature'`) 6. Push to the branch (`git push origin feature/amazing-feature`) 7. Open a Pull Request --- ## ๐Ÿ“Š Benchmarks Tested on a standard 4-core machine with 100 Mbps connection: | Website Type | Pages | Assets | Time (WebClone) | Time (wget) | Speedup | |--------------|-------|--------|------------------|-------------|---------| | Static Site | 50 | 200 | 8s | 45s | **5.6x** | | Blog | 100 | 500 | 25s | 3m 20s | **8.0x** | | Documentation| 200 | 800 | 1m 10s | 12m 15s | **10.5x** | | SPA/Dynamic | 30 | 150 | 35s | N/A* | **โˆž** | *wget cannot render JavaScript-based SPAs --- ## ๐Ÿ“„ License This project is licensed under the **Apache License 2.0** - see the [LICENSE](LICENSE) file for details. --- ## ๐Ÿ‘ค Author **Ruslan Magana** - Website: [ruslanmv.com](https://ruslanmv.com) - GitHub: [@ruslanmv](https://github.com/ruslanmv) - Email: contact@ruslanmv.com --- ## ๐ŸŒŸ Star History If you find WebClone useful, please consider giving it a star! โญ [![Star History Chart](https://api.star-history.com/svg?repos=ruslanmv/webclone&type=Date)](https://star-history.com/#ruslanmv/webclone&Date) --- ## ๐Ÿ™ Acknowledgments - [Typer](https://typer.tiangolo.com/) - Beautiful CLI framework - [Rich](https://rich.readthedocs.io/) - Rich terminal formatting - [Pydantic](https://docs.pydantic.dev/) - Data validation - [aiohttp](https://docs.aiohttp.org/) - Async HTTP client - [uv](https://github.com/astral-sh/uv) - Lightning-fast package installer --- <div align="center"> **Made with โค๏ธ by [Ruslan Magana](https://ruslanmv.com)** </div>

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ruslanmv/webclone'

If you have feedback or need assistance with the MCP directory API, please join our Discord server