README.mdโข14.5 kB
# ๐ WebClone
<div align="center">
[](https://www.python.org/downloads/)
[](LICENSE)
[](https://github.com/astral-sh/ruff)
[](http://mypy-lang.org/)
**A blazingly fast, async-first website cloning engine that preserves everything.**
[Features](#-features) โข [Quick Start](#-quick-start) โข [Usage](#-usage) โข [Docker](#-docker) โข [Contributing](#-contributing)
</div>
---
## ๐ฏ The Why
Traditional website cloners are **slow**, **blocking**, and **fragile**. They download one resource at a time, freeze on JavaScript-heavy sites, and produce incomplete mirrors.
**WebClone** is different. Built from the ground up with modern Python async/await, it:
- โก **Clones 10-100x faster** with concurrent downloads
- ๐ญ **Handles dynamic SPAs** using Selenium for JavaScript rendering
- ๐จ **Delivers beautiful CLI experience** with real-time progress and colored output
- ๐๏ธ **Follows Clean Architecture** with type-safe, production-grade code
- ๐ณ **Ships production-ready** with Docker, full test coverage, and CI/CD
Whether you're archiving websites, conducting competitive research, or building training datasets, **WebClone** is the definitive solution.
---
## โจ Features
### ๐ **Blazingly Fast Async Engine**
- Concurrent downloads with configurable workers (5-50 parallel connections)
- Intelligent queue management with depth-first and breadth-first strategies
- Automatic retry logic with exponential backoff
### ๐ญ **Dynamic Page Rendering**
- Full Selenium integration for JavaScript-heavy sites
- Automated sidebar navigation for SPAs (Phoenix LiveView, React, Vue)
- PDF snapshot generation with Chrome DevTools Protocol
- Screenshot capture for visual archival
### ๐ **Advanced Authentication & Stealth Mode** โญ NEW
- **Bypass bot detection**: Masks automation signatures (navigator.webdriver, etc.)
- **Fix GCM/FCM errors**: Disables Google Cloud Messaging registration
- **Cookie-based auth**: Save and reuse login sessions
- **Handle "insecure browser" blocks**: Automatic workarounds for Google, Facebook, etc.
- **Rate limit detection**: Smart throttling and backoff strategies
- **Human behavior simulation**: Mouse movements and natural scrolling
### ๐จ **World-Class CLI Experience**
- Beautiful terminal UI powered by [Rich](https://github.com/Textualize/rich)
- Real-time progress bars with per-resource status
- Colored, formatted output with tables and panels
- JSON logs for production monitoring
### ๐๏ธ **Production-Grade Architecture**
- **Type-safe**: 100% type hints with Mypy validation
- **Data validation**: Pydantic V2 models with strict schemas
- **Async-first**: Built on `aiohttp` and `asyncio`
- **Modular design**: Clean Architecture with dependency injection
- **Comprehensive logging**: Structured JSON logs with contextual data
### ๐ฆ **Modern Tooling**
- โก **uv**: Lightning-fast dependency management
- ๐ **ruff**: Ultra-fast linting and formatting
- ๐งช **pytest**: Comprehensive test suite with >90% coverage
- ๐ณ **Docker**: Multi-stage builds with distroless base images
- ๐ **Security**: Bandit audits and dependency scanning
---
## ๐ Quick Start
### Prerequisites
- Python 3.11+
- [uv](https://github.com/astral-sh/uv) (recommended) or pip
### Installation
```bash
# Using uv (recommended - blazingly fast!)
curl -LsSf https://astral.sh/uv/install.sh | sh
uv pip install webclone
# Or using pip
pip install webclone
# Or from source
git clone https://github.com/ruslanmv/webclone.git
cd webclone
make install
```
### Your First Clone
```bash
# Clone a website
webclone clone https://example.com
# With custom settings
webclone clone https://example.com \
--output ./my_mirror \
--workers 10 \
--max-pages 100 \
--recursive
```
That's it! Watch as WebClone downloads your site at lightning speed with beautiful progress bars.
### ๐จ Enterprise Desktop GUI (NEW!)
WebClone now includes a professional, native desktop interface built with modern Tkinter for superior performance:
```bash
# Install with GUI support
make install-gui
# Launch the Enterprise Desktop GUI
make gui
```

**The GUI opens instantly as a native desktop application with:**
- ๐ **Home Dashboard** - Feature overview and quick start guide
- ๐ **Authentication Manager** - Visual cookie-based auth workflow with browser integration
- ๐ฅ **Crawl Configurator** - Point-and-click settings with real-time progress
- ๐ **Results Analytics** - Comprehensive stats, tables, and export options
**Perfect for everyone!** No command line required - professional desktop interface with instant startup, native performance, and seamless OS integration.
**Advantages over web-based GUIs:**
โ
Instant startup (no server to launch)
โ
Native desktop performance
โ
Better OS integration (file dialogs, notifications)
โ
No port conflicts
โ
Offline-friendly

### ๐ค MCP Server for AI Agents (NEW!)
WebClone is now an **official Model Context Protocol (MCP) server**, making website cloning available to AI agents like Claude, CrewAI, and any MCP-compatible framework!
```bash
# Install MCP server
make install-mcp
# Use with Claude Desktop - add to config:
# ~/.config/claude/claude_desktop_config.json
{
"mcpServers": {
"webclone": {
"command": "python",
"args": ["/path/to/webclone/webclone-mcp.py"]
}
}
}
```
**AI agents can now:**
- ๐ **clone_website** - Download entire websites automatically
- ๐ฅ **download_file** - Fetch specific files or URLs
- ๐ **save_authentication** - Guide for saving login sessions
- ๐ **list_saved_sessions** - View all authentication cookies
- โน๏ธ **get_site_info** - Analyze websites before downloading
**Example with Claude:**
```
You: Clone the FastAPI documentation website
Claude: I'll clone that for you.
[Uses WebClone MCP tool]
โ
Cloned 127 pages, 543 assets, 45.2 MB total!
```
**Compatible with:**
- โ
Claude Desktop
- โ
CrewAI
- โ
LangChain
- โ
Any MCP-compatible AI framework
๐ **See:** `docs/MCP_GUIDE.md` and `MCP_QUICKSTART.md`
---
## ๐ Usage
### Interface Options
WebClone offers four ways to use it:
1. **๐จ Desktop GUI** (Easiest - Enterprise Edition)
```bash
make gui
```
- Native desktop application
- Instant startup, no browser required
- Visual authentication manager
- Real-time progress tracking
- Perfect for all users!
2. **๐ค MCP Server** (For AI Agents)
```bash
make install-mcp
```
- Claude Desktop integration
- CrewAI compatible
- LangChain ready
- AI-powered automation
- Perfect for AI workflows!
3. **๐ป Command Line** (Most Powerful)
```bash
webclone clone https://example.com
```
- Automation and scripting
- CI/CD pipelines
- Remote servers
- Power users
4. **๐ Python API** (Most Flexible)
```python
from webclone.core import AsyncCrawler
# ... your code
```
- Custom integrations
- Advanced workflows
- Developers
### Basic Commands
```bash
# Show help
webclone --help
# Clone a website
webclone clone <URL> [OPTIONS]
# Analyze a page without downloading
webclone info <URL>
```
### Advanced Options
```bash
webclone clone https://example.com \
--output ./mirror # Output directory (default: website_mirror)
--workers 10 # Concurrent workers (default: 5)
--max-pages 100 # Maximum pages to crawl (0 = unlimited)
--max-depth 3 # Maximum crawl depth (0 = unlimited)
--delay 100 # Delay between requests in ms
--no-assets # Skip downloading CSS, JS, images
--no-pdf # Skip PDF generation
--all-domains # Follow links to other domains
--verbose # Detailed logging output
--json-logs # JSON-formatted logs for parsing
```
### Real-World Examples
```bash
# Archive a news site (limit pages to avoid overload)
webclone clone https://news.example.com --max-pages 50 --workers 5
# Clone a documentation site recursively
webclone clone https://docs.example.com --recursive --max-depth 5
# Fast clone with maximum parallelism
webclone clone https://example.com --workers 20 --delay 0
# Production mode with JSON logs
webclone clone https://example.com --json-logs --output /var/data/mirror
```
### ๐ Authentication & Stealth Examples
WebClone includes advanced features to handle authentication and bypass bot detection:
```bash
# Run interactive authentication examples
python examples/authenticated_crawl.py
# Example 1: Manual login and save cookies
# Opens browser, you log in, cookies are saved
# Example 2: Use saved cookies for automation
# Loads cookies, bypasses authentication
# Example 3: Test stealth mode effectiveness
# Visits bot detection sites to verify masking
```
**Python API for Authentication:**
```python
from pathlib import Path
from webclone.services import SeleniumService
from webclone.models.config import SeleniumConfig
# Manual login and save session
config = SeleniumConfig(headless=False)
service = SeleniumService(config)
service.start_driver()
service.manual_login_session(
"https://accounts.google.com",
Path("./cookies/google.json")
)
# Later: Use saved cookies for automation
config = SeleniumConfig(headless=True)
service = SeleniumService(config)
service.start_driver()
service.navigate_to("https://google.com")
service.load_cookies(Path("./cookies/google.json"))
# Now authenticated!
```
**Fixes Common Issues:**
- โ
"Couldn't sign you in - browser may not be secure"
- โ
GCM/FCM registration errors
- โ
Navigator.webdriver detection
- โ
Rate limiting and CAPTCHA challenges
See [Authentication Guide](docs/AUTHENTICATION_GUIDE.md) for detailed instructions.
---
## ๐ณ Docker
Run WebClone in a containerized environment:
```bash
# Build the image
make docker-build
# Or manually
docker build -t webclone:latest .
# Run a clone
docker run --rm -v $(pwd)/output:/data webclone:latest \
clone https://example.com --max-pages 10
# Interactive shell
docker run --rm -it -v $(pwd)/output:/data \
--entrypoint /bin/bash webclone:latest
```
### Docker Compose Example
```yaml
version: '3.8'
services:
webclone:
image: webclone:latest
volumes:
- ./output:/data
command: clone https://example.com --workers 10
environment:
- WEBCLONE_MAX_PAGES=100
```
---
## ๐๏ธ Architecture
WebClone follows **Clean Architecture** principles:
```
src/webclone/
โโโ cli.py # Typer CLI interface
โโโ core/ # Core business logic
โ โโโ crawler.py # Async web crawler
โ โโโ downloader.py # Asset downloader
โโโ models/ # Pydantic data models
โ โโโ config.py # Configuration schemas
โ โโโ metadata.py # Result metadata
โโโ services/ # External service integrations
โ โโโ selenium_service.py
โโโ utils/ # Shared utilities
โโโ logger.py
โโโ helpers.py
```
### Key Design Decisions
1. **Async-First**: All I/O operations use `asyncio` for maximum concurrency
2. **Type Safety**: 100% type coverage with strict Mypy checks
3. **Pydantic V2**: Data validation at system boundaries
4. **Dependency Injection**: Services receive dependencies via constructors
5. **Single Responsibility**: Each module has one clear purpose
---
## ๐งช Development
### Setup Development Environment
```bash
# Clone the repository
git clone https://github.com/ruslanmv/webclone.git
cd webclone
# Install with dev dependencies
make dev
# Run tests
make test
# Run linter and type checker
make audit
# Format code
make format
```
### Run Tests
```bash
# Full test suite with coverage
make test
# Fast tests without coverage
make test-fast
# Generate HTML coverage report
make coverage
```
### Code Quality
```bash
# Lint with ruff
make lint
# Type check with mypy
make typecheck
# Format code
make format
# Run all quality checks
make audit
```
---
## ๐ค Contributing
We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
### Quick Contribution Workflow
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes
4. Run quality checks (`make audit`)
5. Commit your changes (`git commit -m 'Add amazing feature'`)
6. Push to the branch (`git push origin feature/amazing-feature`)
7. Open a Pull Request
---
## ๐ Benchmarks
Tested on a standard 4-core machine with 100 Mbps connection:
| Website Type | Pages | Assets | Time (WebClone) | Time (wget) | Speedup |
|--------------|-------|--------|------------------|-------------|---------|
| Static Site | 50 | 200 | 8s | 45s | **5.6x** |
| Blog | 100 | 500 | 25s | 3m 20s | **8.0x** |
| Documentation| 200 | 800 | 1m 10s | 12m 15s | **10.5x** |
| SPA/Dynamic | 30 | 150 | 35s | N/A* | **โ** |
*wget cannot render JavaScript-based SPAs
---
## ๐ License
This project is licensed under the **Apache License 2.0** - see the [LICENSE](LICENSE) file for details.
---
## ๐ค Author
**Ruslan Magana**
- Website: [ruslanmv.com](https://ruslanmv.com)
- GitHub: [@ruslanmv](https://github.com/ruslanmv)
- Email: contact@ruslanmv.com
---
## ๐ Star History
If you find WebClone useful, please consider giving it a star! โญ
[](https://star-history.com/#ruslanmv/webclone&Date)
---
## ๐ Acknowledgments
- [Typer](https://typer.tiangolo.com/) - Beautiful CLI framework
- [Rich](https://rich.readthedocs.io/) - Rich terminal formatting
- [Pydantic](https://docs.pydantic.dev/) - Data validation
- [aiohttp](https://docs.aiohttp.org/) - Async HTTP client
- [uv](https://github.com/astral-sh/uv) - Lightning-fast package installer
---
<div align="center">
**Made with โค๏ธ by [Ruslan Magana](https://ruslanmv.com)**
</div>