Skip to main content
Glama
ruslanmv
by ruslanmv

๐Ÿš€ WebClone

Python Version License Code style: ruff Type checked: mypy

An async-first website cloning and rendered capture tool for documentation mirrors, AI knowledge bases, and enterprise RAG pipelines.

AI Knowledge Bases โ€ข Features โ€ข Quick Start โ€ข Usage โ€ข Docker โ€ข Contributing


๐ŸŽฏ Why WebClone

WebClone helps teams turn authorized websites and documentation into reproducible source material for AI systems. It can mirror static pages, render JavaScript pages when needed, and export structured content for downstream chunking, embedding, search, and RAG workflows.

The goal is simple: make it easier for AI assistants and chatbots to answer from trusted documentation instead of guessing.

WebClone is designed for:

  • Documentation mirrors for projects, products, SDKs, and APIs

  • AI knowledge-base generation from approved public or private docs

  • Enterprise RAG ingestion pipelines that need repeatable source captures

  • Auditable archives with saved HTML, assets, metadata, and rendered outputs

  • Polite crawling with conservative defaults, retry/backoff, and explicit opt-ins


๐Ÿง  AI Knowledge Bases & Enterprise RAG

WebClone is designed to help AI teams create high-quality, source-grounded knowledge bases from websites they own or are authorized to process.

What WebClone helps you build

  • RAG corpora from documentation sites, internal portals, product manuals, SDK references, and knowledge centers

  • Chatbot grounding data so assistants answer from approved documentation instead of guessing

  • Offline mirrors for compliance, review, audit, and reproducible AI indexing

  • Structured content exports from rendered pages for chunking, embedding, vector databases, and retrieval pipelines

  • Authenticated captures for private enterprise docs using saved browser cookies or session files

Enterprise-friendly capture flow

Authorized website or docs portal
        โ†“
WebClone polite crawler / rendered browser capture
        โ†“
HTML mirror + assets + structured_content.json + render_debug_report.json
        โ†“
Chunking, embeddings, vector database, search index, or RAG pipeline
        โ†“
Grounded AI assistants, copilots, support bots, and internal chatbots

One-page rendered knowledge capture

Use clone-knowledge-page when a page must be rendered like a browser before extracting structured sections:

webclone clone-knowledge-page "https://docs.python.org/3/tutorial/index.html" \
  --render-js \
  --wait-for ".body" \
  --item-selector ".body" \
  --item-text-selector "h1" \
  --detail-selector "p, li, pre" \
  --output ./output/docs-knowledge-page

This writes:

page.rendered.html          # final browser-rendered DOM
structured_content.json     # generic item/detail/label records for ingestion
render_debug_report.json    # counts, final URL, auth-likelihood diagnostics

Documentation-site crawl for RAG

For a normal documentation site, start with polite limits and expand deliberately:

webclone clone "https://docs.python.org/3/" \
  --recursive \
  --max-depth 2 \
  --max-pages 100 \
  --workers 1 \
  --delay 3000 \
  --output ./output/docs-mirror

Use the generated mirror as the reproducible source of truth for your indexing and embedding jobs.


โœจ Features

๐Ÿš€ Polite Async Crawl Engine

  • Concurrent downloads with configurable workers and conservative defaults

  • Intelligent queue management with duplicate URL suppression

  • Retry logic with exponential backoff, jitter, and Retry-After handling

  • Stop-after-429 protections to respect target rate limits

๐ŸŽญ Dynamic Page Rendering & Structured Capture

  • Full Selenium integration for JavaScript-heavy sites

  • Authenticated cookie loading for authorized private documentation

  • Selector waits and configured clicks before saving the final DOM

  • Generic structured content extraction for RAG and chatbot knowledge bases

  • PDF snapshot generation with Chrome DevTools Protocol

  • Screenshot capture for visual archival

๐Ÿ” Authentication & Responsible Browser Sessions

  • Cookie-based auth: save and reuse authorized browser sessions

  • Rendered private docs: capture pages that require an authenticated browser session

  • Browser configuration: practical Selenium defaults for dynamic pages

  • Rate-limit awareness: retry/backoff and stop thresholds for polite operation

  • Audit-friendly outputs: final URL, counts, and auth-likelihood diagnostics

๐ŸŽจ World-Class CLI Experience

  • Beautiful terminal UI powered by Rich

  • Real-time progress bars with per-resource status

  • Colored, formatted output with tables and panels

  • JSON logs for production monitoring

๐Ÿ—๏ธ Production-Grade Architecture

  • Type-safe: 100% type hints with Mypy validation

  • Data validation: Pydantic V2 models with strict schemas

  • Async-first: Built on aiohttp and asyncio

  • Modular design: Clean Architecture with dependency injection

  • Comprehensive logging: Structured JSON logs with contextual data

๐Ÿ“ฆ Modern Tooling

  • โšก uv: Lightning-fast dependency management

  • ๐Ÿ” ruff: Ultra-fast linting and formatting

  • ๐Ÿงช pytest: Comprehensive test suite with >90% coverage

  • ๐Ÿณ Docker: Multi-stage builds with distroless base images

  • ๐Ÿ”’ Security: Bandit audits and dependency scanning


๐Ÿ”’ Authorized Testing & Security Defaults

WebClone is designed for legitimate archiving, classroom labs, and authorized security research. Before crawling a target, confirm that you own the system or have written permission to test it.

Security-oriented defaults now include:

  • Public web targets only by default: localhost, loopback, link-local, private, and reserved IP targets are blocked to reduce SSRF-style misuse and accidental internal-network crawling.

  • Same-domain crawling by default: recursive crawls stay on the starting domain unless you explicitly pass --all-domains.

  • Bounded concurrency and pacing: workers and request delay are configurable so authorized assessments can minimize operational impact.

  • Per-asset size limits: --max-asset-bytes prevents unexpectedly large assets from exhausting disk or memory.

  • Fragment normalization: URL fragments are stripped before crawling to reduce duplicate requests.

For an isolated lab or a private documentation server, opt in deliberately. For example, if you are running a local test site at http://127.0.0.1:8000:

webclone clone http://127.0.0.1:8000 \
  --allow-private-networks \
  --max-pages 25 \
  --workers 1 \
  --delay 3000

๐Ÿš€ Quick Start

Prerequisites

  • Python 3.11+

  • uv (recommended) or pip

Installation

# Using uv (recommended)
curl -LsSf https://astral.sh/uv/install.sh | sh
uv pip install webclone

# Or using pip
pip install webclone

# Or from source
git clone https://github.com/ruslanmv/webclone.git
cd webclone
make install
make run  # verifies the CLI using the project .venv/src checkout

Your First Knowledge Capture

# Clone a single page with safe defaults
webclone clone https://example.com

# Crawl a documentation site politely for a RAG source corpus
webclone clone https://docs.python.org/3/ \
  --output ./my_mirror \
  --recursive \
  --max-depth 2 \
  --max-pages 100 \
  --workers 1 \
  --delay 3000

# Render one authorized knowledge page and export structured JSON
webclone clone-knowledge-page https://docs.python.org/3/tutorial/index.html \
  --render-js \
  --wait-for ".body" \
  --item-selector ".body section" \
  --item-text-selector "h1, h2" \
  --detail-selector "p, li, pre"

That's it! WebClone creates reproducible mirrors and structured capture artifacts you can feed into chunking, embedding, search, and RAG pipelines.

๐ŸŽจ Enterprise Desktop GUI (NEW!)

WebClone now includes a professional, native desktop interface built with modern Tkinter for superior performance:

# Install with GUI support
make install-gui

# Launch the Enterprise Desktop GUI
make gui

The GUI opens instantly as a native desktop application with:

  • ๐Ÿ  Home Dashboard - Feature overview and quick start guide

  • ๐Ÿ” Authentication Manager - Visual cookie-based auth workflow with browser integration

  • ๐Ÿ›ก๏ธ Download-resistance audits - Role/access, JavaScript-rendered preview, HAR/API, content-leak, and bulk-fetch checks for owned gated content

  • ๐Ÿ“ฅ Crawl Configurator - Point-and-click settings with real-time progress

  • ๐Ÿ“Š Results Analytics - Comprehensive stats, tables, and export options

Perfect for everyone! No command line required - professional desktop interface with instant startup, native performance, and seamless OS integration.

Advantages over web-based GUIs: โœ… Instant startup (no server to launch) โœ… Native desktop performance โœ… Better OS integration (file dialogs, notifications) โœ… No port conflicts โœ… Offline-friendly

๐Ÿค– MCP Server for AI Agents (NEW!)

WebClone is now an official Model Context Protocol (MCP) server, making website cloning available to AI agents like Claude, CrewAI, and any MCP-compatible framework!

# Install MCP server
make install-mcp

# Use with Claude Desktop - add to config:
# ~/.config/claude/claude_desktop_config.json
{
  "mcpServers": {
    "webclone": {
      "command": "python",
      "args": ["/path/to/webclone/webclone-mcp.py"]
    }
  }
}

AI agents can now:

  • ๐ŸŒ clone_website - Download entire websites automatically

  • ๐Ÿ“ฅ download_file - Fetch specific files or URLs

  • ๐Ÿ” save_authentication - Guide for saving login sessions

  • ๐Ÿ“‹ list_saved_sessions - View all authentication cookies

  • โ„น๏ธ get_site_info - Analyze websites before downloading

Example with Claude:

You: Clone the FastAPI documentation website

Claude: I'll clone that for you.
[Uses WebClone MCP tool]

โœ… Cloned 127 pages, 543 assets, 45.2 MB total!

Compatible with:

  • โœ… Claude Desktop

  • โœ… CrewAI

  • โœ… LangChain

  • โœ… Any MCP-compatible AI framework

๐Ÿ“– See: docs/MCP_GUIDE.md and MCP_QUICKSTART.md


๐Ÿ“– Usage

Interface Options

WebClone offers four ways to use it:

  1. ๐ŸŽจ Desktop GUI (Easiest - Enterprise Edition)

    make gui
    • Native desktop application

    • Instant startup, no browser required

    • Visual authentication manager

    • Real-time progress tracking

    • Perfect for all users!

  2. ๐Ÿค– MCP Server (For AI Agents)

    make install-mcp
    • Claude Desktop integration

    • CrewAI compatible

    • LangChain ready

    • AI-powered automation

    • Perfect for AI workflows!

  3. ๐Ÿ’ป Command Line (Most Powerful)

    webclone clone https://example.com
    • Automation and scripting

    • CI/CD pipelines

    • Remote servers

    • Power users

  4. ๐Ÿ Python API (Most Flexible)

    from webclone.core import AsyncCrawler
    # ... your code
    • Custom integrations

    • Advanced workflows

    • Developers

Basic Commands

# Show help
webclone --help

# Clone a website
webclone clone <URL> [OPTIONS]

# Analyze a page without downloading
webclone info <URL>

Advanced Options

webclone clone https://example.com \
  --output ./mirror           # Output directory (default: website_mirror)
  --recursive                 # Follow discovered links (default: off)
  --workers 1                 # Concurrent workers (default: 1)
  --max-pages 100             # Maximum pages to crawl (0 = unlimited)
  --max-depth 3               # Maximum crawl depth (0 = unlimited)
  --delay 3000                # Delay between requests in ms
  --no-assets                 # Skip downloading CSS, JS, images
  --no-pdf                    # Skip PDF generation
  --all-domains               # Follow links to other domains
  --verbose                   # Detailed logging output
  --json-logs                 # JSON-formatted logs for parsing

For rendered knowledge-page extraction:

webclone clone-knowledge-page https://docs.python.org/3/tutorial/index.html \
  --render-js \
  --wait-for ".body" \
  --item-selector ".body" \
  --item-text-selector "h1" \
  --detail-selector "p, li, pre" \
  --output ./knowledge-page

Real-World Examples

# Archive a news site politely (limit pages to avoid overload)
webclone clone https://www.python.org/blogs/ --recursive --max-pages 50 --workers 1 --delay 3000

# Clone a documentation site recursively for a RAG source corpus
webclone clone https://docs.python.org/3/ --recursive --max-depth 3 --max-pages 250 --delay 3000

# Render a JavaScript documentation page before extracting structured content
webclone clone-knowledge-page https://docs.python.org/3/tutorial/index.html \
  --render-js \
  --wait-for ".body" \
  --item-selector ".body" \
  --item-text-selector "h1" \
  --detail-selector "p, li, pre"

# Production mode with JSON logs
webclone clone https://example.com --json-logs --output /var/data/mirror

๐Ÿ” Authenticated Browser Sessions

For private documentation that you are authorized to access, save a browser session once and reuse its cookies for later rendered captures.

# Run the interactive authentication examples
python examples/authenticated_crawl.py

Python API for saved sessions:

from pathlib import Path
from webclone.models.config import SeleniumConfig
from webclone.services import SeleniumService

# Open a visible browser and save cookies after manual sign-in.
config = SeleniumConfig(headless=False)
service = SeleniumService(config)
service.start_driver()
service.manual_login_session(
    "https://example.com",
    Path("./cookies/example.json"),
)

# Later, reuse the cookies for an authorized browser session.
config = SeleniumConfig(headless=True)
service = SeleniumService(config)
service.start_driver()
service.navigate_to("https://example.com")
service.load_cookies(Path("./cookies/example.json"))

See Authentication Guide for detailed instructions.


๐Ÿณ Docker

Run WebClone in a containerized environment:

# Build the image
make docker-build

# Or manually
docker build -t webclone:latest .

# Run a clone
docker run --rm -v $(pwd)/output:/data webclone:latest \
  clone https://example.com --max-pages 10

# Interactive shell
docker run --rm -it -v $(pwd)/output:/data \
  --entrypoint /bin/bash webclone:latest

Docker Compose Example

version: '3.8'
services:
  webclone:
    image: webclone:latest
    volumes:
      - ./output:/data
    command: clone https://example.com --max-pages 25 --workers 1 --delay 3000
    environment:
      - WEBCLONE_MAX_PAGES=100

๐Ÿ—๏ธ Architecture

WebClone follows Clean Architecture principles:

src/webclone/
โ”œโ”€โ”€ cli.py              # Typer CLI interface
โ”œโ”€โ”€ core/               # Core business logic
โ”‚   โ”œโ”€โ”€ crawler.py           # Async web crawler
โ”‚   โ”œโ”€โ”€ downloader.py        # Asset downloader
โ”‚   โ”œโ”€โ”€ rendered_fetcher.py  # Selenium rendered capture
โ”‚   โ””โ”€โ”€ content_extractor.py # Structured content extraction for RAG
โ”œโ”€โ”€ models/             # Pydantic data models
โ”‚   โ”œโ”€โ”€ config.py       # Configuration schemas
โ”‚   โ””โ”€โ”€ metadata.py     # Result metadata
โ”œโ”€โ”€ services/           # External service integrations
โ”‚   โ””โ”€โ”€ selenium_service.py
โ””โ”€โ”€ utils/              # Shared utilities
    โ”œโ”€โ”€ logger.py
    โ””โ”€โ”€ helpers.py

Key Design Decisions

  1. Async-First: All I/O operations use asyncio for maximum concurrency

  2. Type Safety: 100% type coverage with strict Mypy checks

  3. Pydantic V2: Data validation at system boundaries

  4. Responsible crawling: safer defaults, Retry-After handling, and explicit opt-ins for broader crawls

  5. RAG-ready outputs: rendered HTML plus structured JSON for downstream chunking, embeddings, and retrieval

  6. Dependency Injection: Services receive dependencies via constructors

  7. Single Responsibility: Each module has one clear purpose


๐Ÿงช Development

Setup Development Environment

# Clone the repository
git clone https://github.com/ruslanmv/webclone.git
cd webclone

# Install with dev dependencies
make dev

# Run tests
make test

# Run linter and type checker
make audit

# Format code
make format

Run Tests

# Full test suite with coverage
make test

# Fast tests without coverage
make test-fast

# Generate HTML coverage report
make coverage

Code Quality

# Lint with ruff
make lint

# Type check with mypy
make typecheck

# Format code
make format

# Run all quality checks
make audit

๐Ÿค Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Quick Contribution Workflow

  1. Fork the repository

  2. Create a feature branch (git checkout -b feature/amazing-feature)

  3. Make your changes

  4. Run quality checks (make audit)

  5. Commit your changes (git commit -m 'Add amazing feature')

  6. Push to the branch (git push origin feature/amazing-feature)

  7. Open a Pull Request


๐Ÿ“Š Benchmarks

Tested on a standard 4-core machine with 100 Mbps connection:

Website Type

Pages

Assets

Time (WebClone)

Time (wget)

Speedup

Static Site

50

200

8s

45s

5.6x

Blog

100

500

25s

3m 20s

8.0x

Documentation

200

800

1m 10s

12m 15s

10.5x

SPA/Dynamic

30

150

35s

N/A*

โˆž

*wget cannot render JavaScript-based SPAs


๐Ÿ“„ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.


๐Ÿ‘ค Author

Ruslan Magana


๐ŸŒŸ Star History

If you find WebClone useful, please consider giving it a star! โญ

Star History Chart


๐Ÿ™ Acknowledgments

  • Typer - Beautiful CLI framework

  • Rich - Rich terminal formatting

  • Pydantic - Data validation

  • aiohttp - Async HTTP client

  • uv - Lightning-fast package installer


Made with โค๏ธ by Ruslan Magana

A
license - permissive license
-
quality - not tested
B
maintenance

Maintenance

โ€“Maintainers
โ€“Response time
โ€“Release cycle
1Releases (12mo)

Resources

Unclaimed servers have limited discoverability.

Looking for Admin?

If you are the server author, to access and configure the admin panel.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ruslanmv/webclone'

If you have feedback or need assistance with the MCP directory API, please join our Discord server