GUI_GUIDE.mdโข23.2 kB
# ๐จ WebClone Enterprise Desktop GUI - Complete User Guide
## Overview
WebClone's Enterprise Desktop GUI provides a professional, native desktop interface for website cloning and archival. Built with modern Tkinter (ttkbootstrap), it offers superior performance, instant startup, and seamless OS integration.
---
## ๐ Quick Start
### Installation
```bash
# Clone or navigate to WebClone directory
cd webclone
# Install with GUI support
make install-gui
```
### Launch
```bash
# Start the GUI
make gui
```
The Enterprise Desktop GUI will open immediately as a native application.
---
## ๐ Interface Guide
### Application Layout
The WebClone Enterprise GUI features:
- **Left Sidebar**: Navigation with 4 main sections
- **Content Area**: Dynamic pages with forms, controls, and results
- **Modern Theme**: Professional "cosmo" theme with blue accents
- **Native Components**: OS-integrated dialogs and controls
---
### 1. ๐ Home Dashboard
The home page provides:
#### Feature Cards
Four prominent cards highlighting key capabilities:
1. **๐ Blazingly Fast**
- Async-first architecture
- 10-100x faster than traditional crawlers
- Concurrent downloads with semaphore control
2. **๐ Advanced Authentication**
- Stealth mode bypasses bot detection
- Cookie-based persistent sessions
- Support for 2FA and complex logins
3. **๐ฏ Smart Crawling**
- Intelligent link discovery
- Asset extraction (CSS, JS, images, fonts)
- Respects robots.txt and rate limits
4. **๐ Comprehensive Analytics**
- Detailed metrics and reports
- Real-time progress tracking
- Export capabilities
#### Quick Start Guide
Step-by-step workflow overview:
1. (Optional) Set up authentication
2. Navigate to 'Crawl Website'
3. Enter URL and configure
4. Start crawl and monitor
5. View results
#### Action Buttons
- **๐ Set Up Authentication** - Jump to auth page
- **๐ฅ Start Crawling Now** - Jump to crawl page
**Use Case:** First-time users should start here to understand the workflow.
---
### 2. ๐ Authentication Manager
Manage authenticated sessions for protected websites using a tabbed interface.
#### Tab 1: ๐ New Login
Create new authenticated sessions:
**Login Configuration**
1. **Login URL**
- Enter the site's login page URL
- Example: `https://accounts.google.com`
- Example: `https://www.facebook.com/login`
- Must be a valid HTTP/HTTPS URL
2. **Session Name**
- Choose a memorable, descriptive name
- Good: `google_work`, `facebook_personal`, `github_private`
- Bad: `session1`, `temp`, `test`
- Used as filename: `cookies/[session_name].json`
**Workflow**
1. Fill in Login URL and Session Name
2. Click **"๐ Open Browser for Login"**
- Chrome browser opens with stealth mode enabled
- Navigator.webdriver is masked
- Google Cloud Services disabled
- Bot detection bypassed
3. Manually log in to the site
- Enter username/password
- Complete 2FA if required
- Solve CAPTCHAs if needed
- Navigate to your target page
4. Click **"๐พ Save Session & Cookies"**
- All cookies are saved to `./cookies/[session_name].json`
- Browser closes automatically
- Session ready for use!
**Status Messages**
- "Opening browser..." - Browser launching
- "โ
Browser opened!" - Ready for login
- "โ
Session saved!" - Complete
#### Tab 2: ๐พ Saved Sessions
View and manage existing cookie files:
**Session List (Treeview)**
Columns displayed:
- **Session Name** - Your chosen identifier
- **Created Date** - When cookies were saved
- **Domain** - Primary domain of cookies
**Actions**
- **๐ Refresh List** - Reload saved sessions
- **๐๏ธ Delete Selected** - Remove unwanted sessions
- Prompts for confirmation
- Permanently deletes cookie file
- **๐ Open Cookies Folder** - View in file explorer
- Windows: Opens in Explorer
- macOS: Opens in Finder
- Linux: Opens in file manager
**Use Case:** Organize multiple accounts (work, personal, test environments)
#### Tab 3: โ Help
Comprehensive authentication guide including:
- Why authenticate
- How it works
- Supported sites
- Security considerations
- Troubleshooting tips
#### Supported Sites
WebClone's stealth mode works with:
- โ
**Google** (Gmail, Drive, Docs, Photos, etc.)
- โ
**Facebook** (profiles, groups, pages)
- โ
**LinkedIn** (profiles, connections)
- โ
**Twitter/X** (timelines, profiles)
- โ
**GitHub** (private repos, gists)
- โ
**GitLab** (private projects)
- โ
**Bitbucket** (private repos)
- โ
**Any cookie-based authentication system**
---
### 3. ๐ฅ Crawl Website
The main interface for configuring and running crawls.
#### Basic Configuration
**Website URL**
- The starting point for your crawl
- Must be a valid HTTP/HTTPS URL
- Example: `https://example.com`
- Example: `https://docs.python.org/3/`
**Output Directory**
- Where downloaded files will be saved
- Default: `./website_mirror`
- Created automatically if doesn't exist
- Subdirectories created:
- `pages/` - HTML pages
- `assets/` - CSS, JS, images, fonts
- `pdfs/` - PDF snapshots (if enabled)
- `reports/` - Crawl metadata
- Use **Browse...** button for native file dialog
#### Advanced Settings (Expandable)
**Crawling Options (Left Column)**
1. **Recursive Crawl** (toggle)
- โ
Enabled: Follow links to other pages
- โ Disabled: Only download start URL
- Recommended: Enabled for full site mirrors
2. **Same Domain Only** (toggle)
- โ
Enabled: Only crawl URLs on same domain
- โ Disabled: Follow external links too
- Recommended: Enabled to avoid downloading entire internet
3. **Include Assets** (toggle)
- โ
Enabled: Download CSS, JS, images, fonts
- โ Disabled: HTML pages only
- Recommended: Enabled for complete offline viewing
4. **Generate PDFs** (toggle)
- โ
Enabled: Create PDF snapshots of each page
- โ Disabled: Skip PDF generation
- Note: Slower crawl, larger storage
- Recommended: Disabled for most use cases
**Performance Options (Right Column)**
| Setting | Range | Description | Recommended |
|---------|-------|-------------|-------------|
| **Workers** | 1-50 | Concurrent download threads | 5-10 |
| **Delay (ms)** | 0-5000 | Pause between requests | 100-1000 |
| **Max Depth** | 0-10 | How deep to follow links | 0 (unlimited) |
| **Max Pages** | 0-10000 | Maximum pages to download | 0 (unlimited) |
**Configuration Tips:**
- **Fast crawl, gentle site**: Workers=10, Delay=100ms
- **Slow crawl, avoid detection**: Workers=3, Delay=2000ms
- **Test run**: Max Pages=10, Max Depth=2
- **Full mirror**: All defaults (unlimited)
#### Authentication (Optional)
**Cookie File Selection**
- Dropdown shows all saved sessions
- Default: "None" (no authentication)
- Select a session to use saved cookies
- Click **๐ Refresh** to reload list after creating new sessions
**When to Use:**
- โ
Site requires login
- โ
Content behind paywall
- โ
Private repositories
- โ
Personal accounts
- โ Public websites (leave as "None")
#### Control Buttons
**โถ๏ธ Start Crawl** (Large green button)
- Begins the crawl operation
- Disabled during active crawl
- Validates configuration first
- Starts background thread
**โน๏ธ Stop Crawl** (Large red button)
- Stops crawl immediately
- Enabled during active crawl
- Saves partial results
- Cleans up resources
#### Progress Section
**Progress Bar**
- Visual indicator of completion
- Updates in real-time
- Striped animation shows activity
**Status Text**
- Current operation description
- Examples:
- "Ready to start crawling..."
- "Initializing crawl..."
- "Crawling: https://example.com/page1"
- "โ
Crawl completed successfully!"
- "โ Error: [error message]"
**Live Statistics**
Four real-time counters:
1. **Pages: X** - HTML pages downloaded
2. **Assets: X** - CSS, JS, images downloaded
3. **Size: X MB** - Total data downloaded
4. **Time: Xs** - Elapsed time
Updates every 500ms during crawl.
---
### 4. ๐ Results & Analytics
View detailed statistics and download reports.
#### When No Results
If you haven't completed a crawl:
- Shows friendly message
- **๐ฅ Start a Crawl** button navigates to crawl page
#### Summary Cards
Four large metric cards at the top:
1. **๐ Pages Crawled**
- Total HTML pages downloaded
- Large blue number display
2. **๐ฆ Assets Downloaded**
- CSS, JS, images, fonts, etc.
- Large info-colored number
3. **๐พ Total Size**
- Combined size in MB (with 2 decimals)
- Large green number
4. **โฑ๏ธ Duration**
- Total crawl time in seconds
- Large yellow number
#### Detailed Results (Tabbed Interface)
**Tab 1: ๐ Pages**
Sortable table (Treeview) with columns:
- **URL** - Full page URL (400px wide)
- **Status** - HTTP status code (80px)
- **Depth** - Link depth from start URL (80px)
- **Links** - Number of links found (80px)
- **Assets** - Number of assets found (80px)
Scrollable for large result sets.
**Tab 2: ๐ฆ Assets**
Asset distribution by type:
- Grouped count display
- Shows: HTML, CSS, JAVASCRIPT, IMAGE, FONT, etc.
- Monospace font for alignment
- Example output:
```
Asset Distribution:
CSS: 15
HTML: 42
IMAGE: 127
JAVASCRIPT: 23
```
**Tab 3: ๐พ Export**
Export options:
1. **๐ฅ Export as JSON**
- Opens native "Save As" dialog
- Saves complete metadata as JSON
- Includes all pages, assets, timestamps
- Default filename: `[domain]_crawl_results.json`
2. **๐ Open Output Directory**
- Opens file explorer at output location
- Quick access to downloaded files
- OS-specific:
- Windows: `explorer`
- macOS: `open`
- Linux: `xdg-open`
---
## ๐ก Usage Examples
### Example 1: Simple Public Website
**Scenario:** Archive a blog for offline reading
```
1. Launch GUI: make gui
2. Navigate to "๐ฅ Crawl Website"
3. URL: https://myblog.com
4. Settings:
โ
Recursive
โ
Same Domain Only
โ
Include Assets
Workers: 10
Delay: 100ms
5. Click "โถ๏ธ Start Crawl"
6. Monitor progress in real-time
7. Click "๐ Results & Analytics" when done
8. Review statistics and export if desired
```
**Expected Time:** 30 seconds - 5 minutes (depends on site size)
### Example 2: Authenticated Site (Google Drive)
**Scenario:** Download documents from Google Drive
```
Phase 1: Authentication (One Time)
1. Launch GUI: make gui
2. Navigate to "๐ Authentication"
3. Tab: "๐ New Login"
4. Login URL: https://accounts.google.com
5. Session Name: google_drive_personal
6. Click "๐ Open Browser for Login"
7. Log in to Google (username, password, 2FA)
8. Navigate to Google Drive
9. Click "๐พ Save Session & Cookies"
Phase 2: Crawl (Every Time)
1. Navigate to "๐ฅ Crawl Website"
2. URL: https://drive.google.com/drive/folders/[YOUR_FOLDER_ID]
3. Authentication: Select "google_drive_personal"
4. Settings:
โ
Recursive
Max Depth: 3
Workers: 5 (gentle on Google)
Delay: 500ms
5. Click "โถ๏ธ Start Crawl"
6. View results in "๐ Results & Analytics"
```
**Expected Time:** 2-10 minutes (depends on content)
### Example 3: Large Documentation Site
**Scenario:** Clone Python docs without getting blocked
```
1. Launch GUI: make gui
2. Navigate to "๐ฅ Crawl Website"
3. URL: https://docs.python.org/3/
4. Advanced Settings:
โ
Recursive
โ
Same Domain Only
โ
Include Assets
โ Generate PDFs (too slow)
Max Pages: 200 (limit scope)
Workers: 5 (be gentle)
Delay: 1000ms (1 second between requests)
5. Click "โถ๏ธ Start Crawl"
6. Go get coffee โ
7. Monitor in "๐ Results & Analytics"
```
**Expected Time:** 10-30 minutes
### Example 4: Test Run Before Full Crawl
**Scenario:** Preview what will be downloaded
```
1. Launch GUI: make gui
2. Navigate to "๐ฅ Crawl Website"
3. URL: https://unknownsite.com
4. Settings (Conservative):
โ
Recursive
Max Depth: 2 (only go 2 links deep)
Max Pages: 10 (stop after 10 pages)
Workers: 3
Delay: 500ms
5. Click "โถ๏ธ Start Crawl"
6. Review results
7. If good, run again with unlimited settings
```
**Expected Time:** 30 seconds - 2 minutes
---
## โ๏ธ Configuration
### Environment Variables
Create `.env` file in project root:
```bash
# Selenium Configuration
WEBCLONE_SELENIUM_HEADLESS=false # Show browser (for debugging)
WEBCLONE_SELENIUM_TIMEOUT=30
# Crawler Configuration
WEBCLONE_WORKERS=5
WEBCLONE_DELAY_MS=100
WEBCLONE_MAX_PAGES=0 # 0 = unlimited
```
### Cookie Storage
Location: `./cookies/` directory
**File Format:** JSON
```json
[
{
"name": "session_id",
"value": "abc123...",
"domain": ".google.com",
"path": "/",
"expires": 1735689600,
"secure": true,
"httpOnly": true
}
]
```
**Security:**
- โ ๏ธ Never commit to git (already in `.gitignore`)
- โ ๏ธ Treat like passwords
- โ ๏ธ Don't share or email
- โ ๏ธ Store securely
---
## ๐ Troubleshooting
### GUI Won't Start
**Problem:** ImportError or module not found
**Solutions:**
```bash
# Reinstall GUI dependencies
make install-gui
# Or manually
pip install ttkbootstrap>=1.10.1
# Verify installation
python -c "import ttkbootstrap; print(ttkbootstrap.__version__)"
```
### GUI Looks Broken/Ugly
**Problem:** Missing ttkbootstrap, falling back to basic Tkinter
**Solution:**
```bash
# Install ttkbootstrap for modern theme
pip install ttkbootstrap>=1.10.1
```
**Verify:** Application title should say "WebClone" and have blue theme
### Browser Won't Open for Auth
**Problem:** "Chrome not found" or "WebDriver error"
**Solutions:**
1. Install Chrome/Chromium:
```bash
# Ubuntu/Debian
sudo apt install chromium-browser
# Fedora
sudo dnf install chromium
# macOS
brew install --cask google-chrome
# Windows
# Download from: https://www.google.com/chrome/
```
2. Update WebDriver:
```bash
pip install --upgrade selenium webdriver-manager
```
3. Verify Chrome installation:
```bash
google-chrome --version
# or
chromium --version
```
### Cookies Not Working
**Problem:** Still getting "Login required" or access denied
**Possible Causes & Solutions:**
1. **Session Expired**
- Cookie lifetime depends on site (1-30 days typical)
- Solution: Delete old cookie file, re-authenticate
2. **Domain Mismatch**
- Cookies saved for `accounts.google.com`
- Trying to access `drive.google.com`
- Solution: Make sure to navigate to target domain during login
3. **IP Address Changed**
- Some sites invalidate cookies on IP change
- Solution: Re-authenticate from new IP
4. **2FA Required Again**
- Site requires periodic 2FA verification
- Solution: Complete 2FA during manual login
5. **Cookie File Corrupted**
- JSON parse error
- Solution: Delete and recreate session
### Slow Crawling
**Problem:** Taking too long
**Optimization Steps:**
1. **Increase Workers**
```
Workers: 10-20 (from 5)
```
2. **Reduce Delay**
```
Delay: 50-100ms (from 500ms)
```
3. **Limit Scope**
```
Max Pages: 100
Max Depth: 3
```
4. **Disable PDF Generation**
```
โ Generate PDFs (uncheck)
```
5. **Disable Asset Download** (HTML only)
```
โ Include Assets (uncheck)
```
### High Memory Usage
**Problem:** GUI freezes or system slows down
**Solutions:**
1. **Reduce Workers**
```
Workers: 3 (from 10)
```
2. **Increase Delay**
```
Delay: 1000ms (1 second)
```
3. **Limit Pages**
```
Max Pages: 50-100
```
4. **Close Other Applications**
5. **Increase System RAM** (if possible)
### Crawl Stops/Fails
**Problem:** Error message or unexpected stop
**Common Errors:**
1. **Network Error**
- Check internet connection
- Try again with higher delay
2. **Permission Denied**
- Output directory not writable
- Choose different output location
3. **Rate Limited**
- Site detected high request rate
- Increase delay_ms
- Reduce workers
4. **Invalid URL**
- Check URL format
- Must start with `http://` or `https://`
---
## ๐ Security & Privacy
### Cookie Security
**Storage:**
- Cookies stored locally in `./cookies/` directory
- Plain JSON format (not encrypted)
- Only readable by your user account
**Transmission:**
- Never sent to remote servers
- WebClone has no telemetry or analytics
- All processing is local
**Expiry:**
- Cookies expire per site's policy
- Typically 1-30 days
- Must re-authenticate when expired
### Best Practices
1. **Don't Share Cookie Files**
- Contain authentication tokens
- Same as sharing your password
- Can grant full account access
2. **Use Dedicated Accounts**
- Create separate accounts for scraping
- Don't use primary personal accounts
- Reduces risk if compromised
3. **Rotate Sessions Regularly**
- Re-authenticate periodically
- Delete old cookie files
- Keep organized
4. **Check Terms of Service**
- Ensure scraping is allowed
- Some sites prohibit automated access
- Respect site policies
5. **Respect Rate Limits**
- Use appropriate delays
- Don't overwhelm servers
- Be a good internet citizen
6. **Secure Your Computer**
- Cookie files as secure as your system
- Use disk encryption
- Strong user password
---
## ๐ Support
### Getting Help
- ๐ **Documentation**: `docs/` directory
- ๐ **Quick Start**: `GUI_QUICKSTART.md`
- ๐ **Authentication**: `docs/AUTHENTICATION_GUIDE.md`
- ๐ฌ **GitHub Issues**: https://github.com/ruslanmv/webclone/issues
- ๐ง **Email**: contact@ruslanmv.com
- ๐ **Website**: ruslanmv.com
### Reporting Bugs
When reporting issues, include:
1. **Environment**
- OS: Windows/macOS/Linux + version
- Python version: `python --version`
- WebClone version: Check `pyproject.toml`
2. **Steps to Reproduce**
- Exact sequence of actions
- Configuration used
- URL being crawled (if not private)
3. **Error Messages**
- Complete error text
- Stack traces if available
- Screenshots of GUI (if relevant)
4. **Expected vs Actual**
- What you expected to happen
- What actually happened
### Feature Requests
Open a GitHub issue with:
- Clear description of desired feature
- Use case / motivation
- Proposed implementation (if applicable)
---
## ๐ Tips & Tricks
### Performance Optimization
1. **Start Small**
- Test with `Max Pages: 10` first
- Verify configuration works
- Then expand to full crawl
2. **Tune Workers**
- More isn't always better
- CPU-bound: workers โ CPU cores
- I/O-bound: workers = 5-10 optimal
- Experiment to find sweet spot
3. **Use Delays Strategically**
- Public sites: 100-500ms sufficient
- Protected sites: 1000-2000ms safer
- Prevents rate limiting
- Reduces server load
4. **Same Domain Only**
- Prevents downloading external assets
- CDNs, analytics, ads
- Significantly faster
- More focused content
### Authentication Tips
1. **Save Multiple Sessions**
- One per site/account
- Example: `google_work`, `google_personal`
- Easy to switch between
2. **Use Descriptive Names**
- โ
Good: `github_company_repos`, `facebook_marketing`
- โ Bad: `session1`, `temp`, `test123`
3. **Test Cookies Immediately**
- After saving, do a small test crawl
- Verify authentication works
- Catch issues early
4. **Refresh Periodically**
- Sessions expire (1-30 days)
- Re-authenticate weekly for active use
- Delete stale sessions
### Content Selection
**HTML Only** (fastest, smallest):
```
โ Include Assets
โ Generate PDFs
Max Pages: Unlimited
```
**Full Mirror** (complete, large):
```
โ
Include Assets
โ Generate PDFs (unless needed)
โ
Recursive
Max Pages: Unlimited
```
**Quick Preview** (test run):
```
โ
Include Assets
Max Pages: 10
Max Depth: 2
```
**Documentation Archive** (text-focused):
```
โ
Include Assets (for CSS/readability)
โ Generate PDFs
Same Domain Only
```
---
## ๐ Desktop GUI vs CLI
### When to Use Desktop GUI
โ
**Visual workflow preferred**
โ
**First-time users**
โ
**Point-and-click configuration**
โ
**Real-time progress monitoring**
โ
**Managing multiple cookie sessions**
โ
**Interactive authentication**
โ
**Windows users (GUI is easier)**
### When to Use CLI
โ
**Automation/scripting**
โ
**CI/CD pipelines**
โ
**Remote servers (SSH)**
โ
**Batch operations**
โ
**Configuration files (reproducible)**
โ
**Power users**
โ
**Headless environments**
### CLI Quick Reference
```bash
# Basic crawl
webclone clone https://example.com
# With options
webclone clone https://example.com \
--output ./mirror \
--workers 10 \
--delay-ms 100 \
--max-depth 3 \
--recursive
# With authentication
webclone clone https://protected.com \
--cookie-file ./cookies/my_session.json
# Help
webclone --help
webclone clone --help
```
---
## ๐ฏ Advanced Usage
### Batch Crawling Multiple Sites
Use the GUI for each site individually, or automate with CLI:
```bash
#!/bin/bash
# batch_crawl.sh
sites=(
"https://site1.com"
"https://site2.com"
"https://site3.com"
)
for site in "${sites[@]}"; do
webclone clone "$site" --max-pages 50
done
```
### Scheduled Crawls
**Linux/Mac (cron):**
```bash
# Edit crontab
crontab -e
# Run daily at 2 AM
0 2 * * * cd /path/to/webclone && webclone clone https://example.com
```
**Windows (Task Scheduler):**
- Open Task Scheduler
- Create Basic Task
- Trigger: Daily
- Action: Start Program
- Program: `python`
- Arguments: `-m webclone.cli clone https://example.com`
- Start in: `C:\path\to\webclone`
### Custom Configuration Files
Create `crawl_config.json`:
```json
{
"start_url": "https://example.com",
"output_dir": "./mirror",
"recursive": true,
"workers": 10,
"delay_ms": 100
}
```
Load via CLI:
```bash
webclone clone --config crawl_config.json
```
---
## ๐ Advantages of Enterprise Desktop GUI
### vs Web-Based GUIs (Streamlit, Flask, etc.)
โ
**Instant Startup** - No server to launch, opens in <1 second
โ
**Native Performance** - No browser overhead, direct system calls
โ
**Better Responsiveness** - No HTTP roundtrips
โ
**No Port Conflicts** - Doesn't need localhost ports
โ
**Offline Friendly** - Works without network (except crawling)
โ
**Native File Dialogs** - OS-integrated file/folder selection
โ
**System Integration** - Taskbar, notifications, multi-monitor
โ
**Resource Efficient** - Lower memory and CPU usage
โ
**Professional Appearance** - Modern theme matches OS
โ
**Easier Distribution** - Single executable possible
### vs Command-Line Interface
โ
**Visual Configuration** - See all options at once
โ
**No Syntax Errors** - Point-and-click vs typing commands
โ
**Real-Time Feedback** - Progress bars, live stats
โ
**Easier Authentication** - Integrated browser workflow
โ
**Session Management** - Visual cookie file selection
โ
**Results Viewing** - Tables, cards, charts (vs plain text)
โ
**Lower Learning Curve** - Intuitive for non-technical users
โ
**Discoverability** - All features visible in UI
---
**Author**: Ruslan Magana
**Website**: [ruslanmv.com](https://ruslanmv.com)
**License**: Apache 2.0
---
*Happy Crawling with WebClone Enterprise Desktop GUI! ๐*