# π€ MCP Auto Dataset Processor - Complete Guide
## π― **What We Built**
You now have a **fully automated dataset onboarding system** that watches your Google Drive folder and processes new files automatically - no manual file IDs needed!
## π **Quick Start (3 Steps)**
### **Step 1: Start the Auto-Processor**
```bash
python start_auto_processor.py
```
### **Step 2: Upload Files**
- Upload any CSV or Excel file to your `MCP_server` Google Drive folder
- That's it! No file IDs, no manual commands
### **Step 3: Watch the Magic**
- Files are automatically detected within 30 seconds
- Complete processing pipeline runs automatically
- All artifacts saved in organized folders
## π **Monitoring & Management**
### **Real-Time Dashboard**
```bash
# View current status
python processor_dashboard.py
# Live monitoring (auto-refreshing)
python processor_dashboard.py --live
# Detailed analytics
python processor_dashboard.py --stats
```
### **Manual Controls**
```bash
# Run single check
python auto_processor.py --once
# List processed files
python auto_processor.py --list
# Custom check interval
python auto_processor.py --interval 60
# Reset processed files log
python auto_processor.py --reset
```
## π§ **How It Works**
### **Intelligent File Detection**
- β
Monitors Google Drive folder continuously
- β
Only processes supported formats (CSV, Excel)
- β
Ignores already processed files
- β
Waits for upload completion before processing
- β
Handles multiple files efficiently
### **Automatic Processing Pipeline**
1. **File Detection** β New file uploaded to Google Drive
2. **Download** β File retrieved automatically
3. **Analysis** β Metadata extraction and statistics
4. **Quality Rules** β Intelligent DQ rule generation
5. **Documentation** β Excel contracts and reports
6. **Organization** β Structured folder creation
7. **Tracking** β Processing log updated
### **Smart Features**
- **Duplicate Prevention**: Won't process the same file twice
- **Error Recovery**: Handles failures gracefully
- **Batch Processing**: Can handle multiple files at once
- **Progress Tracking**: Maintains detailed logs
- **Resource Efficient**: Minimal system impact
## π **Output Structure**
Each processed dataset gets its own organized folder:
```
processed_datasets/
βββ your_dataset_name/
βββ original_file.csv # Original dataset
βββ dataset_metadata.json # Column info & stats
βββ dataset_contract.xlsx # Professional contract
βββ dataset_dq_report.json # Quality assessment
βββ README.md # Human-readable summary
```
## ποΈ **Configuration**
### **Default Settings**
- **Check Interval**: 30 seconds
- **File Age Threshold**: 1 minute (prevents processing during upload)
- **Supported Formats**: CSV, Excel (.xlsx, .xls)
- **Max Files Per Cycle**: 5
### **Customization**
Edit `auto_config.py` to adjust:
- Check frequency
- File age requirements
- Supported formats
- Logging levels
- Output folders
## π **Troubleshooting**
### **No Files Being Processed?**
1. Check Google Drive folder permissions
2. Verify service account has access
3. Ensure files are supported formats
4. Check `processed_files.json` for duplicates
### **Processing Errors?**
1. Check Google Drive connectivity
2. Verify file formats are valid
3. Check disk space for output folders
4. Review error logs in console
### **Dashboard Not Showing Data?**
1. Ensure `processed_files.json` exists
2. Check Google Drive API access
3. Verify folder IDs in `.env` file
## π **Benefits**
### **Before (Manual)**
- β Find file ID manually
- β Run commands for each file
- β Track processed files yourself
- β Organize outputs manually
- β Monitor progress constantly
### **After (Automated)**
- β
Just upload files to Google Drive
- β
Everything happens automatically
- β
Smart duplicate detection
- β
Organized output structure
- β
Real-time monitoring dashboard
## π **Production Deployment**
### **Run as Service (Linux)**
```bash
# Create systemd service
sudo nano /etc/systemd/system/mcp-auto-processor.service
[Unit]
Description=MCP Auto Dataset Processor
After=network.target
[Service]
Type=simple
User=your-user
WorkingDirectory=/path/to/mcp
ExecStart=/usr/bin/python3 start_auto_processor.py
Restart=always
[Install]
WantedBy=multi-user.target
# Enable and start
sudo systemctl enable mcp-auto-processor
sudo systemctl start mcp-auto-processor
```
### **Run as Service (Windows)**
Use Task Scheduler or Windows Service Wrapper to run `start_auto_processor.py` automatically.
### **Docker Deployment**
```bash
# Build image
docker build -t mcp-auto-processor .
# Run with auto-processor
docker run -d \
-v /path/to/service-account.json:/app/keys/service-account.json \
-v /path/to/processed_datasets:/app/processed_datasets \
-e GOOGLE_SERVICE_ACCOUNT_KEY_PATH=/app/keys/service-account.json \
-e MCP_SERVER_FOLDER_ID=your_server_folder_id \
-e MCP_CLIENT_FOLDER_ID=your_client_folder_id \
mcp-auto-processor python start_auto_processor.py
```
## π― **Use Cases**
### **Data Teams**
- Automatic ingestion of daily reports
- Continuous data quality monitoring
- Self-service data onboarding
### **Business Users**
- Upload spreadsheets for instant analysis
- Automated documentation generation
- Quality-checked data delivery
### **Data Engineers**
- Hands-off data pipeline integration
- Automated metadata cataloging
- Quality rule enforcement
## π **You Now Have**
A **production-ready, fully automated dataset onboarding system** that:
- β
Requires zero manual intervention
- β
Processes files within 30 seconds of upload
- β
Generates professional documentation
- β
Maintains organized data catalogs
- β
Provides real-time monitoring
- β
Scales to handle multiple files
- β
Recovers from errors gracefully
**Just upload files to Google Drive and walk away!** π