Skip to main content
Glama

MCP Dataset Onboarding Server

by Magenta91
AUTOMATION_GUIDE.md•5.98 kB
# šŸ¤– MCP Auto Dataset Processor - Complete Guide ## šŸŽÆ **What We Built** You now have a **fully automated dataset onboarding system** that watches your Google Drive folder and processes new files automatically - no manual file IDs needed! ## šŸš€ **Quick Start (3 Steps)** ### **Step 1: Start the Auto-Processor** ```bash python start_auto_processor.py ``` ### **Step 2: Upload Files** - Upload any CSV or Excel file to your `MCP_server` Google Drive folder - That's it! No file IDs, no manual commands ### **Step 3: Watch the Magic** - Files are automatically detected within 30 seconds - Complete processing pipeline runs automatically - All artifacts saved in organized folders ## šŸ“Š **Monitoring & Management** ### **Real-Time Dashboard** ```bash # View current status python processor_dashboard.py # Live monitoring (auto-refreshing) python processor_dashboard.py --live # Detailed analytics python processor_dashboard.py --stats ``` ### **Manual Controls** ```bash # Run single check python auto_processor.py --once # List processed files python auto_processor.py --list # Custom check interval python auto_processor.py --interval 60 # Reset processed files log python auto_processor.py --reset ``` ## šŸ”§ **How It Works** ### **Intelligent File Detection** - āœ… Monitors Google Drive folder continuously - āœ… Only processes supported formats (CSV, Excel) - āœ… Ignores already processed files - āœ… Waits for upload completion before processing - āœ… Handles multiple files efficiently ### **Automatic Processing Pipeline** 1. **File Detection** → New file uploaded to Google Drive 2. **Download** → File retrieved automatically 3. **Analysis** → Metadata extraction and statistics 4. **Quality Rules** → Intelligent DQ rule generation 5. **Documentation** → Excel contracts and reports 6. **Organization** → Structured folder creation 7. **Tracking** → Processing log updated ### **Smart Features** - **Duplicate Prevention**: Won't process the same file twice - **Error Recovery**: Handles failures gracefully - **Batch Processing**: Can handle multiple files at once - **Progress Tracking**: Maintains detailed logs - **Resource Efficient**: Minimal system impact ## šŸ“ **Output Structure** Each processed dataset gets its own organized folder: ``` processed_datasets/ └── your_dataset_name/ ā”œā”€ā”€ original_file.csv # Original dataset ā”œā”€ā”€ dataset_metadata.json # Column info & stats ā”œā”€ā”€ dataset_contract.xlsx # Professional contract ā”œā”€ā”€ dataset_dq_report.json # Quality assessment └── README.md # Human-readable summary ``` ## šŸŽ›ļø **Configuration** ### **Default Settings** - **Check Interval**: 30 seconds - **File Age Threshold**: 1 minute (prevents processing during upload) - **Supported Formats**: CSV, Excel (.xlsx, .xls) - **Max Files Per Cycle**: 5 ### **Customization** Edit `auto_config.py` to adjust: - Check frequency - File age requirements - Supported formats - Logging levels - Output folders ## šŸ” **Troubleshooting** ### **No Files Being Processed?** 1. Check Google Drive folder permissions 2. Verify service account has access 3. Ensure files are supported formats 4. Check `processed_files.json` for duplicates ### **Processing Errors?** 1. Check Google Drive connectivity 2. Verify file formats are valid 3. Check disk space for output folders 4. Review error logs in console ### **Dashboard Not Showing Data?** 1. Ensure `processed_files.json` exists 2. Check Google Drive API access 3. Verify folder IDs in `.env` file ## šŸŽ‰ **Benefits** ### **Before (Manual)** - āŒ Find file ID manually - āŒ Run commands for each file - āŒ Track processed files yourself - āŒ Organize outputs manually - āŒ Monitor progress constantly ### **After (Automated)** - āœ… Just upload files to Google Drive - āœ… Everything happens automatically - āœ… Smart duplicate detection - āœ… Organized output structure - āœ… Real-time monitoring dashboard ## šŸš€ **Production Deployment** ### **Run as Service (Linux)** ```bash # Create systemd service sudo nano /etc/systemd/system/mcp-auto-processor.service [Unit] Description=MCP Auto Dataset Processor After=network.target [Service] Type=simple User=your-user WorkingDirectory=/path/to/mcp ExecStart=/usr/bin/python3 start_auto_processor.py Restart=always [Install] WantedBy=multi-user.target # Enable and start sudo systemctl enable mcp-auto-processor sudo systemctl start mcp-auto-processor ``` ### **Run as Service (Windows)** Use Task Scheduler or Windows Service Wrapper to run `start_auto_processor.py` automatically. ### **Docker Deployment** ```bash # Build image docker build -t mcp-auto-processor . # Run with auto-processor docker run -d \ -v /path/to/service-account.json:/app/keys/service-account.json \ -v /path/to/processed_datasets:/app/processed_datasets \ -e GOOGLE_SERVICE_ACCOUNT_KEY_PATH=/app/keys/service-account.json \ -e MCP_SERVER_FOLDER_ID=your_server_folder_id \ -e MCP_CLIENT_FOLDER_ID=your_client_folder_id \ mcp-auto-processor python start_auto_processor.py ``` ## šŸŽÆ **Use Cases** ### **Data Teams** - Automatic ingestion of daily reports - Continuous data quality monitoring - Self-service data onboarding ### **Business Users** - Upload spreadsheets for instant analysis - Automated documentation generation - Quality-checked data delivery ### **Data Engineers** - Hands-off data pipeline integration - Automated metadata cataloging - Quality rule enforcement ## šŸ† **You Now Have** A **production-ready, fully automated dataset onboarding system** that: - āœ… Requires zero manual intervention - āœ… Processes files within 30 seconds of upload - āœ… Generates professional documentation - āœ… Maintains organized data catalogs - āœ… Provides real-time monitoring - āœ… Scales to handle multiple files - āœ… Recovers from errors gracefully **Just upload files to Google Drive and walk away!** šŸš€

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Magenta91/MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server