Enables automated dataset onboarding by using Google Drive folders as input sources for raw CSV/Excel files and as catalog storage for processed datasets with metadata, quality reports, and documentation
π€ MCP Dataset Onboarding Server
A FastAPI-based MCP (Model-Compatible Protocol) server for automating dataset onboarding using Google Drive as both input source and mock catalog.
π SECURITY FIRST - READ THIS BEFORE SETUP
β οΈ This repository contains template files only. You MUST configure your own credentials before use.
π Read
π¨ Never commit service account keys or real folder IDs to version control!
Related MCP server: Google Drive MCP Server
Features
Automated Dataset Processing: Complete workflow from raw CSV/Excel files to cataloged datasets
Google Drive Integration: Uses Google Drive folders as input source and catalog storage
Metadata Extraction: Automatically extracts column information, data types, and basic statistics
Data Quality Rules: Suggests DQ rules based on data characteristics
Contract Generation: Creates Excel contracts with schema and DQ information
Mock Catalog: Publishes processed artifacts to a catalog folder
π€ Automated Processing: Watches folders and processes files automatically
π Multiple Interfaces: FastAPI server, MCP server, CLI tools, and dashboards
Project Structure
π Quick Start
1. Security Setup (REQUIRED)
2. Installation
3. Choose Your Interface
π€ Fully Automated (Recommended)
π API Server
π§ LLM Integration (MCP)
π₯οΈ Command Line
π― Usage Scenarios
Scenario 1: Set-and-Forget Automation
python start_auto_processor.pyUpload files to Google Drive
Files processed automatically within 30 seconds
Monitor with
python processor_dashboard.py --live
Scenario 2: LLM-Powered Data Analysis
Configure MCP server in Claude Desktop
Chat: "Analyze the dataset I just uploaded"
Claude uses MCP tools to process and explain your data
Scenario 3: API Integration
python main.pyIntegrate with your data pipelines via REST API
Programmatic dataset onboarding
π What You Get
For each processed dataset:
π Original File: Preserved in organized folder
π Metadata JSON: Column info, types, statistics
π Excel Contract: Professional multi-sheet contract
π Quality Report: Data quality assessment
π README: Human-readable summary
π οΈ Available Tools
FastAPI Endpoints
/tool/extract_metadata- Analyze dataset structure/tool/apply_dq_rules- Generate quality rules/process_dataset- Complete workflow/health- System health check
MCP Tools (for LLMs)
extract_dataset_metadata- Dataset analysisgenerate_data_quality_rules- Quality assessmentprocess_complete_dataset- Full pipelinelist_catalog_files- Catalog browsing
CLI Commands
dataset_manager.py list- Show processed datasetsauto_processor.py --once- Single check cycleprocessor_dashboard.py --live- Real-time monitoring
π§ Configuration
Environment Variables (.env)
Auto-Processor Settings (auto_config.py)
Check interval: 30 seconds
Supported formats: CSV, Excel
File age threshold: 1 minute
Max files per cycle: 5
π Monitoring & Analytics
π³ Docker Deployment
π Troubleshooting
Common Issues
No files detected: Check Google Drive permissions
Processing errors: Verify service account access
MCP not working: Check Claude Desktop configuration
Debug Commands
π€ Contributing
Fork the repository
Create a feature branch
Never commit sensitive data
Test your changes
Submit a pull request
π Documentation
SECURITY_SETUP.md - Security configuration
AUTOMATION_GUIDE.md - Automation features
MCP_INTEGRATION_GUIDE.md - LLM integration
π License
MIT License
π What Makes This Special
π Security First: Proper credential management
π€ True Automation: Zero manual intervention
π§ LLM Integration: Natural language data processing
π Professional Output: Enterprise-ready documentation
π§ Multiple Interfaces: API, CLI, MCP, Dashboard
π Real-time Monitoring: Live processing status
ποΈ Perfect Organization: Structured output folders
Transform your messy data files into professional, documented, quality-checked datasets automatically! π