Enables automated dataset onboarding by using Google Drive folders as input sources for raw CSV/Excel files and as catalog storage for processed datasets with metadata, quality reports, and documentation
🤖 MCP Dataset Onboarding Server
A FastAPI-based MCP (Model-Compatible Protocol) server for automating dataset onboarding using Google Drive as both input source and mock catalog.
🔒 SECURITY FIRST - READ THIS BEFORE SETUP
⚠️ This repository contains template files only. You MUST configure your own credentials before use.
📖 Read SECURITY_SETUP.md for complete security instructions.
🚨 Never commit service account keys or real folder IDs to version control!
Features
- Automated Dataset Processing: Complete workflow from raw CSV/Excel files to cataloged datasets
- Google Drive Integration: Uses Google Drive folders as input source and catalog storage
- Metadata Extraction: Automatically extracts column information, data types, and basic statistics
- Data Quality Rules: Suggests DQ rules based on data characteristics
- Contract Generation: Creates Excel contracts with schema and DQ information
- Mock Catalog: Publishes processed artifacts to a catalog folder
- 🤖 Automated Processing: Watches folders and processes files automatically
- 🌐 Multiple Interfaces: FastAPI server, MCP server, CLI tools, and dashboards
Project Structure
🚀 Quick Start
1. Security Setup (REQUIRED)
2. Installation
3. Choose Your Interface
🤖 Fully Automated (Recommended)
🌐 API Server
🧠 LLM Integration (MCP)
🖥️ Command Line
🎯 Usage Scenarios
Scenario 1: Set-and-Forget Automation
python start_auto_processor.py
- Upload files to Google Drive
- Files processed automatically within 30 seconds
- Monitor with
python processor_dashboard.py --live
Scenario 2: LLM-Powered Data Analysis
- Configure MCP server in Claude Desktop
- Chat: "Analyze the dataset I just uploaded"
- Claude uses MCP tools to process and explain your data
Scenario 3: API Integration
python main.py
- Integrate with your data pipelines via REST API
- Programmatic dataset onboarding
📊 What You Get
For each processed dataset:
- 📄 Original File: Preserved in organized folder
- 📋 Metadata JSON: Column info, types, statistics
- 📊 Excel Contract: Professional multi-sheet contract
- 🔍 Quality Report: Data quality assessment
- 📖 README: Human-readable summary
🛠️ Available Tools
FastAPI Endpoints
/tool/extract_metadata
- Analyze dataset structure/tool/apply_dq_rules
- Generate quality rules/process_dataset
- Complete workflow/health
- System health check
MCP Tools (for LLMs)
extract_dataset_metadata
- Dataset analysisgenerate_data_quality_rules
- Quality assessmentprocess_complete_dataset
- Full pipelinelist_catalog_files
- Catalog browsing
CLI Commands
dataset_manager.py list
- Show processed datasetsauto_processor.py --once
- Single check cycleprocessor_dashboard.py --live
- Real-time monitoring
🔧 Configuration
Environment Variables (.env)
Auto-Processor Settings (auto_config.py)
- Check interval: 30 seconds
- Supported formats: CSV, Excel
- File age threshold: 1 minute
- Max files per cycle: 5
📈 Monitoring & Analytics
🐳 Docker Deployment
🔍 Troubleshooting
Common Issues
- No files detected: Check Google Drive permissions
- Processing errors: Verify service account access
- MCP not working: Check Claude Desktop configuration
Debug Commands
🤝 Contributing
- Fork the repository
- Create a feature branch
- Never commit sensitive data
- Test your changes
- Submit a pull request
📚 Documentation
- SECURITY_SETUP.md - Security configuration
- AUTOMATION_GUIDE.md - Automation features
- MCP_INTEGRATION_GUIDE.md - LLM integration
📄 License
MIT License
🎉 What Makes This Special
- 🔒 Security First: Proper credential management
- 🤖 True Automation: Zero manual intervention
- 🧠 LLM Integration: Natural language data processing
- 📊 Professional Output: Enterprise-ready documentation
- 🔧 Multiple Interfaces: API, CLI, MCP, Dashboard
- 📈 Real-time Monitoring: Live processing status
- 🗂️ Perfect Organization: Structured output folders
Transform your messy data files into professional, documented, quality-checked datasets automatically! 🚀
This server cannot be installed
hybrid server
The server is able to function both locally and remotely, depending on the configuration or use case.
Enables automated dataset processing and onboarding using Google Drive integration. Provides metadata extraction, data quality assessment, and contract generation for CSV/Excel files through natural language interactions.
Related MCP Servers
- AsecurityAlicenseAqualityEnables autonomous data exploration on .csv-based datasets, providing intelligent insights with minimal effort.Last updated -2455PythonMIT License
- -securityAlicense-qualityEnables integration with Google Drive for listing, reading, and searching over files, supporting various file types with automatic export for Google Workspace files.Last updated -1,13343JavaScriptMIT License
- -securityAlicense-qualityIntegrates with Google Drive to enable listing, searching, and reading files, plus reading and writing to Google Sheets.Last updated -935170TypeScriptMIT License
- AsecurityAlicenseAqualityProvides seamless integration with Smartsheet, enabling automated operations on Smartsheet documents through a standardized interface that bridges AI-powered automation tools with Smartsheet's collaboration platform.Last updated -129PythonMIT License