Enables automated dataset onboarding by using Google Drive folders as input sources for raw CSV/Excel files and as catalog storage for processed datasets with metadata, quality reports, and documentation
š¤ MCP Dataset Onboarding Server
A FastAPI-based MCP (Model-Compatible Protocol) server for automating dataset onboarding using Google Drive as both input source and mock catalog.
š SECURITY FIRST - READ THIS BEFORE SETUP
ā ļø This repository contains template files only. You MUST configure your own credentials before use.
š Read
šØ Never commit service account keys or real folder IDs to version control!
Features
Automated Dataset Processing: Complete workflow from raw CSV/Excel files to cataloged datasets
Google Drive Integration: Uses Google Drive folders as input source and catalog storage
Metadata Extraction: Automatically extracts column information, data types, and basic statistics
Data Quality Rules: Suggests DQ rules based on data characteristics
Contract Generation: Creates Excel contracts with schema and DQ information
Mock Catalog: Publishes processed artifacts to a catalog folder
š¤ Automated Processing: Watches folders and processes files automatically
š Multiple Interfaces: FastAPI server, MCP server, CLI tools, and dashboards
Project Structure
š Quick Start
1. Security Setup (REQUIRED)
2. Installation
3. Choose Your Interface
š¤ Fully Automated (Recommended)
š API Server
š§ LLM Integration (MCP)
š„ļø Command Line
šÆ Usage Scenarios
Scenario 1: Set-and-Forget Automation
python start_auto_processor.pyUpload files to Google Drive
Files processed automatically within 30 seconds
Monitor with
python processor_dashboard.py --live
Scenario 2: LLM-Powered Data Analysis
Configure MCP server in Claude Desktop
Chat: "Analyze the dataset I just uploaded"
Claude uses MCP tools to process and explain your data
Scenario 3: API Integration
python main.pyIntegrate with your data pipelines via REST API
Programmatic dataset onboarding
š What You Get
For each processed dataset:
š Original File: Preserved in organized folder
š Metadata JSON: Column info, types, statistics
š Excel Contract: Professional multi-sheet contract
š Quality Report: Data quality assessment
š README: Human-readable summary
š ļø Available Tools
FastAPI Endpoints
/tool/extract_metadata- Analyze dataset structure/tool/apply_dq_rules- Generate quality rules/process_dataset- Complete workflow/health- System health check
MCP Tools (for LLMs)
extract_dataset_metadata- Dataset analysisgenerate_data_quality_rules- Quality assessmentprocess_complete_dataset- Full pipelinelist_catalog_files- Catalog browsing
CLI Commands
dataset_manager.py list- Show processed datasetsauto_processor.py --once- Single check cycleprocessor_dashboard.py --live- Real-time monitoring
š§ Configuration
Environment Variables (.env)
Auto-Processor Settings (auto_config.py)
Check interval: 30 seconds
Supported formats: CSV, Excel
File age threshold: 1 minute
Max files per cycle: 5
š Monitoring & Analytics
š³ Docker Deployment
š Troubleshooting
Common Issues
No files detected: Check Google Drive permissions
Processing errors: Verify service account access
MCP not working: Check Claude Desktop configuration
Debug Commands
š¤ Contributing
Fork the repository
Create a feature branch
Never commit sensitive data
Test your changes
Submit a pull request
š Documentation
SECURITY_SETUP.md - Security configuration
AUTOMATION_GUIDE.md - Automation features
MCP_INTEGRATION_GUIDE.md - LLM integration
š License
MIT License
š What Makes This Special
š Security First: Proper credential management
š¤ True Automation: Zero manual intervention
š§ LLM Integration: Natural language data processing
š Professional Output: Enterprise-ready documentation
š§ Multiple Interfaces: API, CLI, MCP, Dashboard
š Real-time Monitoring: Live processing status
šļø Perfect Organization: Structured output folders
Transform your messy data files into professional, documented, quality-checked datasets automatically! š
This server cannot be installed
Related Resources
Related MCP Servers
- Asecurity-licenseAqualityEnables autonomous data exploration on .csv-based datasets, providing intelligent insights with minimal effort.Last updated -2487MIT License
- -securityAlicense-qualityEnables integration with Google Drive for listing, reading, and searching over files, supporting various file types with automatic export for Google Workspace files.Last updated -62158MIT License
- -security-license-qualityIntegrates with Google Drive to enable listing, searching, and reading files, plus reading and writing to Google Sheets.Last updated -202210MIT License
- AsecurityAlicenseAqualityProvides seamless integration with Smartsheet, enabling automated operations on Smartsheet documents through a standardized interface that bridges AI-powered automation tools with Smartsheet's collaboration platform.Last updated -11MIT License