Enables automated dataset onboarding by using Google Drive folders as input sources for raw CSV/Excel files and as catalog storage for processed datasets with metadata, quality reports, and documentation
Click on "Install Server".
Wait a few minutes for the server to deploy. Once ready, it will show a "Started" state.
In the chat, type
@followed by the MCP server name and your instructions, e.g., "@MCP Dataset Onboarding Serverprocess the sales data CSV I just uploaded to Google Drive"
That's it! The server will respond to your query, and you can continue using it as needed.
Here is a step-by-step guide with screenshots.
π€ MCP Dataset Onboarding Server
A FastAPI-based MCP (Model-Compatible Protocol) server for automating dataset onboarding using Google Drive as both input source and mock catalog.
π SECURITY FIRST - READ THIS BEFORE SETUP
β οΈ This repository contains template files only. You MUST configure your own credentials before use.
π Read
π¨ Never commit service account keys or real folder IDs to version control!
Related MCP server: Google Drive MCP Server
Features
Automated Dataset Processing: Complete workflow from raw CSV/Excel files to cataloged datasets
Google Drive Integration: Uses Google Drive folders as input source and catalog storage
Metadata Extraction: Automatically extracts column information, data types, and basic statistics
Data Quality Rules: Suggests DQ rules based on data characteristics
Contract Generation: Creates Excel contracts with schema and DQ information
Mock Catalog: Publishes processed artifacts to a catalog folder
π€ Automated Processing: Watches folders and processes files automatically
π Multiple Interfaces: FastAPI server, MCP server, CLI tools, and dashboards
Project Structure
βββ main.py # FastAPI server and endpoints
βββ mcp_server.py # True MCP protocol server for LLM integration
βββ utils.py # Google Drive helpers and DQ functions
βββ dataset_processor.py # Centralized dataset processing logic
βββ auto_processor.py # π€ Automated file monitoring
βββ start_auto_processor.py # π Easy startup for auto-processor
βββ processor_dashboard.py # π Monitoring dashboard
βββ dataset_manager.py # CLI tool for managing datasets
βββ local_test.py # Local processing script
βββ auto_config.py # βοΈ Configuration management
βββ requirements.txt # Python dependencies
βββ Dockerfile # Container configuration
βββ .env.template # Environment variables template
βββ .gitignore # Security: excludes sensitive files
βββ SECURITY_SETUP.md # π Security configuration guide
βββ processed_datasets/ # Organized output folder
β βββ [dataset_name]/ # Individual dataset folders
β βββ [dataset].csv # Original dataset
β βββ [dataset]_metadata.json
β βββ [dataset]_contract.xlsx
β βββ [dataset]_dq_report.json
β βββ README.md # Dataset summary
βββ README.md # This fileπ Quick Start
1. Security Setup (REQUIRED)
# 1. Read the security guide
cat SECURITY_SETUP.md
# 2. Set up your Google service account (outside this repo)
# 3. Configure your environment variables
cp .env.template .env
# Edit .env with your actual values
# 4. Verify no sensitive files will be committed
git status2. Installation
# Install dependencies
pip install -r requirements.txt
# Test the setup
python local_test.py3. Choose Your Interface
π€ Fully Automated (Recommended)
# Start auto-processor - upload files and walk away!
python start_auto_processor.pyπ API Server
# Start FastAPI server
python main.pyπ§ LLM Integration (MCP)
# Start MCP server for Claude Desktop, etc.
python mcp_server.pyπ₯οΈ Command Line
# Manual dataset management
python dataset_manager.py list
python dataset_manager.py process YOUR_FILE_IDπ― Usage Scenarios
Scenario 1: Set-and-Forget Automation
python start_auto_processor.pyUpload files to Google Drive
Files processed automatically within 30 seconds
Monitor with
python processor_dashboard.py --live
Scenario 2: LLM-Powered Data Analysis
Configure MCP server in Claude Desktop
Chat: "Analyze the dataset I just uploaded"
Claude uses MCP tools to process and explain your data
Scenario 3: API Integration
python main.pyIntegrate with your data pipelines via REST API
Programmatic dataset onboarding
π What You Get
For each processed dataset:
π Original File: Preserved in organized folder
π Metadata JSON: Column info, types, statistics
π Excel Contract: Professional multi-sheet contract
π Quality Report: Data quality assessment
π README: Human-readable summary
π οΈ Available Tools
FastAPI Endpoints
/tool/extract_metadata- Analyze dataset structure/tool/apply_dq_rules- Generate quality rules/process_dataset- Complete workflow/health- System health check
MCP Tools (for LLMs)
extract_dataset_metadata- Dataset analysisgenerate_data_quality_rules- Quality assessmentprocess_complete_dataset- Full pipelinelist_catalog_files- Catalog browsing
CLI Commands
dataset_manager.py list- Show processed datasetsauto_processor.py --once- Single check cycleprocessor_dashboard.py --live- Real-time monitoring
π§ Configuration
Environment Variables (.env)
GOOGLE_SERVICE_ACCOUNT_KEY_PATH=path/to/your/key.json
MCP_SERVER_FOLDER_ID=your_input_folder_id
MCP_CLIENT_FOLDER_ID=your_output_folder_idAuto-Processor Settings (auto_config.py)
Check interval: 30 seconds
Supported formats: CSV, Excel
File age threshold: 1 minute
Max files per cycle: 5
π Monitoring & Analytics
# Current status
python processor_dashboard.py
# Live monitoring (auto-refresh)
python processor_dashboard.py --live
# Detailed statistics
python processor_dashboard.py --stats
# Processing history
python auto_processor.py --listπ³ Docker Deployment
# Build
docker build -t mcp-dataset-server .
# Run (mount your service account key securely)
docker run -p 8000:8000 \
-v /secure/path/to/key.json:/app/keys/key.json \
-e GOOGLE_SERVICE_ACCOUNT_KEY_PATH=/app/keys/key.json \
-e MCP_SERVER_FOLDER_ID=your_folder_id \
mcp-dataset-serverπ Troubleshooting
Common Issues
No files detected: Check Google Drive permissions
Processing errors: Verify service account access
MCP not working: Check Claude Desktop configuration
Debug Commands
# Test Google Drive connection
python -c "from utils import get_drive_service; print('β
Connected')"
# Check auto-processor status
python auto_processor.py --once
# Verify MCP server
python test_mcp_server.pyπ€ Contributing
Fork the repository
Create a feature branch
Never commit sensitive data
Test your changes
Submit a pull request
π Documentation
SECURITY_SETUP.md - Security configuration
AUTOMATION_GUIDE.md - Automation features
MCP_INTEGRATION_GUIDE.md - LLM integration
π License
MIT License
π What Makes This Special
π Security First: Proper credential management
π€ True Automation: Zero manual intervention
π§ LLM Integration: Natural language data processing
π Professional Output: Enterprise-ready documentation
π§ Multiple Interfaces: API, CLI, MCP, Dashboard
π Real-time Monitoring: Live processing status
ποΈ Perfect Organization: Structured output folders
Transform your messy data files into professional, documented, quality-checked datasets automatically! π
This server cannot be installed
Resources
Looking for Admin?
Admins can modify the Dockerfile, update the server description, and track usage metrics. If you are the server author, to access the admin panel.