Skip to main content
Glama

MCP Dataset Onboarding Server

by Magenta91

πŸ€– MCP Dataset Onboarding Server

A FastAPI-based MCP (Model-Compatible Protocol) server for automating dataset onboarding using Google Drive as both input source and mock catalog.

πŸ”’ SECURITY FIRST - READ THIS BEFORE SETUP

⚠️ This repository contains template files only. You MUST configure your own credentials before use.

πŸ“– Read

🚨 Never commit service account keys or real folder IDs to version control!

Features

  • Automated Dataset Processing: Complete workflow from raw CSV/Excel files to cataloged datasets

  • Google Drive Integration: Uses Google Drive folders as input source and catalog storage

  • Metadata Extraction: Automatically extracts column information, data types, and basic statistics

  • Data Quality Rules: Suggests DQ rules based on data characteristics

  • Contract Generation: Creates Excel contracts with schema and DQ information

  • Mock Catalog: Publishes processed artifacts to a catalog folder

  • πŸ€– Automated Processing: Watches folders and processes files automatically

  • 🌐 Multiple Interfaces: FastAPI server, MCP server, CLI tools, and dashboards

Project Structure

β”œβ”€β”€ main.py # FastAPI server and endpoints β”œβ”€β”€ mcp_server.py # True MCP protocol server for LLM integration β”œβ”€β”€ utils.py # Google Drive helpers and DQ functions β”œβ”€β”€ dataset_processor.py # Centralized dataset processing logic β”œβ”€β”€ auto_processor.py # πŸ€– Automated file monitoring β”œβ”€β”€ start_auto_processor.py # πŸš€ Easy startup for auto-processor β”œβ”€β”€ processor_dashboard.py # πŸ“Š Monitoring dashboard β”œβ”€β”€ dataset_manager.py # CLI tool for managing datasets β”œβ”€β”€ local_test.py # Local processing script β”œβ”€β”€ auto_config.py # βš™οΈ Configuration management β”œβ”€β”€ requirements.txt # Python dependencies β”œβ”€β”€ Dockerfile # Container configuration β”œβ”€β”€ .env.template # Environment variables template β”œβ”€β”€ .gitignore # Security: excludes sensitive files β”œβ”€β”€ SECURITY_SETUP.md # πŸ”’ Security configuration guide β”œβ”€β”€ processed_datasets/ # Organized output folder β”‚ └── [dataset_name]/ # Individual dataset folders β”‚ β”œβ”€β”€ [dataset].csv # Original dataset β”‚ β”œβ”€β”€ [dataset]_metadata.json β”‚ β”œβ”€β”€ [dataset]_contract.xlsx β”‚ β”œβ”€β”€ [dataset]_dq_report.json β”‚ └── README.md # Dataset summary └── README.md # This file

πŸš€ Quick Start

1. Security Setup (REQUIRED)

# 1. Read the security guide cat SECURITY_SETUP.md # 2. Set up your Google service account (outside this repo) # 3. Configure your environment variables cp .env.template .env # Edit .env with your actual values # 4. Verify no sensitive files will be committed git status

2. Installation

# Install dependencies pip install -r requirements.txt # Test the setup python local_test.py

3. Choose Your Interface

πŸ€– Fully Automated (Recommended)

# Start auto-processor - upload files and walk away! python start_auto_processor.py

🌐 API Server

# Start FastAPI server python main.py

🧠 LLM Integration (MCP)

# Start MCP server for Claude Desktop, etc. python mcp_server.py

πŸ–₯️ Command Line

# Manual dataset management python dataset_manager.py list python dataset_manager.py process YOUR_FILE_ID

🎯 Usage Scenarios

Scenario 1: Set-and-Forget Automation

  1. python start_auto_processor.py

  2. Upload files to Google Drive

  3. Files processed automatically within 30 seconds

  4. Monitor with python processor_dashboard.py --live

Scenario 2: LLM-Powered Data Analysis

  1. Configure MCP server in Claude Desktop

  2. Chat: "Analyze the dataset I just uploaded"

  3. Claude uses MCP tools to process and explain your data

Scenario 3: API Integration

  1. python main.py

  2. Integrate with your data pipelines via REST API

  3. Programmatic dataset onboarding

πŸ“Š What You Get

For each processed dataset:

  • πŸ“„ Original File: Preserved in organized folder

  • πŸ“‹ Metadata JSON: Column info, types, statistics

  • πŸ“Š Excel Contract: Professional multi-sheet contract

  • πŸ” Quality Report: Data quality assessment

  • πŸ“– README: Human-readable summary

πŸ› οΈ Available Tools

FastAPI Endpoints

  • /tool/extract_metadata - Analyze dataset structure

  • /tool/apply_dq_rules - Generate quality rules

  • /process_dataset - Complete workflow

  • /health - System health check

MCP Tools (for LLMs)

  • extract_dataset_metadata - Dataset analysis

  • generate_data_quality_rules - Quality assessment

  • process_complete_dataset - Full pipeline

  • list_catalog_files - Catalog browsing

CLI Commands

  • dataset_manager.py list - Show processed datasets

  • auto_processor.py --once - Single check cycle

  • processor_dashboard.py --live - Real-time monitoring

πŸ”§ Configuration

Environment Variables (.env)

GOOGLE_SERVICE_ACCOUNT_KEY_PATH=path/to/your/key.json MCP_SERVER_FOLDER_ID=your_input_folder_id MCP_CLIENT_FOLDER_ID=your_output_folder_id

Auto-Processor Settings (auto_config.py)

  • Check interval: 30 seconds

  • Supported formats: CSV, Excel

  • File age threshold: 1 minute

  • Max files per cycle: 5

πŸ“ˆ Monitoring & Analytics

# Current status python processor_dashboard.py # Live monitoring (auto-refresh) python processor_dashboard.py --live # Detailed statistics python processor_dashboard.py --stats # Processing history python auto_processor.py --list

🐳 Docker Deployment

# Build docker build -t mcp-dataset-server . # Run (mount your service account key securely) docker run -p 8000:8000 \ -v /secure/path/to/key.json:/app/keys/key.json \ -e GOOGLE_SERVICE_ACCOUNT_KEY_PATH=/app/keys/key.json \ -e MCP_SERVER_FOLDER_ID=your_folder_id \ mcp-dataset-server

πŸ” Troubleshooting

Common Issues

  • No files detected: Check Google Drive permissions

  • Processing errors: Verify service account access

  • MCP not working: Check Claude Desktop configuration

Debug Commands

# Test Google Drive connection python -c "from utils import get_drive_service; print('βœ… Connected')" # Check auto-processor status python auto_processor.py --once # Verify MCP server python test_mcp_server.py

🀝 Contributing

  1. Fork the repository

  2. Create a feature branch

  3. Never commit sensitive data

  4. Test your changes

  5. Submit a pull request

πŸ“š Documentation

πŸ“„ License

MIT License

πŸŽ‰ What Makes This Special

  • πŸ”’ Security First: Proper credential management

  • πŸ€– True Automation: Zero manual intervention

  • 🧠 LLM Integration: Natural language data processing

  • πŸ“Š Professional Output: Enterprise-ready documentation

  • πŸ”§ Multiple Interfaces: API, CLI, MCP, Dashboard

  • πŸ“ˆ Real-time Monitoring: Live processing status

  • πŸ—‚οΈ Perfect Organization: Structured output folders

Transform your messy data files into professional, documented, quality-checked datasets automatically! πŸš€

Related MCP Servers

  • A
    security
    A
    license
    A
    quality
    Enables autonomous data exploration on .csv-based datasets, providing intelligent insights with minimal effort.
    Last updated -
    2
    487
    MIT License
    • Apple
  • -
    security
    A
    license
    -
    quality
    Enables integration with Google Drive for listing, reading, and searching over files, supporting various file types with automatic export for Google Workspace files.
    Last updated -
    634
    52
    MIT License
  • -
    security
    A
    license
    -
    quality
    Integrates with Google Drive to enable listing, searching, and reading files, plus reading and writing to Google Sheets.
    Last updated -
    130
    210
    MIT License
  • A
    security
    A
    license
    A
    quality
    Provides seamless integration with Smartsheet, enabling automated operations on Smartsheet documents through a standardized interface that bridges AI-powered automation tools with Smartsheet's collaboration platform.
    Last updated -
    11
    MIT License
    • Linux
    • Apple

View all related MCP servers

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Magenta91/MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server