Skip to main content
Glama

MCP Dataset Onboarding Server

by Magenta91
README.md•7.6 kB
# šŸ¤– MCP Dataset Onboarding Server A FastAPI-based MCP (Model-Compatible Protocol) server for automating dataset onboarding using Google Drive as both input source and mock catalog. ## šŸ”’ **SECURITY FIRST - READ THIS BEFORE SETUP** āš ļø **This repository contains template files only. You MUST configure your own credentials before use.** šŸ“– **Read [SECURITY_SETUP.md](SECURITY_SETUP.md) for complete security instructions.** 🚨 **Never commit service account keys or real folder IDs to version control!** ## Features - **Automated Dataset Processing**: Complete workflow from raw CSV/Excel files to cataloged datasets - **Google Drive Integration**: Uses Google Drive folders as input source and catalog storage - **Metadata Extraction**: Automatically extracts column information, data types, and basic statistics - **Data Quality Rules**: Suggests DQ rules based on data characteristics - **Contract Generation**: Creates Excel contracts with schema and DQ information - **Mock Catalog**: Publishes processed artifacts to a catalog folder - **šŸ¤– Automated Processing**: Watches folders and processes files automatically - **🌐 Multiple Interfaces**: FastAPI server, MCP server, CLI tools, and dashboards ## Project Structure ``` ā”œā”€ā”€ main.py # FastAPI server and endpoints ā”œā”€ā”€ mcp_server.py # True MCP protocol server for LLM integration ā”œā”€ā”€ utils.py # Google Drive helpers and DQ functions ā”œā”€ā”€ dataset_processor.py # Centralized dataset processing logic ā”œā”€ā”€ auto_processor.py # šŸ¤– Automated file monitoring ā”œā”€ā”€ start_auto_processor.py # šŸš€ Easy startup for auto-processor ā”œā”€ā”€ processor_dashboard.py # šŸ“Š Monitoring dashboard ā”œā”€ā”€ dataset_manager.py # CLI tool for managing datasets ā”œā”€ā”€ local_test.py # Local processing script ā”œā”€ā”€ auto_config.py # āš™ļø Configuration management ā”œā”€ā”€ requirements.txt # Python dependencies ā”œā”€ā”€ Dockerfile # Container configuration ā”œā”€ā”€ .env.template # Environment variables template ā”œā”€ā”€ .gitignore # Security: excludes sensitive files ā”œā”€ā”€ SECURITY_SETUP.md # šŸ”’ Security configuration guide ā”œā”€ā”€ processed_datasets/ # Organized output folder │ └── [dataset_name]/ # Individual dataset folders │ ā”œā”€ā”€ [dataset].csv # Original dataset │ ā”œā”€ā”€ [dataset]_metadata.json │ ā”œā”€ā”€ [dataset]_contract.xlsx │ ā”œā”€ā”€ [dataset]_dq_report.json │ └── README.md # Dataset summary └── README.md # This file ``` ## šŸš€ Quick Start ### 1. Security Setup (REQUIRED) ```bash # 1. Read the security guide cat SECURITY_SETUP.md # 2. Set up your Google service account (outside this repo) # 3. Configure your environment variables cp .env.template .env # Edit .env with your actual values # 4. Verify no sensitive files will be committed git status ``` ### 2. Installation ```bash # Install dependencies pip install -r requirements.txt # Test the setup python local_test.py ``` ### 3. Choose Your Interface #### šŸ¤– Fully Automated (Recommended) ```bash # Start auto-processor - upload files and walk away! python start_auto_processor.py ``` #### 🌐 API Server ```bash # Start FastAPI server python main.py ``` #### 🧠 LLM Integration (MCP) ```bash # Start MCP server for Claude Desktop, etc. python mcp_server.py ``` #### šŸ–„ļø Command Line ```bash # Manual dataset management python dataset_manager.py list python dataset_manager.py process YOUR_FILE_ID ``` ## šŸŽÆ Usage Scenarios ### Scenario 1: Set-and-Forget Automation 1. `python start_auto_processor.py` 2. Upload files to Google Drive 3. Files processed automatically within 30 seconds 4. Monitor with `python processor_dashboard.py --live` ### Scenario 2: LLM-Powered Data Analysis 1. Configure MCP server in Claude Desktop 2. Chat: "Analyze the dataset I just uploaded" 3. Claude uses MCP tools to process and explain your data ### Scenario 3: API Integration 1. `python main.py` 2. Integrate with your data pipelines via REST API 3. Programmatic dataset onboarding ## šŸ“Š What You Get For each processed dataset: - **šŸ“„ Original File**: Preserved in organized folder - **šŸ“‹ Metadata JSON**: Column info, types, statistics - **šŸ“Š Excel Contract**: Professional multi-sheet contract - **šŸ” Quality Report**: Data quality assessment - **šŸ“– README**: Human-readable summary ## šŸ› ļø Available Tools ### FastAPI Endpoints - `/tool/extract_metadata` - Analyze dataset structure - `/tool/apply_dq_rules` - Generate quality rules - `/process_dataset` - Complete workflow - `/health` - System health check ### MCP Tools (for LLMs) - `extract_dataset_metadata` - Dataset analysis - `generate_data_quality_rules` - Quality assessment - `process_complete_dataset` - Full pipeline - `list_catalog_files` - Catalog browsing ### CLI Commands - `dataset_manager.py list` - Show processed datasets - `auto_processor.py --once` - Single check cycle - `processor_dashboard.py --live` - Real-time monitoring ## šŸ”§ Configuration ### Environment Variables (.env) ```env GOOGLE_SERVICE_ACCOUNT_KEY_PATH=path/to/your/key.json MCP_SERVER_FOLDER_ID=your_input_folder_id MCP_CLIENT_FOLDER_ID=your_output_folder_id ``` ### Auto-Processor Settings (auto_config.py) - Check interval: 30 seconds - Supported formats: CSV, Excel - File age threshold: 1 minute - Max files per cycle: 5 ## šŸ“ˆ Monitoring & Analytics ```bash # Current status python processor_dashboard.py # Live monitoring (auto-refresh) python processor_dashboard.py --live # Detailed statistics python processor_dashboard.py --stats # Processing history python auto_processor.py --list ``` ## 🐳 Docker Deployment ```bash # Build docker build -t mcp-dataset-server . # Run (mount your service account key securely) docker run -p 8000:8000 \ -v /secure/path/to/key.json:/app/keys/key.json \ -e GOOGLE_SERVICE_ACCOUNT_KEY_PATH=/app/keys/key.json \ -e MCP_SERVER_FOLDER_ID=your_folder_id \ mcp-dataset-server ``` ## šŸ” Troubleshooting ### Common Issues - **No files detected**: Check Google Drive permissions - **Processing errors**: Verify service account access - **MCP not working**: Check Claude Desktop configuration ### Debug Commands ```bash # Test Google Drive connection python -c "from utils import get_drive_service; print('āœ… Connected')" # Check auto-processor status python auto_processor.py --once # Verify MCP server python test_mcp_server.py ``` ## šŸ¤ Contributing 1. Fork the repository 2. Create a feature branch 3. **Never commit sensitive data** 4. Test your changes 5. Submit a pull request ## šŸ“š Documentation - [SECURITY_SETUP.md](SECURITY_SETUP.md) - Security configuration - [AUTOMATION_GUIDE.md](AUTOMATION_GUIDE.md) - Automation features - [MCP_INTEGRATION_GUIDE.md](MCP_INTEGRATION_GUIDE.md) - LLM integration ## šŸ“„ License MIT License ## šŸŽ‰ What Makes This Special - **šŸ”’ Security First**: Proper credential management - **šŸ¤– True Automation**: Zero manual intervention - **🧠 LLM Integration**: Natural language data processing - **šŸ“Š Professional Output**: Enterprise-ready documentation - **šŸ”§ Multiple Interfaces**: API, CLI, MCP, Dashboard - **šŸ“ˆ Real-time Monitoring**: Live processing status - **šŸ—‚ļø Perfect Organization**: Structured output folders Transform your messy data files into professional, documented, quality-checked datasets automatically! šŸš€

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Magenta91/MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server