Skip to main content
Glama
Magenta91

MCP Dataset Onboarding Server

by Magenta91
README.mdβ€’7.6 kB
# πŸ€– MCP Dataset Onboarding Server A FastAPI-based MCP (Model-Compatible Protocol) server for automating dataset onboarding using Google Drive as both input source and mock catalog. ## πŸ”’ **SECURITY FIRST - READ THIS BEFORE SETUP** ⚠️ **This repository contains template files only. You MUST configure your own credentials before use.** πŸ“– **Read [SECURITY_SETUP.md](SECURITY_SETUP.md) for complete security instructions.** 🚨 **Never commit service account keys or real folder IDs to version control!** ## Features - **Automated Dataset Processing**: Complete workflow from raw CSV/Excel files to cataloged datasets - **Google Drive Integration**: Uses Google Drive folders as input source and catalog storage - **Metadata Extraction**: Automatically extracts column information, data types, and basic statistics - **Data Quality Rules**: Suggests DQ rules based on data characteristics - **Contract Generation**: Creates Excel contracts with schema and DQ information - **Mock Catalog**: Publishes processed artifacts to a catalog folder - **πŸ€– Automated Processing**: Watches folders and processes files automatically - **🌐 Multiple Interfaces**: FastAPI server, MCP server, CLI tools, and dashboards ## Project Structure ``` β”œβ”€β”€ main.py # FastAPI server and endpoints β”œβ”€β”€ mcp_server.py # True MCP protocol server for LLM integration β”œβ”€β”€ utils.py # Google Drive helpers and DQ functions β”œβ”€β”€ dataset_processor.py # Centralized dataset processing logic β”œβ”€β”€ auto_processor.py # πŸ€– Automated file monitoring β”œβ”€β”€ start_auto_processor.py # πŸš€ Easy startup for auto-processor β”œβ”€β”€ processor_dashboard.py # πŸ“Š Monitoring dashboard β”œβ”€β”€ dataset_manager.py # CLI tool for managing datasets β”œβ”€β”€ local_test.py # Local processing script β”œβ”€β”€ auto_config.py # βš™οΈ Configuration management β”œβ”€β”€ requirements.txt # Python dependencies β”œβ”€β”€ Dockerfile # Container configuration β”œβ”€β”€ .env.template # Environment variables template β”œβ”€β”€ .gitignore # Security: excludes sensitive files β”œβ”€β”€ SECURITY_SETUP.md # πŸ”’ Security configuration guide β”œβ”€β”€ processed_datasets/ # Organized output folder β”‚ └── [dataset_name]/ # Individual dataset folders β”‚ β”œβ”€β”€ [dataset].csv # Original dataset β”‚ β”œβ”€β”€ [dataset]_metadata.json β”‚ β”œβ”€β”€ [dataset]_contract.xlsx β”‚ β”œβ”€β”€ [dataset]_dq_report.json β”‚ └── README.md # Dataset summary └── README.md # This file ``` ## πŸš€ Quick Start ### 1. Security Setup (REQUIRED) ```bash # 1. Read the security guide cat SECURITY_SETUP.md # 2. Set up your Google service account (outside this repo) # 3. Configure your environment variables cp .env.template .env # Edit .env with your actual values # 4. Verify no sensitive files will be committed git status ``` ### 2. Installation ```bash # Install dependencies pip install -r requirements.txt # Test the setup python local_test.py ``` ### 3. Choose Your Interface #### πŸ€– Fully Automated (Recommended) ```bash # Start auto-processor - upload files and walk away! python start_auto_processor.py ``` #### 🌐 API Server ```bash # Start FastAPI server python main.py ``` #### 🧠 LLM Integration (MCP) ```bash # Start MCP server for Claude Desktop, etc. python mcp_server.py ``` #### πŸ–₯️ Command Line ```bash # Manual dataset management python dataset_manager.py list python dataset_manager.py process YOUR_FILE_ID ``` ## 🎯 Usage Scenarios ### Scenario 1: Set-and-Forget Automation 1. `python start_auto_processor.py` 2. Upload files to Google Drive 3. Files processed automatically within 30 seconds 4. Monitor with `python processor_dashboard.py --live` ### Scenario 2: LLM-Powered Data Analysis 1. Configure MCP server in Claude Desktop 2. Chat: "Analyze the dataset I just uploaded" 3. Claude uses MCP tools to process and explain your data ### Scenario 3: API Integration 1. `python main.py` 2. Integrate with your data pipelines via REST API 3. Programmatic dataset onboarding ## πŸ“Š What You Get For each processed dataset: - **πŸ“„ Original File**: Preserved in organized folder - **πŸ“‹ Metadata JSON**: Column info, types, statistics - **πŸ“Š Excel Contract**: Professional multi-sheet contract - **πŸ” Quality Report**: Data quality assessment - **πŸ“– README**: Human-readable summary ## πŸ› οΈ Available Tools ### FastAPI Endpoints - `/tool/extract_metadata` - Analyze dataset structure - `/tool/apply_dq_rules` - Generate quality rules - `/process_dataset` - Complete workflow - `/health` - System health check ### MCP Tools (for LLMs) - `extract_dataset_metadata` - Dataset analysis - `generate_data_quality_rules` - Quality assessment - `process_complete_dataset` - Full pipeline - `list_catalog_files` - Catalog browsing ### CLI Commands - `dataset_manager.py list` - Show processed datasets - `auto_processor.py --once` - Single check cycle - `processor_dashboard.py --live` - Real-time monitoring ## πŸ”§ Configuration ### Environment Variables (.env) ```env GOOGLE_SERVICE_ACCOUNT_KEY_PATH=path/to/your/key.json MCP_SERVER_FOLDER_ID=your_input_folder_id MCP_CLIENT_FOLDER_ID=your_output_folder_id ``` ### Auto-Processor Settings (auto_config.py) - Check interval: 30 seconds - Supported formats: CSV, Excel - File age threshold: 1 minute - Max files per cycle: 5 ## πŸ“ˆ Monitoring & Analytics ```bash # Current status python processor_dashboard.py # Live monitoring (auto-refresh) python processor_dashboard.py --live # Detailed statistics python processor_dashboard.py --stats # Processing history python auto_processor.py --list ``` ## 🐳 Docker Deployment ```bash # Build docker build -t mcp-dataset-server . # Run (mount your service account key securely) docker run -p 8000:8000 \ -v /secure/path/to/key.json:/app/keys/key.json \ -e GOOGLE_SERVICE_ACCOUNT_KEY_PATH=/app/keys/key.json \ -e MCP_SERVER_FOLDER_ID=your_folder_id \ mcp-dataset-server ``` ## πŸ” Troubleshooting ### Common Issues - **No files detected**: Check Google Drive permissions - **Processing errors**: Verify service account access - **MCP not working**: Check Claude Desktop configuration ### Debug Commands ```bash # Test Google Drive connection python -c "from utils import get_drive_service; print('βœ… Connected')" # Check auto-processor status python auto_processor.py --once # Verify MCP server python test_mcp_server.py ``` ## 🀝 Contributing 1. Fork the repository 2. Create a feature branch 3. **Never commit sensitive data** 4. Test your changes 5. Submit a pull request ## πŸ“š Documentation - [SECURITY_SETUP.md](SECURITY_SETUP.md) - Security configuration - [AUTOMATION_GUIDE.md](AUTOMATION_GUIDE.md) - Automation features - [MCP_INTEGRATION_GUIDE.md](MCP_INTEGRATION_GUIDE.md) - LLM integration ## πŸ“„ License MIT License ## πŸŽ‰ What Makes This Special - **πŸ”’ Security First**: Proper credential management - **πŸ€– True Automation**: Zero manual intervention - **🧠 LLM Integration**: Natural language data processing - **πŸ“Š Professional Output**: Enterprise-ready documentation - **πŸ”§ Multiple Interfaces**: API, CLI, MCP, Dashboard - **πŸ“ˆ Real-time Monitoring**: Live processing status - **πŸ—‚οΈ Perfect Organization**: Structured output folders Transform your messy data files into professional, documented, quality-checked datasets automatically! πŸš€

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Magenta91/MCP'

If you have feedback or need assistance with the MCP directory API, please join our Discord server