# π€ MCP Dataset Onboarding Server
A FastAPI-based MCP (Model-Compatible Protocol) server for automating dataset onboarding using Google Drive as both input source and mock catalog.
## π **SECURITY FIRST - READ THIS BEFORE SETUP**
β οΈ **This repository contains template files only. You MUST configure your own credentials before use.**
π **Read [SECURITY_SETUP.md](SECURITY_SETUP.md) for complete security instructions.**
π¨ **Never commit service account keys or real folder IDs to version control!**
## Features
- **Automated Dataset Processing**: Complete workflow from raw CSV/Excel files to cataloged datasets
- **Google Drive Integration**: Uses Google Drive folders as input source and catalog storage
- **Metadata Extraction**: Automatically extracts column information, data types, and basic statistics
- **Data Quality Rules**: Suggests DQ rules based on data characteristics
- **Contract Generation**: Creates Excel contracts with schema and DQ information
- **Mock Catalog**: Publishes processed artifacts to a catalog folder
- **π€ Automated Processing**: Watches folders and processes files automatically
- **π Multiple Interfaces**: FastAPI server, MCP server, CLI tools, and dashboards
## Project Structure
```
βββ main.py # FastAPI server and endpoints
βββ mcp_server.py # True MCP protocol server for LLM integration
βββ utils.py # Google Drive helpers and DQ functions
βββ dataset_processor.py # Centralized dataset processing logic
βββ auto_processor.py # π€ Automated file monitoring
βββ start_auto_processor.py # π Easy startup for auto-processor
βββ processor_dashboard.py # π Monitoring dashboard
βββ dataset_manager.py # CLI tool for managing datasets
βββ local_test.py # Local processing script
βββ auto_config.py # βοΈ Configuration management
βββ requirements.txt # Python dependencies
βββ Dockerfile # Container configuration
βββ .env.template # Environment variables template
βββ .gitignore # Security: excludes sensitive files
βββ SECURITY_SETUP.md # π Security configuration guide
βββ processed_datasets/ # Organized output folder
β βββ [dataset_name]/ # Individual dataset folders
β βββ [dataset].csv # Original dataset
β βββ [dataset]_metadata.json
β βββ [dataset]_contract.xlsx
β βββ [dataset]_dq_report.json
β βββ README.md # Dataset summary
βββ README.md # This file
```
## π Quick Start
### 1. Security Setup (REQUIRED)
```bash
# 1. Read the security guide
cat SECURITY_SETUP.md
# 2. Set up your Google service account (outside this repo)
# 3. Configure your environment variables
cp .env.template .env
# Edit .env with your actual values
# 4. Verify no sensitive files will be committed
git status
```
### 2. Installation
```bash
# Install dependencies
pip install -r requirements.txt
# Test the setup
python local_test.py
```
### 3. Choose Your Interface
#### π€ Fully Automated (Recommended)
```bash
# Start auto-processor - upload files and walk away!
python start_auto_processor.py
```
#### π API Server
```bash
# Start FastAPI server
python main.py
```
#### π§ LLM Integration (MCP)
```bash
# Start MCP server for Claude Desktop, etc.
python mcp_server.py
```
#### π₯οΈ Command Line
```bash
# Manual dataset management
python dataset_manager.py list
python dataset_manager.py process YOUR_FILE_ID
```
## π― Usage Scenarios
### Scenario 1: Set-and-Forget Automation
1. `python start_auto_processor.py`
2. Upload files to Google Drive
3. Files processed automatically within 30 seconds
4. Monitor with `python processor_dashboard.py --live`
### Scenario 2: LLM-Powered Data Analysis
1. Configure MCP server in Claude Desktop
2. Chat: "Analyze the dataset I just uploaded"
3. Claude uses MCP tools to process and explain your data
### Scenario 3: API Integration
1. `python main.py`
2. Integrate with your data pipelines via REST API
3. Programmatic dataset onboarding
## π What You Get
For each processed dataset:
- **π Original File**: Preserved in organized folder
- **π Metadata JSON**: Column info, types, statistics
- **π Excel Contract**: Professional multi-sheet contract
- **π Quality Report**: Data quality assessment
- **π README**: Human-readable summary
## π οΈ Available Tools
### FastAPI Endpoints
- `/tool/extract_metadata` - Analyze dataset structure
- `/tool/apply_dq_rules` - Generate quality rules
- `/process_dataset` - Complete workflow
- `/health` - System health check
### MCP Tools (for LLMs)
- `extract_dataset_metadata` - Dataset analysis
- `generate_data_quality_rules` - Quality assessment
- `process_complete_dataset` - Full pipeline
- `list_catalog_files` - Catalog browsing
### CLI Commands
- `dataset_manager.py list` - Show processed datasets
- `auto_processor.py --once` - Single check cycle
- `processor_dashboard.py --live` - Real-time monitoring
## π§ Configuration
### Environment Variables (.env)
```env
GOOGLE_SERVICE_ACCOUNT_KEY_PATH=path/to/your/key.json
MCP_SERVER_FOLDER_ID=your_input_folder_id
MCP_CLIENT_FOLDER_ID=your_output_folder_id
```
### Auto-Processor Settings (auto_config.py)
- Check interval: 30 seconds
- Supported formats: CSV, Excel
- File age threshold: 1 minute
- Max files per cycle: 5
## π Monitoring & Analytics
```bash
# Current status
python processor_dashboard.py
# Live monitoring (auto-refresh)
python processor_dashboard.py --live
# Detailed statistics
python processor_dashboard.py --stats
# Processing history
python auto_processor.py --list
```
## π³ Docker Deployment
```bash
# Build
docker build -t mcp-dataset-server .
# Run (mount your service account key securely)
docker run -p 8000:8000 \
-v /secure/path/to/key.json:/app/keys/key.json \
-e GOOGLE_SERVICE_ACCOUNT_KEY_PATH=/app/keys/key.json \
-e MCP_SERVER_FOLDER_ID=your_folder_id \
mcp-dataset-server
```
## π Troubleshooting
### Common Issues
- **No files detected**: Check Google Drive permissions
- **Processing errors**: Verify service account access
- **MCP not working**: Check Claude Desktop configuration
### Debug Commands
```bash
# Test Google Drive connection
python -c "from utils import get_drive_service; print('β
Connected')"
# Check auto-processor status
python auto_processor.py --once
# Verify MCP server
python test_mcp_server.py
```
## π€ Contributing
1. Fork the repository
2. Create a feature branch
3. **Never commit sensitive data**
4. Test your changes
5. Submit a pull request
## π Documentation
- [SECURITY_SETUP.md](SECURITY_SETUP.md) - Security configuration
- [AUTOMATION_GUIDE.md](AUTOMATION_GUIDE.md) - Automation features
- [MCP_INTEGRATION_GUIDE.md](MCP_INTEGRATION_GUIDE.md) - LLM integration
## π License
MIT License
## π What Makes This Special
- **π Security First**: Proper credential management
- **π€ True Automation**: Zero manual intervention
- **π§ LLM Integration**: Natural language data processing
- **π Professional Output**: Enterprise-ready documentation
- **π§ Multiple Interfaces**: API, CLI, MCP, Dashboard
- **π Real-time Monitoring**: Live processing status
- **ποΈ Perfect Organization**: Structured output folders
Transform your messy data files into professional, documented, quality-checked datasets automatically! π