README.mdā¢7.6 kB
# š¤ MCP Dataset Onboarding Server
A FastAPI-based MCP (Model-Compatible Protocol) server for automating dataset onboarding using Google Drive as both input source and mock catalog.
## š **SECURITY FIRST - READ THIS BEFORE SETUP**
ā ļø **This repository contains template files only. You MUST configure your own credentials before use.**
š **Read [SECURITY_SETUP.md](SECURITY_SETUP.md) for complete security instructions.**
šØ **Never commit service account keys or real folder IDs to version control!**
## Features
- **Automated Dataset Processing**: Complete workflow from raw CSV/Excel files to cataloged datasets
- **Google Drive Integration**: Uses Google Drive folders as input source and catalog storage
- **Metadata Extraction**: Automatically extracts column information, data types, and basic statistics
- **Data Quality Rules**: Suggests DQ rules based on data characteristics
- **Contract Generation**: Creates Excel contracts with schema and DQ information
- **Mock Catalog**: Publishes processed artifacts to a catalog folder
- **š¤ Automated Processing**: Watches folders and processes files automatically
- **š Multiple Interfaces**: FastAPI server, MCP server, CLI tools, and dashboards
## Project Structure
```
āāā main.py # FastAPI server and endpoints
āāā mcp_server.py # True MCP protocol server for LLM integration
āāā utils.py # Google Drive helpers and DQ functions
āāā dataset_processor.py # Centralized dataset processing logic
āāā auto_processor.py # š¤ Automated file monitoring
āāā start_auto_processor.py # š Easy startup for auto-processor
āāā processor_dashboard.py # š Monitoring dashboard
āāā dataset_manager.py # CLI tool for managing datasets
āāā local_test.py # Local processing script
āāā auto_config.py # āļø Configuration management
āāā requirements.txt # Python dependencies
āāā Dockerfile # Container configuration
āāā .env.template # Environment variables template
āāā .gitignore # Security: excludes sensitive files
āāā SECURITY_SETUP.md # š Security configuration guide
āāā processed_datasets/ # Organized output folder
ā āāā [dataset_name]/ # Individual dataset folders
ā āāā [dataset].csv # Original dataset
ā āāā [dataset]_metadata.json
ā āāā [dataset]_contract.xlsx
ā āāā [dataset]_dq_report.json
ā āāā README.md # Dataset summary
āāā README.md # This file
```
## š Quick Start
### 1. Security Setup (REQUIRED)
```bash
# 1. Read the security guide
cat SECURITY_SETUP.md
# 2. Set up your Google service account (outside this repo)
# 3. Configure your environment variables
cp .env.template .env
# Edit .env with your actual values
# 4. Verify no sensitive files will be committed
git status
```
### 2. Installation
```bash
# Install dependencies
pip install -r requirements.txt
# Test the setup
python local_test.py
```
### 3. Choose Your Interface
#### š¤ Fully Automated (Recommended)
```bash
# Start auto-processor - upload files and walk away!
python start_auto_processor.py
```
#### š API Server
```bash
# Start FastAPI server
python main.py
```
#### š§ LLM Integration (MCP)
```bash
# Start MCP server for Claude Desktop, etc.
python mcp_server.py
```
#### š„ļø Command Line
```bash
# Manual dataset management
python dataset_manager.py list
python dataset_manager.py process YOUR_FILE_ID
```
## šÆ Usage Scenarios
### Scenario 1: Set-and-Forget Automation
1. `python start_auto_processor.py`
2. Upload files to Google Drive
3. Files processed automatically within 30 seconds
4. Monitor with `python processor_dashboard.py --live`
### Scenario 2: LLM-Powered Data Analysis
1. Configure MCP server in Claude Desktop
2. Chat: "Analyze the dataset I just uploaded"
3. Claude uses MCP tools to process and explain your data
### Scenario 3: API Integration
1. `python main.py`
2. Integrate with your data pipelines via REST API
3. Programmatic dataset onboarding
## š What You Get
For each processed dataset:
- **š Original File**: Preserved in organized folder
- **š Metadata JSON**: Column info, types, statistics
- **š Excel Contract**: Professional multi-sheet contract
- **š Quality Report**: Data quality assessment
- **š README**: Human-readable summary
## š ļø Available Tools
### FastAPI Endpoints
- `/tool/extract_metadata` - Analyze dataset structure
- `/tool/apply_dq_rules` - Generate quality rules
- `/process_dataset` - Complete workflow
- `/health` - System health check
### MCP Tools (for LLMs)
- `extract_dataset_metadata` - Dataset analysis
- `generate_data_quality_rules` - Quality assessment
- `process_complete_dataset` - Full pipeline
- `list_catalog_files` - Catalog browsing
### CLI Commands
- `dataset_manager.py list` - Show processed datasets
- `auto_processor.py --once` - Single check cycle
- `processor_dashboard.py --live` - Real-time monitoring
## š§ Configuration
### Environment Variables (.env)
```env
GOOGLE_SERVICE_ACCOUNT_KEY_PATH=path/to/your/key.json
MCP_SERVER_FOLDER_ID=your_input_folder_id
MCP_CLIENT_FOLDER_ID=your_output_folder_id
```
### Auto-Processor Settings (auto_config.py)
- Check interval: 30 seconds
- Supported formats: CSV, Excel
- File age threshold: 1 minute
- Max files per cycle: 5
## š Monitoring & Analytics
```bash
# Current status
python processor_dashboard.py
# Live monitoring (auto-refresh)
python processor_dashboard.py --live
# Detailed statistics
python processor_dashboard.py --stats
# Processing history
python auto_processor.py --list
```
## š³ Docker Deployment
```bash
# Build
docker build -t mcp-dataset-server .
# Run (mount your service account key securely)
docker run -p 8000:8000 \
-v /secure/path/to/key.json:/app/keys/key.json \
-e GOOGLE_SERVICE_ACCOUNT_KEY_PATH=/app/keys/key.json \
-e MCP_SERVER_FOLDER_ID=your_folder_id \
mcp-dataset-server
```
## š Troubleshooting
### Common Issues
- **No files detected**: Check Google Drive permissions
- **Processing errors**: Verify service account access
- **MCP not working**: Check Claude Desktop configuration
### Debug Commands
```bash
# Test Google Drive connection
python -c "from utils import get_drive_service; print('ā
Connected')"
# Check auto-processor status
python auto_processor.py --once
# Verify MCP server
python test_mcp_server.py
```
## š¤ Contributing
1. Fork the repository
2. Create a feature branch
3. **Never commit sensitive data**
4. Test your changes
5. Submit a pull request
## š Documentation
- [SECURITY_SETUP.md](SECURITY_SETUP.md) - Security configuration
- [AUTOMATION_GUIDE.md](AUTOMATION_GUIDE.md) - Automation features
- [MCP_INTEGRATION_GUIDE.md](MCP_INTEGRATION_GUIDE.md) - LLM integration
## š License
MIT License
## š What Makes This Special
- **š Security First**: Proper credential management
- **š¤ True Automation**: Zero manual intervention
- **š§ LLM Integration**: Natural language data processing
- **š Professional Output**: Enterprise-ready documentation
- **š§ Multiple Interfaces**: API, CLI, MCP, Dashboard
- **š Real-time Monitoring**: Live processing status
- **šļø Perfect Organization**: Structured output folders
Transform your messy data files into professional, documented, quality-checked datasets automatically! š