DataBeak

Tests codecov Python 3.12+ License Code style: ruff

AI-Powered CSV Processing via Model Context Protocol

Transform how AI assistants work with CSV data. DataBeak provides 40+ specialized tools for data manipulation, analysis, and validation through the Model Context Protocol (MCP).

Features

🔄 Complete Data Operations - Load, transform, and analyze CSV data from URLs and string content
📊 Advanced Analytics - Statistics, correlations, outlier detection, data profiling
✅ Data Validation - Schema validation, quality scoring, anomaly detection
🎯 Stateless Design - Clean MCP architecture with external context management
⚡ High Performance - Async I/O, streaming downloads, chunked processing
🔒 Session Management - Multi-user support with isolated sessions
🛡️ Web-Safe - No file system access; designed for secure web hosting
🌟 Code Quality - Zero ruff violations, 100% mypy compliance, perfect MCP documentation standards, comprehensive test coverage

Getting Started

The fastest way to use DataBeak is with uvx (no installation required):

For Claude Desktop

Add this to your MCP Settings file:

{ "mcpServers": { "databeak": { "command": "uvx", "args": [ "--from", "git+https://github.com/jonpspri/databeak.git", "databeak" ] } } }

For Other AI Clients

DataBeak works with Continue, Cline, Windsurf, and Zed. See the installation guide for specific configuration examples.

HTTP Mode (Advanced)

For HTTP-based AI clients or custom deployments:

# Run in HTTP mode uv run databeak --transport http --host 0.0.0.0 --port 8000 # Access server at http://localhost:8000/mcp # Health check at http://localhost:8000/health

Quick Test

Once configured, ask your AI assistant:

"Load this CSV data: name,price\nWidget,10.99\nGadget,25.50" "Load CSV from URL: https://example.com/data.csv" "Remove duplicate rows and show me the statistics" "Find outliers in the price column"

Documentation

📚 Complete Documentation

Installation Guide - Setup for all AI clients
Quick Start Tutorial - Learn in 10 minutes
API Reference - All 40+ tools documented
Architecture - Technical details

Environment Variables

Configure DataBeak behavior with environment variables (all use DATABEAK_ prefix):

Variable	Default	Description
`DATABEAK_SESSION_TIMEOUT`	3600	Session timeout (seconds)
`DATABEAK_MAX_DOWNLOAD_SIZE_MB`	100	Maximum URL download size (MB)
`DATABEAK_MAX_MEMORY_USAGE_MB`	1000	Max DataFrame memory (MB)
`DATABEAK_MAX_ROWS`	1,000,000	Max DataFrame rows
`DATABEAK_URL_TIMEOUT_SECONDS`	30	URL download timeout
`DATABEAK_HEALTH_MEMORY_THRESHOLD_MB`	2048	Health monitoring memory threshold

See settings.py for complete configuration options.

Known Limitations

DataBeak is designed for interactive CSV processing with AI assistants. Be aware of these constraints:

Data Loading: URLs and string content only (no local file system access for web hosting security)
Download Size: Maximum 100MB per URL download (configurable via DATABEAK_MAX_DOWNLOAD_SIZE_MB)
DataFrame Size: Maximum 1GB memory and 1M rows per DataFrame (configurable)
Session Management: Maximum 100 concurrent sessions, 1-hour timeout (configurable)
Memory: Large datasets may require significant memory; monitor with health_check tool
CSV Dialects: Assumes standard CSV format; complex dialects may require pre-processing
Concurrency: Async I/O for concurrent URL downloads; parallel sessions supported
Data Types: Automatic type inference; complex types may need explicit conversion
URL Loading: HTTPS only; blocks private networks (127.0.0.1, 192.168.x.x, 10.x.x.x) for security

For production deployments with larger datasets, adjust environment variables and monitor resource usage with health_check and get_server_info tools.

Contributing

We welcome contributions! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes with tests
Run quality checks: uv run -m pytest
Submit a pull request

Note: All changes must go through pull requests. Direct commits to main are blocked by pre-commit hooks.

Development

# Setup development environment git clone https://github.com/jonpspri/databeak.git cd databeak uv sync # Run the server locally uv run databeak # Run tests uv run -m pytest tests/unit/ # Unit tests (primary) uv run -m pytest # All tests # Run quality checks uv run ruff check uv run mypy src/databeak/

Testing Structure

DataBeak implements comprehensive unit and integration testing:

Unit Tests (tests/unit/) - 940+ fast, isolated module tests
Integration Tests (tests/integration/) - 43 FastMCP Client-based protocol tests across 7 test files
E2E Tests (tests/e2e/) - Planned: Complete workflow validation

Test Execution:

uv run pytest -n auto tests/unit/ # Run unit tests (940+ tests) uv run pytest -n auto tests/integration/ # Run integration tests (43 tests) uv run pytest -n auto --cov=src/databeak # Run with coverage analysis

See Testing Guide for comprehensive testing details.