Skip to main content
Glama

DataBeak

Tests codecov Python 3.12+ License Code style: ruff

AI-Powered CSV Processing via Model Context Protocol

Transform how AI assistants work with CSV data. DataBeak provides 40+ specialized tools for data manipulation, analysis, and validation through the Model Context Protocol (MCP).

Features

  • šŸ”„ Complete Data Operations - Load, transform, and analyze CSV data from URLs and string content

  • šŸ“Š Advanced Analytics - Statistics, correlations, outlier detection, data profiling

  • āœ… Data Validation - Schema validation, quality scoring, anomaly detection

  • šŸŽÆ Stateless Design - Clean MCP architecture with external context management

  • ⚔ High Performance - Async I/O, streaming downloads, chunked processing

  • šŸ”’ Session Management - Multi-user support with isolated sessions

  • šŸ›”ļø Web-Safe - No file system access; designed for secure web hosting

  • 🌟 Code Quality - Zero ruff violations, 100% mypy compliance, perfect MCP documentation standards, comprehensive test coverage

Getting Started

The fastest way to use DataBeak is with uvx (no installation required):

For Claude Desktop

Add this to your MCP Settings file:

{ "mcpServers": { "databeak": { "command": "uvx", "args": [ "--from", "git+https://github.com/jonpspri/databeak.git", "databeak" ] } } }

For Other AI Clients

DataBeak works with Continue, Cline, Windsurf, and Zed. See the installation guide for specific configuration examples.

HTTP Mode (Advanced)

For HTTP-based AI clients or custom deployments:

# Run in HTTP mode uv run databeak --transport http --host 0.0.0.0 --port 8000 # Access server at http://localhost:8000/mcp # Health check at http://localhost:8000/health

Quick Test

Once configured, ask your AI assistant:

"Load this CSV data: name,price\nWidget,10.99\nGadget,25.50" "Load CSV from URL: https://example.com/data.csv" "Remove duplicate rows and show me the statistics" "Find outliers in the price column"

Documentation

šŸ“š Complete Documentation

Environment Variables

Configure DataBeak behavior with environment variables (all use DATABEAK_ prefix):

Variable

Default

Description

DATABEAK_SESSION_TIMEOUT

3600

Session timeout (seconds)

DATABEAK_MAX_DOWNLOAD_SIZE_MB

100

Maximum URL download size (MB)

DATABEAK_MAX_MEMORY_USAGE_MB

1000

Max DataFrame memory (MB)

DATABEAK_MAX_ROWS

1,000,000

Max DataFrame rows

DATABEAK_URL_TIMEOUT_SECONDS

30

URL download timeout

DATABEAK_HEALTH_MEMORY_THRESHOLD_MB

2048

Health monitoring memory threshold

See settings.py for complete configuration options.

Known Limitations

DataBeak is designed for interactive CSV processing with AI assistants. Be aware of these constraints:

  • Data Loading: URLs and string content only (no local file system access for web hosting security)

  • Download Size: Maximum 100MB per URL download (configurable via DATABEAK_MAX_DOWNLOAD_SIZE_MB)

  • DataFrame Size: Maximum 1GB memory and 1M rows per DataFrame (configurable)

  • Session Management: Maximum 100 concurrent sessions, 1-hour timeout (configurable)

  • Memory: Large datasets may require significant memory; monitor with health_check tool

  • CSV Dialects: Assumes standard CSV format; complex dialects may require pre-processing

  • Concurrency: Async I/O for concurrent URL downloads; parallel sessions supported

  • Data Types: Automatic type inference; complex types may need explicit conversion

  • URL Loading: HTTPS only; blocks private networks (127.0.0.1, 192.168.x.x, 10.x.x.x) for security

For production deployments with larger datasets, adjust environment variables and monitor resource usage with health_check and get_server_info tools.

Contributing

We welcome contributions! Please:

  1. Fork the repository

  2. Create a feature branch (git checkout -b feature/amazing-feature)

  3. Make your changes with tests

  4. Run quality checks: uv run -m pytest

  5. Submit a pull request

Note: All changes must go through pull requests. Direct commits to main are blocked by pre-commit hooks.

Development

# Setup development environment git clone https://github.com/jonpspri/databeak.git cd databeak uv sync # Run the server locally uv run databeak # Run tests uv run -m pytest tests/unit/ # Unit tests (primary) uv run -m pytest # All tests # Run quality checks uv run ruff check uv run mypy src/databeak/

Testing Structure

DataBeak implements comprehensive unit and integration testing:

  • Unit Tests (tests/unit/) - 940+ fast, isolated module tests

  • Integration Tests (tests/integration/) - 43 FastMCP Client-based protocol tests across 7 test files

  • E2E Tests (tests/e2e/) - Planned: Complete workflow validation

Test Execution:

uv run pytest -n auto tests/unit/ # Run unit tests (940+ tests) uv run pytest -n auto tests/integration/ # Run integration tests (43 tests) uv run pytest -n auto --cov=src/databeak # Run with coverage analysis

See Testing Guide for comprehensive testing details.

License

Apache 2.0 - see LICENSE file.

Support

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/jonpspri/databeak'

If you have feedback or need assistance with the MCP directory API, please join our Discord server