Server Configuration
Describes the environment variables required to run the server.
| Name | Required | Description | Default |
|---|---|---|---|
| DATABEAK_MAX_ROWS | No | Max DataFrame rows | 1000000 |
| DATABEAK_SESSION_TIMEOUT | No | Session timeout (seconds) | 3600 |
| DATABEAK_MAX_MEMORY_USAGE_MB | No | Max DataFrame memory (MB) | 1000 |
| DATABEAK_URL_TIMEOUT_SECONDS | No | URL download timeout | 30 |
| DATABEAK_MAX_DOWNLOAD_SIZE_MB | No | Maximum URL download size (MB) | 100 |
| DATABEAK_HEALTH_MEMORY_THRESHOLD_MB | No | Health monitoring memory threshold | 2048 |
Schema
Prompts
Interactive templates invoked by user choice
| Name | Description |
|---|---|
| analyze_csv_prompt | Generate a prompt to analyze CSV data. |
| data_cleaning_prompt | Generate a prompt for data cleaning suggestions. |
Resources
Contextual data attached and managed by the client
| Name | Description |
|---|---|
No resources | |
Tools
Functions exposed to the LLM to take actions
| Name | Description |
|---|---|
| health_check | Check DataBeak server health and availability with memory monitoring. Returns server status, session capacity, memory usage, and version information. Use before large operations to verify system readiness and resource availability. |
| get_server_info | Get DataBeak server capabilities and supported operations. Returns server version, available tools, supported file formats, and resource limits. Use to discover what operations are available before planning workflows. |
| load_csv_from_url | Load CSV file from URL into DataBeak session. Downloads and parses CSV data with security validation. Returns session ID and data preview for further operations. |
| load_csv_from_content | Load CSV data from string content into DataBeak session. Parses CSV data directly from string with validation. Returns session ID and data preview for further operations. |
| get_session_info | Get comprehensive information about a specific session. Returns session metadata, data status, and configuration. Essential for session management and workflow coordination. |
| get_cell_value | Get value of specific cell with coordinate targeting. Supports column name or index targeting. Returns value with coordinates and data type information. |
| set_cell_value | Set value of specific cell with coordinate targeting. Supports column name or index, tracks old and new values. Returns operation result with coordinates and data type. |
| get_row_data | Get data from specific row with optional column filtering. Returns complete row data or filtered by column list. Converts pandas types for JSON serialization. |
| get_column_data | Get data from specific column with optional row range slicing. Supports row range filtering for focused analysis. Returns column values with range metadata. |
| insert_row | Insert new row at specified index with multiple data formats. Supports dict, list, and JSON string input with null value handling. Returns insertion result with before/after statistics. |
| delete_row | Delete row at specified index with comprehensive tracking. Captures deleted data for undo operations. Returns operation result with before/after statistics. |
| update_row | Update specific columns in row with selective updates. Supports partial column updates with change tracking. Returns old/new values for updated columns. |
| get_statistics | Get comprehensive statistical summary of numerical columns. Computes descriptive statistics for all or specified numerical columns including count, mean, standard deviation, min/max values, and percentiles. Optimized for AI workflows with clear statistical insights and data understanding. Returns: Comprehensive statistical analysis with per-column summaries Statistical Metrics: 📊 Count: Number of non-null values 📈 Mean: Average value 📉 Std: Standard deviation (measure of spread) 🔢 Min/Max: Minimum and maximum values 📊 Percentiles: 25th, 50th (median), 75th quartiles Examples: # Get statistics for all numeric columns stats = await get_statistics("session_123") # Analyze specific columns only
stats = await get_statistics("session_123", columns=["price", "quantity"])
# Analyze all numeric columns (percentiles always included)
stats = await get_statistics("session_123") AI Workflow Integration: 1. Essential for data understanding and quality assessment 2. Identifies data distribution and potential issues 3. Guides feature engineering and analysis decisions 4. Provides context for outlier detection thresholds |
| get_column_statistics | Get detailed statistical analysis for a single column. Provides focused statistical analysis for a specific column including data type information, null value handling, and comprehensive numerical statistics when applicable. Returns: Detailed statistical analysis for the specified column Column Analysis: 🔍 Data Type: Detected pandas data type 📊 Statistics: Complete statistical summary for numeric columns 🔢 Non-null Count: Number of valid (non-null) values 📈 Distribution: Statistical distribution characteristics Examples: # Analyze a price column stats = await get_column_statistics(ctx, "price") # Analyze a categorical column
stats = await get_column_statistics(ctx, "category") AI Workflow Integration: 1. Deep dive analysis for specific columns of interest 2. Data quality assessment for individual features 3. Understanding column characteristics for modeling 4. Validation of data transformations |
| get_correlation_matrix | Calculate correlation matrix for numerical columns. Computes pairwise correlations between numerical columns using various correlation methods. Essential for understanding relationships between variables and feature selection in analytical workflows. Returns: Correlation matrix with pairwise correlation coefficients Correlation Methods: 📊 Pearson: Linear relationships (default, assumes normality) 📈 Spearman: Monotonic relationships (rank-based, non-parametric) 🔄 Kendall: Concordant/discordant pairs (robust, small samples) Examples: # Basic correlation analysis corr = await get_correlation_matrix(ctx) # Analyze specific columns with Spearman correlation
corr = await get_correlation_matrix(ctx,
columns=["price", "rating", "sales"],
method="spearman")
# Filter correlations above threshold
corr = await get_correlation_matrix(ctx, min_correlation=0.5) AI Workflow Integration: 1. Feature selection and dimensionality reduction 2. Multicollinearity detection before modeling 3. Understanding variable relationships 4. Data validation and quality assessment |
| get_value_counts | Get frequency distribution of values in a column. Analyzes the distribution of values in a specified column, providing counts and optionally percentages for each unique value. Essential for understanding categorical data and identifying common patterns. Returns: Frequency distribution with counts/percentages for each unique value Analysis Features: 🔢 Frequency Counts: Raw counts for each unique value 📊 Percentage Mode: Normalized frequencies as percentages 🎯 Top Values: Configurable limit for most frequent values 📈 Summary Stats: Total values, unique count, distribution insights Examples: # Basic value counts counts = await get_value_counts(ctx, "category") # Get percentages for top 10 values
counts = await get_value_counts(ctx, "status",
normalize=True, top_n=10)
# Sort in ascending order
counts = await get_value_counts(ctx, "grade", ascending=True) AI Workflow Integration: 1. Categorical data analysis and encoding decisions 2. Data quality assessment (identifying rare values) 3. Understanding distribution for sampling strategies 4. Feature engineering insights for categorical variables |
| detect_outliers | Detect outliers in numerical columns using various algorithms. Identifies data points that deviate significantly from the normal pattern using statistical and machine learning methods. Essential for data quality assessment and anomaly detection in analytical workflows. Returns: Detailed outlier analysis with locations and severity scores Detection Methods: 📊 Z-Score: Statistical method based on standard deviations 📈 IQR: Interquartile range method (robust to distribution) 🤖 Isolation Forest: ML-based method for high-dimensional data Examples: # Basic outlier detection outliers = await detect_outliers(ctx, ["price", "quantity"]) # Use IQR method with custom threshold
outliers = await detect_outliers(ctx, ["sales"],
method="iqr", threshold=2.5) AI Workflow Integration: 1. Data quality assessment and cleaning 2. Anomaly detection for fraud/error identification 3. Data preprocessing for machine learning 4. Understanding data distribution characteristics |
| profile_data | Generate comprehensive data profile with statistical insights. Creates a complete analytical profile of the dataset including column characteristics, data types, null patterns, and statistical summaries. Provides holistic data understanding for analytical workflows. Returns: Comprehensive data profile with multi-dimensional analysis Profile Components: 📊 Column Profiles: Data types, null patterns, uniqueness 📈 Statistical Summaries: Numerical column characteristics 🔗 Correlations: Inter-variable relationships (optional) 🎯 Outliers: Anomaly detection across columns (optional) 💾 Memory Usage: Resource consumption analysis Examples: # Full data profile profile = await profile_data(ctx) # Quick profile without expensive computations
profile = await profile_data(ctx,
include_correlations=False,
include_outliers=False) AI Workflow Integration: 1. Initial data exploration and understanding 2. Automated data quality reporting 3. Feature engineering guidance 4. Data preprocessing strategy development |
| group_by_aggregate | Group data and compute aggregations for analytical insights. Performs GROUP BY operations with multiple aggregation functions per column. Essential for segmentation analysis and understanding patterns across different data groups. Returns: Grouped aggregation results with statistics per group Aggregation Functions: 📊 count, mean, median, sum, min, max 📈 std, var (statistical measures) 🎯 first, last (positional) 📋 nunique (unique count) Examples: # Sales analysis by region result = await group_by_aggregate(ctx, group_by=["region"], aggregations={"sales": ["sum", "mean", "count"]}) # Multi-dimensional grouping
result = await group_by_aggregate(ctx,
group_by=["category", "region"],
aggregations={
"price": ["mean", "std"],
"quantity": ["sum", "count"]
}) AI Workflow Integration: 1. Segmentation analysis and market research 2. Feature engineering for categorical interactions 3. Data summarization for reporting and insights 4. Understanding group-based patterns and trends |
| find_cells_with_value | Find all cells containing a specific value for data discovery. Searches through the dataset to locate all occurrences of a specific value, providing coordinates and context. Essential for data validation, quality checking, and understanding data patterns. Returns: Locations of all matching cells with coordinates and context Search Features: 🎯 Exact Match: Precise value matching with type consideration 🔍 Substring Search: Flexible text-based search for string columns 📍 Coordinates: Row and column positions for each match 📊 Summary Stats: Total matches, columns searched, search parameters Examples: # Find all cells with value "ERROR" results = await find_cells_with_value(ctx, "ERROR") # Substring search in specific columns
results = await find_cells_with_value(ctx, "john",
columns=["name", "email"],
exact_match=False) AI Workflow Integration: 1. Data quality assessment and error detection 2. Pattern identification and data validation 3. Reference data location and verification 4. Data cleaning and preprocessing guidance |
| get_data_summary | Get comprehensive data overview and structural summary. Provides high-level overview of dataset structure, dimensions, data types, and memory usage. Essential first step in data exploration and analysis planning workflows. Returns: Comprehensive data overview with structural information Summary Components: 📏 Dimensions: Rows, columns, shape information 🔢 Data Types: Column type distribution and analysis 💾 Memory Usage: Resource consumption breakdown 👀 Preview: Sample rows for quick data understanding (optional) 📊 Overview: High-level dataset characteristics Examples: # Full data summary with preview summary = await get_data_summary(ctx) # Structure summary without preview data
summary = await get_data_summary(ctx, include_preview=False) AI Workflow Integration: 1. Initial data exploration and understanding 2. Planning analytical approaches based on data structure 3. Resource planning for large dataset processing 4. Data quality initial assessment |
| inspect_data_around | Inspect data around a specific coordinate for contextual analysis. Examines the data surrounding a specific cell to understand context, patterns, and relationships. Useful for data validation, error investigation, and understanding local data patterns. Returns: Contextual view of data around the specified coordinates Inspection Features: 📍 Center Point: Specified cell as reference point 🔍 Radius View: Configurable area around center cell 📊 Data Context: Surrounding values for pattern analysis 🎯 Coordinates: Clear row/column reference system Examples: # Inspect around a specific data point context = await inspect_data_around(ctx, row=50, column_name="price", radius=3) # Minimal context view
context = await inspect_data_around(ctx, row=10,
column_name="status", radius=1) AI Workflow Integration: 1. Error investigation and data quality assessment 2. Pattern recognition in local data areas 3. Understanding data relationships and context 4. Validation of data transformations and corrections |
| validate_schema | Validate data against a schema definition using Pandera validation framework. This function leverages Pandera's comprehensive validation capabilities to provide robust data validation. The schema is dynamically converted to Pandera format and applied to the DataFrame for maximum validation coverage and reliability. For more information on Pandera validation capabilities, see:
Returns: ValidateSchemaResult with validation status and detailed error information |
| check_data_quality | Check data quality based on predefined or custom rules. Returns: DataQualityResult with comprehensive quality assessment results |
| find_anomalies | Find anomalies in the data using multiple detection methods. Returns: FindAnomaliesResult with comprehensive anomaly detection results |
| filter_rows | Filter rows using flexible conditions: comprehensive null value and text matching support. Provides powerful filtering capabilities optimized for AI-driven data analysis. Supports multiple operators, logical combinations, and comprehensive null value handling. Examples: # Numeric filtering filter_rows(ctx, [{"column": "age", "operator": ">", "value": 25}]) # Text filtering with null handling
filter_rows(ctx, [
{"column": "name", "operator": "contains", "value": "Smith"},
{"column": "email", "operator": "is_not_null"}
], mode="and")
# Multiple conditions with OR logic
filter_rows(ctx, [
{"column": "status", "operator": "==", "value": "active"},
{"column": "priority", "operator": "==", "value": "high"}
], mode="or") |
| sort_data | Sort data by one or more columns with comprehensive error handling. Provides flexible sorting capabilities with support for multiple columns and sort directions. Handles mixed data types appropriately and maintains data integrity throughout the sorting process. Examples: # Simple single column sort sort_data(ctx, ["age"]) # Multi-column sort with different directions
sort_data(ctx, [
{"column": "department", "ascending": True},
{"column": "salary", "ascending": False}
])
# Using SortColumn objects for type safety
sort_data(ctx, [
SortColumn(column="name", ascending=True),
SortColumn(column="age", ascending=False)
]) |
| remove_duplicates | Remove duplicate rows from the dataframe with comprehensive validation. Provides flexible duplicate removal with options for column subset selection and different keep strategies. Handles edge cases and provides detailed statistics about the deduplication process. Examples: # Remove exact duplicate rows remove_duplicates(ctx) # Remove duplicates based on specific columns
remove_duplicates(ctx, subset=["email", "name"])
# Keep last occurrence instead of first
remove_duplicates(ctx, subset=["id"], keep="last")
# Remove all duplicates (keep none)
remove_duplicates(ctx, subset=["email"], keep="none") |
| fill_missing_values | Fill or remove missing values with comprehensive strategy support. Provides multiple strategies for handling missing data, including statistical imputation methods. Handles different data types appropriately and validates strategy compatibility with column types. Examples: # Drop rows with any missing values fill_missing_values(ctx, strategy="drop") # Fill missing values with 0
fill_missing_values(ctx, strategy="fill", value=0)
# Forward fill specific columns
fill_missing_values(ctx, strategy="forward", columns=["price", "quantity"])
# Fill with column mean for numeric columns
fill_missing_values(ctx, strategy="mean", columns=["age", "salary"]) |
| select_columns | Select specific columns from dataframe, removing all others. Validates column existence and reorders by selection order. Returns selection details with before/after column counts. |
| rename_columns | Rename columns in the dataframe. Returns: Dict with rename details Examples: # Using dictionary mapping rename_columns(ctx, {"old_col1": "new_col1", "old_col2": "new_col2"}) # Rename multiple columns
rename_columns(ctx, {
"FirstName": "first_name",
"LastName": "last_name",
"EmailAddress": "email"
}) |
| add_column | Add a new column to the dataframe. Returns: ColumnOperationResult with operation details Examples: # Add column with constant value add_column(ctx, "status", "active") # Add column with list of values
add_column(ctx, "scores", [85, 90, 78, 92, 88])
# Add computed column
add_column(ctx, "total", formula="price * quantity")
# Add column with complex formula
add_column(ctx, "full_name", formula="first_name + ' ' + last_name") |
| remove_columns | Remove columns from the dataframe. Returns: ColumnOperationResult with removal details Examples: # Remove single column remove_columns(ctx, ["temp_column"]) # Remove multiple columns
remove_columns(ctx, ["col1", "col2", "col3"])
# Clean up after analysis
remove_columns(ctx, ["_temp", "_backup", "old_value"]) |
| change_column_type | Change the data type of a column. Returns: ColumnOperationResult with conversion details Examples: # Convert string numbers to integers change_column_type(ctx, "age", "int") # Convert to float, replacing errors with NaN
change_column_type(ctx, "price", "float", errors="coerce")
# Convert to datetime
change_column_type(ctx, "date", "datetime")
# Convert to boolean
change_column_type(ctx, "is_active", "bool") |
| update_column | Update values in a column using various operations with discriminated unions. Returns: ColumnOperationResult with update details Examples: # Using discriminated union - Replace operation update_column(ctx, "status", { "type": "replace", "pattern": "N/A", "replacement": "Unknown" }) # Using discriminated union - Map operation
update_column(ctx, "code", {
"type": "map",
"mapping": {"A": "Alpha", "B": "Beta"}
})
# Using discriminated union - Fill operation
update_column(ctx, "score", {
"type": "fillna",
"value": 0
})
# Legacy format still supported
update_column(ctx, "score", {
"operation": "fillna",
"value": 0
}) |
| replace_in_column | Replace patterns in a column with replacement text. Returns: ColumnOperationResult with replacement details Examples: # Replace with regex replace_in_column(ctx, "name", r"Mr.", "Mister") # Remove non-digits from phone numbers
replace_in_column(ctx, "phone", r"\D", "", regex=True)
# Simple string replacement
replace_in_column(ctx, "status", "N/A", "Unknown", regex=False)
# Replace multiple spaces with single space
replace_in_column(ctx, "description", r"\s+", " ") |
| extract_from_column | Extract patterns from a column using regex with capturing groups. Returns: ColumnOperationResult with extraction details Examples: # Extract email parts extract_from_column(ctx, "email", r"(.+)@(.+)") # Extract code components
extract_from_column(ctx, "product_code", r"([A-Z]{2})-(\d+)")
# Extract and expand into multiple columns
extract_from_column(ctx, "full_name", r"(\w+)\s+(\w+)", expand=True)
# Extract year from date string
extract_from_column(ctx, "date", r"\d{4}") |
| split_column | Split column values by delimiter. Returns: ColumnOperationResult with split details Examples: # Keep first part of split split_column(ctx, "full_name", " ", part_index=0) # Keep last part
split_column(ctx, "email", "@", part_index=1)
# Expand into multiple columns
split_column(ctx, "address", ",", expand_to_columns=True)
# Expand with custom column names
split_column(ctx, "name", " ", expand_to_columns=True,
new_columns=["first_name", "last_name"]) |
| transform_column_case | Transform the case of text in a column. Returns: ColumnOperationResult with transformation details Examples: # Convert to uppercase transform_column_case(ctx, "code", "upper") # Convert names to title case
transform_column_case(ctx, "name", "title")
# Convert to lowercase for comparison
transform_column_case(ctx, "email", "lower")
# Capitalize sentences
transform_column_case(ctx, "description", "capitalize") |
| strip_column | Strip whitespace or specified characters from column values. Returns: ColumnOperationResult with strip details Examples: # Remove leading/trailing whitespace strip_column(ctx, "name") # Remove specific characters
strip_column(ctx, "phone", "()")
# Clean currency values
strip_column(ctx, "price", "$,")
# Remove quotes
strip_column(ctx, "quoted_text", "'\"") |
| fill_column_nulls | Fill null/NaN values in a specific column with a specified value. Returns: ColumnOperationResult with fill details Examples: # Fill missing names with "Unknown" fill_column_nulls(ctx, "name", "Unknown") # Fill missing ages with 0
fill_column_nulls(ctx, "age", 0)
# Fill missing status with default
fill_column_nulls(ctx, "status", "pending")
# Fill missing scores with -1
fill_column_nulls(ctx, "score", -1) |