get_column_statistics

Analyze column data characteristics including data types, null values, and statistical summaries to assess data quality and understand feature distributions.

Instructions

Get detailed statistical analysis for a single column.

Provides focused statistical analysis for a specific column including data type information, null value handling, and comprehensive numerical statistics when applicable.

Returns: Detailed statistical analysis for the specified column

Column Analysis: 🔍 Data Type: Detected pandas data type 📊 Statistics: Complete statistical summary for numeric columns 🔢 Non-null Count: Number of valid (non-null) values 📈 Distribution: Statistical distribution characteristics

Examples: # Analyze a price column stats = await get_column_statistics(ctx, "price")

# Analyze a categorical column stats = await get_column_statistics(ctx, "category")

AI Workflow Integration: 1. Deep dive analysis for specific columns of interest 2. Data quality assessment for individual features 3. Understanding column characteristics for modeling 4. Validation of data transformations

Input Schema

TableJSON Schema

Name	Required	Description	Default
`column`	Yes	Name of the column to analyze in detail

Implementation Reference

src/databeak/servers/statistics_server.py:158-287 (handler)
The main handler function implementing the get_column_statistics tool logic. Computes detailed statistics for a specified column, handling both numeric and non-numeric data types, using pandas for analysis.
async def get_column_statistics( ctx: Annotated[Context, Field(description="FastMCP context for session access")], column: Annotated[str, Field(description="Name of the column to analyze in detail")], ) -> ColumnStatisticsResult: """Get detailed statistical analysis for a single column. Provides focused statistical analysis for a specific column including data type information, null value handling, and comprehensive numerical statistics when applicable. Returns: Detailed statistical analysis for the specified column Column Analysis: 🔍 Data Type: Detected pandas data type 📊 Statistics: Complete statistical summary for numeric columns 🔢 Non-null Count: Number of valid (non-null) values 📈 Distribution: Statistical distribution characteristics Examples: # Analyze a price column stats = await get_column_statistics(ctx, "price") # Analyze a categorical column stats = await get_column_statistics(ctx, "category") AI Workflow Integration: 1. Deep dive analysis for specific columns of interest 2. Data quality assessment for individual features 3. Understanding column characteristics for modeling 4. Validation of data transformations """ # Get session_id from FastMCP context session_id = ctx.session_id _session, df = get_session_data(session_id) # Only need df, not session if column not in df.columns: raise ColumnNotFoundError(column, df.columns.tolist()) col_data = df[column] dtype = str(col_data.dtype) count = int(col_data.count()) unique_count = int(col_data.nunique()) # Helper function to safely convert pandas scalars to float def safe_float(value: Any) -> float: """Safely convert pandas scalar to float.""" try: return float(value) if not pd.isna(value) else 0.0 except (TypeError, ValueError): return 0.0 # No longer recording operations (simplified MCP architecture) # Build StatisticsSummary directly if pd.api.types.is_numeric_dtype(col_data) and not pd.api.types.is_bool_dtype(col_data): # Numeric columns - calculate all statistics col_data_non_null = col_data.dropna() percentile_25 = ( float(col_data_non_null.quantile(0.25)) if len(col_data_non_null) > 0 else None ) percentile_50 = ( float(col_data_non_null.quantile(0.50)) if len(col_data_non_null) > 0 else None ) percentile_75 = ( float(col_data_non_null.quantile(0.75)) if len(col_data_non_null) > 0 else None ) stats_summary = StatisticsSummary( count=count, mean=safe_float(col_data.mean()), std=safe_float(col_data.std()), min=safe_float(col_data.min()), percentile_25=percentile_25, percentile_50=percentile_50, percentile_75=percentile_75, max=safe_float(col_data.max()), unique=unique_count, ) else: # For non-numeric columns, populate categorical statistics # Calculate most frequent value for categorical columns most_frequent_val: str | None = None most_frequent_count: int | None = None if count > 0: mode_result = col_data.mode() if len(mode_result) > 0: mode_val = mode_result.iloc[0] if mode_val is not None and not pd.isna(mode_val): most_frequent_val = str(mode_val) most_frequent_count = int(col_data.value_counts().iloc[0]) stats_summary = StatisticsSummary( count=count, mean=None, std=None, min=None, percentile_25=None, percentile_50=None, percentile_75=None, max=None, unique=unique_count, top=most_frequent_val, freq=most_frequent_count, ) # Map dtype to expected literal type dtype_map: dict[ str, Literal["int64", "float64", "object", "bool", "datetime64", "category"], ] = { "int64": "int64", "float64": "float64", "object": "object", "bool": "bool", "datetime64[ns]": "datetime64", "category": "category", } data_type: Literal["int64", "float64", "object", "bool", "datetime64", "category"] = ( dtype_map.get(dtype, "object") ) return ColumnStatisticsResult( column=column, statistics=stats_summary, data_type=data_type, non_null_count=count, )
src/databeak/models/statistics_models.py:66-75 (schema)
Pydantic model defining the output response schema for the get_column_statistics tool.
class ColumnStatisticsResult(BaseToolResponse): """Response model for individual column statistical analysis.""" column: str = Field(description="Name of the analyzed column") statistics: StatisticsSummary = Field(description="Statistical summary for the column") data_type: Literal["int64", "float64", "object", "bool", "datetime64", "category"] = Field( description="Pandas data type of the column", ) non_null_count: int = Field(description="Number of non-null values in the column")
src/databeak/models/statistics_models.py:14-53 (schema)
Pydantic model used within ColumnStatisticsResult for detailed statistical summary of the column.
class StatisticsSummary(BaseModel): """Statistical summary for a single column.""" model_config = ConfigDict(populate_by_name=True) count: int = Field(description="Total number of non-null values") mean: float | None = Field(default=None, description="Arithmetic mean (numeric columns only)") std: float | None = Field(default=None, description="Standard deviation (numeric columns only)") min: float | str | None = Field(default=None, description="Minimum value in the column") percentile_25: float | None = Field( default=None, alias="25%", description="25th percentile value (numeric columns only)", ) percentile_50: float | None = Field( default=None, alias="50%", description="50th percentile/median value (numeric columns only)", ) percentile_75: float | None = Field( default=None, alias="75%", description="75th percentile value (numeric columns only)", ) max: float | str | None = Field(default=None, description="Maximum value in the column") # Categorical statistics fields unique: int | None = Field( None, description="Number of unique values (categorical columns only)", ) top: str | None = Field( None, description="Most frequently occurring value (categorical columns only)", ) freq: int | None = Field( None, description="Frequency of the most common value (categorical columns only)", )
src/databeak/servers/statistics_server.py:511-511 (registration)
Registers the get_column_statistics handler as an MCP tool on the statistics_server FastMCP instance.
statistics_server.tool(name="get_column_statistics")(get_column_statistics)

DataBeak