get_column_statistics
Analyze column data characteristics including data types, null values, and statistical summaries to assess data quality and understand feature distributions.
Instructions
Get detailed statistical analysis for a single column.
Provides focused statistical analysis for a specific column including data type information, null value handling, and comprehensive numerical statistics when applicable.
Returns: Detailed statistical analysis for the specified column
Column Analysis: ๐ Data Type: Detected pandas data type ๐ Statistics: Complete statistical summary for numeric columns ๐ข Non-null Count: Number of valid (non-null) values ๐ Distribution: Statistical distribution characteristics
Examples: # Analyze a price column stats = await get_column_statistics(ctx, "price")
AI Workflow Integration: 1. Deep dive analysis for specific columns of interest 2. Data quality assessment for individual features 3. Understanding column characteristics for modeling 4. Validation of data transformations
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| column | Yes | Name of the column to analyze in detail |
Implementation Reference
- The main handler function implementing the get_column_statistics tool logic. Computes detailed statistics for a specified column, handling both numeric and non-numeric data types, using pandas for analysis.async def get_column_statistics( ctx: Annotated[Context, Field(description="FastMCP context for session access")], column: Annotated[str, Field(description="Name of the column to analyze in detail")], ) -> ColumnStatisticsResult: """Get detailed statistical analysis for a single column. Provides focused statistical analysis for a specific column including data type information, null value handling, and comprehensive numerical statistics when applicable. Returns: Detailed statistical analysis for the specified column Column Analysis: ๐ Data Type: Detected pandas data type ๐ Statistics: Complete statistical summary for numeric columns ๐ข Non-null Count: Number of valid (non-null) values ๐ Distribution: Statistical distribution characteristics Examples: # Analyze a price column stats = await get_column_statistics(ctx, "price") # Analyze a categorical column stats = await get_column_statistics(ctx, "category") AI Workflow Integration: 1. Deep dive analysis for specific columns of interest 2. Data quality assessment for individual features 3. Understanding column characteristics for modeling 4. Validation of data transformations """ # Get session_id from FastMCP context session_id = ctx.session_id _session, df = get_session_data(session_id) # Only need df, not session if column not in df.columns: raise ColumnNotFoundError(column, df.columns.tolist()) col_data = df[column] dtype = str(col_data.dtype) count = int(col_data.count()) unique_count = int(col_data.nunique()) # Helper function to safely convert pandas scalars to float def safe_float(value: Any) -> float: """Safely convert pandas scalar to float.""" try: return float(value) if not pd.isna(value) else 0.0 except (TypeError, ValueError): return 0.0 # No longer recording operations (simplified MCP architecture) # Build StatisticsSummary directly if pd.api.types.is_numeric_dtype(col_data) and not pd.api.types.is_bool_dtype(col_data): # Numeric columns - calculate all statistics col_data_non_null = col_data.dropna() percentile_25 = ( float(col_data_non_null.quantile(0.25)) if len(col_data_non_null) > 0 else None ) percentile_50 = ( float(col_data_non_null.quantile(0.50)) if len(col_data_non_null) > 0 else None ) percentile_75 = ( float(col_data_non_null.quantile(0.75)) if len(col_data_non_null) > 0 else None ) stats_summary = StatisticsSummary( count=count, mean=safe_float(col_data.mean()), std=safe_float(col_data.std()), min=safe_float(col_data.min()), percentile_25=percentile_25, percentile_50=percentile_50, percentile_75=percentile_75, max=safe_float(col_data.max()), unique=unique_count, ) else: # For non-numeric columns, populate categorical statistics # Calculate most frequent value for categorical columns most_frequent_val: str | None = None most_frequent_count: int | None = None if count > 0: mode_result = col_data.mode() if len(mode_result) > 0: mode_val = mode_result.iloc[0] if mode_val is not None and not pd.isna(mode_val): most_frequent_val = str(mode_val) most_frequent_count = int(col_data.value_counts().iloc[0]) stats_summary = StatisticsSummary( count=count, mean=None, std=None, min=None, percentile_25=None, percentile_50=None, percentile_75=None, max=None, unique=unique_count, top=most_frequent_val, freq=most_frequent_count, ) # Map dtype to expected literal type dtype_map: dict[ str, Literal["int64", "float64", "object", "bool", "datetime64", "category"], ] = { "int64": "int64", "float64": "float64", "object": "object", "bool": "bool", "datetime64[ns]": "datetime64", "category": "category", } data_type: Literal["int64", "float64", "object", "bool", "datetime64", "category"] = ( dtype_map.get(dtype, "object") ) return ColumnStatisticsResult( column=column, statistics=stats_summary, data_type=data_type, non_null_count=count, )
- Pydantic model defining the output response schema for the get_column_statistics tool.class ColumnStatisticsResult(BaseToolResponse): """Response model for individual column statistical analysis.""" column: str = Field(description="Name of the analyzed column") statistics: StatisticsSummary = Field(description="Statistical summary for the column") data_type: Literal["int64", "float64", "object", "bool", "datetime64", "category"] = Field( description="Pandas data type of the column", ) non_null_count: int = Field(description="Number of non-null values in the column")
- Pydantic model used within ColumnStatisticsResult for detailed statistical summary of the column.class StatisticsSummary(BaseModel): """Statistical summary for a single column.""" model_config = ConfigDict(populate_by_name=True) count: int = Field(description="Total number of non-null values") mean: float | None = Field(default=None, description="Arithmetic mean (numeric columns only)") std: float | None = Field(default=None, description="Standard deviation (numeric columns only)") min: float | str | None = Field(default=None, description="Minimum value in the column") percentile_25: float | None = Field( default=None, alias="25%", description="25th percentile value (numeric columns only)", ) percentile_50: float | None = Field( default=None, alias="50%", description="50th percentile/median value (numeric columns only)", ) percentile_75: float | None = Field( default=None, alias="75%", description="75th percentile value (numeric columns only)", ) max: float | str | None = Field(default=None, description="Maximum value in the column") # Categorical statistics fields unique: int | None = Field( None, description="Number of unique values (categorical columns only)", ) top: str | None = Field( None, description="Most frequently occurring value (categorical columns only)", ) freq: int | None = Field( None, description="Frequency of the most common value (categorical columns only)", )
- src/databeak/servers/statistics_server.py:511-511 (registration)Registers the get_column_statistics handler as an MCP tool on the statistics_server FastMCP instance.statistics_server.tool(name="get_column_statistics")(get_column_statistics)