get_statistics
Compute descriptive statistics for numerical columns including count, mean, standard deviation, min/max values, and percentiles to analyze data distribution and quality.
Instructions
Get comprehensive statistical summary of numerical columns.
Computes descriptive statistics for all or specified numerical columns including count, mean, standard deviation, min/max values, and percentiles. Optimized for AI workflows with clear statistical insights and data understanding.
Returns: Comprehensive statistical analysis with per-column summaries
Statistical Metrics: š Count: Number of non-null values š Mean: Average value š Std: Standard deviation (measure of spread) š¢ Min/Max: Minimum and maximum values š Percentiles: 25th, 50th (median), 75th quartiles
Examples: # Get statistics for all numeric columns stats = await get_statistics("session_123")
AI Workflow Integration: 1. Essential for data understanding and quality assessment 2. Identifies data distribution and potential issues 3. Guides feature engineering and analysis decisions 4. Provides context for outlier detection thresholds
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| columns | Yes | List of specific columns to analyze (None = all numeric columns) |
Implementation Reference
- The core handler function that implements the get_statistics tool logic. It retrieves the session data, selects numeric columns, computes descriptive statistics (count, mean, std, min, max, quartiles), and returns a StatisticsResult object.async def get_statistics( ctx: Annotated[Context, Field(description="FastMCP context for session access")], *, columns: Annotated[ list[str] | None, Field(description="List of specific columns to analyze (None = all numeric columns)"), ] = None, ) -> StatisticsResult: """Get comprehensive statistical summary of numerical columns. Computes descriptive statistics for all or specified numerical columns including count, mean, standard deviation, min/max values, and percentiles. Optimized for AI workflows with clear statistical insights and data understanding. Returns: Comprehensive statistical analysis with per-column summaries Statistical Metrics: š Count: Number of non-null values š Mean: Average value š Std: Standard deviation (measure of spread) š¢ Min/Max: Minimum and maximum values š Percentiles: 25th, 50th (median), 75th quartiles Examples: # Get statistics for all numeric columns stats = await get_statistics("session_123") # Analyze specific columns only stats = await get_statistics("session_123", columns=["price", "quantity"]) # Analyze all numeric columns (percentiles always included) stats = await get_statistics("session_123") AI Workflow Integration: 1. Essential for data understanding and quality assessment 2. Identifies data distribution and potential issues 3. Guides feature engineering and analysis decisions 4. Provides context for outlier detection thresholds """ # Get session_id from FastMCP context session_id = ctx.session_id _session, df = get_session_data(session_id) # Only need df, not session # Select numeric columns if columns: missing_cols = [col for col in columns if col not in df.columns] if missing_cols: raise ColumnNotFoundError(missing_cols[0], df.columns.tolist()) numeric_df = df[columns].select_dtypes(include=[np.number]) # Return empty results if no numeric columns found when specific columns requested if numeric_df.empty: return StatisticsResult( statistics={}, column_count=0, numeric_columns=[], total_rows=len(df), ) else: numeric_df = df.select_dtypes(include=[np.number]) # Return empty results if no numeric columns if numeric_df.empty: return StatisticsResult( statistics={}, column_count=0, numeric_columns=[], total_rows=len(df), ) # Calculate statistics stats_dict = {} for col in numeric_df.columns: col_data = numeric_df[col].dropna() # Create StatisticsSummary directly # Calculate statistics, using 0.0 for undefined values col_stats = StatisticsSummary.model_validate( { "count": int(col_data.count()), "mean": float(col_data.mean()) if len(col_data) > 0 and not pd.isna(col_data.mean()) else 0.0, "std": float(col_data.std()) if len(col_data) > 1 and not pd.isna(col_data.std()) else 0.0, "min": float(col_data.min()) if len(col_data) > 0 and not pd.isna(col_data.min()) else 0.0, "max": float(col_data.max()) if len(col_data) > 0 and not pd.isna(col_data.max()) else 0.0, "25%": float(col_data.quantile(0.25)) if len(col_data) > 0 else 0.0, "50%": float(col_data.quantile(0.50)) if len(col_data) > 0 else 0.0, "75%": float(col_data.quantile(0.75)) if len(col_data) > 0 else 0.0, }, ) stats_dict[col] = col_stats # No longer recording operations (simplified MCP architecture) return StatisticsResult( statistics=stats_dict, column_count=len(stats_dict), numeric_columns=list(stats_dict.keys()), total_rows=len(df), )
- Pydantic model defining the output schema for the get_statistics tool response, including per-column statistics and dataset metadata.class StatisticsResult(BaseToolResponse): """Response model for dataset statistical analysis.""" statistics: dict[str, StatisticsSummary] = Field( description="Statistical summary for each column", ) column_count: int = Field(description="Total number of columns analyzed") numeric_columns: list[str] = Field(description="Names of numeric columns that were analyzed") total_rows: int = Field(description="Total number of rows in the dataset")
- Pydantic model used within StatisticsResult for individual column statistical summaries, supporting both numeric and categorical statistics.class StatisticsSummary(BaseModel): """Statistical summary for a single column.""" model_config = ConfigDict(populate_by_name=True) count: int = Field(description="Total number of non-null values") mean: float | None = Field(default=None, description="Arithmetic mean (numeric columns only)") std: float | None = Field(default=None, description="Standard deviation (numeric columns only)") min: float | str | None = Field(default=None, description="Minimum value in the column") percentile_25: float | None = Field( default=None, alias="25%", description="25th percentile value (numeric columns only)", ) percentile_50: float | None = Field( default=None, alias="50%", description="50th percentile/median value (numeric columns only)", ) percentile_75: float | None = Field( default=None, alias="75%", description="75th percentile value (numeric columns only)", ) max: float | str | None = Field(default=None, description="Maximum value in the column") # Categorical statistics fields unique: int | None = Field( None, description="Number of unique values (categorical columns only)", ) top: str | None = Field( None, description="Most frequently occurring value (categorical columns only)", ) freq: int | None = Field( None, description="Frequency of the most common value (categorical columns only)", )
- src/databeak/servers/statistics_server.py:510-510 (registration)Registration of the get_statistics handler as an MCP tool on the statistics_server FastMCP instance.statistics_server.tool(name="get_statistics")(get_statistics)