get_statistics
Analyze numerical data columns to compute descriptive statistics including count, mean, standard deviation, min/max values, and percentiles for data quality assessment and distribution analysis.
Instructions
Get comprehensive statistical summary of numerical columns.
Computes descriptive statistics for all or specified numerical columns including count, mean, standard deviation, min/max values, and percentiles. Optimized for AI workflows with clear statistical insights and data understanding.
Returns: Comprehensive statistical analysis with per-column summaries
Statistical Metrics: ๐ Count: Number of non-null values ๐ Mean: Average value ๐ Std: Standard deviation (measure of spread) ๐ข Min/Max: Minimum and maximum values ๐ Percentiles: 25th, 50th (median), 75th quartiles
Examples: # Get statistics for all numeric columns stats = await get_statistics("session_123")
AI Workflow Integration: 1. Essential for data understanding and quality assessment 2. Identifies data distribution and potential issues 3. Guides feature engineering and analysis decisions 4. Provides context for outlier detection thresholds
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| columns | Yes | List of specific columns to analyze (None = all numeric columns) |
Implementation Reference
- The main handler function that executes the get_statistics tool logic: loads session data, selects numeric columns, computes descriptive statistics (count, mean, std, min, max, quartiles), handles edge cases like empty data or missing columns, and returns structured StatisticsResult.async def get_statistics( ctx: Annotated[Context, Field(description="FastMCP context for session access")], *, columns: Annotated[ list[str] | None, Field(description="List of specific columns to analyze (None = all numeric columns)"), ] = None, ) -> StatisticsResult: """Get comprehensive statistical summary of numerical columns. Computes descriptive statistics for all or specified numerical columns including count, mean, standard deviation, min/max values, and percentiles. Optimized for AI workflows with clear statistical insights and data understanding. Returns: Comprehensive statistical analysis with per-column summaries Statistical Metrics: ๐ Count: Number of non-null values ๐ Mean: Average value ๐ Std: Standard deviation (measure of spread) ๐ข Min/Max: Minimum and maximum values ๐ Percentiles: 25th, 50th (median), 75th quartiles Examples: # Get statistics for all numeric columns stats = await get_statistics("session_123") # Analyze specific columns only stats = await get_statistics("session_123", columns=["price", "quantity"]) # Analyze all numeric columns (percentiles always included) stats = await get_statistics("session_123") AI Workflow Integration: 1. Essential for data understanding and quality assessment 2. Identifies data distribution and potential issues 3. Guides feature engineering and analysis decisions 4. Provides context for outlier detection thresholds """ # Get session_id from FastMCP context session_id = ctx.session_id _session, df = get_session_data(session_id) # Only need df, not session # Select numeric columns if columns: missing_cols = [col for col in columns if col not in df.columns] if missing_cols: raise ColumnNotFoundError(missing_cols[0], df.columns.tolist()) numeric_df = df[columns].select_dtypes(include=[np.number]) # Return empty results if no numeric columns found when specific columns requested if numeric_df.empty: return StatisticsResult( statistics={}, column_count=0, numeric_columns=[], total_rows=len(df), ) else: numeric_df = df.select_dtypes(include=[np.number]) # Return empty results if no numeric columns if numeric_df.empty: return StatisticsResult( statistics={}, column_count=0, numeric_columns=[], total_rows=len(df), ) # Calculate statistics stats_dict = {} for col in numeric_df.columns: col_data = numeric_df[col].dropna() # Create StatisticsSummary directly # Calculate statistics, using 0.0 for undefined values col_stats = StatisticsSummary.model_validate( { "count": int(col_data.count()), "mean": float(col_data.mean()) if len(col_data) > 0 and not pd.isna(col_data.mean()) else 0.0, "std": float(col_data.std()) if len(col_data) > 1 and not pd.isna(col_data.std()) else 0.0, "min": float(col_data.min()) if len(col_data) > 0 and not pd.isna(col_data.min()) else 0.0, "max": float(col_data.max()) if len(col_data) > 0 and not pd.isna(col_data.max()) else 0.0, "25%": float(col_data.quantile(0.25)) if len(col_data) > 0 else 0.0, "50%": float(col_data.quantile(0.50)) if len(col_data) > 0 else 0.0, "75%": float(col_data.quantile(0.75)) if len(col_data) > 0 else 0.0, }, ) stats_dict[col] = col_stats # No longer recording operations (simplified MCP architecture) return StatisticsResult( statistics=stats_dict, column_count=len(stats_dict), numeric_columns=list(stats_dict.keys()), total_rows=len(df), )
- Pydantic models defining the input/output schema for get_statistics: StatisticsSummary for individual column statistics and StatisticsResult for the overall response containing statistics for multiple columns.class StatisticsSummary(BaseModel): """Statistical summary for a single column.""" model_config = ConfigDict(populate_by_name=True) count: int = Field(description="Total number of non-null values") mean: float | None = Field(default=None, description="Arithmetic mean (numeric columns only)") std: float | None = Field(default=None, description="Standard deviation (numeric columns only)") min: float | str | None = Field(default=None, description="Minimum value in the column") percentile_25: float | None = Field( default=None, alias="25%", description="25th percentile value (numeric columns only)", ) percentile_50: float | None = Field( default=None, alias="50%", description="50th percentile/median value (numeric columns only)", ) percentile_75: float | None = Field( default=None, alias="75%", description="75th percentile value (numeric columns only)", ) max: float | str | None = Field(default=None, description="Maximum value in the column") # Categorical statistics fields unique: int | None = Field( None, description="Number of unique values (categorical columns only)", ) top: str | None = Field( None, description="Most frequently occurring value (categorical columns only)", ) freq: int | None = Field( None, description="Frequency of the most common value (categorical columns only)", ) class StatisticsResult(BaseToolResponse): """Response model for dataset statistical analysis.""" statistics: dict[str, StatisticsSummary] = Field( description="Statistical summary for each column", ) column_count: int = Field(description="Total number of columns analyzed") numeric_columns: list[str] = Field(description="Names of numeric columns that were analyzed") total_rows: int = Field(description="Total number of rows in the dataset")
- src/databeak/servers/statistics_server.py:502-513 (registration)Creates the FastMCP statistics_server instance and registers get_statistics (along with related tools) as MCP tools.# Create Statistics server statistics_server = FastMCP( "DataBeak-Statistics", instructions="Statistics and correlation analysis server for DataBeak with comprehensive numerical analysis capabilities", ) # Register the statistical analysis functions directly as MCP tools statistics_server.tool(name="get_statistics")(get_statistics) statistics_server.tool(name="get_column_statistics")(get_column_statistics) statistics_server.tool(name="get_correlation_matrix")(get_correlation_matrix) statistics_server.tool(name="get_value_counts")(get_value_counts)