get_data_summary
Analyze dataset structure, dimensions, data types, and memory usage to understand data characteristics for exploration and analysis planning.
Instructions
Get comprehensive data overview and structural summary.
Provides high-level overview of dataset structure, dimensions, data types, and memory usage. Essential first step in data exploration and analysis planning workflows.
Returns: Comprehensive data overview with structural information
Summary Components: ๐ Dimensions: Rows, columns, shape information ๐ข Data Types: Column type distribution and analysis ๐พ Memory Usage: Resource consumption breakdown ๐ Preview: Sample rows for quick data understanding (optional) ๐ Overview: High-level dataset characteristics
Examples: # Full data summary with preview summary = await get_data_summary(ctx)
AI Workflow Integration: 1. Initial data exploration and understanding 2. Planning analytical approaches based on data structure 3. Resource planning for large dataset processing 4. Data quality initial assessment
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| include_preview | Yes | Include sample data rows in summary | |
| max_preview_rows | Yes | Maximum number of preview rows to include |
Implementation Reference
- The main handler function that executes the get_data_summary tool. It retrieves the session data, computes shape, column info, data types, missing data statistics, memory usage, and optional data preview.async def get_data_summary( ctx: Annotated[Context, Field(description="FastMCP context for session access")], *, include_preview: Annotated[ bool, Field(description="Include sample data rows in summary"), ] = True, max_preview_rows: Annotated[ int, Field(description="Maximum number of preview rows to include"), ] = 10, ) -> DataSummaryResult: """Get comprehensive data overview and structural summary. Provides high-level overview of dataset structure, dimensions, data types, and memory usage. Essential first step in data exploration and analysis planning workflows. Returns: Comprehensive data overview with structural information Summary Components: ๐ Dimensions: Rows, columns, shape information ๐ข Data Types: Column type distribution and analysis ๐พ Memory Usage: Resource consumption breakdown ๐ Preview: Sample rows for quick data understanding (optional) ๐ Overview: High-level dataset characteristics Examples: # Full data summary with preview summary = await get_data_summary(ctx) # Structure summary without preview data summary = await get_data_summary(ctx, include_preview=False) AI Workflow Integration: 1. Initial data exploration and understanding 2. Planning analytical approaches based on data structure 3. Resource planning for large dataset processing 4. Data quality initial assessment """ # Get session_id from FastMCP context session_id = ctx.session_id _session, df = get_session_data(session_id) # Create coordinate system coordinate_system = { "row_indexing": f"0 to {len(df) - 1} (0-based)", "column_indexing": "Use column names or 0-based indices", } # Create shape info shape = {"rows": len(df), "columns": len(df.columns)} # Create DataTypeInfo objects for each column columns_info = {} for col in df.columns: col_dtype = str(df[col].dtype) # Map pandas dtypes to Pydantic model literals if "int" in col_dtype: mapped_dtype = "int64" elif "float" in col_dtype: mapped_dtype = "float64" elif "bool" in col_dtype: mapped_dtype = "bool" elif "datetime" in col_dtype: mapped_dtype = "datetime64" elif "category" in col_dtype: mapped_dtype = "category" else: mapped_dtype = "object" columns_info[str(col)] = DataTypeInfo( type=cast( "Literal['int64', 'float64', 'object', 'bool', 'datetime64', 'category']", mapped_dtype, ), nullable=bool(df[col].isna().any()), unique_count=int(df[col].nunique()), null_count=int(df[col].isna().sum()), ) # Create data types categorization (convert column names to strings) data_types = { "numeric": [str(col) for col in df.select_dtypes(include=["number"]).columns], "text": [str(col) for col in df.select_dtypes(include=["object"]).columns], "datetime": [str(col) for col in df.select_dtypes(include=["datetime"]).columns], "boolean": [str(col) for col in df.select_dtypes(include=["bool"]).columns], } # Create missing data info total_missing = int(df.isna().sum().sum()) missing_by_column = {str(col): int(df[col].isna().sum()) for col in df.columns} # Handle empty dataframe total_cells = len(df) * len(df.columns) missing_percentage = round(total_missing / total_cells * 100, 2) if total_cells > 0 else 0.0 missing_data = MissingDataInfo( total_missing=total_missing, missing_by_column=missing_by_column, missing_percentage=missing_percentage, ) # Create preview if include_preview: preview_data = create_data_preview_with_indices(df, num_rows=max_preview_rows) # Convert to DataPreview object preview = DataPreview( rows=preview_data.get("records", []), row_count=preview_data.get("total_rows", 0), column_count=preview_data.get("total_columns", 0), truncated=preview_data.get("preview_rows", 0) < preview_data.get("total_rows", 0), ) else: preview = None # Calculate memory usage memory_usage_mb = round(df.memory_usage(deep=True).sum() / (1024 * 1024), 2) return DataSummaryResult( coordinate_system=coordinate_system, shape=shape, columns=columns_info, data_types=data_types, missing_data=missing_data, memory_usage_mb=memory_usage_mb, preview=preview, )
- Pydantic model defining the output schema/response structure for the get_data_summary tool, including shape, columns, data types, missing data, memory usage, and optional preview.class DataSummaryResult(BaseToolResponse): """Response model for data overview and summary.""" coordinate_system: dict[str, str] shape: dict[str, int] columns: dict[str, DataTypeInfo] data_types: dict[str, list[str]] missing_data: MissingDataInfo memory_usage_mb: float preview: DataPreview | None = None
- src/databeak/servers/discovery_server.py:855-855 (registration)FastMCP tool registration decorator that registers the get_data_summary function as a tool named 'get_data_summary' on the discovery_server.discovery_server.tool(name="get_data_summary")(get_data_summary)