profile_data
Generate comprehensive data profiles with statistical insights, column characteristics, and analytical summaries to understand dataset structure and quality for data analysis workflows.
Instructions
Generate comprehensive data profile with statistical insights.
Creates a complete analytical profile of the dataset including column characteristics, data types, null patterns, and statistical summaries. Provides holistic data understanding for analytical workflows.
Returns: Comprehensive data profile with multi-dimensional analysis
Profile Components: ๐ Column Profiles: Data types, null patterns, uniqueness ๐ Statistical Summaries: Numerical column characteristics ๐ Correlations: Inter-variable relationships (optional) ๐ฏ Outliers: Anomaly detection across columns (optional) ๐พ Memory Usage: Resource consumption analysis
Examples: # Full data profile profile = await profile_data(ctx)
AI Workflow Integration: 1. Initial data exploration and understanding 2. Automated data quality reporting 3. Feature engineering guidance 4. Data preprocessing strategy development
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
No arguments | |||
Implementation Reference
- The handler function that implements the core logic of the 'profile_data' tool. It retrieves the session data, computes profiling statistics for each column (nulls, uniques, most frequent value), calculates memory usage, and returns a ProfileResult.async def profile_data( ctx: Annotated[Context, Field(description="FastMCP context for session access")], ) -> ProfileResult: """Generate comprehensive data profile with statistical insights. Creates a complete analytical profile of the dataset including column characteristics, data types, null patterns, and statistical summaries. Provides holistic data understanding for analytical workflows. Returns: Comprehensive data profile with multi-dimensional analysis Profile Components: ๐ Column Profiles: Data types, null patterns, uniqueness ๐ Statistical Summaries: Numerical column characteristics ๐ Correlations: Inter-variable relationships (optional) ๐ฏ Outliers: Anomaly detection across columns (optional) ๐พ Memory Usage: Resource consumption analysis Examples: # Full data profile profile = await profile_data(ctx) # Quick profile without expensive computations profile = await profile_data(ctx, include_correlations=False, include_outliers=False) AI Workflow Integration: 1. Initial data exploration and understanding 2. Automated data quality reporting 3. Feature engineering guidance 4. Data preprocessing strategy development """ # Get session_id from FastMCP context session_id = ctx.session_id _session, df = get_session_data(session_id) # Create ProfileInfo for each column (simplified to match model) profile_dict = {} for col in df.columns: col_data = df[col] # Get the most frequent value and its frequency value_counts = col_data.value_counts(dropna=False) most_frequent = None frequency = None if len(value_counts) > 0: most_frequent = value_counts.index[0] frequency = int(value_counts.iloc[0]) # Handle various data types for most_frequent if most_frequent is None or pd.isna(most_frequent): most_frequent = None elif not isinstance(most_frequent, str | int | float | bool): most_frequent = str(most_frequent) profile_info = ProfileInfo( column_name=col, data_type=str(col_data.dtype), null_count=int(col_data.isna().sum()), null_percentage=round(col_data.isna().sum() / len(df) * 100, 2), unique_count=int(col_data.nunique()), unique_percentage=round(col_data.nunique() / len(df) * 100, 2), most_frequent=most_frequent, frequency=frequency, ) profile_dict[col] = profile_info # Note: Correlation and outlier analysis have been simplified # since the ProfileResult model doesn't include them memory_usage_mb = round(df.memory_usage(deep=True).sum() / (1024 * 1024), 2) return ProfileResult( profile=profile_dict, total_rows=len(df), total_columns=len(df.columns), memory_usage_mb=memory_usage_mb, )
- src/databeak/servers/discovery_server.py:852-852 (registration)The registration of the 'profile_data' tool on the FastMCP discovery_server instance using the tool decorator.discovery_server.tool(name="profile_data")(profile_data)
- Pydantic model defining the structure for individual column profile information used in the profile_data response.class ProfileInfo(BaseModel): """Data profiling information for a column.""" column_name: str = Field(description="Name of the profiled column") data_type: str = Field(description="Pandas data type of the column") null_count: int = Field(description="Number of null/missing values") null_percentage: float = Field(description="Percentage of null values (0-100)") unique_count: int = Field(description="Number of unique values") unique_percentage: float = Field(description="Percentage of unique values (0-100)") most_frequent: CsvCellValue = Field(None, description="Most frequently occurring value") frequency: int | None = Field(None, description="Frequency count of most common value")
- Pydantic model defining the output response structure for the profile_data tool.class ProfileResult(BaseToolResponse): """Response model for comprehensive data profiling.""" profile: dict[str, ProfileInfo] total_rows: int total_columns: int memory_usage_mb: float include_correlations: bool = True include_outliers: bool = True