profile_data
Generate comprehensive data profiles with statistical insights to understand dataset characteristics, identify patterns, and support analytical workflows.
Instructions
Generate comprehensive data profile with statistical insights.
Creates a complete analytical profile of the dataset including column characteristics, data types, null patterns, and statistical summaries. Provides holistic data understanding for analytical workflows.
Returns: Comprehensive data profile with multi-dimensional analysis
Profile Components: š Column Profiles: Data types, null patterns, uniqueness š Statistical Summaries: Numerical column characteristics š Correlations: Inter-variable relationships (optional) šÆ Outliers: Anomaly detection across columns (optional) š¾ Memory Usage: Resource consumption analysis
Examples: # Full data profile profile = await profile_data(ctx)
AI Workflow Integration: 1. Initial data exploration and understanding 2. Automated data quality reporting 3. Feature engineering guidance 4. Data preprocessing strategy development
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
No arguments | |||
Implementation Reference
- The main execution logic for the profile_data tool. It retrieves the current session data, computes profiling statistics for each column (data type, nulls, uniques, most frequent), calculates memory usage, and returns a ProfileResult.async def profile_data( ctx: Annotated[Context, Field(description="FastMCP context for session access")], ) -> ProfileResult: """Generate comprehensive data profile with statistical insights. Creates a complete analytical profile of the dataset including column characteristics, data types, null patterns, and statistical summaries. Provides holistic data understanding for analytical workflows. Returns: Comprehensive data profile with multi-dimensional analysis Profile Components: š Column Profiles: Data types, null patterns, uniqueness š Statistical Summaries: Numerical column characteristics š Correlations: Inter-variable relationships (optional) šÆ Outliers: Anomaly detection across columns (optional) š¾ Memory Usage: Resource consumption analysis Examples: # Full data profile profile = await profile_data(ctx) # Quick profile without expensive computations profile = await profile_data(ctx, include_correlations=False, include_outliers=False) AI Workflow Integration: 1. Initial data exploration and understanding 2. Automated data quality reporting 3. Feature engineering guidance 4. Data preprocessing strategy development """ # Get session_id from FastMCP context session_id = ctx.session_id _session, df = get_session_data(session_id) # Create ProfileInfo for each column (simplified to match model) profile_dict = {} for col in df.columns: col_data = df[col] # Get the most frequent value and its frequency value_counts = col_data.value_counts(dropna=False) most_frequent = None frequency = None if len(value_counts) > 0: most_frequent = value_counts.index[0] frequency = int(value_counts.iloc[0]) # Handle various data types for most_frequent if most_frequent is None or pd.isna(most_frequent): most_frequent = None elif not isinstance(most_frequent, str | int | float | bool): most_frequent = str(most_frequent) profile_info = ProfileInfo( column_name=col, data_type=str(col_data.dtype), null_count=int(col_data.isna().sum()), null_percentage=round(col_data.isna().sum() / len(df) * 100, 2), unique_count=int(col_data.nunique()), unique_percentage=round(col_data.nunique() / len(df) * 100, 2), most_frequent=most_frequent, frequency=frequency, ) profile_dict[col] = profile_info # Note: Correlation and outlier analysis have been simplified # since the ProfileResult model doesn't include them memory_usage_mb = round(df.memory_usage(deep=True).sum() / (1024 * 1024), 2) return ProfileResult( profile=profile_dict, total_rows=len(df), total_columns=len(df.columns), memory_usage_mb=memory_usage_mb, )
- Pydantic model defining the output schema of the profile_data tool response, including per-column profiles and dataset summary metrics.class ProfileResult(BaseToolResponse): """Response model for comprehensive data profiling.""" profile: dict[str, ProfileInfo] total_rows: int total_columns: int memory_usage_mb: float include_correlations: bool = True include_outliers: bool = True
- Pydantic model for individual column profiling information used within the profile_data tool's response.class ProfileInfo(BaseModel): """Data profiling information for a column.""" column_name: str = Field(description="Name of the profiled column") data_type: str = Field(description="Pandas data type of the column") null_count: int = Field(description="Number of null/missing values") null_percentage: float = Field(description="Percentage of null values (0-100)") unique_count: int = Field(description="Number of unique values") unique_percentage: float = Field(description="Percentage of unique values (0-100)") most_frequent: CsvCellValue = Field(None, description="Most frequently occurring value") frequency: int | None = Field(None, description="Frequency count of most common value")
- src/databeak/servers/discovery_server.py:852-852 (registration)Registers the profile_data function as an MCP tool named 'profile_data' on the discovery_server FastMCP instance.discovery_server.tool(name="profile_data")(profile_data)