Skip to main content
Glama

profile_data

Generate comprehensive data profiles with statistical insights to understand dataset characteristics, identify patterns, and support analytical workflows.

Instructions

Generate comprehensive data profile with statistical insights.

Creates a complete analytical profile of the dataset including column characteristics, data types, null patterns, and statistical summaries. Provides holistic data understanding for analytical workflows.

Returns: Comprehensive data profile with multi-dimensional analysis

Profile Components: πŸ“Š Column Profiles: Data types, null patterns, uniqueness πŸ“ˆ Statistical Summaries: Numerical column characteristics πŸ”— Correlations: Inter-variable relationships (optional) 🎯 Outliers: Anomaly detection across columns (optional) πŸ’Ύ Memory Usage: Resource consumption analysis

Examples: # Full data profile profile = await profile_data(ctx)

# Quick profile without expensive computations
profile = await profile_data(ctx,
                           include_correlations=False,
                           include_outliers=False)

AI Workflow Integration: 1. Initial data exploration and understanding 2. Automated data quality reporting 3. Feature engineering guidance 4. Data preprocessing strategy development

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault

No arguments

Implementation Reference

  • The main execution logic for the profile_data tool. It retrieves the current session data, computes profiling statistics for each column (data type, nulls, uniques, most frequent), calculates memory usage, and returns a ProfileResult.
    async def profile_data(
        ctx: Annotated[Context, Field(description="FastMCP context for session access")],
    ) -> ProfileResult:
        """Generate comprehensive data profile with statistical insights.
    
        Creates a complete analytical profile of the dataset including column
        characteristics, data types, null patterns, and statistical summaries.
        Provides holistic data understanding for analytical workflows.
    
        Returns:
            Comprehensive data profile with multi-dimensional analysis
    
        Profile Components:
            πŸ“Š Column Profiles: Data types, null patterns, uniqueness
            πŸ“ˆ Statistical Summaries: Numerical column characteristics
            πŸ”— Correlations: Inter-variable relationships (optional)
            🎯 Outliers: Anomaly detection across columns (optional)
            πŸ’Ύ Memory Usage: Resource consumption analysis
    
        Examples:
            # Full data profile
            profile = await profile_data(ctx)
    
            # Quick profile without expensive computations
            profile = await profile_data(ctx,
                                       include_correlations=False,
                                       include_outliers=False)
    
        AI Workflow Integration:
            1. Initial data exploration and understanding
            2. Automated data quality reporting
            3. Feature engineering guidance
            4. Data preprocessing strategy development
    
        """
        # Get session_id from FastMCP context
        session_id = ctx.session_id
        _session, df = get_session_data(session_id)
    
        # Create ProfileInfo for each column (simplified to match model)
        profile_dict = {}
    
        for col in df.columns:
            col_data = df[col]
    
            # Get the most frequent value and its frequency
            value_counts = col_data.value_counts(dropna=False)
            most_frequent = None
            frequency = None
            if len(value_counts) > 0:
                most_frequent = value_counts.index[0]
                frequency = int(value_counts.iloc[0])
    
                # Handle various data types for most_frequent
                if most_frequent is None or pd.isna(most_frequent):
                    most_frequent = None
                elif not isinstance(most_frequent, str | int | float | bool):
                    most_frequent = str(most_frequent)
    
            profile_info = ProfileInfo(
                column_name=col,
                data_type=str(col_data.dtype),
                null_count=int(col_data.isna().sum()),
                null_percentage=round(col_data.isna().sum() / len(df) * 100, 2),
                unique_count=int(col_data.nunique()),
                unique_percentage=round(col_data.nunique() / len(df) * 100, 2),
                most_frequent=most_frequent,
                frequency=frequency,
            )
    
            profile_dict[col] = profile_info
    
        # Note: Correlation and outlier analysis have been simplified
        # since the ProfileResult model doesn't include them
    
        memory_usage_mb = round(df.memory_usage(deep=True).sum() / (1024 * 1024), 2)
    
        return ProfileResult(
            profile=profile_dict,
            total_rows=len(df),
            total_columns=len(df.columns),
            memory_usage_mb=memory_usage_mb,
        )
  • Pydantic model defining the output schema of the profile_data tool response, including per-column profiles and dataset summary metrics.
    class ProfileResult(BaseToolResponse):
        """Response model for comprehensive data profiling."""
    
        profile: dict[str, ProfileInfo]
        total_rows: int
        total_columns: int
        memory_usage_mb: float
        include_correlations: bool = True
        include_outliers: bool = True
  • Pydantic model for individual column profiling information used within the profile_data tool's response.
    class ProfileInfo(BaseModel):
        """Data profiling information for a column."""
    
        column_name: str = Field(description="Name of the profiled column")
        data_type: str = Field(description="Pandas data type of the column")
        null_count: int = Field(description="Number of null/missing values")
        null_percentage: float = Field(description="Percentage of null values (0-100)")
        unique_count: int = Field(description="Number of unique values")
        unique_percentage: float = Field(description="Percentage of unique values (0-100)")
        most_frequent: CsvCellValue = Field(None, description="Most frequently occurring value")
        frequency: int | None = Field(None, description="Frequency count of most common value")
  • Registers the profile_data function as an MCP tool named 'profile_data' on the discovery_server FastMCP instance.
    discovery_server.tool(name="profile_data")(profile_data)

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/jonpspri/databeak'

If you have feedback or need assistance with the MCP directory API, please join our Discord server