Skip to main content
Glama
K02D

MCP Tabular Data Analysis Server

by K02D

describe_dataset

Generate comprehensive statistics for tabular datasets to analyze structure, column types, numeric summaries, missing values, and data previews.

Instructions

Generate comprehensive statistics for a tabular dataset.

Args:
    file_path: Path to CSV or SQLite file
    include_all: If True, include statistics for all columns (not just numeric)

Returns:
    Dictionary containing:
    - shape: (rows, columns)
    - columns: List of column names with their types
    - numeric_stats: Descriptive statistics for numeric columns
    - missing_values: Count of missing values per column
    - sample: First 5 rows as preview

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
file_pathYes
include_allNo

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault

No arguments

Implementation Reference

  • Main handler function for the 'describe_dataset' tool. Loads dataset from CSV/SQLite, computes comprehensive statistics including shape, column types, missing values, numeric descriptive stats (with skew/kurtosis), categorical summaries, and a data sample.
    @mcp.tool()
    def describe_dataset(file_path: str, include_all: bool = False) -> dict[str, Any]:
        """
        Generate comprehensive statistics for a tabular dataset.
        
        Args:
            file_path: Path to CSV or SQLite file
            include_all: If True, include statistics for all columns (not just numeric)
        
        Returns:
            Dictionary containing:
            - shape: (rows, columns)
            - columns: List of column names with their types
            - numeric_stats: Descriptive statistics for numeric columns
            - missing_values: Count of missing values per column
            - sample: First 5 rows as preview
        """
        df = _load_data(file_path)
        
        # Basic info
        result = {
            "shape": {"rows": len(df), "columns": len(df.columns)},
            "columns": {
                col: str(df[col].dtype) for col in df.columns
            },
            "missing_values": df.isnull().sum().to_dict(),
        }
        
        # Numeric statistics
        numeric_cols = _get_numeric_columns(df)
        if numeric_cols:
            stats_df = df[numeric_cols].describe()
            # Add additional stats
            stats_df.loc["median"] = df[numeric_cols].median()
            stats_df.loc["skew"] = df[numeric_cols].skew()
            stats_df.loc["kurtosis"] = df[numeric_cols].kurtosis()
            result["numeric_stats"] = stats_df.to_dict()
        
        # Categorical columns info
        cat_cols = df.select_dtypes(include=["object", "category"]).columns.tolist()
        if cat_cols:
            result["categorical_columns"] = {
                col: {
                    "unique_values": df[col].nunique(),
                    "top_values": df[col].value_counts().head(5).to_dict()
                }
                for col in cat_cols
            }
        
        # Sample data
        result["sample"] = df.head(5).to_dict(orient="records")
        
        return result
  • Helper function to identify numeric columns, used in describe_dataset for statistics computation.
    def _get_numeric_columns(df: pd.DataFrame) -> list[str]:
        """Get list of numeric column names."""
        return df.select_dtypes(include=[np.number]).columns.tolist()
  • Core helper function to load datasets from CSV or SQLite files, handling path resolution and table selection for databases. Called by describe_dataset.
    def _load_data(file_path: str) -> pd.DataFrame:
        """Load data from CSV or SQLite file."""
        path = _resolve_path(file_path)
        
        if not path.exists():
            raise FileNotFoundError(
                f"File not found: {file_path}\n"
                f"Resolved to: {path}\n"
                f"Project root: {_PROJECT_ROOT}\n"
                f"Current working directory: {Path.cwd()}"
            )
        
        suffix = path.suffix.lower()
        
        if suffix == ".csv":
            return pd.read_csv(str(path))
        elif suffix in (".db", ".sqlite", ".sqlite3"):
            # For SQLite, list tables or load first table
            conn = sqlite3.connect(str(path))
            tables = pd.read_sql_query(
                "SELECT name FROM sqlite_master WHERE type='table'", conn
            )
            if tables.empty:
                conn.close()
                raise ValueError(f"No tables found in SQLite database: {file_path}")
            first_table = tables.iloc[0]["name"]
            df = pd.read_sql_query(f"SELECT * FROM {first_table}", conn)
            conn.close()
            return df
        else:
            raise ValueError(f"Unsupported file format: {suffix}. Use .csv or .db/.sqlite")
  • Helper to resolve relative file paths to absolute paths based on project root, used by _load_data.
    def _resolve_path(file_path: str) -> Path:
        """
        Resolve file path relative to project root if it's a relative path.
        
        Args:
            file_path: Absolute or relative file path
        
        Returns:
            Resolved absolute Path
        """
        path = Path(file_path)
        
        # If absolute path, use as-is
        if path.is_absolute():
            return path
        
        # Otherwise, resolve relative to project root
        resolved = _PROJECT_ROOT / path
        return resolved.resolve()
  • MCP tool registration decorator for the describe_dataset function.
    @mcp.tool()
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It partially succeeds by describing the return format in detail, but fails to mention critical behaviors like performance implications for large datasets, memory usage, error handling for invalid files, or whether the operation is read-only (implied but not stated). The description adds some context but leaves significant gaps.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is perfectly structured and front-loaded: the first sentence states the core purpose, followed by clearly labeled sections for arguments and returns. Every sentence earns its place by providing essential information without redundancy, making it highly scannable and efficient.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's moderate complexity (2 parameters, no annotations, but with output schema), the description is reasonably complete. The output schema exists, so the description appropriately doesn't need to explain return values in schema terms, but it usefully summarizes the dictionary structure. However, it misses some context like performance considerations or error cases.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters4/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 0%, so the description must compensate. It effectively explains both parameters: 'file_path' is clarified as 'Path to CSV or SQLite file', and 'include_all' is explained as controlling whether statistics cover all columns or just numeric ones. This adds meaningful semantics beyond the bare schema, though it doesn't detail file path format or validation rules.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the specific action ('Generate comprehensive statistics') and resource ('tabular dataset'), distinguishing it from sibling tools like 'data_quality_report' or 'statistical_test' by focusing on descriptive statistics rather than quality assessment or hypothesis testing. The verb 'generate' and scope 'comprehensive statistics' precisely define what the tool does.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives like 'data_quality_report' or 'auto_insights', nor does it mention prerequisites such as file format requirements or data size limitations. It lacks explicit when/when-not statements or named alternatives, leaving usage context implied at best.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/K02D/mcp-tabular'

If you have feedback or need assistance with the MCP directory API, please join our Discord server