Skip to main content
Glama
K02D

MCP Tabular Data Analysis Server

by K02D

compute_correlation

Calculate correlation matrices between numeric columns in CSV or SQLite files to identify relationships in tabular data using Pearson, Spearman, or Kendall methods.

Instructions

Compute correlation matrix between numeric columns.

Args:
    file_path: Path to CSV or SQLite file
    columns: List of columns to include (default: all numeric columns)
    method: Correlation method - 'pearson' (default), 'spearman', or 'kendall'

Returns:
    Dictionary containing:
    - method: Correlation method used
    - correlation_matrix: Full correlation matrix
    - top_correlations: Top 10 strongest correlations (excluding self-correlations)

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
file_pathYes
columnsNo
methodNopearson

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault

No arguments

Implementation Reference

  • The primary handler function for the 'compute_correlation' tool. Loads the dataset, validates inputs, computes the correlation matrix using pandas.corr() with the specified method, extracts the top 10 pairwise correlations (excluding self-correlations), categorizes correlation strength, and returns a structured dictionary with the matrix and highlights.
    @mcp.tool()
    def compute_correlation(
        file_path: str,
        columns: list[str] | None = None,
        method: str = "pearson",
    ) -> dict[str, Any]:
        """
        Compute correlation matrix between numeric columns.
        
        Args:
            file_path: Path to CSV or SQLite file
            columns: List of columns to include (default: all numeric columns)
            method: Correlation method - 'pearson' (default), 'spearman', or 'kendall'
        
        Returns:
            Dictionary containing:
            - method: Correlation method used
            - correlation_matrix: Full correlation matrix
            - top_correlations: Top 10 strongest correlations (excluding self-correlations)
        """
        df = _load_data(file_path)
        
        # Get numeric columns
        if columns:
            # Validate provided columns
            invalid = [c for c in columns if c not in df.columns]
            if invalid:
                raise ValueError(f"Columns not found: {invalid}")
            numeric_df = df[columns].select_dtypes(include=[np.number])
        else:
            numeric_df = df.select_dtypes(include=[np.number])
        
        if len(numeric_df.columns) < 2:
            raise ValueError("Need at least 2 numeric columns for correlation")
        
        # Compute correlation matrix
        corr_matrix = numeric_df.corr(method=method)
        
        # Find top correlations (excluding diagonal)
        correlations = []
        for i, col1 in enumerate(corr_matrix.columns):
            for j, col2 in enumerate(corr_matrix.columns):
                if i < j:  # Upper triangle only
                    corr_value = corr_matrix.loc[col1, col2]
                    if not np.isnan(corr_value):
                        correlations.append({
                            "column1": col1,
                            "column2": col2,
                            "correlation": round(float(corr_value), 4),
                            "strength": _interpret_correlation(abs(corr_value))
                        })
        
        # Sort by absolute correlation
        correlations.sort(key=lambda x: abs(x["correlation"]), reverse=True)
        
        return {
            "method": method,
            "columns_analyzed": corr_matrix.columns.tolist(),
            "correlation_matrix": corr_matrix.round(4).to_dict(),
            "top_correlations": correlations[:10],
        }
  • Helper function called by compute_correlation to categorize the strength of correlations into 'very_strong', 'strong', 'moderate', 'weak', or 'negligible' based on absolute value thresholds.
    def _interpret_correlation(value: float) -> str:
        """Interpret correlation strength."""
        if value >= 0.9:
            return "very_strong"
        elif value >= 0.7:
            return "strong"
        elif value >= 0.5:
            return "moderate"
        elif value >= 0.3:
            return "weak"
        else:
            return "negligible"
  • Helper function used by compute_correlation to identify numeric columns in the dataframe for correlation analysis.
    def _get_numeric_columns(df: pd.DataFrame) -> list[str]:
        """Get list of numeric column names."""
        return df.select_dtypes(include=[np.number]).columns.tolist()
  • Shared helper function used by compute_correlation to load tabular data from CSV or SQLite files into a pandas DataFrame.
    def _load_data(file_path: str) -> pd.DataFrame:
        """Load data from CSV or SQLite file."""
        path = _resolve_path(file_path)
        
        if not path.exists():
            raise FileNotFoundError(
                f"File not found: {file_path}\n"
                f"Resolved to: {path}\n"
                f"Project root: {_PROJECT_ROOT}\n"
                f"Current working directory: {Path.cwd()}"
            )
        
        suffix = path.suffix.lower()
        
        if suffix == ".csv":
            return pd.read_csv(str(path))
        elif suffix in (".db", ".sqlite", ".sqlite3"):
            # For SQLite, list tables or load first table
            conn = sqlite3.connect(str(path))
            tables = pd.read_sql_query(
                "SELECT name FROM sqlite_master WHERE type='table'", conn
            )
            if tables.empty:
                conn.close()
                raise ValueError(f"No tables found in SQLite database: {file_path}")
            first_table = tables.iloc[0]["name"]
            df = pd.read_sql_query(f"SELECT * FROM {first_table}", conn)
            conn.close()
            return df
        else:
            raise ValueError(f"Unsupported file format: {suffix}. Use .csv or .db/.sqlite")
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries the full burden of behavioral disclosure. It describes the return structure in detail, which is helpful, but lacks information on error handling, performance characteristics, or data size limitations. It adequately covers the core operation but misses advanced behavioral traits.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is well-structured and front-loaded with the core purpose, followed by clear sections for arguments and returns. Every sentence adds value, with no redundant or verbose language, making it efficient and easy to parse.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's moderate complexity, no annotations, and an output schema that likely covers return values, the description is mostly complete. It details parameters and returns well, but could improve by addressing usage guidelines or edge cases. The presence of an output schema reduces the need for return value explanation.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters5/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

The description adds significant meaning beyond the input schema, which has 0% description coverage. It explains that 'file_path' is for CSV or SQLite files, 'columns' defaults to all numeric columns, and 'method' includes specific options like 'pearson', 'spearman', or 'kendall' with a default. This fully compensates for the schema's lack of descriptions.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the verb 'compute' and the resource 'correlation matrix between numeric columns,' making the purpose specific and unambiguous. It distinguishes this tool from siblings like 'statistical_test' or 'describe_dataset' by focusing specifically on correlation analysis.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives. It does not mention sibling tools like 'statistical_test' for other analyses or 'describe_dataset' for basic statistics, leaving the agent without context for tool selection.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/K02D/mcp-tabular'

If you have feedback or need assistance with the MCP directory API, please join our Discord server