Skip to main content
Glama

get_data_info

Get descriptive statistics and a data preview for Stata, CSV, or Excel files. Understand variable details and optionally view head rows to explore a dataset without prior knowledge.

Instructions

Get descriptive statistics and a data preview for a data file (dta, csv, xlsx). Returns overview, variable details, and optional head rows filtered by requested variables. Use when you need to understand a dataset or have no prior knowledge of the data.

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
data_pathYes
vars_listNo
encodingNoutf-8
headNo

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault
resultYes

Implementation Reference

  • Primary handler function for the get_data_info tool. Accepts data_path, optional vars_list, encoding, and config_file. Resolves the path, determines file extension, fetches the appropriate data handler class, instantiates it, and returns JSON-serialized dataset info.
    def get_data_info(
        data_path: str,
        vars_list: List[str] | None = None,
        encoding: str = "utf-8",
        config_file: str | Path | None = None,
    ) -> str:
        """Return descriptive statistics for a supported dataset."""
        runtime = create_runtime_context(config_file=config_file)
        resolved_data_path = Path(data_path).expanduser().resolve()
        data_extension = resolved_data_path.suffix.lower().strip(".")
    
        data_info_cls = get_data_handler(data_extension)
        if not data_info_cls:
            return f"Unsupported file extension now: {data_extension}"
    
        data_info = data_info_cls(
            resolved_data_path,
            vars_list,
            encoding=encoding,
            cache_dir=runtime.tmp_base_path,
        )
        try:
            return json.dumps(data_info.info, ensure_ascii=False)
        except Exception as error:
            return f"Failed to generate data summary for {resolved_data_path}: {error}"
  • Alternative (legacy/mcp-server) handler for get_data_info. Same core logic but uses config.STATA_MCP_FOLDER.TMP for caching and supports a head parameter (row preview). Also includes logging and cache awareness.
    def get_data_info(
            data_path: str,
            vars_list: List[str] | None = None,
            encoding: str = "utf-8",
            head: int = 0,
    ) -> str:
        """
        Return descriptive statistics for a supported data file.
    
        Args:
            data_path (str): Absolute path to .dta, .csv, .xlsx, .xls, .sav file.
            vars_list (List[str] | None): Optional variable subset (default: all variables).
            encoding (str): File encoding (ignored for .dta).
            head (int): Number of preview rows (0 = disabled).
    
        Returns:
            str: JSON string with overview, variable details, and config.
    
        Examples:
            >>> get_data_info("/Applications/Stata/auto.dta")
            >>> get_data_info("/Applications/Stata/auto.dta", vars_list=["price", "mpg"], head=5)
        """
        data_path = Path(data_path).expanduser().resolve()
        data_extension = data_path.suffix.lower().strip(".")
    
        # Lazy import: pandas/numpy/requests are heavy, only load when needed
        from .data_info import get_data_handler
    
        # Get the appropriate data handler class from the registry
        data_info_cls = get_data_handler(data_extension)
    
        if not data_info_cls:
            logging.error(f"Unsupported file extension: {data_extension} for data file: {data_path}")
            return f"Unsupported file extension now: {data_extension}"
    
        data_info = data_info_cls(data_path, vars_list, encoding=encoding, cache_dir=config.STATA_MCP_FOLDER.TMP, head=head)
        try:
            info = data_info.info
            if data_info.is_cache:
                saved_path = info.get("saved_path", None)
                logging.info(f"Successfully generated data summary for {data_path}, saved to {saved_path}")
            else:
                logging.info(f"Successfully generated data summary for {data_path}")
            return json.dumps(info, ensure_ascii=False)
        except Exception as e:
            logging.error(f"Failed to generate data summary for {data_path}: {str(e)}")
            return f"Failed to generate data summary for {data_path}: {str(e)}"
  • Registration entry for get_data_info in the _TOOL_REGISTRY dict. Maps the tool name to its description, the handler function, and the profiles ('core', 'all') under which it is registered.
    "get_data_info": {
        "description": (
            "Get descriptive statistics and a data preview for a data file "
            "(dta, csv, xlsx). Returns overview, variable details, "
            "and optional head rows filtered by requested variables. "
            "Use when you need to understand a dataset or have no prior knowledge of the data."
        ),
        "func": get_data_info,
        "profiles": {"core", "all"},
    },
  • Re-export of get_data_info from the api package, making it available via from ..api import get_data_info.
    from .get_data_info import get_data_info
    from .read_log import read_log
    from .stata_do import stata_do
    from .stata_help import stata_help
    from .write_dofile import write_dofile
    
    __all__ = [
        "RuntimeContext",
        "create_runtime_context",
        "ado_package_install",
        "get_data_info",
        "read_log",
        "stata_do",
        "stata_help",
        "write_dofile",
    ]
  • CLI parser definition flagging get_data_info as a 'core' tool in the --core argument's help text.
    help="Register only core tools (stata_do, get_data_info, help)",
Behavior3/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

No annotations provided, so description carries full burden. Describes return of overview, variable details, and optional head rows filtered by vars. Lacks details on side effects, permissions, or performance, but is standard for a read-only info tool.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness5/5

Is the description appropriately sized, front-loaded, and free of redundancy?

Two sentences, no redundancy. First sentence defines function, second adds returns and usage guidance. Efficiently conveys essential information.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness4/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Output schema exists, so return values are covered. Description mentions overview, variable details, head rows. Missing encoding parameter explanation, but overall complete for an info tool. Siblings unrelated.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters2/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 0%, so description must compensate. It mentions 'filtered by requested variables' for vars_list and 'optional head rows' for head, but does not explain data_path or encoding. Partial coverage, missing half of parameters.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose5/5

Does the description clearly state what the tool does and how it differs from similar tools?

Description clearly states specific verb ('Get descriptive statistics and a data preview') and resource ('data file dta, csv, xlsx'). Returns overview and variable details. Sibling tools are unrelated, so no differentiation needed.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines4/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

Explicitly says 'Use when you need to understand a dataset or have no prior knowledge of the data.' No exclusions or alternatives mentioned, but siblings are distinct, making guidance adequate.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/SepineTam/stata-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server