Skip to main content
Glama

validate_schema

Validate data against schema definitions using Pandera framework to ensure data quality and compliance with specified rules.

Instructions

Validate data against a schema definition using Pandera validation framework.

This function leverages Pandera's comprehensive validation capabilities to provide robust data validation. The schema is dynamically converted to Pandera format and applied to the DataFrame for maximum validation coverage and reliability.

For more information on Pandera validation capabilities, see:

  • Pandera Documentation: https://pandera.readthedocs.io/

  • Check API: https://pandera.readthedocs.io/en/stable/reference/generated/pandera.api.checks.Check.html

Returns: ValidateSchemaResult with validation status and detailed error information

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
schemaYesSchema definition with column validation rules

Output Schema

TableJSON Schema
NameRequiredDescriptionDefault
validYesWhether validation passed overall
errorsYesAll validation errors found
summaryYesSummary of validation results
validation_errorsYesValidation errors grouped by column name

Implementation Reference

  • Main handler function implementing the validate_schema tool. Loads session data, builds dynamic Pandera DataFrameSchema from input rules, performs validation, collects and limits errors, returns detailed results.
    def validate_schema(
        ctx: Annotated[Context, Field(description="FastMCP context for session access")],
        schema: Annotated[
            ValidationSchema,
            Field(description="Schema definition with column validation rules"),
        ],
    ) -> ValidateSchemaResult:
        """Validate data against a schema definition using Pandera validation framework.
    
        This function leverages Pandera's comprehensive validation capabilities to provide
        robust data validation. The schema is dynamically converted to Pandera format
        and applied to the DataFrame for maximum validation coverage and reliability.
    
        For more information on Pandera validation capabilities, see:
        - Pandera Documentation: https://pandera.readthedocs.io/
        - Check API: https://pandera.readthedocs.io/en/stable/reference/generated/pandera.api.checks.Check.html
    
        Returns:
            ValidateSchemaResult with validation status and detailed error information
    
        """
        session_id = ctx.session_id
        _session, df = get_session_data(session_id)
        settings = get_settings()
        validation_errors: dict[str, list[ValidationError]] = {}
    
        parsed_schema = schema.root
    
        # Apply resource management for large datasets
        logger.info("Validating schema for %d rows, %d columns", len(df), len(df.columns))
        if len(df) > settings.max_anomaly_sample_size:
            logger.warning(
                "Large dataset (%d rows), using sample of %d for validation",
                len(df),
                settings.max_anomaly_sample_size,
            )
            df = sample_large_dataset(df, settings.max_anomaly_sample_size, "Schema validation")
    
        # Convert validation_summary to ValidationSummary
        validation_summary = ValidationSummary(
            total_columns=len(parsed_schema),
            valid_columns=0,
            invalid_columns=0,
            missing_columns=[],
            extra_columns=[],
        )
    
        # Check for missing and extra columns
        schema_columns = set(parsed_schema.keys())
        df_columns = set(df.columns)
    
        validation_summary.missing_columns = list(schema_columns - df_columns)
        validation_summary.extra_columns = list(df_columns - schema_columns)
    
        # Build Pandera schema dynamically from our validation rules
        pandera_columns = {}
    
        for col_name, rules_model in parsed_schema.items():
            if col_name not in df.columns:
                # Handle missing columns separately
                validation_errors[col_name] = [
                    ValidationError(
                        error="column_missing",
                        message=f"Column '{col_name}' not found in data",
                    ),
                ]
                validation_summary.invalid_columns += 1
                continue
    
            # Convert ColumnValidationRules to Pandera checks
            checks = []
            rules = rules_model.model_dump(exclude_none=True)
            ignore_na = rules.get("ignore_na", True)
    
            # Build Pandera checks from validation rules
            if rules.get("equal_to") is not None:
                checks.append(Check.equal_to(rules["equal_to"], ignore_na=ignore_na))
            if rules.get("not_equal_to") is not None:
                checks.append(Check.not_equal_to(rules["not_equal_to"], ignore_na=ignore_na))
            if rules.get("greater_than") is not None:
                checks.append(Check.greater_than(rules["greater_than"], ignore_na=ignore_na))
            if rules.get("greater_than_or_equal_to") is not None:
                checks.append(
                    Check.greater_than_or_equal_to(
                        rules["greater_than_or_equal_to"], ignore_na=ignore_na
                    )
                )
            if rules.get("less_than") is not None:
                checks.append(Check.less_than(rules["less_than"], ignore_na=ignore_na))
            if rules.get("less_than_or_equal_to") is not None:
                checks.append(
                    Check.less_than_or_equal_to(rules["less_than_or_equal_to"], ignore_na=ignore_na)
                )
            if rules.get("in_range") is not None:
                range_dict = rules["in_range"]
                checks.append(Check.in_range(range_dict["min"], range_dict["max"], ignore_na=ignore_na))
            if rules.get("isin") is not None:
                checks.append(Check.isin(rules["isin"], ignore_na=ignore_na))
            if rules.get("notin") is not None:
                checks.append(Check.notin(rules["notin"], ignore_na=ignore_na))
            if rules.get("str_contains") is not None:
                checks.append(Check.str_contains(rules["str_contains"], ignore_na=ignore_na))
            if rules.get("str_endswith") is not None:
                checks.append(Check.str_endswith(rules["str_endswith"], ignore_na=ignore_na))
            if rules.get("str_startswith") is not None:
                checks.append(Check.str_startswith(rules["str_startswith"], ignore_na=ignore_na))
            if rules.get("str_matches") is not None:
                checks.append(Check.str_matches(rules["str_matches"], ignore_na=ignore_na))
            if rules.get("str_length") is not None:
                length_dict = rules["str_length"]
                min_len = length_dict.get("min")
                max_len = length_dict.get("max")
                checks.append(Check.str_length(min_len, max_len, ignore_na=ignore_na))
    
            # Create Pandera Column with checks
            pandera_columns[col_name] = Column(
                nullable=rules.get("nullable", True),
                unique=rules.get("unique", False),
                coerce=rules.get("coerce", False),
                checks=checks,
                name=col_name,
            )
    
        # Create and apply Pandera DataFrameSchema
        pandera_schema = DataFrameSchema(
            columns=pandera_columns,
            strict=False,  # Allow extra columns not in schema
            name="DataBeak_Validation_Schema",
        )
    
        # Validate using Pandera
        try:
            pandera_schema.validate(df, lazy=True)
            # If validation succeeds, update summary
            validation_summary.valid_columns = len(pandera_columns)
            validation_summary.invalid_columns = len(validation_errors)  # Only missing columns
    
        except pandera.errors.SchemaErrors as schema_errors:
            # Process Pandera validation errors
            for error_data in schema_errors.failure_cases.to_dict("records"):
                col_name = str(error_data.get("column", "unknown"))
                check_name = str(error_data.get("check", "unknown"))
                failure_case = error_data.get("failure_case", "unknown")
    
                if col_name not in validation_errors:
                    validation_errors[col_name] = []
    
                validation_errors[col_name].append(
                    ValidationError(
                        error=f"pandera_{check_name}",
                        message=f"Pandera validation failed: {check_name} - {failure_case}",
                        check_name=check_name,
                        failure_case=str(failure_case),
                    )
                )
    
            validation_summary.invalid_columns = len(validation_errors)
            validation_summary.valid_columns = (
                len(parsed_schema)
                - validation_summary.invalid_columns
                - len(validation_summary.missing_columns)
            )
    
        is_valid = len(validation_errors) == 0 and len(validation_summary.missing_columns) == 0
    
        # No longer recording operations (simplified MCP architecture)
    
        # Flatten all validation errors with resource limits
        all_errors = []
        for error_list in validation_errors.values():
            all_errors.extend(error_list)
    
        # Apply violation limits to prevent resource exhaustion
        limited_errors, was_truncated = apply_violation_limits(
            all_errors, settings.max_validation_violations, "Schema validation"
        )
    
        if was_truncated:
            logger.warning(
                "Validation found %d errors, limited to %d",
                len(all_errors),
                settings.max_validation_violations,
            )
    
        return ValidateSchemaResult(
            valid=is_valid,
            errors=limited_errors,
            summary=validation_summary,
            validation_errors=validation_errors,
        )
  • Output Pydantic model defining the structure of the validation result returned by the tool.
    class ValidateSchemaResult(BaseModel):
        """Response model for schema validation operations."""
    
        valid: bool = Field(description="Whether validation passed overall")
        errors: list[ValidationError] = Field(description="All validation errors found")
        summary: ValidationSummary = Field(description="Summary of validation results")
        validation_errors: dict[str, list[ValidationError]] = Field(
            description="Validation errors grouped by column name",
        )
  • Input Pydantic RootModel wrapping the schema dictionary of column validation rules.
    class ValidationSchema(RootModel[dict[str, ColumnValidationRules]]):
        """Schema definition for data validation."""
  • Detailed input model defining validation rules for each column, supporting Pandera-compatible checks like ranges, patterns, uniqueness, etc.
    class ColumnValidationRules(BaseModel):
        """Column validation rules based on Pandera Field and Check validation capabilities.
    
        This class implements comprehensive column validation using rules compatible with
        Pandera's validation system. It leverages Pandera's robust validation framework
        for maximum data quality assurance.
    
        For complete documentation on validation behaviors and options, see:
        - Pandera Field API: https://pandera.readthedocs.io/en/stable/reference/generated/pandera.api.pandas.model_components.Field.html
        - Pandera Check API: https://pandera.readthedocs.io/en/stable/reference/generated/pandera.api.checks.Check.html
        - Pandas validation guide: https://pandas.pydata.org/docs/user_guide/basics.html#validation
    
        The validation rules are organized by category to match Pandera's Check API for
        maximum compatibility and comprehensive data validation coverage.
        """
    
        # Core Field Properties (Pandera Field parameters)
        nullable: bool = Field(
            default=True, description="Allow null/NaN values in the column (Pandera nullable parameter)"
        )
        unique: bool = Field(
            default=False, description="Ensure all column values are unique (Pandera unique parameter)"
        )
        coerce: bool = Field(
            default=False, description="Attempt automatic type conversion (Pandera coerce parameter)"
        )
    
        # Equality Checks (Pandera Check.equal_to/not_equal_to)
        equal_to: int | float | str | bool | None = Field(
            default=None, description="All values must equal this exact value (Pandera Check.equal_to)"
        )
        not_equal_to: int | float | str | bool | None = Field(
            default=None, description="No values may equal this value (Pandera Check.not_equal_to)"
        )
    
        # Numeric Range Checks (Pandera Check comparison methods)
        greater_than: int | float | None = Field(
            default=None,
            description="All numeric values must be strictly greater than this (Pandera Check.greater_than)",
        )
        greater_than_or_equal_to: int | float | None = Field(
            default=None,
            description="All numeric values must be >= this value (Pandera Check.greater_than_or_equal_to)",
        )
        less_than: int | float | None = Field(
            default=None,
            description="All numeric values must be strictly less than this (Pandera Check.less_than)",
        )
        less_than_or_equal_to: int | float | None = Field(
            default=None,
            description="All numeric values must be <= this value (Pandera Check.less_than_or_equal_to)",
        )
        in_range: dict[str, int | float] | None = Field(
            default=None,
            description="Numeric range constraints as {'min': num, 'max': num} (Pandera Check.in_range)",
        )
    
        # Set Membership Checks (Pandera Check.isin/notin)
        isin: list[str | int | float | bool] | None = Field(
            default=None,
            description="Values must be in this list of allowed values (Pandera Check.isin)",
        )
        notin: list[str | int | float | bool] | None = Field(
            default=None,
            description="Values must not be in this list of forbidden values (Pandera Check.notin)",
        )
    
        # String-specific Validation (Pandera Check string methods)
        str_contains: str | None = Field(
            default=None, description="Strings must contain this substring (Pandera Check.str_contains)"
        )
        str_endswith: str | None = Field(
            default=None, description="Strings must end with this suffix (Pandera Check.str_endswith)"
        )
        str_startswith: str | None = Field(
            default=None,
            description="Strings must start with this prefix (Pandera Check.str_startswith)",
        )
        str_matches: str | None = Field(
            default=None,
            description="Strings must match this regex pattern (Pandera Check.str_matches)",
        )
        str_length: dict[str, int] | None = Field(
            default=None,
            description="String length constraints as {'min': int, 'max': int} (Pandera Check.str_length)",
        )
    
        # Validation Control Parameters (Pandera behavior controls)
        ignore_na: bool = Field(
            default=True,
            description="Ignore null values during validation checks (Pandera ignore_na parameter)",
        )
        raise_warning: bool = Field(
            default=False,
            description="Raise warning instead of exception on validation failure (Pandera raise_warning parameter)",
        )
    
        @field_validator("str_matches")
        @classmethod
        def validate_regex_pattern(cls, v: str | None) -> str | None:
            """Validate that str_matches pattern is a valid regular expression."""
            if v is not None:
                re.compile(v)
            return v
    
        @field_validator("str_length", "in_range")
        @classmethod
        def validate_range_dict(cls, v: dict[str, int | float] | None) -> dict[str, int | float] | None:
            """Validate range constraint dictionaries for str_length and in_range."""
            if v is None:
                return v
    
            if not isinstance(v, dict):
                msg = "Range constraint must be a dictionary with 'min' and/or 'max' keys"
                raise TypeError(msg)
    
            allowed_keys = {"min", "max"}
            invalid_keys = set(v.keys()) - allowed_keys
            if invalid_keys:
                msg = f"Range constraint contains invalid keys: {invalid_keys}. Allowed: {allowed_keys}"
                raise ValueError(msg)
    
            # Validate min/max relationship
            if "min" in v and "max" in v and v["min"] > v["max"]:
                msg = f"Range constraint min ({v['min']}) cannot be greater than max ({v['max']})"
                raise ValueError(msg)
    
            return v
  • FastMCP server instance creation and tool registration for validate_schema and related validation tools.
    validation_server = FastMCP(
        "DataBeak-Validation",
        instructions="Data validation server for DataBeak",
    )
    
    # Register the validation functions as MCP tools
    validation_server.tool(name="validate_schema")(validate_schema)
    validation_server.tool(name="check_data_quality")(check_data_quality)
    validation_server.tool(name="find_anomalies")(find_anomalies)
Behavior2/5

Does the description disclose side effects, auth requirements, rate limits, or destructive behavior?

With no annotations provided, the description carries full burden for behavioral disclosure. It mentions that the tool 'leverages Pandera's comprehensive validation capabilities' and returns 'validation status and detailed error information', but doesn't specify important behavioral aspects: whether this is a read-only operation, what happens on validation failure (exceptions vs. warnings), performance characteristics, or data size limitations. The description adds some context about Pandera framework but misses critical operational details.

Agents need to know what a tool does to the world before calling it. Descriptions should go beyond structured annotations to explain consequences.

Conciseness3/5

Is the description appropriately sized, front-loaded, and free of redundancy?

The description is moderately concise but includes unnecessary promotional language ('comprehensive validation capabilities', 'maximum validation coverage and reliability') and external documentation links that don't help the AI agent. The core purpose is stated upfront, but the second paragraph and documentation links add bulk without operational value. The 'Returns:' section is useful but could be more integrated.

Shorter descriptions cost fewer tokens and are easier for agents to parse. Every sentence should earn its place.

Completeness3/5

Given the tool's complexity, does the description cover enough for an agent to succeed on first attempt?

Given the tool's complexity (data validation framework integration), no annotations, and the presence of an output schema, the description is minimally adequate. It identifies the framework and return type but misses important context: what data format is expected (presumably pandas DataFrame based on Pandera reference), how data is provided to the tool (not mentioned in parameters), error handling behavior, and performance considerations. The output schema existence reduces but doesn't eliminate the need for more operational context.

Complex tools with many parameters or behaviors need more documentation. Simple tools need less. This dimension scales expectations accordingly.

Parameters3/5

Does the description clarify parameter syntax, constraints, interactions, or defaults beyond what the schema provides?

Schema description coverage is 100%, so the schema already fully documents the single 'schema' parameter with extensive validation rule details. The description doesn't add any parameter-specific information beyond what's in the schema - it doesn't explain how to structure the schema parameter, provide examples, or clarify the relationship between the schema parameter and the data being validated. Baseline 3 is appropriate when schema does all the work.

Input schemas describe structure but not intent. Descriptions should explain non-obvious parameter relationships and valid value ranges.

Purpose4/5

Does the description clearly state what the tool does and how it differs from similar tools?

The description clearly states the tool's purpose: 'Validate data against a schema definition using Pandera validation framework.' It specifies the verb (validate), resource (data), and framework (Pandera). However, it doesn't explicitly distinguish this from sibling tools like 'check_data_quality' or 'profile_data', which might have overlapping functionality in data validation contexts.

Agents choose between tools based on descriptions. A clear purpose with a specific verb and resource helps agents select the right tool.

Usage Guidelines2/5

Does the description explain when to use this tool, when not to, or what alternatives exist?

The description provides no guidance on when to use this tool versus alternatives. It mentions Pandera's capabilities but doesn't specify scenarios where this validation tool is appropriate compared to sibling tools like 'check_data_quality' or 'profile_data'. There's no mention of prerequisites, data format requirements, or when-not-to-use conditions.

Agents often have multiple tools that could apply. Explicit usage guidance like "use X instead of Y when Z" prevents misuse.

Install Server

Other Tools

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/jonpspri/databeak'

If you have feedback or need assistance with the MCP directory API, please join our Discord server