Skip to main content
Glama

validate_schema

Validate data against schema definitions using Pandera framework to ensure data quality and compliance with specified rules.

Instructions

Validate data against a schema definition using Pandera validation framework.

This function leverages Pandera's comprehensive validation capabilities to provide robust data validation. The schema is dynamically converted to Pandera format and applied to the DataFrame for maximum validation coverage and reliability.

For more information on Pandera validation capabilities, see:

  • Pandera Documentation: https://pandera.readthedocs.io/

  • Check API: https://pandera.readthedocs.io/en/stable/reference/generated/pandera.api.checks.Check.html

Returns: ValidateSchemaResult with validation status and detailed error information

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
schemaYesSchema definition with column validation rules

Implementation Reference

  • Main handler function implementing the validate_schema tool. Loads session data, builds dynamic Pandera DataFrameSchema from input rules, performs validation, collects and limits errors, returns detailed results.
    def validate_schema(
        ctx: Annotated[Context, Field(description="FastMCP context for session access")],
        schema: Annotated[
            ValidationSchema,
            Field(description="Schema definition with column validation rules"),
        ],
    ) -> ValidateSchemaResult:
        """Validate data against a schema definition using Pandera validation framework.
    
        This function leverages Pandera's comprehensive validation capabilities to provide
        robust data validation. The schema is dynamically converted to Pandera format
        and applied to the DataFrame for maximum validation coverage and reliability.
    
        For more information on Pandera validation capabilities, see:
        - Pandera Documentation: https://pandera.readthedocs.io/
        - Check API: https://pandera.readthedocs.io/en/stable/reference/generated/pandera.api.checks.Check.html
    
        Returns:
            ValidateSchemaResult with validation status and detailed error information
    
        """
        session_id = ctx.session_id
        _session, df = get_session_data(session_id)
        settings = get_settings()
        validation_errors: dict[str, list[ValidationError]] = {}
    
        parsed_schema = schema.root
    
        # Apply resource management for large datasets
        logger.info("Validating schema for %d rows, %d columns", len(df), len(df.columns))
        if len(df) > settings.max_anomaly_sample_size:
            logger.warning(
                "Large dataset (%d rows), using sample of %d for validation",
                len(df),
                settings.max_anomaly_sample_size,
            )
            df = sample_large_dataset(df, settings.max_anomaly_sample_size, "Schema validation")
    
        # Convert validation_summary to ValidationSummary
        validation_summary = ValidationSummary(
            total_columns=len(parsed_schema),
            valid_columns=0,
            invalid_columns=0,
            missing_columns=[],
            extra_columns=[],
        )
    
        # Check for missing and extra columns
        schema_columns = set(parsed_schema.keys())
        df_columns = set(df.columns)
    
        validation_summary.missing_columns = list(schema_columns - df_columns)
        validation_summary.extra_columns = list(df_columns - schema_columns)
    
        # Build Pandera schema dynamically from our validation rules
        pandera_columns = {}
    
        for col_name, rules_model in parsed_schema.items():
            if col_name not in df.columns:
                # Handle missing columns separately
                validation_errors[col_name] = [
                    ValidationError(
                        error="column_missing",
                        message=f"Column '{col_name}' not found in data",
                    ),
                ]
                validation_summary.invalid_columns += 1
                continue
    
            # Convert ColumnValidationRules to Pandera checks
            checks = []
            rules = rules_model.model_dump(exclude_none=True)
            ignore_na = rules.get("ignore_na", True)
    
            # Build Pandera checks from validation rules
            if rules.get("equal_to") is not None:
                checks.append(Check.equal_to(rules["equal_to"], ignore_na=ignore_na))
            if rules.get("not_equal_to") is not None:
                checks.append(Check.not_equal_to(rules["not_equal_to"], ignore_na=ignore_na))
            if rules.get("greater_than") is not None:
                checks.append(Check.greater_than(rules["greater_than"], ignore_na=ignore_na))
            if rules.get("greater_than_or_equal_to") is not None:
                checks.append(
                    Check.greater_than_or_equal_to(
                        rules["greater_than_or_equal_to"], ignore_na=ignore_na
                    )
                )
            if rules.get("less_than") is not None:
                checks.append(Check.less_than(rules["less_than"], ignore_na=ignore_na))
            if rules.get("less_than_or_equal_to") is not None:
                checks.append(
                    Check.less_than_or_equal_to(rules["less_than_or_equal_to"], ignore_na=ignore_na)
                )
            if rules.get("in_range") is not None:
                range_dict = rules["in_range"]
                checks.append(Check.in_range(range_dict["min"], range_dict["max"], ignore_na=ignore_na))
            if rules.get("isin") is not None:
                checks.append(Check.isin(rules["isin"], ignore_na=ignore_na))
            if rules.get("notin") is not None:
                checks.append(Check.notin(rules["notin"], ignore_na=ignore_na))
            if rules.get("str_contains") is not None:
                checks.append(Check.str_contains(rules["str_contains"], ignore_na=ignore_na))
            if rules.get("str_endswith") is not None:
                checks.append(Check.str_endswith(rules["str_endswith"], ignore_na=ignore_na))
            if rules.get("str_startswith") is not None:
                checks.append(Check.str_startswith(rules["str_startswith"], ignore_na=ignore_na))
            if rules.get("str_matches") is not None:
                checks.append(Check.str_matches(rules["str_matches"], ignore_na=ignore_na))
            if rules.get("str_length") is not None:
                length_dict = rules["str_length"]
                min_len = length_dict.get("min")
                max_len = length_dict.get("max")
                checks.append(Check.str_length(min_len, max_len, ignore_na=ignore_na))
    
            # Create Pandera Column with checks
            pandera_columns[col_name] = Column(
                nullable=rules.get("nullable", True),
                unique=rules.get("unique", False),
                coerce=rules.get("coerce", False),
                checks=checks,
                name=col_name,
            )
    
        # Create and apply Pandera DataFrameSchema
        pandera_schema = DataFrameSchema(
            columns=pandera_columns,
            strict=False,  # Allow extra columns not in schema
            name="DataBeak_Validation_Schema",
        )
    
        # Validate using Pandera
        try:
            pandera_schema.validate(df, lazy=True)
            # If validation succeeds, update summary
            validation_summary.valid_columns = len(pandera_columns)
            validation_summary.invalid_columns = len(validation_errors)  # Only missing columns
    
        except pandera.errors.SchemaErrors as schema_errors:
            # Process Pandera validation errors
            for error_data in schema_errors.failure_cases.to_dict("records"):
                col_name = str(error_data.get("column", "unknown"))
                check_name = str(error_data.get("check", "unknown"))
                failure_case = error_data.get("failure_case", "unknown")
    
                if col_name not in validation_errors:
                    validation_errors[col_name] = []
    
                validation_errors[col_name].append(
                    ValidationError(
                        error=f"pandera_{check_name}",
                        message=f"Pandera validation failed: {check_name} - {failure_case}",
                        check_name=check_name,
                        failure_case=str(failure_case),
                    )
                )
    
            validation_summary.invalid_columns = len(validation_errors)
            validation_summary.valid_columns = (
                len(parsed_schema)
                - validation_summary.invalid_columns
                - len(validation_summary.missing_columns)
            )
    
        is_valid = len(validation_errors) == 0 and len(validation_summary.missing_columns) == 0
    
        # No longer recording operations (simplified MCP architecture)
    
        # Flatten all validation errors with resource limits
        all_errors = []
        for error_list in validation_errors.values():
            all_errors.extend(error_list)
    
        # Apply violation limits to prevent resource exhaustion
        limited_errors, was_truncated = apply_violation_limits(
            all_errors, settings.max_validation_violations, "Schema validation"
        )
    
        if was_truncated:
            logger.warning(
                "Validation found %d errors, limited to %d",
                len(all_errors),
                settings.max_validation_violations,
            )
    
        return ValidateSchemaResult(
            valid=is_valid,
            errors=limited_errors,
            summary=validation_summary,
            validation_errors=validation_errors,
        )
  • Output Pydantic model defining the structure of the validation result returned by the tool.
    class ValidateSchemaResult(BaseModel):
        """Response model for schema validation operations."""
    
        valid: bool = Field(description="Whether validation passed overall")
        errors: list[ValidationError] = Field(description="All validation errors found")
        summary: ValidationSummary = Field(description="Summary of validation results")
        validation_errors: dict[str, list[ValidationError]] = Field(
            description="Validation errors grouped by column name",
        )
  • Input Pydantic RootModel wrapping the schema dictionary of column validation rules.
    class ValidationSchema(RootModel[dict[str, ColumnValidationRules]]):
        """Schema definition for data validation."""
  • Detailed input model defining validation rules for each column, supporting Pandera-compatible checks like ranges, patterns, uniqueness, etc.
    class ColumnValidationRules(BaseModel):
        """Column validation rules based on Pandera Field and Check validation capabilities.
    
        This class implements comprehensive column validation using rules compatible with
        Pandera's validation system. It leverages Pandera's robust validation framework
        for maximum data quality assurance.
    
        For complete documentation on validation behaviors and options, see:
        - Pandera Field API: https://pandera.readthedocs.io/en/stable/reference/generated/pandera.api.pandas.model_components.Field.html
        - Pandera Check API: https://pandera.readthedocs.io/en/stable/reference/generated/pandera.api.checks.Check.html
        - Pandas validation guide: https://pandas.pydata.org/docs/user_guide/basics.html#validation
    
        The validation rules are organized by category to match Pandera's Check API for
        maximum compatibility and comprehensive data validation coverage.
        """
    
        # Core Field Properties (Pandera Field parameters)
        nullable: bool = Field(
            default=True, description="Allow null/NaN values in the column (Pandera nullable parameter)"
        )
        unique: bool = Field(
            default=False, description="Ensure all column values are unique (Pandera unique parameter)"
        )
        coerce: bool = Field(
            default=False, description="Attempt automatic type conversion (Pandera coerce parameter)"
        )
    
        # Equality Checks (Pandera Check.equal_to/not_equal_to)
        equal_to: int | float | str | bool | None = Field(
            default=None, description="All values must equal this exact value (Pandera Check.equal_to)"
        )
        not_equal_to: int | float | str | bool | None = Field(
            default=None, description="No values may equal this value (Pandera Check.not_equal_to)"
        )
    
        # Numeric Range Checks (Pandera Check comparison methods)
        greater_than: int | float | None = Field(
            default=None,
            description="All numeric values must be strictly greater than this (Pandera Check.greater_than)",
        )
        greater_than_or_equal_to: int | float | None = Field(
            default=None,
            description="All numeric values must be >= this value (Pandera Check.greater_than_or_equal_to)",
        )
        less_than: int | float | None = Field(
            default=None,
            description="All numeric values must be strictly less than this (Pandera Check.less_than)",
        )
        less_than_or_equal_to: int | float | None = Field(
            default=None,
            description="All numeric values must be <= this value (Pandera Check.less_than_or_equal_to)",
        )
        in_range: dict[str, int | float] | None = Field(
            default=None,
            description="Numeric range constraints as {'min': num, 'max': num} (Pandera Check.in_range)",
        )
    
        # Set Membership Checks (Pandera Check.isin/notin)
        isin: list[str | int | float | bool] | None = Field(
            default=None,
            description="Values must be in this list of allowed values (Pandera Check.isin)",
        )
        notin: list[str | int | float | bool] | None = Field(
            default=None,
            description="Values must not be in this list of forbidden values (Pandera Check.notin)",
        )
    
        # String-specific Validation (Pandera Check string methods)
        str_contains: str | None = Field(
            default=None, description="Strings must contain this substring (Pandera Check.str_contains)"
        )
        str_endswith: str | None = Field(
            default=None, description="Strings must end with this suffix (Pandera Check.str_endswith)"
        )
        str_startswith: str | None = Field(
            default=None,
            description="Strings must start with this prefix (Pandera Check.str_startswith)",
        )
        str_matches: str | None = Field(
            default=None,
            description="Strings must match this regex pattern (Pandera Check.str_matches)",
        )
        str_length: dict[str, int] | None = Field(
            default=None,
            description="String length constraints as {'min': int, 'max': int} (Pandera Check.str_length)",
        )
    
        # Validation Control Parameters (Pandera behavior controls)
        ignore_na: bool = Field(
            default=True,
            description="Ignore null values during validation checks (Pandera ignore_na parameter)",
        )
        raise_warning: bool = Field(
            default=False,
            description="Raise warning instead of exception on validation failure (Pandera raise_warning parameter)",
        )
    
        @field_validator("str_matches")
        @classmethod
        def validate_regex_pattern(cls, v: str | None) -> str | None:
            """Validate that str_matches pattern is a valid regular expression."""
            if v is not None:
                re.compile(v)
            return v
    
        @field_validator("str_length", "in_range")
        @classmethod
        def validate_range_dict(cls, v: dict[str, int | float] | None) -> dict[str, int | float] | None:
            """Validate range constraint dictionaries for str_length and in_range."""
            if v is None:
                return v
    
            if not isinstance(v, dict):
                msg = "Range constraint must be a dictionary with 'min' and/or 'max' keys"
                raise TypeError(msg)
    
            allowed_keys = {"min", "max"}
            invalid_keys = set(v.keys()) - allowed_keys
            if invalid_keys:
                msg = f"Range constraint contains invalid keys: {invalid_keys}. Allowed: {allowed_keys}"
                raise ValueError(msg)
    
            # Validate min/max relationship
            if "min" in v and "max" in v and v["min"] > v["max"]:
                msg = f"Range constraint min ({v['min']}) cannot be greater than max ({v['max']})"
                raise ValueError(msg)
    
            return v
  • FastMCP server instance creation and tool registration for validate_schema and related validation tools.
    validation_server = FastMCP(
        "DataBeak-Validation",
        instructions="Data validation server for DataBeak",
    )
    
    # Register the validation functions as MCP tools
    validation_server.tool(name="validate_schema")(validate_schema)
    validation_server.tool(name="check_data_quality")(check_data_quality)
    validation_server.tool(name="find_anomalies")(find_anomalies)

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/jonpspri/databeak'

If you have feedback or need assistance with the MCP directory API, please join our Discord server