detect_outliers

Identify data points that deviate from normal patterns using statistical and machine learning methods for data quality assessment and anomaly detection in analytical workflows.

Instructions

Detect outliers in numerical columns using various algorithms.

Identifies data points that deviate significantly from the normal pattern using statistical and machine learning methods. Essential for data quality assessment and anomaly detection in analytical workflows.

Returns: Detailed outlier analysis with locations and severity scores

Detection Methods: 📊 Z-Score: Statistical method based on standard deviations 📈 IQR: Interquartile range method (robust to distribution) 🤖 Isolation Forest: ML-based method for high-dimensional data

Examples: # Basic outlier detection outliers = await detect_outliers(ctx, ["price", "quantity"])

# Use IQR method with custom threshold outliers = await detect_outliers(ctx, ["sales"], method="iqr", threshold=2.5)

AI Workflow Integration: 1. Data quality assessment and cleaning 2. Anomaly detection for fraud/error identification 3. Data preprocessing for machine learning 4. Understanding data distribution characteristics

Input Schema

TableJSON Schema

Name	Required	Description	Default
`columns`	No	List of numerical columns to analyze for outliers (None = all numeric)
`method`	No	Detection algorithm: zscore, iqr, or isolation_forest	iqr
`threshold`	No	Sensitivity threshold (higher = less sensitive)

Implementation Reference

src/databeak/servers/discovery_server.py:147-295 (handler)
The core handler function implementing outlier detection logic for numerical columns using IQR or Z-score methods. Processes data from session, identifies outliers, and returns structured results.
async def detect_outliers( ctx: Annotated[Context, Field(description="FastMCP context for session access")], columns: Annotated[ list[str] | None, Field(description="List of numerical columns to analyze for outliers (None = all numeric)"), ] = None, method: Annotated[ str, Field(description="Detection algorithm: zscore, iqr, or isolation_forest"), ] = "iqr", threshold: Annotated[ float, Field(description="Sensitivity threshold (higher = less sensitive)"), ] = 1.5, ) -> OutliersResult: """Detect outliers in numerical columns using various algorithms. Identifies data points that deviate significantly from the normal pattern using statistical and machine learning methods. Essential for data quality assessment and anomaly detection in analytical workflows. Returns: Detailed outlier analysis with locations and severity scores Detection Methods: 📊 Z-Score: Statistical method based on standard deviations 📈 IQR: Interquartile range method (robust to distribution) 🤖 Isolation Forest: ML-based method for high-dimensional data Examples: # Basic outlier detection outliers = await detect_outliers(ctx, ["price", "quantity"]) # Use IQR method with custom threshold outliers = await detect_outliers(ctx, ["sales"], method="iqr", threshold=2.5) AI Workflow Integration: 1. Data quality assessment and cleaning 2. Anomaly detection for fraud/error identification 3. Data preprocessing for machine learning 4. Understanding data distribution characteristics """ # Get session_id from FastMCP context session_id = ctx.session_id _session, df = get_session_data(session_id) # Select numeric columns if columns: missing_cols = [col for col in columns if col not in df.columns] if missing_cols: raise ColumnNotFoundError(missing_cols[0], df.columns.tolist()) numeric_df = df[columns].select_dtypes(include=[np.number]) else: numeric_df = df.select_dtypes(include=[np.number]) if numeric_df.empty: raise InvalidParameterError( "columns", # noqa: EM101 columns if columns else "auto-detected", "at least one numeric column", ) outliers_by_column = {} total_outliers_count = 0 if method == "iqr": for col in numeric_df.columns: q1 = numeric_df[col].quantile(0.25) q3 = numeric_df[col].quantile(0.75) iqr = q3 - q1 lower_bound = q1 - threshold * iqr upper_bound = q3 + threshold * iqr outlier_mask = (numeric_df[col] < lower_bound) | (numeric_df[col] > upper_bound) outlier_indices = df.index[outlier_mask] # Create OutlierInfo objects for each outlier outlier_infos = [] for idx in outlier_indices[:100]: # Limit to first 100 raw_value = df.loc[idx, col] try: value = float(cast("Any", raw_value)) except (ValueError, TypeError): continue # Skip non-numeric values # Calculate IQR score (distance from nearest bound relative to IQR) if value < lower_bound: iqr_score = float((lower_bound - value) / iqr) if iqr > 0 else 0.0 else: iqr_score = float((value - upper_bound) / iqr) if iqr > 0 else 0.0 outlier_infos.append( OutlierInfo(row_index=int(idx), value=value, iqr_score=iqr_score), ) outliers_by_column[col] = outlier_infos total_outliers_count += len(outlier_indices) elif method == "zscore": for col in numeric_df.columns: col_mean = numeric_df[col].mean() col_std = numeric_df[col].std() z_scores = np.abs((numeric_df[col] - col_mean) / col_std) outlier_mask = z_scores > threshold outlier_indices = df.index[outlier_mask] # Create OutlierInfo objects for each outlier outlier_infos = [] for idx in outlier_indices[:100]: # Limit to first 100 raw_value = df.loc[idx, col] try: value = float(cast("Any", raw_value)) except (ValueError, TypeError): continue # Skip non-numeric values z_score = float(abs((value - col_mean) / col_std)) if col_std > 0 else 0.0 outlier_infos.append( OutlierInfo(row_index=int(idx), value=value, z_score=z_score), ) outliers_by_column[col] = outlier_infos total_outliers_count += len(outlier_indices) else: raise InvalidParameterError( "method", # noqa: EM101 method, "zscore, iqr, or isolation_forest", ) # Map method names to match Pydantic model expectations if method == "zscore": pydantic_method = "zscore" elif method == "iqr": pydantic_method = "iqr" else: pydantic_method = "isolation_forest" return OutliersResult( outliers_found=total_outliers_count, outliers_by_column=outliers_by_column, method=cast("Literal['zscore', 'iqr', 'isolation_forest']", pydantic_method), threshold=threshold, )
src/databeak/servers/discovery_server.py:41-48 (schema)
Pydantic model defining the structure for individual outlier information used in the response.
class OutlierInfo(BaseModel): """Information about a detected outlier.""" row_index: int = Field(description="Row index where outlier was detected") value: float = Field(description="Outlier value found") z_score: float | None = Field(default=None, description="Z-score if using z-score method") iqr_score: float | None = Field(default=None, description="IQR score if using IQR method")
src/databeak/servers/discovery_server.py:74-85 (schema)
Pydantic response model for the detect_outliers tool output, including total outliers, per-column details, method, and threshold.
class OutliersResult(BaseToolResponse): """Response model for outlier detection analysis.""" outliers_found: int = Field(description="Total number of outliers detected") outliers_by_column: dict[str, list[OutlierInfo]] = Field( description="Outliers grouped by column name", ) method: Literal["zscore", "iqr", "isolation_forest"] = Field( description="Detection method used", ) threshold: float = Field(description="Threshold value used for detection")
src/databeak/servers/discovery_server.py:851-851 (registration)
Registers the detect_outliers handler as an MCP tool named 'detect_outliers' on the discovery_server FastMCP instance.
discovery_server.tool(name="detect_outliers")(detect_outliers)

DataBeak

detect_outliers

Instructions

Input Schema

Implementation Reference

Other Tools

Latest Blog Posts

MCP directory API