clean_dataset
Clean datasets by handling missing values and outliers using customizable strategies like imputation, removal, or capping. Prepare data for analysis with methods such as KNN, IQR, or Z-score.
Instructions
Clean dataset by handling missing values and outliers
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| dataset_name | Yes | Name of the dataset to clean | |
| missing_strategy | No | Strategy for handling missing values (supports both full names and short aliases) | |
| outlier_strategy | No | Strategy for handling outliers | cap |
| outlier_method | No | Method for outlier detection (supports both full names and short aliases) | iqr |
| missing_constant_value | No | Value to use when missing_strategy is fill_constant | |
| missing_drop_threshold | No | Proportion of missing values above which to drop columns/rows | |
| missing_knn_neighbors | No | Number of neighbors for KNN imputation | |
| missing_max_iter | No | Maximum iterations for iterative imputation | |
| missing_random_state | No | Random seed for reproducible imputation | |
| outlier_z_threshold | No | Z-score threshold for outlier detection | |
| outlier_iqr_multiplier | No | IQR multiplier for outlier detection | |
| outlier_contamination | No | Expected contamination ratio for isolation forest and LOF | |
| outlier_percentile_lower | No | Lower percentile bound for percentile-based outlier detection | |
| outlier_percentile_upper | No | Upper percentile bound for percentile-based outlier detection | |
| outlier_dbscan_eps | No | DBSCAN epsilon parameter | |
| outlier_dbscan_min_samples | No | DBSCAN minimum samples parameter | |
| handle_missing_first | No | Handle missing values before outlier detection | |
| preserve_original | No | Preserve original dataset alongside cleaned version | |
| output_name | Yes | Name for the cleaned dataset |