clean_dataset
Apply cleaning operations like dropping duplicates, filling missing values, renaming columns, and more to a dataset, writing a cleaned copy without modifying the original.
Instructions
Apply cleaning operations to a dataset and write a new file.
NEVER modifies the original file. Always writes to output_path.
Supported operations:
- "drop_duplicates" — remove exact duplicate rows
- "drop_columns:[col1:col2:...]" — remove specified columns
- "fill_na:[col:value]" — fill nulls in col with value
- "rename_column:[old_name:new_name]" — rename a column
- "strip_whitespace" — strip leading/trailing spaces from all string columns
- "standardize_dates:[col:format]" — parse col as date (format: 'auto' or strftime)
- "drop_na_rows:[col]" — drop rows where col is null
- "drop_na_rows_any" — drop rows with ANY null value
Args:
path: Absolute local path to the source dataset.
operations: List of operation strings (see above).
output_path: Where to write the cleaned file. If empty, appends '_cleaned'
before the extension (e.g. data.csv → data_cleaned.csv).
Returns JSON with: output_path, original_shape, cleaned_shape, row_delta,
col_delta, operations_applied, operations_skipped.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| path | Yes | ||
| operations | Yes | ||
| output_path | No |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |