clean_dataset
Remove duplicates, fill nulls, and rename columns to clean a dataset, saving results to a new file without altering the original.
Instructions
Apply cleaning operations to a dataset and write a new file.
NEVER modifies the original file. Always writes to output_path.
Supported operations:
- "drop_duplicates" — remove exact duplicate rows
- "drop_columns:[col1:col2:...]" — remove specified columns
- "fill_na:[col:value]" — fill nulls in col with value
- "rename_column:[old_name:new_name]" — rename a column
- "strip_whitespace" — strip leading/trailing spaces from all string columns
- "standardize_dates:[col:format]" — parse col as date (format: 'auto' or strftime)
- "drop_na_rows:[col]" — drop rows where col is null
- "drop_na_rows_any" — drop rows with ANY null value
Args:
path: Absolute local path to the source dataset.
operations: List of operation strings (see above).
output_path: Where to write the cleaned file. If empty, appends '_cleaned'
before the extension (e.g. data.csv → data_cleaned.csv).
Returns JSON with: output_path, original_shape, cleaned_shape, row_delta,
col_delta, operations_applied, operations_skipped.Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| path | Yes | ||
| operations | Yes | ||
| output_path | No |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |