remove_duplicates
Eliminate duplicate rows from dataframes with flexible column selection and keep strategies. Provides validation and statistics for data cleaning.
Instructions
Remove duplicate rows from the dataframe with comprehensive validation.
Provides flexible duplicate removal with options for column subset selection and different keep strategies. Handles edge cases and provides detailed statistics about the deduplication process.
Examples: # Remove exact duplicate rows remove_duplicates(ctx)
# Remove duplicates based on specific columns
remove_duplicates(ctx, subset=["email", "name"])
# Keep last occurrence instead of first
remove_duplicates(ctx, subset=["id"], keep="last")
# Remove all duplicates (keep none)
remove_duplicates(ctx, subset=["email"], keep="none")Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| subset | No | Columns to consider for duplicates (None = all columns) | |
| keep | No | Which duplicates to keep: first, last, or none | first |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| success | No | Whether operation completed successfully | |
| operation | Yes | Type of operation performed | |
| transform | No | Transform description | |
| part_index | No | Part index for split operations | |
| nulls_filled | No | Number of null values filled | |
| rows_removed | No | Number of rows removed (for remove_duplicates) | |
| rows_affected | Yes | Number of rows affected by operation | |
| values_filled | No | Number of values filled (for fill_missing_values) | |
| updated_sample | No | Sample values after operation | |
| original_sample | No | Sample values before operation | |
| columns_affected | Yes | Names of columns affected |