remove_duplicates
Remove duplicate rows from dataframes with flexible column selection and keep strategies. Choose which duplicates to retain while getting detailed deduplication statistics.
Instructions
Remove duplicate rows from the dataframe with comprehensive validation.
Provides flexible duplicate removal with options for column subset selection and different keep strategies. Handles edge cases and provides detailed statistics about the deduplication process.
Examples: # Remove exact duplicate rows remove_duplicates(ctx)
# Remove duplicates based on specific columns
remove_duplicates(ctx, subset=["email", "name"])
# Keep last occurrence instead of first
remove_duplicates(ctx, subset=["id"], keep="last")
# Remove all duplicates (keep none)
remove_duplicates(ctx, subset=["email"], keep="none")
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| subset | No | Columns to consider for duplicates (None = all columns) | |
| keep | No | Which duplicates to keep: first, last, or none | first |
Input Schema (JSON Schema)
{
"properties": {
"keep": {
"default": "first",
"description": "Which duplicates to keep: first, last, or none",
"enum": [
"first",
"last",
"none"
],
"type": "string"
},
"subset": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "Columns to consider for duplicates (None = all columns)"
}
},
"type": "object"
}