Skip to main content
Glama

remove_duplicates

Remove duplicate rows from dataframes with flexible column selection and keep strategies. Choose which duplicates to retain while getting detailed deduplication statistics.

Instructions

Remove duplicate rows from the dataframe with comprehensive validation.

Provides flexible duplicate removal with options for column subset selection and different keep strategies. Handles edge cases and provides detailed statistics about the deduplication process.

Examples: # Remove exact duplicate rows remove_duplicates(ctx)

# Remove duplicates based on specific columns remove_duplicates(ctx, subset=["email", "name"]) # Keep last occurrence instead of first remove_duplicates(ctx, subset=["id"], keep="last") # Remove all duplicates (keep none) remove_duplicates(ctx, subset=["email"], keep="none")

Input Schema

TableJSON Schema
NameRequiredDescriptionDefault
subsetNoColumns to consider for duplicates (None = all columns)
keepNoWhich duplicates to keep: first, last, or nonefirst

Implementation Reference

  • The core handler function that executes the remove_duplicates tool logic using pandas drop_duplicates on the session dataframe.
    def remove_duplicates( ctx: Annotated[Context, Field(description="FastMCP context for session access")], subset: Annotated[ list[str] | None, Field(description="Columns to consider for duplicates (None = all columns)"), ] = None, keep: Annotated[ Literal["first", "last", "none"], Field(description="Which duplicates to keep: first, last, or none"), ] = "first", ) -> ColumnOperationResult: """Remove duplicate rows from the dataframe with comprehensive validation. Provides flexible duplicate removal with options for column subset selection and different keep strategies. Handles edge cases and provides detailed statistics about the deduplication process. Examples: # Remove exact duplicate rows remove_duplicates(ctx) # Remove duplicates based on specific columns remove_duplicates(ctx, subset=["email", "name"]) # Keep last occurrence instead of first remove_duplicates(ctx, subset=["id"], keep="last") # Remove all duplicates (keep none) remove_duplicates(ctx, subset=["email"], keep="none") """ session_id = ctx.session_id session, df = get_session_data(session_id) rows_before = len(df) # Validate subset columns if provided if subset: missing_cols = [col for col in subset if col not in df.columns] if missing_cols: msg = f"Columns not found in subset: {missing_cols}" raise ToolError(msg) # Convert keep parameter for pandas keep_param: Literal["first", "last"] | Literal[False] = keep if keep != "none" else False # Remove duplicates session.df = df.drop_duplicates(subset=subset, keep=keep_param).reset_index(drop=True) rows_after = len(session.df) rows_removed = rows_before - rows_after # No longer recording operations (simplified MCP architecture) return ColumnOperationResult( operation="remove_duplicates", rows_affected=rows_after, columns_affected=subset if subset else df.columns.tolist(), rows_removed=rows_removed, )
  • Registers the remove_duplicates function as an MCP tool on the transformation_server.
    transformation_server.tool(name="remove_duplicates")(remove_duplicates)
  • Pydantic response model (ColumnOperationResult) used by remove_duplicates and other column operations, including specific rows_removed field.
    class ColumnOperationResult(BaseToolResponse): """Response model for column operations (add, remove, rename, etc.).""" operation: str = Field(description="Type of operation performed") rows_affected: int = Field(description="Number of rows affected by operation") columns_affected: list[str] = Field(description="Names of columns affected") original_sample: list[CsvCellValue] | None = Field( default=None, description="Sample values before operation", ) updated_sample: list[CsvCellValue] | None = Field( default=None, description="Sample values after operation", ) # Additional fields for specific operations part_index: int | None = Field(default=None, description="Part index for split operations") transform: str | None = Field(default=None, description="Transform description") nulls_filled: int | None = Field(default=None, description="Number of null values filled") rows_removed: int | None = Field( default=None, description="Number of rows removed (for remove_duplicates)", ) values_filled: int | None = Field( default=None, description="Number of values filled (for fill_missing_values)", )
  • Lists 'remove_duplicates' in the data_manipulation capabilities of the server info.
    "remove_duplicates",

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/jonpspri/databeak'

If you have feedback or need assistance with the MCP directory API, please join our Discord server