remove_duplicates
Eliminates duplicate rows in CSV data by comparing specified columns, keeping the first or last occurrence.
Instructions
Remove duplicate rows.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| session_id | Yes | ||
| subset | No | ||
| keep | No | first |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
No arguments | |||
Implementation Reference
- Core implementation of the remove_duplicates tool. Gets session data, validates subset columns, uses pandas drop_duplicates() to remove duplicates, records the operation, and returns success/result info.
async def remove_duplicates( session_id: str, subset: list[str] | None = None, keep: str = "first", ctx: Context = None ) -> dict[str, Any]: """ Remove duplicate rows. Args: session_id: Session identifier subset: Column names to consider for duplicates (None for all) keep: Which duplicates to keep ('first', 'last', False to drop all) ctx: FastMCP context Returns: Dict with success status and duplicate info """ try: manager = get_session_manager() session = manager.get_session(session_id) if not session or session.df is None: return {"success": False, "error": "Invalid session or no data loaded"} df = session.df rows_before = len(df) if subset: missing_cols = [col for col in subset if col not in df.columns] if missing_cols: return {"success": False, "error": f"Columns not found: {missing_cols}"} # Convert keep parameter keep_param = keep if keep != "none" else False session.df = df.drop_duplicates(subset=subset, keep=keep_param).reset_index(drop=True) rows_after = len(session.df) session.record_operation( OperationType.REMOVE_DUPLICATES, {"subset": subset, "keep": keep, "rows_removed": rows_before - rows_after}, ) return { "success": True, "rows_before": rows_before, "rows_after": rows_after, "duplicates_removed": rows_before - rows_after, "subset": subset, "keep": keep, } except Exception as e: logger.error(f"Error removing duplicates: {e!s}") return {"success": False, "error": str(e)} - src/csv_editor/server.py:263-268 (registration)MCP tool registration of remove_duplicates via @mcp.tool decorator. Defines the public API with session_id, subset, and keep parameters, delegating to the implementation.
@mcp.tool async def remove_duplicates( session_id: str, subset: list[str] | None = None, keep: str = "first", ctx: Context = None ) -> dict[str, Any]: """Remove duplicate rows.""" return await _remove_duplicates(session_id, subset, keep, ctx) - OperationType enum defining REMOVE_DUPLICATES = 'remove_duplicates' used to record the operation in session history.
REMOVE_DUPLICATES = "remove_duplicates" GROUP_BY = "group_by" VALIDATE = "validate" PROFILE = "profile" QUALITY_CHECK = "quality_check" ANOMALY_DETECTION = "anomaly_detection" - Data quality validation helper that recommends using the remove_duplicates tool when duplicate rows are detected.
quality_results["recommendations"].append( "Consider removing duplicate rows using the remove_duplicates tool" ) - src/csv_editor/server.py:78-79 (registration)Capabilities listing mentioning remove_duplicates as a data_manipulation capability.
"remove_duplicates", ],