remove_duplicates
Eliminate duplicate rows from CSV files to maintain data integrity and accuracy in datasets.
Instructions
Remove duplicate rows.
Input Schema
TableJSON Schema
| Name | Required | Description | Default |
|---|---|---|---|
| session_id | Yes | ||
| subset | No | ||
| keep | No | first |
Implementation Reference
- Core handler function that executes the duplicate removal using pandas.DataFrame.drop_duplicates, updates the session dataframe, records the operation, and returns statistics on rows removed.async def remove_duplicates( session_id: str, subset: Optional[List[str]] = None, keep: str = "first", ctx: Context = None ) -> Dict[str, Any]: """ Remove duplicate rows. Args: session_id: Session identifier subset: Column names to consider for duplicates (None for all) keep: Which duplicates to keep ('first', 'last', False to drop all) ctx: FastMCP context Returns: Dict with success status and duplicate info """ try: manager = get_session_manager() session = manager.get_session(session_id) if not session or session.df is None: return {"success": False, "error": "Invalid session or no data loaded"} df = session.df rows_before = len(df) if subset: missing_cols = [col for col in subset if col not in df.columns] if missing_cols: return {"success": False, "error": f"Columns not found: {missing_cols}"} # Convert keep parameter keep_param = keep if keep != "none" else False session.df = df.drop_duplicates(subset=subset, keep=keep_param).reset_index(drop=True) rows_after = len(session.df) session.record_operation(OperationType.REMOVE_DUPLICATES, { "subset": subset, "keep": keep, "rows_removed": rows_before - rows_after }) return { "success": True, "rows_before": rows_before, "rows_after": rows_after, "duplicates_removed": rows_before - rows_after, "subset": subset, "keep": keep } except Exception as e: logger.error(f"Error removing duplicates: {str(e)}") return {"success": False, "error": str(e)}
- src/csv_editor/server.py:280-288 (registration)MCP tool registration decorator and wrapper function that delegates to the core implementation in transformations.py, defining the tool schema via type annotations.@mcp.tool async def remove_duplicates( session_id: str, subset: Optional[List[str]] = None, keep: str = "first", ctx: Context = None ) -> Dict[str, Any]: """Remove duplicate rows.""" return await _remove_duplicates(session_id, subset, keep, ctx)
- OperationType enum value used to record the remove_duplicates operation in session history.REMOVE_DUPLICATES = "remove_duplicates"