find_duplicates_resource
Detect duplicate rows in datasets to improve data quality. Specify columns, apply filters, and get duplicate groups sorted by frequency.
Instructions
Find rows that appear more than once on the given columns (or all columns).
Returns duplicate groups sorted by frequency descending. Useful for detecting data-quality issues in payroll, census, and registry datasets. First call downloads + caches. Subsequent calls reuse the cache.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | Direct URL to the file (CKAN resource 'url' field). | |
| format | Yes | Format declared in CKAN. Accepts: csv, tsv, xlsx, json. | |
| columns | No | Columns to check for duplication. None = all columns. Example: ['Nombre', 'Cedula'] checks for rows with same name and ID. | |
| filters | No | Same filter syntax as filter_resource. Applied before duplicate check. | |
| limit | No | Max duplicate groups to return (1–500). |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| error | No | ||
| hint | No | ||
| source_url | No | ||
| format | No | ||
| cache | No | ||
| columns_checked | No | ||
| duplicate_groups_found | No | ||
| groups_returned | No | ||
| total_duplicate_rows | No | ||
| columns | No | ||
| rows | No |