qc_duplicate_accessions
Identify duplicate or clonal accessions by computing pairwise identity-by-state similarity, grouping those above a threshold to clean mislabelled duplicates and clones.
Instructions
Detect duplicate / clonal accessions via pairwise identity-by-state (IBS).
Computes IBS allele-sharing similarity between every pair of samples and groups
pairs at or above similarity_threshold into duplicate sets — the core
genebank "cleaning" check for mislabelled duplicates and clones. By default
subsamples to max_markers evenly-spaced markers for speed (set to 0/None to
use all). Writes duplicate_pairs.csv and duplicate_groups.csv. For large
sets pass method="allelematrix" to fetch the marker subset without a full export.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| method | No | Genotype source: 'vcf' (full export, cached) or 'allelematrix' (paged, server-side subset). | vcf |
| region | No | Restrict analysis to a genomic window: 'chrom' or 'chrom:start-end' (1-based). | |
| output_dir | No | Directory for the output CSV(s) (default ./gigwa_results/<module>/). | |
| max_markers | No | Cap the number of markers analysed (evenly-spaced subsample); omit to use all. | |
| variant_set_db_id | Yes | BrAPI variantSetDbId identifying the run (MODULE§project§run); from list_variant_sets / list_content. | |
| similarity_threshold | No | IBS similarity (0-1) at/above which accessions are grouped as duplicates. |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |