audit_import_quality
Scan Gigwa databases for genotype-encoding artifacts. Flags runs with miscalled heterozygotes or lost hom-alt classes (BROKEN) and suspiciously complete or monomorphic data (SUSPECT).
Instructions
Scan a Gigwa instance for databases imported with genotype-encoding artifacts.
With no variant_set_db_id this audits every run on the instance; pass one to
audit a single variant set. For each run it pulls a bounded genotype sample (up to
max_markers markers × max_samples callsets) via paged BrAPI
search/allelematrix — cheap and constant-cost regardless of how large the variant
set is, so it is safe to run across a whole production instance without exporting
multi-GB VCFs. The aggregate genotype-class fractions it needs are estimated tightly
from the sample (a true zero hom-alt class stays zero; a rare-but-real one shows up).
It flags two import failure modes plus two weaker signals:
BROKEN — cohort mean Ho above
het_threshold(DArT 2-row mis-call), or homozygous-alt genotypes far below their HWE expectation given the alt-allele frequency (lost hom-alt class; the HWE test avoids false positives on low-MAF / mostly-monomorphic panels where near-zero hom-alt is genuine).SUSPECT — call rate above
complete_call_rate(no missing data, often missing forced to 0/0), monomorphic fraction abovemonomorphic_threshold, or AD/DP depth fields present but uniformly zero (a VCF synthesised from genotype calls with fabricated depth/likelihoods — the same converter often miscalls GT too).
Writes import_quality_scan.csv (one row per run) under output_dir (default
./gigwa_results/) and returns a summary ranked worst-first. Read-only — it never
modifies Gigwa.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| output_dir | No | Directory for the output CSV(s) (default ./gigwa_results/<module>/). | |
| max_markers | No | Cap the number of markers analysed (evenly-spaced subsample); omit to use all. | |
| max_samples | No | Cap the number of samples/callsets sampled (allelematrix path). | |
| het_threshold | No | Mean observed-heterozygosity above which a run is flagged BROKEN (mis-called heterozygotes). | |
| variant_set_db_id | No | BrAPI variantSetDbId identifying the run (MODULE§project§run); from list_variant_sets / list_content. | |
| complete_call_rate | No | Call-rate above which a run is flagged as suspiciously complete (no missing data). | |
| monomorphic_threshold | No | Monomorphic-marker fraction above which a run is flagged for low informativeness. |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |