diversity_structure
Identifies population structure by performing PCA on genotype data followed by K-means clustering, selecting the optimal number of clusters using the Calinski-Harabasz index.
Instructions
Lightweight population-structure clustering (PCA + K-means, in-Python).
Reduces the alt-dosage matrix with PCA (Patterson scaling), then runs K-means for
K in k_min..k_max and picks the K with the highest pseudo-F (Calinski-Harabasz)
between/within variance ratio — a clear maximum when groups are well separated.
Writes structure_clusters.csv (sample, assigned cluster at the best K, PC coords)
and reports the chosen K with cluster sizes. (No external ADMIXTURE binary — computed
entirely in Python, consistent with the rest of the analysis layer.)
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| k_max | No | Largest number of clusters (K) to evaluate. | |
| k_min | No | Smallest number of clusters (K) to evaluate. | |
| method | No | Genotype source: 'vcf' (full export, cached) or 'allelematrix' (paged, server-side subset). | vcf |
| region | No | Restrict analysis to a genomic window: 'chrom' or 'chrom:start-end' (1-based). | |
| output_dir | No | Directory for the output CSV(s) (default ./gigwa_results/<module>/). | |
| max_markers | No | Cap the number of markers analysed (evenly-spaced subsample); omit to use all. | |
| variant_set_db_id | Yes | BrAPI variantSetDbId identifying the run (MODULE§project§run); from list_variant_sets / list_content. |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |