Genotype–phenotype prediction requires diverse data, including genetic variants, functional annotations (FA), linkage disequilibrium, polygenic risk scores (PRS), covariates, and genome-wide association studies (GWAS). We present a framework to optimize data integration for phenotype prediction. Using UK Biobank genotypes (733 samples), two GWAS for migraine, three for depression, publicly available FA, and four PRS tools, we tested various combinations of datasets for migraine prediction.
The analysis proceeded in three main steps: (1) the suitability of GWAS files from the catalog for PRS calculation was assessed using the GWASPoker tool; (2) we benchmarked 46 PRS tools and selected the top four for generating PRS; and (3) various combinations of all datasets were systematically evaluated for optimal prediction of migraine.
The best individual dataset performance achieved a test AUC of 0.64 (±0.14). When different combinations were formed—configuration 1 (migraine-related data sources) and configuration 2 (migraine- and depression-related data sources)—the combination of covariates, PRS-PLINK, and weighted annotated genotype data in configuration 1 achieved a test AUC of 0.69 (±0.13). In configuration 2, combining unweighted annotated genotype data and PRS-LDAK achieved a test AUC of 0.66 (±0.04).
We observed that the inclusion of PRS, covariates, PRS from AnnoPred and LDAK, and annotated genotype data improves prediction performance. This study proposes a framework that integrates diverse datasets, optimizing data combinations and selection strategies to improve the accuracy of genotype–phenotype prediction.