Federated and transfer learning methods for cross-ancestry and cross-phenotype integration of genomic datasets - Abstract This proposal aims to develop advanced data integration methods for improving genetic risk prediction in under-represented non-European populations. Genome-Wide Association Studies (GWAS) have yielded important biological insights into the heritable basis of many complex traits and diseases, and polygenic risk scores (PRS) have shown promising potentials for disease risk stratification. However, since the vast majority of participants in large-scale genomic datasets are from European ancestry (EA) populations, the performance of current PRS is much poorer in non-EA populations than in EA populations, which may exacerbate existing health disparities. Despite some recent inclusive data collection efforts, current risk prediction methods cannot effectively address the heavily unbalanced sample sizes across populations. Robust data integration methods are needed to leverage similarities in genetic architectures across ancestral populations, phenotypic correlations and pleiotropy, and variant functional annotations while accounting for different sources of heterogeneity. Moreover, as various national and institutional biobanks become available, efficient information-sharing strategies with data privacy considerations are needed for combining data across biobanks to improve sample diversity and sample size. This proposal will address these needs by developing a methodological framework with advanced transfer learning (TL) and federated learning (FL) techniques for integrating various sources of data to bridge the gap of risk prediction across populations. Specifically, in Aim 1, we will develop a TL method to integrate ancestrally diverse data based on high-dimensional models with a distance-based regularization to characterize the similarities across populations, and a communication-efficient FL algorithm that jointly fits the TL model across multiple biobanks with only summary-level statistics. In Aim 2, we will develop methods that enable joint analyses of multiple phenotypes in association tests and risk prediction models. We will develop an FL algorithm to combine data from multiple biobanks for cross-phenotype association test, and a TL method with an angle-based regularization to leverage genetic correlations among mixed types of phenotypes in risk prediction. In Aim 3, we will develop knowledge- graph-based TL methods that leverage the shared latent spaces between phenotype-genotype knowledge graphs constructed from different ancestral populations and enable the incorporation of functional annotations. In Aim 4, we will develop open-access statistical software capable of implementing the proposed methods in both offline and cloud computing environments, and apply the proposed methods to the analysis of major depressive disorder and cardiovascular diseases using data from the All of Us program, eMERGE, and the UK biobank.