This proposal aims to develop advanced and scalable statistical methods for integrative analysis of large-scale
Whole Genome Sequencing (WGS) studies and biobanks of common diseases, such as heart and lung
diseases. Genome-Wide Association Studies (GWAS) have revealed thousands of genetic variants associated
with many common diseases, but are limited to common variants from a majority of individuals of only
European ancestry. Large-scale multi-ethnic WGS studies and biobanks have been rapidly arising to overcome
these limitations, and to study the genetic underpinnings of complex diseases and traits in both coding and
non-coding rare variants across populations. Examples include the NHLBI Trans-Omics Precision Medicine
Program (TOPMed) and the NHGRI Genome Sequencing Program (GSP), UK biobank, and All of Us. Various
omics data are also available in TOPMed. Full usage of these datasets can fuel genetic discoveries applicable
to genetically understudied populations. These studies consist of hundreds of millions of rare variants (RVs),
and their analysis faces several challenges. First, although several methods have been developed for RV
analysis, they have limited power for analysis of non-coding RVs, as their functions are unknown or cell-type
specific. There is a pressing need to empower RV Association Tests (RVATs) for non-coding variants by
developing more powerful statistical learning methods using integrative analysis and incorporating cell-type
specific variant functional annotations. Second, large sample sizes of WGS studies and data privacy
consideration of many national and institutional biobanks with unbalanced case and control ratios call for
distributed WGS analyses. Third, it is of substantial interest to develop polygenic risk scores using both
common and rare variants in WGS studies, and to investigate causal effects of biomarkers and omics’ markers
on diseases using Mendelian Randomization (MR) using both common and rare variants as instrumental
variables. This proposal aims at addressing these needs with four aims. First, we will develop statistical
learning based ensemble RVATs to boost power. This ensemble RVAT framework will be extended to use
cell-type-specific functional annotations calculated from single-cell assays, and to perform meta-analysis.
Second, we will develop distributed methods for important tasks in the analysis of large WGS and federated
biobank data: estimating population structure via distributed fast principal component analysis, distributed
methods for fitting generalized linear mixed models, and distributed RVATs. Third, we will develop methods for
polygenic risk score (PRS) using both common and rare variants in WGS studies, and develop Mendelian
Randomization methods for studying the causal effects of biomarkers and omics markers on diseases by using
WGS-based PRs as instrumental variables. Fourth, we will develop open-access statistical software capable of
implementing our proposed methods in both offline and cloud computing environments. We will apply the
proposed methods to the analysis of the TOPMed and GSP data and the biobanks.