PROJECT SUMMARY
Genome-wide association studies (GWAS) have associated tens of thousands of common variants with human
diseases and traits. The rapid expansion of Whole-Genome Sequencing (WGS) studies and biobanks offer
great potential to understand the physiologic and pathophysiologic associations of both common and rare
variants. The IGVF Consortium aims to systematically study the functional and phenotypic effects of genomic
variation; it is not, however, feasible to experimentally characterize the vast number of candidate variants of
interest. Computational models which can accurately predict the context-specific effects of variants are
essential in designing targeted research. We propose an approach anchored on a framework of
high-confidence regulatory elements (REs), from which we will develop methods to learn RE-gene links,
perform rare variant association tests, and finemap causal common and rare variants. We aim to make all our
results, methods, and tools available to the community through a public portal and the NHGRI and NHLBI Data
Commons. Our proposal has four aims: (1) Develop a core framework of REs from open chromatin regions on
which to anchor our models, improving on past approaches by producing higher-resolution predictions of
functional base-pairs, producing novel RE subclassifications using functional characterization datasets from
IGVF and other sources, and harnessing single-cell datasets to delineate lineage- and stimulus-specific
elements. (2) Use this framework to predict the roles of variants in molecular phenotypes, specifically gene
expression and cellular response to stimuli. We will build statistical and machine-learning methods to predict
context-specific links between REs and their target genes, using three-dimensional conformation data
produced by the IGVF Consortium and external sources. We will apply this method across many cell types and
perform feature selection to build a catalog of high-confidence RE-gene links and regulatory networks. (3)
Develop statistical methods to perform cell type-specific rare variant association tests (cellSTAAR) in WGS
studies, and a latent variable model to prioritize candidate functional variants for traits and diseases, using
results from Aims 1 and 2. We will apply these methods to analyze various metabolic, immune-mediated, and
psychiatric disorders in the multi-ethnic WGS data of the NHLBI Trans-Omic Precision Medicine Program
(TOPMed) and the NHGRI Genome Sequencing Program (GSP) to identify candidate causal
disease-associated variants. (4) Make all the results publicly available by substantially expanding the FAVOR
Portal to include whole genome variant functional annotations of all three billion genomic positions as well as
cell type-specific annotations. We will implement both FAVOR and cellSTAAR in the Data Commons AnVIL
(NHGRI) and BioData Catalyst (NHLBI) so researchers may use them for analysis of new datasets in a
scalable cloud computing environment. We will work closely with other centers and the Data Analysis
Coordinating Center (DACC) of the IGVF on joint analyses and building the IGVF Variant Catalog.