ABSTRACT
Noncoding genetic variation that alters gene regulation is of paramount importance for health, disease, and
evolution. Diseases ranging in incidence from the most common to the most rare all have substantial risk
associated with regulatory variation; and most of the genetic differences between closely related species are
noncoding. Whole genome sequencing can directly identify that variation but to realize its potential to elucidate
the genetic determinants of health and disease, will require accurate annotation of this noncoding variation for
functionality. In coding sequence, the genetic code allows variants to be annotated to a rough hierarchy of likely
functional effects and pathogenicity. In noncoding sequence such annotation is less clear. Perturbation assays,
i.e., assays that modify genetic or epigenetic states and measure the effect of those perturbations on regulatory
endpoints, offer a possible path to annotating noncoding variation. However, to fully leverage this data, novel
and sophisticated statistical and machine learning approaches are required to extract useful information from
those assays, to integrate that information across regulatory endpoints, and to extrapolate findings so that
annotation of previously unobserved (unperturbed) variation in diverse cell types is possible. The goal of the
Duke Prediction Center is to develop the analytic approaches and tools that will allow for the routine
annotation of noncoding variation for functionality and ultimately pathogenicity. Aim 1 is to establish best
practices in perturbation assay design and analysis. This will allow IGVF characterization centers design their
experiments so that, when coupled with optimized analyses, the data produced will be maximally informative for
subsequent predictive modeling. Aim 2 is to develop novel mechanistic machine learning approaches for
predicting the functional effect of noncoding variation on function in diverse cell-types. Aim 3 is to identify
noncoding genomic regions that are subject to functional constraint which will be leveraged in prioritizing variants
for pathogenicity. The expected outcomes of this project will be (i) robust estimates of optimal experimental
design parameters and recommendations for analysis tools and best practices for the various assays used within
the IGVF consortium, (ii) predicted functional effects of observed variation to be shared through the IGVF
variant/phenotype catalog as well as a state-of-the-art machine learning method (and associated tools) that can
identify previously-unknown interactions among genomic variants, both observed and novel, and predict their
functional impact in diverse cell types, and (iii) a list of regulatory elements subject to functional constraint shared
through the IGVF variant/phenotype catalog and a principled prioritization framework (and associated tools) for
interpreting variation within patient genomes for pathogenicity. Due to the considerable success of genetics,
there are thousands of unknown regulatory causes of disease. Each of those causes is an opportunity to improve
treatment, diagnostics, or prevention. This project will be a major advance towards unlocking that potential.