PROJECT SUMMARY / ABSTRACT
A central goal of genomics is to understand the relationship between genotype and phenotype. In recent years,
the ability to quantitatively study genotype-phenotype maps has been revolutionized by the development of
multiplex assays of variant effect (MAVEs), which measure molecular phenotypes for thousands to millions of
genotypic variants in parallel. MAVE is an umbrella term that includes massively parallel reporter assays for
studies of DNA or RNA regulatory sequences, as well as deep mutational scanning assays of proteins or
structural RNAs. The rapid adoption of MAVE techniques across multiple genomic disciplines has created an
acute need for computational methods that can robustly and reproducibly infer quantitative genotype-
phenotype (G-P) maps from the large datasets that MAVEs produce. Here we propose a unified conceptual
and computational framework for quantitatively modeling G-P maps from MAVE data. This proposal is
motivated by our realization that accounting for the noise and nonlinearities that are omnipresent in MAVE
experiments requires explicit modeling of both the MAVE measurement process and the G-P map of interest.
This joint inference strategy is more computationally demanding than most MAVE analysis methods, but it is
feasible using modern deep learning frameworks. Our extensive preliminary data show that this modeling
strategy is able to recover high-precision G-P maps even in the presence of major confounding effects, and
thus has the potential to benefit MAVE studies in multiple areas of genomics. Aim 1 will develop methods for
modeling the measurement processes that arise in diverse MAVE experimental designs. Aim 2 will develop
general methods for modeling genetic interactions within G-P maps, and will use these methods in conjunction
with new experiments to elucidate the molecular mechanism of a recently approved drug that targets
alternative mRNA splicing. Aim 3 will develop methods for inferring G-P maps that reflect biophysical models
of gene regulation, including both thermodynamic (i.e., quasi-equilibrium) and kinetic (i.e., non-equilibrium
steady-state) models. These methods will then be used, in conjunction with new MAVE experiments, to
develop a biophysical model for how a pleiotropic transcription factor regulates gene expression throughout the
Escherichia coli genome. Aim 4 will study and develop methods for treating gauge freedoms and sloppy
modes in the above classes of models, thereby facilitating the comparison, interpretation, and exploration of
inferred G-P maps. All of the computational techniques we develop will be incorporated into a robust and easy-
to-use Python package called MAVE-NN. We will benchmark MAVE-NN on a diverse array of MAVE datasets,
including published datasets and data generated as part of this project. In all, this work will fill a major need in
the analysis of MAVE experiments, yielding a robust, flexible, and scalable computational platform that will help
accelerate the use of MAVEs for understanding the effects of human genetic variation at the genomic scale.