Understanding how the complex coordination of many different proteins bind to DNA and RNA provides
mechanistic insights into cellular regulatory functions. Recent developments in deep neural networks (DNNs)
have greatly enhanced our ability to accurately predict experiments in regulatory genomics. Despite their
impressive performance compared to traditional methods in computational genomics, their low interpretability
has earned them a reputation as a black box. To address this gap, post hoc interpretation methods are being
increasingly used to gain mechanistic insights underlying black box predictions. While many of the current
interpretation methods are useful, there is often a notable disagreement between their findings. These methods
have also been shown to have specific strengths, as well as blind spots in areas that are essential for gene
regulation. Despite their promise, deciphering the complexity of cellular regulatory functions learned by a DNN
through current interpretation methods remains challenging. Here we propose two complementary aims that
serve to enhance the biological insights gained from genomic DNNs. Together, the work from these aims will
create a surrogate modeling framework, which uses simplified mathematical models trained on a sequence
library to approximate the corresponding sequence–function relationships learned by a DNN. In Aim 1, we will
develop and implement a set of surrogate modeling strategies for interpreting genomic DNNs. In Aim 2, we will
develop and implement computational methods to design refined sequence libraries for improved surrogate
modeling of genomic DNNs. As the number of deep learning applications in genomics is rapidly increasing, the
biomedical community will greatly benefit from our surrogate modeling framework. This framework will be made
publicly available in a software package called SQuID (Surrogate Quantitative Interpretability for Deepnets),
providing user-friendly computational tools to characterize functional relationships learned by any DNN trained
on functional genomics assays. An ability to do so will drive new discoveries in functional genomics for any task
where deep learning has been applied, and for all future ones to come. My current position as a joint postdoc
working with Peter Koo and Justin Kinney at Cold Spring Harbor Laboratory (CSHL) provides an ideal
environment for carrying out the proposed research, with the mentorship and training I need to transition into the
field of computational genomics. Dr. Koo develops DNN architectures and interpretation methods for functional
genomics, while Dr. Kinney develops MAVE technologies, as well as quantitative methods for analyzing the data
these technologies produce. I will also take advantage of the many training resources offered by CSHL, including
career development workshops offered at the School of Biological Sciences, as well as exposure to cutting-edge
science offered by the CSHL Meetings & Courses Program. Together, the research project and training plan
proposed here will equip me well to establish an independent research program focused on mechanistically
understanding genomic regulatory mechanisms through the lens of modern machine learning methods.