PROJECT SUMMARY
Understanding how the coordination of transcription factors bind to non-coding DNA provides mechanistic
insights into transcriptional regulation. Recent developments in deep neural networks (DNNs) have
revolutionized our ability to study regulatory genomics. While they have demonstrated improved predictions
compared to previous methods based on traditional computational genomics, their low interpretability has earned
them a reputation as a black box. To address this gap, post hoc model interpretability methods have emerged
to interrogate important features that the network has learned. Of these, attribution maps have demonstrated
promise, providing importance scores for each nucleotide in a given sequence; these have a natural
interpretation as single-nucleotide variant effects. In principle, attribution maps should contain information to
identify motifs that are important for cell-type specific regulatory functions and annotate their positions at base-
resolution. However, attribution maps are often noisy in practice; in addition to motifs, they contain spurious
importance scores for arbitrary nucleotides for reasons that are not well established. Despite their promise,
interpreting a DNN through attribution maps remains challenging. Here we propose three complementary aims
that serve to maximize the biological insights that we can achieve from attribution maps for genomic DNNs. In
Aim 1, we will develop a model selection framework to identify the optimal DNN from a set of candidate DNNs
that yields high generalization performance and interpretable attribution maps. In Aim 2, we will develop robust
training strategies based on regularization and data augmentations tailored for genomics, with the broader aim
of ensuring that DNNs yield high-quality attribution maps and high generalization. In Aim 3, we will develop and
employ interpretable computational methods to directly analyze attribution maps to facilitate discovery of
functional motifs and annotate their positions. Each aim will be implemented as open-source software in
TensorFlow and PyTorch. As the number of deep learning applications in genomics is rising quickly, the
biomedical community will greatly benefit from these user-friendly computational tools by enabling the
deployment of robust training and interpretability analysis for any DNN trained on functional genomics assays.
This, in turn, will drive new discoveries in cis-regulatory biology across the many biological systems that deep
learning has already been applied to and the new applications that will continue to emerge in the future.