PROJECT SUMMARY
Understanding transcriptional regulation remains as a major task in the molecular biology field. Enhancers are
genetic elements that regulate when and where genes are expressed and their expression levels. These elements
are hard to discover because their locations and orientations are not constrained with respect to their target genes.
Several diseases and susceptibility to certain diseases are linked to mutations and variants in enhancers. Multiple
experimental and computational methods have been developed for locating enhancers. Computational methods
are more suitable to handle the large number of genomes being sequenced now because they are faster, cheaper,
and less labor intensive than experimental methods. Despite many available computational tools, we lack a
sophisticated tool that can measure similarity in the enhancer activity of a pair of sequences. We propose here
utilizing Deep Artificial Neural Networks (DANNs) to develop such a tool. The long-term objective of this project is
to decipher the code governing gene regulation with the following specific aims: (i) design a computational tool for
measuring enhancer-enhancer similarity, (ii) validate up to 96 putative enhancers experimentally, (iii) understand
enhancer grammar, and (iv) annotate enhancers in more than 50 insect genomes. To achieve these aims, a novel
application of DANNs is proposed. Current tools utilize DANNs to answer a yes-no question: does a sequence
have similar activity to the tissue-specific enhancers comprising a particular training set of known enhancers?
These approaches require training a separate network on each tissue, leading to inconsistent performances on
different tissues. Instead, here we use a DANN to answer a related but different question: does this sequence
have similar enhancer activity to a single known tissue-specific enhancer? This deep network should perform
consistently on different cell types because it is trained on pairs of sequences — not individual sequences as is the
case in the available tools — representing all tissues for which there are known enhancers. The DANN is trained
to recognize sequence pairs with similar enhancer activities and those with dissimilar activities including (i) two
enhancers active in two different tissues, (ii) one enhancer and a random genomic sequence, and (iii) two random
genomic sequences. The tool outputs a score between 0 and 1, indicating how similar the enhancer activities
of the two sequences are. Using a much simpler machine learning algorithm than DANNs, we demonstrate that
pairs with similar enhancer activities can be separated from pairs of random genomic sequences or pairs of
one enhancer and a random genomic sequence with a very high accuracy. The new tool has many important
potential applications including consistent annotation of enhancers across cell types and related species. Our tool
can annotate enhancers active in a cell type that has a small number of known enhancers, and it can annotate
enhancers in related genomes when there is a set of known enhancers demarcated in one of them. Discovering
new transcription factor binding sites is another potential application. Studying enhancer “design principles” and
the effects of variants can be facilitated using the proposed tool. Such applications will advance our field.