Exploiting Natural Genetic and Organismic Variation to Identify the DNA Motifs Regulating Transcription - Understanding how non-coding DNA regulates gene expression is critical to addressing myriad problems in
biotechnology and human health. This endeavor, however, has proven a major challenge, in part, as the
genome appears to encode many highly convoluted regulatory networks. Particularly in vertebrates, regulatory
elements such as promoters or enhancers are highly diverse and contain of dozens of DNA motifs spaced by
intervening sequences. Traditional sequence analysis strategies that focus on conservation are ineffective to
single out the functional motifs as regulatory DNA evolves rapidly and often even relocates.
The transcription start site (TSS) is a landmark of gene regulation. Accurate TSS enables to define DNA motifs
functionally associated with transcription. TSSs further allow anchoring and comparing the regulatory regions
or orthologous genes across evolution, independent of direct sequence conservation. Although distantly
related organisms typically lack homologous regulatory DNA, it remains to be explored to what extent specific
sequence motifs are selectively conserved to drive expression gene. I therefore developed capped small RNA-
seq (csRNA-seq), which accurately maps the TSS of both stable (protein coding and non-coding RNAs) and
unstable transcripts (enhancer RNAs, divergent transcripts) to reveal active regulatory elements genome-wide.
csRNA-seq only requires total RNA as starting material, thus enabling TSSs profiling in virtually any eukaryotic
organism from which RNA can be extracted. Eukarya, from unicellular protists to humans, vary in organismic,
genetic and regulatory complexity. I hypothesize that this spectrum in diversity, combined with TSS mapping
(csRNA-seq), can be exploited to uncover the key DNA motifs and subsequently TF networks that regulate
gene expression across the Eukarya. Analogous to the work of an archeologist at prehistoric sites, mapping
TSSs along the tree of life ‘excavates’ ancestral, less convoluted states of gene regulation. These insights
should also be instrumental to better interpret the human genome. To explore this central hypothesis, I seek to
1) implement tools for the analysis and visualization of csRNA-seq and facilitate the comparative analysis of
annotated regulatory features across species, 2) identify the TF binding sites mediating transcription initiation
across Eukarya, 3) trace the evolution and usage of TF binding sites and their spatial organization in regulatory
elements of orthologous genes or sets of genes. In preparation, I have generated data for 42 Eukarya
spanning over 2 billion years of transcriptome evolution and joined an exceptional bioinformatics group, which
also provides a unique an opportunity for training critical for my successful transition to independence.
This proposal, if successful, will reveal the major DNA motifs mediating transcription and markedly expand our
mechanistic understanding of eukaryotic gene regulation. Furthermore, it will provide a novel method to
capture nascent TSSs (csRNA-seq), a free software suite to facilitate analysis, a data portal for easy data
access and browsing, and unique dataset to the greater scientific community.