PROJECT SUMMARY:
There is a need for integrative data analyses that anchor transcriptomic research in contexts predictive of
human health, as illustrated by growing awareness of disease-associated synonymous transcript variants
and RNA biotechnologies such as mRNA vaccines. To help uncover sequence features that are important
for RNA regulation, we present context-dependent models of translational efficiency, a key metric of
transcript function. We show that position-dependent codon usage bias (PDCUB) identifies start codons
among AUGs more consistently than the Kozak sequence, while high-PDCUB transcripts are enriched for
medically important genes tied to human development and neural function. Attention-based transformer
networks and interpretation techniques will independently predict translational efficiency in human
transcripts, with comparison to ribosome profiling and RNA abundance data in multiple human cell lines,
to characterize how PDCUB and other sequence features guide translational efficiency across health-
critical contexts. Transfection assays validate the roles of predicted sequence features.
Beyond sequence, higher-order structures also drive RNA function and stability, including
translational regulation and interactions with microRNAs and RNA-binding proteins (RBPs). A new RNA
structural alignment method and associated clustering will uncover structural domains and group them
by mutual similarity to find common structural motifs that impact RNA structure-function relationships,
improving our understanding of the role of transcript structure in pathogenesis. Evaluation will consist of
clustering RNA families in our previously built RNA structure meta-database, bpRNA-1m, with identified
structural domains analyzed in the context of ribosome profiling data to characterize the role of these
domains in regulating translation. Meanwhile, clustering structures according to RNA-protein crosslinking
data will let us identify motifs involved in the binding of RBPs.
Finally, a comprehensive transcriptome browser and meta-database will integrate transcriptomic
data for known and new transcript-level features, including those described above. Easy to access and
use, this resource will enable scientific and medical researchers to find and define RNA sequence features
and structural motifs. By cohesively cataloging the complex facets of transcript-level interactions, along
with sequence and structural features relevant for transcript regulation, our transcriptome browser will
help researchers visualize ribosomal occupancy, examine RNA structures, microRNA and RBP binding,
catalog splice variants, and understand the sequence features that drive transcript interactions. Allelic
variants mapped to RNA transcript positions will be combined our annotations, along with feature-based
machine learning predictions incorporated into the browser, to assist researchers in generating first-pass
predictions of transcript variants and interpreting their outcomes in the context of human health.