SUMMARY
This project will contribute novel pre-trained DNA Bidirectional Encoder Representations from Transformers,
called DNABERT, and associated deep-learning tools to decipher the language of non-coding DNA and facilitate
integration of gene regulatory information from rapidly accumulating sequence data with NLM’s genetic
databases (for example, dbSNP, dbGaP and ClinVar), which serve both scientists and the public health by
helping identify the genetic components of disease. While the genetic code explaining how DNA is translated
into proteins is universal, the regulatory code that determines when and how the genes are expressed varies
across different cell-types and organisms. Non-coding DNA is highly complex due to the existence of polysemy
and distant semantic relationship, from a language modeling perspective. Recently, deep learning methods have
been used in unraveling the gene regulatory code, but failed to globally and robustly model such language
features in the genome, especially in data-scarce scenarios. To address this challenge, we propose DNABERT
to model DNA as a language, by adapting the idea of Bidirectional Encoder Representations from Transformers
(BERT). Based on recent observations in natural language processing research, we hypothesize that pre-trained
transformer-based neural network model offer a promising, and yet not fully explored, deep learning approach
for a variety of sequence prediction tasks in the analysis of non-coding DNA. Our preliminary results showed
that DNABERT on the human genome achieved state-of-the-art performance on promoter and splice-site
prediction tasks, after easy fine-tuning on small task-specific data (Ji, Y. et al. 2020). The goal of our proposed
research is to develop DNABERT for a variety of sequence prediction tasks, and benchmark with existing state-
of-the-art deep-learning based methods. Specific aims are (1) develop novel deep-learning methods by adapting
BERT; (2) apply the proposed deep-learning methods to specifically target non-coding DNA sequence analyses
and predictions; and (3) predict and validate functional non-coding genetic variants by applying DNABERT
prediction models. A major contribution of the proposed research is development of pre-trained DNABERT model
and prediction algorithms, which present new powerful methods for analyses and predictions of DNA sequences.
Since the pre-training of DNABERT is resource-intensive, we will provide the source code and pre-trained model
at Github for future academic research. We will also develop an integrated web server to (1) deploy DNABERT
model, (2) database to store the identified sequence features and predictions, and (3) tutorials to help users to
apply DNABERT to their specific research problems. We anticipate that DNABERT can bring new advancements
and insights to the bioinformatics community by bringing advanced language modeling perspective to gene
regulation analyses.