Thursday, April 25, 2024 4/25/2024

Developing novel deep-learning based methods for deciphering non-coding gene regulatory code

Award Number: R01LM013722
ORGANIZATION: NATIONAL LIBRARY OF MEDICINE
OPDIV: NIH
AWARD CLASS: DISCRETIONARY
AWARD ACTIVITY TYPE: SCIENTIFIC/HEALTH RESEARCH (INCLUDES SURVEYS)

Group Awards By Issue Date FY or Funding FY:

View Award Abstract

SUMMARY This project will contribute novel pre-trained DNA Bidirectional Encoder Representations from Transformers, called DNABERT, and associated deep-learning tools to decipher the language of non-coding DNA and facilitate integration of gene regulatory information from rapidly accumulating sequence data with NLM’s genetic databases (for example, dbSNP, dbGaP and ClinVar), which serve both scientists and the public health by helping identify the genetic components of disease. While the genetic code explaining how DNA is translated into proteins is universal, the regulatory code that determines when and how the genes are expressed varies across different cell-types and organisms. Non-coding DNA is highly complex due to the existence of polysemy and distant semantic relationship, from a language modeling perspective. Recently, deep learning methods have been used in unraveling the gene regulatory code, but failed to globally and robustly model such language features in the genome, especially in data-scarce scenarios. To address this challenge, we propose DNABERT to model DNA as a language, by adapting the idea of Bidirectional Encoder Representations from Transformers (BERT). Based on recent observations in natural language processing research, we hypothesize that pre-trained transformer-based neural network model offer a promising, and yet not fully explored, deep learning approach for a variety of sequence prediction tasks in the analysis of non-coding DNA. Our preliminary results showed that DNABERT on the human genome achieved state-of-the-art performance on promoter and splice-site prediction tasks, after easy fine-tuning on small task-specific data (Ji, Y. et al. 2020). The goal of our proposed research is to develop DNABERT for a variety of sequence prediction tasks, and benchmark with existing state- of-the-art deep-learning based methods. Specific aims are (1) develop novel deep-learning methods by adapting BERT; (2) apply the proposed deep-learning methods to specifically target non-coding DNA sequence analyses and predictions; and (3) predict and validate functional non-coding genetic variants by applying DNABERT prediction models. A major contribution of the proposed research is development of pre-trained DNABERT model and prediction algorithms, which present new powerful methods for analyses and predictions of DNA sequences. Since the pre-training of DNABERT is resource-intensive, we will provide the source code and pre-trained model at Github for future academic research. We will also develop an integrated web server to (1) deploy DNABERT model, (2) database to store the identified sequence features and predictions, and (3) tutorials to help users to apply DNABERT to their specific research problems. We anticipate that DNABERT can bring new advancements and insights to the bioinformatics community by bringing advanced language modeling perspective to gene regulation analyses.


Issue Date FY	Funding FY	Legal Entity Name	Legal Entity Address	Legal Entity City	Legal Entity State	Legal Entity Zip Code	Legal Entity COUNTY	Legal Entity COUNTRY	Assistance Listing	Award Code	Budget Year	Action Date	Action Type	Action Amount

Issue Date FY: 2023 ( Subtotal = $330,753 )
2023	2023	THE RESEARCH FOUNDATION FOR THE STATE UNIVERSITY OF NEW YORK	W5510 FRANKS MELVILLE MEMORIAL LIBRARY	STONY BROOK	NY	11794	SUFFOLK	USA	Medical Library Assistance	000	3	4/20/2023	NON-COMPETING CONTINUATION	$330,753
														Subtotal = $330,753

Issue Date FY: 2022 ( Subtotal = $330,731 )
2022	2022	RESEARCH FOUNDATION FOR THE STATE UNIVERSITY OF NEW YORK, THE	WEST 5510 FRANKS MELVILLE MEMORIAL LIBRARY	STONY BROOK	NY	11794	SUFFOLK	USA	Medical Library Assistance	000	2	4/29/2022	NON-COMPETING CONTINUATION	$330,731
														Subtotal = $330,731

Issue Date FY: 2021 ( Subtotal = $345,594 )
2021	2021	RESEARCH FOUNDATION FOR THE STATE UNIVERSITY OF NEW YORK, THE	WEST 5510 FRANKS MELVILLE MEMORIAL LIBRARY	STONY BROOK	NY	11794	SUFFOLK	USA	Medical Library Assistance	000	1	7/15/2021	NEW	$345,594
														Subtotal = $345,594

Grand Total All Awards = $1,007,078

Top

All Categories

About

Search

Reports

Data Submission

Award Information

Developing novel deep-learning based methods for deciphering non-coding gene regulatory code

Award Number: R01LM013722

ORGANIZATION: NATIONAL LIBRARY OF MEDICINE

OPDIV: NIH

AWARD CLASS: DISCRETIONARY

AWARD ACTIVITY TYPE: SCIENTIFIC/HEALTH RESEARCH (INCLUDES SURVEYS)

Federal Websites

Department of Health & Human Services

HHS Operating Divisions

HHS Staff Divisions

Download A Document Viewer