Thursday, December 18, 2025 12/18/2025

Generating a full-length reference transcriptome for human protein-coding genes

Award Number: U24HG011451
ORGANIZATION: NATIONAL HUMAN GENOME RESEARCH INSTITUTE
OPDIV: NIH
AWARD CLASS: COOPERATIVE AGREEMENT
AWARD ACTIVITY TYPE: SCIENTIFIC/HEALTH RESEARCH (INCLUDES SURVEYS)
PERIOD OF PERFORMANCE START DATE: 08/22/2022
PERIOD OF PERFORMANCE END DATE: 06/30/2027

Group Awards By:

View Award Description

Generating a full-length reference transcriptome for human protein-coding genes - Abstract Elucidating the coding potential of the genome has benefited from accurate genome sequences and extensive transcriptome sequencing to allow detailed models for protein-coding sequences (CDSs) or open reading frames (ORFs). Although at least one reliable full-length transcript model has been assigned for every protein-coding gene, the majority of alternative isoforms remains uncharacterized due to i) vast differences of expression levels between isoforms expressed from common genes, and ii) the difficulty of obtaining full-length (FL) transcript sequences. Furthermore, there remains a large discrepancy between the total number of transcripts in annotation databases and the number for which there is an annotated FL transcript with experimental evidence. The spectrum of encoded transcripts comprises a vast but finite “isoform-space” with multiple dimensions: i) genes, ii) tissues and cell types, iii) development and time iv) disease, and v) response to stimuli. Just as expression levels vary across cells and tissues, so can the relative abundance of alternatively spliced transcripts. Full, functional understanding of the human genome will not be possible without empirical knowledge and complete annotation of the entire complement of encoded functional proteins. Historically, gene annotation was supported predominantly by ESTs and mRNAs from INSDC databases while automated approaches to annotation are being applied to whole genomes and transcriptomes. However, current automated annotation does not provide the same quality data as does manual annotation. Sensitivity and specificity are reduced, less functional annotation is captured, and all automated methods lack the capacity of a manual annotator to introduce additional orthogonal data types and interpretation of the scientific literature, but manual annotation is highly labor-intensive. GENCODE release v36 represents the interpretation of nearly 10 million EST, cDNA and protein homologies. Given the anticipated volumes of data, with single experiments producing more data than the entire INSDC catalogue, current methods of manual annotation do not scale. The emergence of long transcriptomic sequencing methods provides for the replacement of historical data types to the benefit of gene and transcript annotation. However, the massively greater data volumes already being deposited in public data archives exceed manual curation capability, demanding implementation of automated solutions without compromising annotation quality. Furthermore, as untargeted sequencing approaches are very inefficient in their discovery of less abundant transcripts, the majority of sequence data generated gives us very little insight into discoverable transcript diversity. To overcome these challenges, our two respective groups have joined forces to increase the catalog of fully experimentally verified full length human protein-coding transcripts. This proposal focuses on the integration of experimental approaches that will provide a comprehensive enumeration of human protein-coding transcripts, a “Reference Human Transcriptome” with the development of an automated annotation pipeline to allow the integration of this resource into GENCODE gene annotation.


Issue Date FY	Funding FY	Legal Entity Name	Legal Entity Address	Legal Entity City	Legal Entity State	Legal Entity Zip Code	Legal Entity COUNTY	Legal Entity COUNTRY	Assistance Listing	Award Code	Budget Year	Action Date	Action Type	Action Amount

Issue Date FY: 2025 ( Subtotal = $640,208 )
2025	2025	DANA-FARBER CANCER INSTITUTE, INC.	450 BROOKLINE AVE	BOSTON	MA	02215	SUFFOLK	USA	Human Genome Research	001	4	9/8/2025	NON-COMPETING CONTINUATION	$640,208
2025	2024	DANA-FARBER CANCER INSTITUTE, INC.	450 BROOKLINE AVE	BOSTON	MA	02215	SUFFOLK	USA	Human Genome Research	000	3	11/6/2024	NON-COMPETING CONTINUATION	$0
														Subtotal = $640,208

Issue Date FY: 2024 ( Subtotal = $647,290 )
2024	2024	DANA-FARBER CANCER INSTITUTE, INC.	450 BROOKLINE AVE	BOSTON	MA	02215	SUFFOLK	USA	Human Genome Research	000	3	7/16/2024	NON-COMPETING CONTINUATION	$647,290
														Subtotal = $647,290

Issue Date FY: 2023 ( Subtotal = $663,518 )
2023	2023	DANA-FARBER CANCER INSTITUTE, INC.	450 BROOKLINE AVE	BOSTON	MA	02115	SUFFOLK	USA	Human Genome Research	000	2	7/5/2023	NON-COMPETING CONTINUATION	$663,518
														Subtotal = $663,518

Issue Date FY: 2022 ( Subtotal = $757,593 )
2022	2022	DANA-FARBER CANCER INSTITUTE, INC.	450 BROOKLINE AVE	BOSTON	MA	02115	SUFFOLK	USA	Human Genome Research	000	1	8/22/2022	NEW	$681,834
2022	2022	DANA-FARBER CANCER INSTITUTE, INC.	450 BROOKLINE AVE	BOSTON	MA	02115	SUFFOLK	USA	Human Genome Research	001	1	9/15/2022	NEW	$75,759
														Subtotal = $757,593

Grand Total All Awards = $2,708,609

Top

All Categories

About

Search

Reports

Data Submission

Award Information

Generating a full-length reference transcriptome for human protein-coding genes

Award Number: U24HG011451

ORGANIZATION: NATIONAL HUMAN GENOME RESEARCH INSTITUTE

OPDIV: NIH

AWARD CLASS: COOPERATIVE AGREEMENT

AWARD ACTIVITY TYPE: SCIENTIFIC/HEALTH RESEARCH (INCLUDES SURVEYS)

PERIOD OF PERFORMANCE START DATE: 08/22/2022

PERIOD OF PERFORMANCE END DATE: 06/30/2027

Federal Websites

Department of Health & Human Services

HHS Operating Divisions

HHS Staff Divisions

Download A Document Viewer