Abstract
Elucidating the coding potential of the genome has benefited from accurate genome sequences and extensive
transcriptome sequencing to allow detailed models for protein-coding sequences (CDSs) or open reading frames
(ORFs). Although at least one reliable full-length transcript model has been assigned for every protein-coding
gene, the majority of alternative isoforms remains uncharacterized due to i) vast differences of expression levels
between isoforms expressed from common genes, and ii) the difficulty of obtaining full-length (FL) transcript
sequences. Furthermore, there remains a large discrepancy between the total number of transcripts in
annotation databases and the number for which there is an annotated FL transcript with experimental evidence.
The spectrum of encoded transcripts comprises a vast but finite “isoform-space” with multiple dimensions: i)
genes, ii) tissues and cell types, iii) development and time iv) disease, and v) response to stimuli. Just as
expression levels vary across cells and tissues, so can the relative abundance of alternatively spliced transcripts.
Full, functional understanding of the human genome will not be possible without empirical knowledge and
complete annotation of the entire complement of encoded functional proteins.
Historically, gene annotation was supported predominantly by ESTs and mRNAs from INSDC databases while
automated approaches to annotation are being applied to whole genomes and transcriptomes. However, current
automated annotation does not provide the same quality data as does manual annotation. Sensitivity and
specificity are reduced, less functional annotation is captured, and all automated methods lack the capacity of a
manual annotator to introduce additional orthogonal data types and interpretation of the scientific literature, but
manual annotation is highly labor-intensive. GENCODE release v36 represents the interpretation of nearly 10
million EST, cDNA and protein homologies. Given the anticipated volumes of data, with single experiments
producing more data than the entire INSDC catalogue, current methods of manual annotation do not scale. The
emergence of long transcriptomic sequencing methods provides for the replacement of historical data types to
the benefit of gene and transcript annotation. However, the massively greater data volumes already being
deposited in public data archives exceed manual curation capability, demanding implementation of automated
solutions without compromising annotation quality. Furthermore, as untargeted sequencing approaches are very
inefficient in their discovery of less abundant transcripts, the majority of sequence data generated gives us very
little insight into discoverable transcript diversity. To overcome these challenges, our two respective groups have
joined forces to increase the catalog of fully experimentally verified full length human protein-coding transcripts.
This proposal focuses on the integration of experimental approaches that will provide a comprehensive
enumeration of human protein-coding transcripts, a “Reference Human Transcriptome” with the development of
an automated annotation pipeline to allow the integration of this resource into GENCODE gene annotation.