Generating a full-length reference transcriptome for human protein-coding genes - Abstract Elucidating the coding potential of the genome has benefited from accurate genome sequences and extensive transcriptome sequencing to allow detailed models for protein-coding sequences (CDSs) or open reading frames (ORFs). Although at least one reliable full-length transcript model has been assigned for every protein-coding gene, the majority of alternative isoforms remains uncharacterized due to i) vast differences of expression levels between isoforms expressed from common genes, and ii) the difficulty of obtaining full-length (FL) transcript sequences. Furthermore, there remains a large discrepancy between the total number of transcripts in annotation databases and the number for which there is an annotated FL transcript with experimental evidence. The spectrum of encoded transcripts comprises a vast but finite “isoform-space” with multiple dimensions: i) genes, ii) tissues and cell types, iii) development and time iv) disease, and v) response to stimuli. Just as expression levels vary across cells and tissues, so can the relative abundance of alternatively spliced transcripts. Full, functional understanding of the human genome will not be possible without empirical knowledge and complete annotation of the entire complement of encoded functional proteins. Historically, gene annotation was supported predominantly by ESTs and mRNAs from INSDC databases while automated approaches to annotation are being applied to whole genomes and transcriptomes. However, current automated annotation does not provide the same quality data as does manual annotation. Sensitivity and specificity are reduced, less functional annotation is captured, and all automated methods lack the capacity of a manual annotator to introduce additional orthogonal data types and interpretation of the scientific literature, but manual annotation is highly labor-intensive. GENCODE release v36 represents the interpretation of nearly 10 million EST, cDNA and protein homologies. Given the anticipated volumes of data, with single experiments producing more data than the entire INSDC catalogue, current methods of manual annotation do not scale. The emergence of long transcriptomic sequencing methods provides for the replacement of historical data types to the benefit of gene and transcript annotation. However, the massively greater data volumes already being deposited in public data archives exceed manual curation capability, demanding implementation of automated solutions without compromising annotation quality. Furthermore, as untargeted sequencing approaches are very inefficient in their discovery of less abundant transcripts, the majority of sequence data generated gives us very little insight into discoverable transcript diversity. To overcome these challenges, our two respective groups have joined forces to increase the catalog of fully experimentally verified full length human protein-coding transcripts. This proposal focuses on the integration of experimental approaches that will provide a comprehensive enumeration of human protein-coding transcripts, a “Reference Human Transcriptome” with the development of an automated annotation pipeline to allow the integration of this resource into GENCODE gene annotation.