Exploring new approaches for enhanced human gene annotation - Project Summary This project will support our work on computational methods for gene finding with the specific goal of improving human gene annotation. The human gene catalog has a tremendous impact on biomedical research, and a huge and ever-growing number of genetic studies depend on this list. Until we have a complete and accurate catalog of all human genes and transcripts, including variants that appear in abnormal tissues and tissue states, we will be unable to discover all the genetic causes of disease. Over the years we have developed many innovative and efficient systems to identify genes, and during the last decade our focus has shifted towards using the data produced by the RNA sequencing (RNA-seq) technology, which has revolutionized the methodology of gene discovery. The most widely-used of our systems is StringTie, a genome-guided transcriptome assembler that produces significant improvements over competing methods, both in terms of accuracy as well as efficiency. During the last few years, we adapted StringTie in response to new technology, and released several major new updates to StringTie, including a version that is capable of handling mixed transcriptomic data and achieves assembly and quantification results that are better than long-read, short- read, or corrected long-read assemblies alone. StringTie has an enormous user community, with more than 75,000 new users and software downloads over the past three years. The StringTie papers, combined, have accumulated more than 13,100 citations, and this number continues to grow. Due to the high efficiency of StringTie, we were able to use a massive RNA-seq database to build CHESS, a new human gene catalog that represents a comprehensive, reproducible, and open method for annotating the human genome. Our work on CHESS has highlighted the presence of a vast amount of transcriptional noise which leads to systematic errors in the ability of leading computational methods to assemble and quantify the genes and transcripts in RNA- sequencing experiments, which in turn impacts downstream assessments of variant impact. During our future work, we will be addressing today's exponential growth of RNA-seq data, and we plan to develop efficient algorithms that are able to process and summarize enormous data sets in order to make large-scale transcriptome analyses faster and much more accurate. Our efforts will focus on improving gene annotation by developing methods to aggregate transcriptomic evidence across thousands of RNA-seq data sets, with goals that include identifying transcript features that are conserved across samples, and creating maps of functional transcriptional activities that are characteristic of different cell and tissue types, developmental stages, and disease types. As a result, we envision creating a database of clinically relevant transcription profiles, for which we will then build new and robust methods to identify differential expression signals across all the conditions in our database. As has always been the case in the past, we will also continue to adapt our methods as well as create new systems in response to new technology.