Methods for improved detection of activated molecular pathways in cancer - PROJECT SUMMARY/ABSTRACT Studying tumors by quantifying gene expression via RNA-sequencing (RNA-seq) has proven crucial to elucidating their active biological pathways and processes, how they differ from normal tissue, and how they might be targeted for therapy. Furthermore, new single cell RNA-seq (scRNA-seq) techniques are beginning to uncover the heterogeneity of tumors by profiling them at single cell resolution. Deriving knowledge of pathway activity from expression data requires the application of methods such as Gene Set Enrichment Analysis (GSEA), which is a community standard for assessing the coordinate up- or down-regulation of pathways, processes, and phenotypes represented by groups of genes or ‘gene sets’. As GSEA requires high-quality and well-annotated gene sets for a robust analysis, the Mesirov lab maintains and freely distributes the Molecular Signatures Database (MSigDB), which contains multiple collections of gene sets to accompany our GSEA software. Ideally, this database would consist of coherent gene sets, that is, sets whose member genes show coordinate up-regulation or coordinate down-regulation and specifically indicate activation or repression of a specific pathway or process relevant to a particular cell type or disease phenotype. However, due to the manner of collection of some gene sets in MSigDB, e.g., curation from scientific publications or extraction from canonical pathway databases, some of the gene sets lack coherence. In addition, users of our GSEA implementations are beginning to input new scRNA-seq data. However, we have identified statistical problems arising from the sparsity of scRNA-seq data that make standard GSEA results uninterpretable. To address these concerns, we propose the following aims. Aim 1: We will develop a data-driven refinement approach for the gene sets in the MSigDB. Our approach will leverage large-scale compendia of expression datasets and protein-protein interaction networks to use existing gene sets as starting points to construct refined gene sets. Aim 2: We will use the refinement method from Aim 1 to assemble a new Hallmark collection of refined gene sets for use in GSEA. Aim 3: We will develop and validate an approach to pathway enrichment detection that accounts for the sparsity of scRNA-seq. Following the completion of these aims, we will have released a new, freely available collection of gene sets that enable more robust GSEA as well as a new method which will allow these new, or any, gene sets to be used to test for enrichment in scRNA-seq.