PROJECT SUMMARY/ABSTRACT
Studying tumors by quantifying gene expression via RNA-sequencing (RNA-seq) has proven crucial
to elucidating their active biological pathways and processes, how they differ from normal tissue, and
how they might be targeted for therapy. Furthermore, new single cell RNA-seq (scRNA-seq)
techniques are beginning to uncover the heterogeneity of tumors by profiling them at single cell
resolution. Deriving knowledge of pathway activity from expression data requires the application of
methods such as Gene Set Enrichment Analysis (GSEA), which is a community standard for
assessing the coordinate up- or down-regulation of pathways, processes, and phenotypes
represented by groups of genes or ‘gene sets’. As GSEA requires high-quality and well-annotated
gene sets for a robust analysis, the Mesirov lab maintains and freely distributes the Molecular
Signatures Database (MSigDB), which contains multiple collections of gene sets to accompany our
GSEA software. Ideally, this database would consist of coherent gene sets, that is, sets whose
member genes show coordinate up-regulation or coordinate down-regulation and specifically indicate
activation or repression of a specific pathway or process relevant to a particular cell type or disease
phenotype. However, due to the manner of collection of some gene sets in MSigDB, e.g., curation
from scientific publications or extraction from canonical pathway databases, some of the gene sets
lack coherence. In addition, users of our GSEA implementations are beginning to input new
scRNA-seq data. However, we have identified statistical problems arising from the sparsity of
scRNA-seq data that make standard GSEA results uninterpretable. To address these concerns, we
propose the following aims.
Aim 1: We will develop a data-driven refinement approach for the gene sets in the MSigDB.
Our approach will leverage large-scale compendia of expression datasets and protein-protein
interaction networks to use existing gene sets as starting points to construct refined gene sets.
Aim 2: We will use the refinement method from Aim 1 to assemble a new Hallmark collection
of refined gene sets for use in GSEA.
Aim 3: We will develop and validate an approach to pathway enrichment detection that
accounts for the sparsity of scRNA-seq.
Following the completion of these aims, we will have released a new, freely available collection of
gene sets that enable more robust GSEA as well as a new method which will allow these new, or any,
gene sets to be used to test for enrichment in scRNA-seq.