PROJECT SUMMARY / ABSTRACT
RNA-seq is a powerful tool for studying molecular biology. However, without cell sorting (or related techniques),
conventional RNA-seq applied to tissue samples cannot determine gene expression in underlying cell-types.
This is problematic because differential gene expression observed at the tissue level is not necessarily reflected
in underling cell-types, which obscures biological insight. For example, Schmiedel et al. recently applied RNA-
seq to 13 purified blood cell-types from 106 individuals1, which uncovered the molecular basis of sex-specific
differences in immune response. However, this was obscured when they applied RNA-seq to only whole-blood.
Single-cell RNA-seq is the obvious candidate to probe cell-type-specific effects more broadly. However, for most
tissues, single-cell RNA-seq has been restricted to small sample sizes, due to specialized dissociation protocols
and cost. Thus, only bulk-tissue RNA-seq data are available for large sample sizes. Crucially, much of these
bulk data are paired to enormous stores of informative clinical phenotypic data and additional -omics data. These
datasets include large NIH initiatives such as GTEx, TCGA, and All of Us, which have collected data on genetics,
disease status, outcome, drug treatments, ethnicity, sex, and much more. The critical gap is that we cannot
currently study the relationship between cell-type level gene expression and any of these phenotypes.
To overcome this limitation, we will develop computational tools for estimating cell-type-specific differential
expression from bulk RNA-seq data, when a small reference single-cell RNA-seq dataset is available from the
same tissue-type. This will allow us to study the cell-type-specific differences in expression that drive human
phenotypes and diseases, unlocking the tens-of-thousands of bulk RNA-seq samples paired to phenotypic data.
The basis for this research program is a previous study where we developed a method to recover the cell-type-
specific effects of inherited genetic variation on gene expression in bulk breast-tumor RNA-seq data. This method
allowed us to discover a novel breast cancer risk gene—which was obscured using conventional methods.
Here, we posit that a similar mathematical framework can be adapted to recover any cell-type-specific effect
from bulk-tissue RNA-seq. Hence, we can develop specific tools to perform multiple commonly applied analyses
at cell-type-specific resolution from bulk-tissue RNA-seq by leveraging matched single-cell data, including
differential expression, correlative and gene set enrichment analysis.
Finally, new spatial transcriptomics technologies are emerging that enable spatially resolved gene expression to
be measured directly in tissue sections. These platforms quantify gene expression in situ in ~100μm barcoded
spots. Each spot captures a small cluster of cells—akin to a miniaturized bulk-tissue RNA-seq experiment.
Hence, the same abstract mathematical framework can be used to identify effects such as cell-type-specific
spatial variation in gene expression. Computational tools for these data are evolving quickly; thus, this award will
also allow us to develop methods that meet the changing needs of these new gene expression platforms.