A knowledge-guided analysis approach to recovering rare signals from single-cell transcriptomic data - PROJECT SUMMARY This innovative project aims at developing computational methods that overcome a major limitation of existing single-cell transcriptomic data analysis methods. Single-cell transcriptomics has enabled profiling of gene expression in individual cells. This is particularly useful for studying rare cells because their data are largely invisible if mixed together with other cells in a bulk sample. There are many examples of biologically important rare cells, such as tissue stem cells, senescent cells, endothelial progenitors, and tumor-initiating cells. Ironically, existing analysis methods for single-cell transcriptomic data often ignore rare cells. This is because in a standard step included in all mainstream analysis pipelines called dimensionality reduction, rare signals can easily be discarded in order to preserve the most prominent signals. As a result, rare cells are usually not clustered together in the reduced data, which in turn makes it difficult to identify and study these cells. To tackle this problem, here we propose the novel concept of knowledge-guided single-cell data analysis. Taking marker genes of cells of interest as externally-supplied knowledge, our algorithm will be instructed to retain information about these genes during dimensionality reduction. As a result, the rare cells are much more likely to be clustered in the reduced data. Another important application of our methods is separating highly similar cell sub-populations. By supplying genes differentially expressed (DEGs) between them as knowledge input, they will become more separated in the reduced data. In Aim 1, we will design and implement the computational methods. We will use the autoencoder artificial neural network framework, which is proven to be useful for single-cell data, and introduce novel components to take and use the external knowledge. A key aspect of our methods will be that both the external knowledge and the data itself will be respected, which means the dimensionality reduction process will pay attention to the marker genes/DEGs only if the most prominent signals in the data can also be preserved at the same time. In Aim 2, we will systematically test the effectiveness of our methods in identifying rare cells and separating highly similar cell sub-populations using published data sets. We will use independent data to define cell populations, such as cell surface protein measurements in CITE-seq, and use them to quantitatively assess how well our methods cluster rare cells and separate different cell sub- populations. We will benchmark against state-of-the-art single-cell data analysis methods. In Aim 3, we will assess the effects of having noisy and incomplete knowledge inputs. The former refers to genes not specifically expressed in a rare cell type or not differentially expressed between cell sub-populations, while the latter refers to specifically/differentially expressed genes that are not supplied as knowledge input. We will artificially include noisy genes and exclude informative genes to study tradeoffs between comprehensive yet noisy and precise yet incomplete knowledge inputs. Overall, this project will produce computational methods and open-source software that will propel the study of important rare signals in single-cell transcriptomic data.