Discovering potential drugs and treatments of many diseases heavily depends on identifying differentially
expressed (DE) genes in disease conditions within individual cell types. While it is possible to
experimentally sort out cells of individual cell types for DE analysis, computationally leveraging bulk tissue
data has the advantage of greater availability, lower expenses, and less human handling. A critical step
toward this research is to (completely) deconvolute gene expressions in specific cell types from the
heterogeneous bulk tissues. Complete deconvolution can be viewed as a nonnegative matrix factorization
(NMF) problem, however, NMF is strongly ill-posed, and its non-separable solutions give great challenges
in data interpretability. These challenges vary in different applications, so if no special treatment is taken,
results from complete deconvolution of gene expression data will make accurate DE analysis almost
impossible. In this proposal, a mathematical model and associated computational algorithms will be
established for the fundamental research of bulk tissue RNAseq analysis, for better data interpretability,
reliability, and efficiency. To tackle this challenge, the geometric structure of the given bulk tissue data set
will be explored first to identify marker genes for the constituent cell types. Then the model is established
by (1) enforcing the weak solvability condition (because of noises) of NMF and (2) performing geometrical
constraints on the data space of knowns. This work is motivated by the common characteristics of many
biological data, in which expression levels across sample tissues exhibit strong correlations among certain
genes. For massive amount of biological data, stochastic fast computational algorithms will be developed.
After validation and benchmarking, the proposed model will be applied to DE analysis for various datasets.
This proposed new model is important to decipher cellular transcriptional alterations in many diseases. In
modeling strategies, this research provides a new perspective of observing topological/geometric
structures of data, enforcing the corresponding constraints to enhance problem solvability and data
interpretability. In computation, this research develops nonlinear graph Laplacian regularized optimization
associated with stochastic compression algorithms, which can process massive data with low storage.
requirement, low complexity, and adapt to modern structure of computer hardware.
As