Statistical modeling of cross-sample variation and learning of latent structures in microbiome sequencing data - PROJECT ABSTRACT
The bacterial communities (microbiota) residing on the human body have been linked to a variety of acute and
chronic diseases and conditions, such as obesity, inflammatory bowel disorders, Type 2 diabetes, depression,
and urinary tract infections (UTIs), as well as to the host’s response to a variety of treatments and health
interventions for these diseases and conditions. As the critical role played by the microbiota has become
increasingly recognized, microbiome sequencing data sets are now routinely generated under ever more
sophisticated experimental designs and survey strategies. While such data share many common features and
challenges of modern big data, such as high-dimensionality and sparsity, they also possess characteristics
peculiar to the microbiota, including (i) the explicit and latent contextual relationships among the bacterial species,
such as their evolutionary and functional relationships; and (ii) the substantial heterogeneity across samples and
complex structure in the sample-to-sample variation. Effective analysis of modern microbiome studies calls for
new statistical methodology that incorporates these important characteristics in the data generative mechanism.
This project’s objective is to develop a suite of statistical models, methods, algorithms, and software that meet
this urgent need. An initial aim is to develop a multi-scale probabilistic framework for modeling microbiome
compositions that effectively characterizes the high dimensionality, sparsity, and substantial cross-sample
variation in microbiome sequencing data, and incorporates a variety of common experimental designs, such as
covariates, batch effects, and multiple time points, while striking a balance in flexibility, analytical parsimony, and
computational tractability. An additional focus is to develop latent variable models for microbiome compositional
data for the purpose of identifying latent structures such as sample clusters and species subcommunities. A final
aim is to produce user-friendly, open-source software that implements all of the proposed methods for the
analysis of microbiome sequencing data. All of the models and methods developed are informed by two on-
going collaborative projects of PI Ma and his team. One is on the identification of microbial communities
associated with UTIs in aging women, and the other on the study of longitudinal changes in the microbiome of
cancer patients undergoing hematopoietic stem cell transplantation. These studies will serve as testbeds for all
development. The models, methods, and software developed will not only result in better prediction of the health
outcomes in these and other microbiome studies but also help decipher the roles of microbiome in various
diseases and biomedical processes, with the ultimate goal of personalized interventions on the microbiome
compositions of patients to lead to improved health.