New data science approaches to visualize and understand the impact of the microbiome on risk of graft-versus-host disease - Project Summary/Abstract Allogeneic stem cell transplantation is a life-saving therapy for a variety of blood disorders, but its use is limited by a high rate of serious side effects, including the development of graft-versus-host-disease (GVHD). The gut microbiome, or the composition of microorganisms populating the digestive tract, plays a key role in triggering this inflammatory response, and there is an urgent need to analyze patient microbiome profiles to both predict and mitigate risk of GVHD. However, microbiome data pose a number of statistical challenges not addressed by existing methods due to high dimensionality, heterogeneity across subjects, and complex phylogenetic relationships. In this proposal, we develop new data science approaches to make sense of microbiome data, providing insight that can guide the development of future interventions aimed at reducing GVHD incidence. We will develop accurate and efficient methods for microbiome data analysis and make them available in user-friendly formats. We focus on the development of novel methods for visualization and prediction using microbiome data, as detailed in the following specific aims: Specific Aim 1: To develop and evaluate advanced tools for visualization of microbiome data. The high dimensionality and unique structure of microbiome data present challenges to effective data visualization. In this aim, we will develop approaches for both unsupervised and supervised visualization of microbiome data, along with an RShiny app and QIIME2 plug-in that will make these tools accessible to both clinicians and bioinformaticians. The methods and software resulting from this aim will provide robust approaches to enable researchers to better visualize global microbiome heterogeneity across their study population, enhancing data exploration and identification of potential confounding factors or outliers. Specific Aim 2: To develop predictive modeling approaches for binary and survival outcomes. In this aim, we will focus on selection of predictive microbiome features in the context of regression. We will carry out key advances enabling the effective application of sparse modeling to predict GVHD risk: novel statistical approaches to handle binary and time-to-event outcomes, including those with competing risks, and computationally efficient implementations, to be made freely available as both an R package and RShiny application. Specific Aim 3: To develop methods for understanding the impact of rare features. Current microbiome profiling methods allow for very fine resolution of the strains present in each sample. In this aim, we propose two methods to understand the impact of rare features. We will first develop a method to provide insight into kernel association results, by obtaining estimated effect sizes for individual microbiome features. We will then develop an approach for nonparametric clustering of the regression coefficients, which allows flexible aggregation of the observed rare features. Successful completion of this work will result in new statistical and computational approaches to provide insights into microbiome data, generating hypotheses that can guide the development of future strategies to predict and mitigate GVHD. These methods will be disseminated through easy-to-use and efficient cloud-based software implementations.