In the past decade, there has been an explosion of data collected from biological and biomedical systems, both
in terms of type and volume. Mining these high-dimensional, heterogeneous, and often dynamic datasets to
make biologically or medically important inferences or develop predictive models requires new sophisticated
data analytics methods. New machine learning methods have begun filling this gap, but most of these methods
generate “black box” models that lack clear interpretability. Additionally, these methods are associative, and are
thus incapable of teasing out the complex cause-effect relationships among features in the dataset. Directed
causal graphical models (DCGMs) are a powerful tool for filling this gap. DCGMs, learned from observational
datasets, can represent causal relationships between variables. This allows DCGMs to generate hypotheses of
mechanisms and construct parsimonious, causally informed predictive models. However, biomedical datasets
often have features that make it difficult to construct causal graphical models over the full dataset. Examples
include: data type heterogeneity, high dimensionality, multicollinearity, cyclicity, and nonstationarity. To address
these problems, I propose to develop methods for learning causal graphs in datasets containing (1) a
heterogeneous mixture of continuous, categorical, and censored variables, (2) high dimensionality and
multicollinearity, and (3) cyclicity and nonstationarity. In Aim 1, I will develop a new causal discovery algorithm
that accommodates continuous, categorical and censored variables (e.g., survival). In Aim 2, I will test and
compare various methods for matrix decomposition and dimensionality reduction in their ability to learn a
meaningful low-dimensional latent feature space to be used in graph learning methods. In Aim 3, I will develop
a new method for causal discovery in dynamic, possibly cyclic, gene regulatory networks at single cell resolution.
In all cases, testing and validation will be performed on synthetic and real-life publicly available datasets. These
methodological improvements constitute important steps forward in the field of causal discovery and they can
be utilized together or independently to provide a flexible and powerful platform for analysis of a wide range of
biomedical datasets. Once made available, they will enable researchers to make inferences about causal
mechanisms, generate hypotheses, and build robust, parsimonious predictive models.