Abstract
With growing interest in personalized medicine and the rise of machine learning, constructing good risk
prediction and prognostic models has been drawing renewed attention. In this development, much effort
is concentrated in identifying good predictors of patient outcomes, although the same level of rigor is often
absent in improving the outcome side of prediction. The majority of popular supervised techniques (e.g.,
regularized logistic regression and its variations), which can be readily applied in risk model development,
assumes that the prediction target is a clear single outcome measured at a single time point. In clinical
reality, patient outcomes are often complex, multivariate, and measured with errors. Even when a target
is a relatively clear univariate outcome (e.g., death, cancer, diabetes, etc), the process that leads to this
ultimate outcome often involves complex intermediate outcomes, where predicting and understanding this
intermediate process can be crucial in providing effective care and preventing negative ultimate outcomes.
The situation calls for a ¿exible learning framework that can easily incorporate this important but neglected
aspect in model development - better characterizing and constructing prediction targets before building
prediction models.
Focusing on risk labels as prediction targets, we propose a pragmatic 3-stage learning approach,
where we sequentially 1) generate latent labels, 2) validate them using explicit validators, and 3) go on
with supervised learning with labeled data. Latent variable (LV) strategies used in Satge 1 have great
potentials in handling complex outcome information. The unsupervised nature of LV strategies makes
highly ¿exible data synthesis and organization possible. The same nature, however, can also be seen
as esoteric and subjective, which is not desirable in situations where transparency and reproducibility are
of great concern such as in risk prediction. As a practical solution to this problem, we propose the use
of explicit clinical validators, which not only makes LV-based labels closely aligned with contemporary
science and clinical practice, but also makes it possible to automatically validate and narrow a large
pool of candidate labels. With the goal of developing a practical and transparent system of learning
and inference for clinical research and practice, we formed a highly interdisciplinary team of researchers
with expertise in latent variable modeling, machine learning, psychometrics and causal inference along
with clinical/substantive expertise. Our streamlined learning framework focuses on direct and transparent
validation of latent variable solutions to ensure clear communication across risk model developers, clinical
researchers and practitioners. The project ultimately aims to improve personalized treatment and care by
improving risk prediction.