PROJECT SUMMARY
Integrating high-dimensional and heterogenous biomedical data, such as electronic health
records (EHRs), molecular data, imaging, and free text, is a key challenge for making robust
discoveries that transform healthcare. Current work in the literature commonly analyze
biomedical data types separately, focus on small disease-related cohorts of patients, and rely on
domain experts and manual clinical feature selection in an ad hoc manner. Although
appropriate in some situations, supervised definitions of the feature space scale poorly, do not
generalize well, include inherent biases, and miss opportunities to discover novel patterns and
features. To address these issues, we will develop novel methods based on unsupervised
machine learning to derive low-dimensional vector-based representations, i.e., “embeddings”, of
medical concepts and patient clinical histories from large- scale, multi-modal and domain-free
biomedical datasets. These pre-computed representations aim to overcome common biases due to
population, supervised labeling, and specific hospital operation processes. These multi-modal
embeddings can be fine-tuned and applied to a number of specific predictive tasks,
improving scalability, generalizability and effectiveness of machine learning models in
healthcare. In particular, we will first develop methods based on unsupervised learning to create
multi-modal embeddings of medical concepts using heterogeneous EHRs, linked biobanks and
electrocardiogram waveform data, from the diverse population of five hospitals within the Mount
Sinai Health System in New York, NY, and publicly available medical knowledge. We will then
create a scalable framework to compute unsupervised multi-modal embeddings that can
summarize patient clinical histories and lead to subtyping and patient stratification. We will also
develop a federated learning system to share, visualize, and combine embeddings generated
separately at different medical institutes to capture a larger and more diverse population and
clinical landscape. We will apply embeddings to advance methods for EHR-based disease
phenotyping, onset prediction, and subtyping. While tested on EHRs, genetic and waveform data
from linked repositories, and medical knowledge, the proposed approaches will be easily
extendable to include other data, such as clinical images. This project will represent a step
towards the next generation of ML in healthcare ML that can (i) scale to billions of patients, (ii)
embed complex relationships of multi-modal data, and (iii) create less biased disease
representations by securely learning from patients across institutions via federated learning.