Multi-modal unsupervised embeddings to advance machine learning in healthcare - PROJECT SUMMARY Integrating high-dimensional and heterogenous biomedical data, such as electronic health records (EHRs), molecular data, imaging, and free text, is a key challenge for making robust discoveries that transform healthcare. Current work in the literature commonly analyze biomedical data types separately, focus on small disease-related cohorts of patients, and rely on domain experts and manual clinical feature selection in an ad hoc manner. Although appropriate in some situations, supervised definitions of the feature space scale poorly, do not generalize well, include inherent biases, and miss opportunities to discover novel patterns and features. To address these issues, we will develop novel methods based on unsupervised machine learning to derive low-dimensional vector-based representations, i.e., “embeddings”, of medical concepts and patient clinical histories from large- scale, multi-modal and domain-free biomedical datasets. These pre-computed representations aim to overcome common biases due to population, supervised labeling, and specific hospital operation processes. These multi-modal embeddings can be fine-tuned and applied to a number of specific predictive tasks, improving scalability, generalizability and effectiveness of machine learning models in healthcare. In particular, we will first develop methods based on unsupervised learning to create multi-modal embeddings of medical concepts using heterogeneous EHRs, linked biobanks and electrocardiogram waveform data, from the diverse population of five hospitals within the Mount Sinai Health System in New York, NY, and publicly available medical knowledge. We will then create a scalable framework to compute unsupervised multi-modal embeddings that can summarize patient clinical histories and lead to subtyping and patient stratification. We will also develop a federated learning system to share, visualize, and combine embeddings generated separately at different medical institutes to capture a larger and more diverse population and clinical landscape. We will apply embeddings to advance methods for EHR-based disease phenotyping, onset prediction, and subtyping. While tested on EHRs, genetic and waveform data from linked repositories, and medical knowledge, the proposed approaches will be easily extendable to include other data, such as clinical images. This project will represent a step towards the next generation of ML in healthcare ML that can (i) scale to billions of patients, (ii) embed complex relationships of multi-modal data, and (iii) create less biased disease representations by securely learning from patients across institutions via federated learning.