Progression Subtyping and Drug Target Identification for Parkinson's Disease with Integrative Machine Learning - ABSTRACT Parkinson's Disease (PD) is the second most prevalent neurodegenerative disorder all over the world. It is estimated that PD affects 2-3% of people older than 65 years. The underlying etiology and pathophysiology of PD remain unclear to date. Furthermore, PD patients show great heterogeneity in disease progression throughout the PD course, which is a critical factor that hinders therapeutic development. This creates challenges in finding effective disease-modifying treatment or prevention strategies. To overcome the challenges, massive resources for PD study have been built up and become available for research, including clinical, multi-omics, and neuroimaging data generated from well-designed research initiatives such as the Parkinson's Progression Markers Initiative (PPMI) and the Parkinson Disease Biomarkers Program (PDBP); general data sources in biology such as protein-protein interactome network data and functional genomic data; a comprehensive biomedical knowledge graph (BKG) we built; and continuously increasing volume of real-world patient data (RWD). Integrative analysis of these massive and heterogeneous data poses considerable challenges to conventional computational approaches for deriving valuable and reliable insights. Despite the numerous efforts to develop novel machine learning (ML) algorithms for analyzing these data, they typically focused on one or a few data types. Therefore, there is a critical need to develop ML methods to perform integrative and effective analyses of heterogeneous PD data sources to derive comprehensive insights. This project aims to build such a pipeline with three specific aims. Aim 1 identifies progression subtypes of PD through integrative modeling of longitudinal clinical, transcriptomic, and neuroimaging data of participants in the PPMI and PDBP cohorts. Aim 2 identifies gene modules that govern the differential PD progression through integrative analysis of multi-omics data with network medicine and ML. Aim 3 evaluates and validates the gene modules as drug targets through in-silico drug repurposing with multi-omics, BKG we built, and real-world patient data, respectively. In sum, this pipeline will perform integrative analysis on the longitudinal clinical data, transcriptomics data, and neuroimaging data, as well as whole-genome/exome sequencing data from the PPMI and PDBP cohorts, publicly available human interactome data, functional genomic data, drug-perturbation multi-omics data, our BKG, and the real- world EHR data from the INSIGHT network (covering ~12 million patients across New York City's Five health systems and the greater metropolitan area), the Cleveland CIinic EHR database (covering ~11 million patients extracted from IBM Explorys), and the Temple Health EHR (covering ~1.2 million patients).