Development and implementation of statistical machine learning methods to shorten rare disease odysseys - Patients with rare diseases (RDs) face tremendous physical, psychosocial, and economic suffering in their protracted journeys toward diagnosis and therapy. These journeys, known as diagnostic and therapeutic odysseys, are riddled with diagnostic delays and difficulties finding effective treatment strategies. Undiagnosed Diseases Network (UDN) at the NIH was established to diagnose individuals who a The re living with the often dire consequences of an RD. Despite the UDN’s comprehensive diagnostic approach, 70% of patients remain undiagnosed, highlighting the need for novel diagnostic strategies. The diagnostic approach at the UDN currently relies on manual extraction of RD phenotypes from clinical notes in electronic health records (EHR), which is laborious and time-consuming. A promising alternative is to leverage natural language processing (NLP) models, which can automatically extract fine-grained RD phenotypes from clinical notes, to support timely diagnosis at the UDN. Existing general NLP models, however, are not suitable for supporting diagnosis at the UDN. Furthermore, NLP models have limited impact on diagnosis due to scarce infrastructure for delivering them to the clinic, highlighting the need to bridge the implementation gap between NLP and practice. Even after diagnosis, patients often undergo therapeutic odysseys. Despite advancements in gene therapy, evidence shows that genetics alone do not account for the wide diversity in RD phenotypes. Exposures also play a critical role, but less is known about how their causal effects vary across individuals. This knowledge gap underscores the need to elucidate the complex phenome-genome-exposome interplay on an individual-level basis, which is crucial in informing personalized disease management strategies. The overall objective of this proposal is to develop and implement advanced statistical machine learning (ML) methods aimed at shortening RD odysseys. During the K99 phase, I will develop a novel NLP system to extract RD phenotypes from clinical notes (Aim 1) and implement it using REDCap at the Vanderbilt UDN (Aim 2). During the R00 phase, I will leverage phenomic, genomic, and exposomic data from All of Us and build a causal inference framework that uses modern statistical ML techniques to estimate personalized causal effects of exposures on RD phenotypes (Aim 3). The expected outcomes are a novel, open-source NLP system for RDs, an implementation framework using REDCap to support timely diagnosis at the Vanderbilt UDN, and an advanced, reproducible causal inference framework to elucidate the complex phenome-genome- exposome interplay underlying RDs on an individual-level basis. During the K99 phase, the PI will be mentored by experts in NLP, REDCap, EHR phenotyping, and RDs at Vanderbilt, and develop competencies in those areas. This proposal will yield results for subsequent studies on data-driven approaches aimed at shortening RD odysseys. This award will provide the necessary training to supplement the PI’s expertise in statistical ML and causal inference and help her transition into an independent career in biomedical data science.