Patients with rare diseases (RDs) face tremendous physical, psychosocial, and economic suffering in their
protracted journeys toward diagnosis and therapy. These journeys, known as diagnostic and therapeutic
odysseys, are riddled with diagnostic delays and difficulties finding effective treatment strategies.
Undiagnosed Diseases Network (UDN) at the NIH was established to diagnose individuals who a
The
re living with
the often dire consequences of an RD. Despite the UDN’s comprehensive diagnostic approach, 70% of
patients remain undiagnosed, highlighting the need for novel diagnostic strategies. The diagnostic approach at
the UDN currently relies on manual extraction of RD phenotypes from clinical notes in electronic health records
(EHR), which is laborious and time-consuming. A promising alternative is to leverage natural language
processing (NLP) models, which can automatically extract fine-grained RD phenotypes from clinical notes, to
support timely diagnosis at the UDN. Existing general NLP models, however, are not suitable for supporting
diagnosis at the UDN. Furthermore, NLP models have limited impact on diagnosis due to scarce infrastructure
for delivering them to the clinic, highlighting the need to bridge the implementation gap between NLP and
practice. Even after diagnosis, patients often undergo therapeutic odysseys. Despite advancements in gene
therapy, evidence shows that genetics alone do not account for the wide diversity in RD phenotypes.
Exposures also play a critical role, but less is known about how their causal effects vary across individuals.
This knowledge gap underscores the need to elucidate the complex phenome-genome-exposome interplay on
an individual-level basis, which is crucial in informing personalized disease management strategies. The
overall objective of this proposal is to develop and implement advanced statistical machine learning
(ML) methods aimed at shortening RD odysseys. During the K99 phase, I will develop a novel NLP system
to extract RD phenotypes from clinical notes (Aim 1) and implement it using REDCap at the Vanderbilt UDN
(Aim 2). During the R00 phase, I will leverage phenomic, genomic, and exposomic data from All of Us and
build a causal inference framework that uses modern statistical ML techniques to estimate personalized causal
effects of exposures on RD phenotypes (Aim 3). The expected outcomes are a novel, open-source NLP
system for RDs, an implementation framework using REDCap to support timely diagnosis at the Vanderbilt
UDN, and an advanced, reproducible causal inference framework to elucidate the complex phenome-genome-
exposome interplay underlying RDs on an individual-level basis. During the K99 phase, the PI will be mentored
by experts in NLP, REDCap, EHR phenotyping, and RDs at Vanderbilt, and develop competencies in those
areas. This proposal will yield results for subsequent studies on data-driven approaches aimed at shortening
RD odysseys. This award will provide the necessary training to supplement the PI’s expertise in statistical ML
and causal inference and help her transition into an independent career in biomedical data science.