PROJECT SUMMARY
Randomized controlled trials (RCTs) are the gold-standard in clinical research but are subject to many
limitations including high costs, limited generalizability, and small sample sizes in patient subgroups. By
contrast, electronic health records (EHRs) are widely available and contain information on large and
representative patient cohorts. However, because they capture the uncontrolled observations of many
clinicians, they are highly susceptible to bias. The recent availability of the raw data from RCTs has created a
unique opportunity to integrate them with that from EHRs, and to innovate methods that exploit the distinct
advantages of each dataset.
We propose to identify the zone of overlap between these data and build bridges in data representations.
These bridges could enable us to better emulate randomized trials using EHR data and measure the same
effects seen in the trials. Consequently, it would allow us to study subgroups that were excluded from the
pivotal trials associated with new drug approvals by the FDA.
We will test these ideas out in the context of Ulcerative colitis (UC) and scale to others in future work. We
have obtained access to the raw data from 12 RCTs in UC (N=6,226). These data contain timed and structured
measurements of disease activity including the Mayo score, a composite score of patient symptoms and
endoscopic severity. We have also obtained access to the EHR data of 3,270 UC patients treated at the
University of California San Francisco. These data contain similar data as RCTs but largely in an unstructured
form. In addition, these assessments tend to be incomplete relative to trials due to costs and invasiveness of
some tests. We will address this problem of unharmonized and incomplete EHR data in three aims.
In Aim 1, we will harmonize the RCT data into an analysis-ready format. We will also develop text
classification tools to transform free-texted EHR data into Mayo subscores, and validate these tools against
data from a second center. In Aim 2, we will integrate the RCT and EHR data, train algorithms to impute RCT-
based representations of the patient state from partial measurements made in EHRs, and test them under
conditions typifying real-world data capture. In Aim 3, we will use these algorithms to harmonize EHR data,
validate them as a tool to recover the same effects as RCTs, and study new patient subgroups.
The applicant will carry out these aims and train in biostatistics, natural language processing, machine
learning, and overall career development. With the help of his mentors, he will launch a career dedicated to
developing and disseminating methods for learning from complex clinical data, and in so doing, promote a
future of better healthcare for all patients.