PROJECT SUMMARY
Electronic health record (EHR) data represent a huge opportunity for cost-efficient clinical and public health
research, especially when a randomized trial or a prospective observational study is not feasible or ethical. EHR
systems, however, are typically developed to support clinical and/or billing activities. As such, substantial care
is needed when using EHR data to address a particular scientific question. In this, an important potential threat
to validity is missing data. Moreover, since EHR data are not collected for any particular research question, it
will often be the case that measurements that are critical to answering the question will be unavailable in the
record of some patients. This, in turn, requires researchers to contend with the potential for selection bias and
compromised generalizability.
Towards addressing issues of missing data in an EHR, researchers could, in principle, appeal to a vast
statistical literature and use standard methods such as multiple imputation (MI), inverse-probability weighting
(IPW) or doubly- robust (DR) estimation. These methods, however, have generally been developed outside of the
EHR context. As such, they typically fail to acknowledge the complexity of the EHR data, in particular the many
decisions made by patients and health care providers that give rise to `complete data' in the EHR, known to as
the data provenance. Because of the disconnect between this complexity and the settings for which most missing
data methods are developed, the application of standard missing data methods to EHR-based studies will often
fail to resolve selection bias and generalizability will remain compromised.
Unfortunately, in contrast to confounding bias, very little attention has been paid to developing methods for
missing data that are specifically tailored to the complexity of EHR-based studies. We will begin to address this
gap by developing, implementing and evaluating a suite of novel, innovative statistical tools including: Aim 1: A
unified framework for robust causal inference in unmatched and matched EHR-based cohort studies with missing
confounder data; Aim 2: A formal, robust framework for causal inference in emulated target trials based on EHR
data; Aim 3: A novel blended analysis framework for missing data in EHR-based studies that combines MI and
IPW in an innovative and unique way; Aim 4: A novel double-sampling strategy for when the EHR data are
suspected to be missing-not-at-random.
The proposed aims are motivated by challenges the investigative team has faced in a series of EHR-based
studies of long-term outcomes among patients who have undergone bariatric surgery. Throughout this research,
we will use data from one of these studies, the DURABLE study, which has rich demographic and longitudinal
clinical information from three Kaiser Permanente health systems on ≈45,000 patients who underwent bariatric
surgery between 1997-2015, as well as on ≈1,636,000 non-surgical enrollees during that time period.