Project Summary Abstract
The adoption of electronic health records (EHR) in healthcare has resulted in a hugely promising source of
data for public health and medical research. Because EHR include rich data on large populations at relatively
low cost, many researchers have turned to observational studies using EHR as an alternative to conducting
randomized studies that are often prohibitively expensive and time-consuming. However, EHR data are not
collected with research questions in mind, meaning data necessary for statistical analysis are frequently missing.
Two commonly utilized study designs in observational settings are the target trial emulation and matched
cohort designs. A critical component in each of these study designs is determining the population of patients
eligible for inclusion in the study. Missing data in variables that define eligibility criteria thus present a major
challenge for researchers. In practice, patients with incomplete eligibility data are frequently excluded from anal-
ysis, despite the possibility of selection bias, where subjects with observed eligibility data may be fundamentally
different than excluded subjects. Few works have acknowledged that missing eligibility data poses the risk of
selection bias. What little work exists doesn’t consider the problem in the above study designs, examine diverse
types of outcomes, or provide expansive guidelines on which clinical settings this bias arises.
An inverse probability weighting (IPW) framework to address selection bias will be developed in a manner
tailored towards sequential target trial emulations examining time-to-event endpoints. Estimation and inferential
procedures under this framework will be established, and methods will be evaluated on a complex simulation
infrastructure that adequately captures the intricacies of EHR data. This will enable detailed characterization of
clinical settings where bias arises in practice. IPW fails to produce consistent estimates when weight models
are missspecified. Influce-function based estimators will be derived, which will be robust to forms of model
mispecification and allow for estimation via flexible machine learning methods. This class of estimators will be
developed for the matched cohort design when interest lies in continuous or longitudinal outcomes.
The methods described in these aims will be applied to EHR-derived data that include long-term health
outcomes among 45,000 individuals who underwent bariatric surgery between 1997 and 2015, and over 1.6
million non-surgical patients eligible for bariatric surgery during that time frame. Specifically, this research will
answer open questions about the efficacy and safety of bariatric surgery in the treatment of patients with obesity
and type 2 diabetes, and will consider how rates of micro- and macrovascular complications associated with
diabetes differ between patients undergoing bariatric surgery and those not. Robust software will be developed
that provides researchers valid, practical, and user-friendly tools for the the identification, characterization, and
control of selection bias in EHR-based research.