Robust methods for missing data in electronic health records-based studies - PROJECT SUMMARY Electronic health record (EHR) data represent a huge opportunity for cost-efficient clinical and public health research, especially when a randomized trial or a prospective observational study is not feasible or ethical. EHR systems, however, are typically developed to support clinical and/or billing activities. As such, substantial care is needed when using EHR data to address a particular scientific question. In this, an important potential threat to validity is missing data. Moreover, since EHR data are not collected for any particular research question, it will often be the case that measurements that are critical to answering the question will be unavailable in the record of some patients. This, in turn, requires researchers to contend with the potential for selection bias and compromised generalizability. Towards addressing issues of missing data in an EHR, researchers could, in principle, appeal to a vast statistical literature and use standard methods such as multiple imputation (MI), inverse-probability weighting (IPW) or doubly- robust (DR) estimation. These methods, however, have generally been developed outside of the EHR context. As such, they typically fail to acknowledge the complexity of the EHR data, in particular the many decisions made by patients and health care providers that give rise to `complete data' in the EHR, known to as the data provenance. Because of the disconnect between this complexity and the settings for which most missing data methods are developed, the application of standard missing data methods to EHR-based studies will often fail to resolve selection bias and generalizability will remain compromised. Unfortunately, in contrast to confounding bias, very little attention has been paid to developing methods for missing data that are specifically tailored to the complexity of EHR-based studies. We will begin to address this gap by developing, implementing and evaluating a suite of novel, innovative statistical tools including: Aim 1: A unified framework for robust causal inference in unmatched and matched EHR-based cohort studies with missing confounder data; Aim 2: A formal, robust framework for causal inference in emulated target trials based on EHR data; Aim 3: A novel blended analysis framework for missing data in EHR-based studies that combines MI and IPW in an innovative and unique way; Aim 4: A novel double-sampling strategy for when the EHR data are suspected to be missing-not-at-random. The proposed aims are motivated by challenges the investigative team has faced in a series of EHR-based studies of long-term outcomes among patients who have undergone bariatric surgery. Throughout this research, we will use data from one of these studies, the DURABLE study, which has rich demographic and longitudinal clinical information from three Kaiser Permanente health systems on ≈45,000 patients who underwent bariatric surgery between 1997-2015, as well as on ≈1,636,000 non-surgical enrollees during that time period.