Improving missing data analysis in distributed research networks - ABSTRACT
Electronic health record (EHR) databases collect data that reflect routine clinical care. These databases are
increasingly used in comparative effectiveness research, patient-centered outcomes research, quality
improvement assessment, and public health surveillance to generate actionable evidence that improves patient
care. It is often necessary to analyze multiple databases that cover large and diverse populations to improve
the statistical power of the study or generalizability of the findings. A common approach to analyzing multiple
databases is the use of a distributed research network (DRN) architecture, in which data remains under the
physical control of data partners. Although EHRs are generally thought to contain rich clinical information, the
information is not uniformly collected. Certain information is available only for some patients, and only at some
time points for a given patient. There are generally two types of missing information in EHRs. The first is the
conventionally understood and obvious missing data in which some data fields (e.g., body mass index) are not
complete for various reasons, e.g., the clinician does not collect the information or the patient chooses not to
provide the information. The second is less obvious because the data field is not empty but the recorded value
may be incorrect due to missing data. For example, EHRs generally do not have complete data for care that
occurs in a different delivery system. A medical condition (e.g., asthma) may be coded as “no” but the true
value would have been “yes” if more complete data had been available, e.g., from claims data as the other
delivery system would submit a claim to the patient’s health plan for the care provided. In other words, one
may incorrectly treat “absence of evidence” as “evidence of absence”. EHRs hold great promise but we must
address several outstanding methodological challenges inherent in the databases, specifically missing data.
Addressing missing data is more challenging in DRNs due to different missing data mechanisms across
databases. The specific aims of the study are: (1) Apply and assess missing data methods developed in
single-database settings to handle obvious and well-recognized missing data in DRNs; (2) Apply and assess
machine learning and predictive modeling techniques to address less obvious and under-recognized missing
data for select variables in DRNs; and (3) Apply and assess a comprehensive analytic approach that combines
conventional missing data methods and machine learning techniques to address missing data in DRNs. The
analytic methods developed in this project, including the extension of existing missing data methods to DRNs,
the innovative use of machine learning techniques to address missing data, and their integration with privacy-
protecting analytic methods, will have direct impact on the design and analysis of future comparative
effectiveness and safety studies, and patient-centered outcomes research conducted in DRNs.