PROJECT SUMMARY
Nationwide adoption of electronic health records (EHRs) has led to the increasing availability of large clinical
datasets. With statistical modeling and machine learning, these datasets have been be used in a wide range of
applications, including diagnosis, decision support, cost reduction, and personalized medicine. However,
because the same patient could be treated at multiple health care institutions, data from only a single EHR might
not contain the complete medical history for that patient, with critical events potentially missing. A common
approach to addressing this problem is to apply data checks that filter the EHR for patients whose data appear
to be more “complete”. Examples of filters include requiring at least one visit per year or ensuring that age, sex,
and race are all recorded. However, in a previous study using EHR data from seven institutions, we showed that
these filters can greatly reduce the sample size and introduce unexpected biases by selecting sicker patients
who seek care more often and changing the demographics of the resulting cohorts. This project extends this
prior research by implementing an expanded set of data completeness filters and testing their accuracy and
potential biases using a combination of national claims data and EHR data from dozens of hospitals and
healthcare centers across the country. This will enable us to understand how data completeness varies in
different EHRs and quantify the tradeoffs of different approaches to correcting for gaps in patients' records. First,
we will develop and measure the accuracy of data completeness filters using national claims data. This provides
a “gold standard” of longitudinal data where patients' complete medical histories are known during the periods
in which they were enrolled in the insurance plan. After partitioning the data by provider groups to model gaps
in EHR data, we will test how well data completeness filters, individually and in combined machine learning
models, select patients with fewer gaps. We will then test whether the filters introduce biases by selecting sicker
patients (more diagnoses, more visits, etc.) or changing their demographic characteristics (age, sex, and zip
code). Then, we will test the filters on EHR data, first at a single large medical center, and then across a national
network of 57 institutions, representing different geographic regions, patient populations, number of years of
data, and types of health care facilities. We will evaluate the filters by measuring whether they improve the
performance of a machine learning model for predicting hospital admissions. Our ultimate goals are to (a) help
researchers balance the need for complete data with the biases this might introduce to their models and (b) help
them predict how well models trained on one EHR dataset might work on other EHRs with different data
completeness profiles.