Large-Scale Nationally Representative Patient-generated Health Data for Development of Generalizable
Data Science Methodologies for Precision Public Health. Racial-ethnic minorities, socioeconomically
disadvantaged, and other underserved populations experience disproportionate adverse health outcomes
despite decades of research correlating social determinants (SDs) to variations in health outcomes. Many public
health approaches use population averages to create “one-size-fits-all” interventions to increase the probability
of achieving the best outcomes for the average person, but are limited by population heterogeneity in number,
magnitude, interplay, and amplification of SDs. Precision public health (PPH) emerged to use digital technologies
(DTs) to develop interventions targeting unique needs of specific populations to improve the health and reduce
disparities. Analysis of voluminous, precise, continuous, and longitudinal data generated by DTs holds great
promise for PPH as smartphones, Internet of Things, and wearable sensors are becoming ubiquitous, generating
data on environment, transportation, geolocation, diet, exercise, social interactions, and daily activities. These
person-generated health data (PGHD) have unprecedented potential to add rich insight on everyday human
behaviors to traditional health research. Though clinical PGHD applications are in early stages, there is rapid
progress in development of digital indicators of health, offering virtually limitless potential. Because PGHD are
typically captured outside of controlled research settings, they suffer from challenges of non-traditional data that
impede their acceptance and use across the healthcare ecosystem. First, PGHD are vulnerable to input biases
as users of consumer DTs are a self-selected group. Second, PGHD suffer from poor internal data quality due
to high variability in completeness for reasons that are not always equally distributed across individuals (e.g.,
connectivity issues, battery, user forgetfulness, user error). Together, input bias and poor data quality lead to
poor external validity, where analytics derived from PGHD are not generalizable to the broader population. The
objective of this partnership between the RAND Corporation and Evidation Health is to improve generalizability
of data science methods for PGHD, allowing for representation of all population groups, including the historically
underserved. We will accomplish this goal via three aims: (i) generate PGHD from a nationally representative
probability sample of Americans to understand the social distribution of user engagement with health DTs and
poor sleep health; (ii) develop a methodology that characterizes missing data within PGHD and selects
appropriate imputation strategies (existing and novel) optimized for reduction in model bias and socio-
demographic input disparities; and, (iii) create a propensity-score based statistical weighting methodology to
improve the effectiveness and applicability of methods derived from non-random, self-selected, and/or already
collected PGHD in underserved populations. This work will enable future identification and application of digital
indicators for health interventions that account for all populations, a critical first step for digital PPH.