ABSTRACT
Precision medicine aims to accurately classify patients to improve diagnosis, intervention
selection, and prognosis. The All of Us Research Program (AoURP) collects a diverse array of
data types from participants, including surveys, electronic health records (EHRs), physical
measurements, wearable devices, and biosamples, offering valuable insights into health
trajectories. However, certain aspects of a patient’s life remain unrepresented in the collected
data, which can limit the accuracy of research and care. To address this gap, we propose the
creation of the All of Us Center for Linkage and Acquisition of Data (CLAD) to supplement
existing data sources using passive data streams and deploy integration strategies to "put the
patient back together again." This team brings together collective experience leading large
initiatives involving data acquisition, linkage, harmonization, quality assurance, pipelines and
platforms, governance, and security.
We will design and implement a data collection, linkage, and integration strategy that lays a
foundation for a variety of AoURP data linkages for identified, and de-identified data
integration, including person-level linkages such as with mortality, residential history, and
administrative claims, and geocoded data pipelines to enable linkages with the Environmental
Justice Index. The CLAD will acquire and process new data linkages and geocoded data in a
cloud-based Data Linkage Platform (DLP), guided by our experience formulating
researcher-ready datasets with scientific utility. Our CLAD team will perform data quality
assurance, repair, and standardization checks to ensure accuracy and robustness of data-driven
research. This endeavor will align data with interoperability standards and clinical terminologies,
extend them where necessary, and create a data quality dashboard for every data change and data
health check Data Quality reports for each of the sources and sites. We will also explore new
methods of clinical data acquisition from HINs to mitigate data missingness with a focus on
underrepresented populations by comparing AoURP participant-linked ambulatory EHR data
from OCHIN, which includes Medicaid and uninsured patients, with EHR data from health
systems served by Datavant. Diverse CLAD sources and novel analytical methods, such as
probabilistic models, will be used to reveal patterns of care and potential interventions for
communities underrepresented in biomedical research.