Project Summary/Abstract
Phenotypic variability across demographically diverse populations are driven by environmental factors. The
overall goal of this proposal is to deploy data science approaches to drive discovery of associations between
exposures (E) and phenotypes (P) in demographically diverse populations. We lack data science methods to
associate, replicate, and prioritize exposure variables of the exposome (E) in phenotypes (P) and disease
incidence (D), required for the delivery of precision medicine. Observational studies are fraught with 4 unsolved
data science challenges. First, E-based studies are: (1) limited to associating a few hypothesized exposure-
phenotype pairs (E-P) at a time, leading to a fragmented literature of environmental associations. Machine
learning (ML) approaches for feature selection and prediction hold promise, however, (2) most extant E-based
cohorts contain missing data, challenging the use of ML to detect complex E-P associations, Third, (3) biases,
such as confounding and study design influence associations and hinder translation. Fourth, (4) there are few
well-powered data resources that systematically document longitudinal E-P and E-D associations across
massive precision medicine. It is a challenge to systematically associate a number of exposures in multiple
phenotypes and replicate these associations across cohorts. (Aim 1). The “vibration of effects”, or the degree
to which associations change as a function of study design (e.g., analytic method, sample size) and model
choice is a hidden bias in observational studies (Aim 2). Third, an outstanding question is the degree to which
environmental differences lead to health disparities. To address these challenges and gaps, we propose to Aim
1: develop and test machine learning methods to associate multiple environmental exposure indicators with
multiple phenotypes: EP-WAS. We hypothesize that exposures will explain a significant amount of variation in
phenotype in populations and will deposit all data and models in a novel EP-WAS Catalog. Aim 2: Quantitate
how study design influences associations between exposure biomarkers and phenotype. We will scale up,
extend, and test a method called “vibration of effects” (VoE) to measure how study criteria influences the
stability of associations (how reproducible associations are as a function of analytic choice). Aim 3. Leverage
EP-WAS and VoE to disentangle biological, demographic, and environmental influences of phenotypic
disparities in hypercholesterolemia. We will deploy EP-WAS and VoE packaged libraries in the largest cohort
study to partition phenotypic variation across demographic groups in factors for hypercholesterolemia. We will
equip the biomedical community with data science approaches for robust data-driven discovery and
interpretation of exposure-phenotype factors in observational datasets, required for the identification of
environmental health disparities. For the first time, investigators will ascertain the collective role of the
environment in heart disease at scale just in time for the All of Us program.