Abstract
Observational studies based on big data from electronic medical records (EMRs) have been conducted recently
in many areas of medical research [1, 2]. These results provide high impact information on rare events or rare
diseases that otherwise would not be available in non-EMR studies with much smaller sample sizes. Many EMR
studies have also been done in Ophthalmology [3], especially using data from the Intelligent Research in Sight
(IRIS®) [4]. These studies face several challenges which could affect the validity of the study results. First, one
common issue is the EMR data may not fully represent the background population, especially minority groups.
This may cause biased disease estimates for underrepresented groups and gives invalid conclusions, if the ob-
served differences are not taken into consideration during data analysis. Second, when estimating prevalence
and incidence of target diseases and their associated risk factors, the entire cohort without the primary disease
or a group of healthy individuals with similar sample sizes could be considered as the control group (over 70 mil-
lions records in IRIS). Optimal sampling methods adjusting for related risk factors are inevitable tools (currently
unavailable) for selecting equally informative study groups with much smaller sample sizes and higher compu-
tational efficiency. Thirdly, among about half of the publsihed IRIS studies to date, the primary outcomes are
frequently rare events. Moreover, when we combine classes from categorical variables, unbalanced subgroups
with much smaller sample sizes often appear and this may lead to unreliable estimates with much wider confi-
dence intervals. The results become even less trustworthy when the variable recombination happens to a rare
disease outcome. It is evident that other big data EMR studies could also face the same challenges. In addi-
tion, these issues may have significant financial consequences, e.g, the lengthy running times are costly when
the EMR is hosted in secure cloud environment and the situation becomes even worse when statistical software
would crash without giving any meaningful results after long runs. To address these challenges, in this applica-
tion, through a collaborative effort between Wills Eye Hospital and the University of Connecticut, that combines
theoretical and applied statistical expertise, we propose to develop and evaluate novel subsampling and optimal
analysis methods which to the best of our knowledge do not exist to date. This application proposes to achieve
the following aims: 1) Derive optimal subsampling probabilities for both rare and non-rare events data with both
categorical and numerical covariates, which are also invariant to measurement scales for numerical covariates; 2)
Design an effect balancing approach for covariates with rare category combinations to better include underrepre-
sented subgroups to prevent potential disparity in analysis results and protect health disparity. 3) Develop optimal
sampling strategies to adjust for selection bias in EMR studies. Most importantly, we will create user-friendly
software packages on optimal subsampling for practitioners that will be applicable for similar settings in medical
research.