Molecular features associated with time-to-event outcomes, such as overall or disease-free survival, may be
prognostically relevant or potential therapeutic targets. Therefore, analyzing data from high-throughput genomic
assays with clinical follow-up data has been of growing interest. The Cancer Genome Atlas (TCGA) Project has
collected baseline demographic, clinical characteristics, and follow-up data for 11,125 patients for 32 different
cancer types and corresponding tissue samples were processed for examining SNPs, copy number, methylation,
miRNA expression, and mRNA expression. Because the number of variables (P ) exceeds the sample size (N),
one strategy frequently employed when associating molecular features to survivorship data is to fit univariable
Cox proportional hazards (PH) models followed by adjustment for multiple hypothesis tests using a false discovery
rate approach. However, most chronic conditions and diseases, including cancer, are likely caused by multiple
dysregulated genes or mutations. It is therefore critical to fit multivariable models in the presence of a high-
dimensional covariate space. Traditional statistical methods cannot be used when the number of features exceeds
the sample size (e.g., P > N), though penalized methods perform automatic variable selection and accommodate
the P > N scenario. Penalized approaches including LASSO, smoothly clipped absolute deviation (SCAD),
adaptive LASSO, and Bayesian LASSO have all been extended to Cox's PH model for handling high-dimensional
covariate spaces. However, when modeling survival or other time-to-event outcomes, the Cox PH model assumes
that all subjects will experience the event of interest, which is violated when a subset of subjects are cured.
Instead, when a subset of subjects in the data are cured, mixture cure models should be fit. Although mixture
cure models have been described for traditional settings where the number of samples exceeds the number
of covariates, limited variable selection methods and no methods for high-dimensional model fitting currently
exist for mixture cure models. Therefore, this project will overcome a critical barrier to progress in this field
by developing penalized parametric and semi-parametric mixture cure models applicable for high-dimensional
datasets. The specific aims of this application are to: (1) Develop penalized parametric mixture cure models
for high-dimensional datasets; and (2) Develop a penalized semi-parametric proportional hazards mixture cure
model for high-dimensional datasets. For both aims we will characterize the performance of the methods using
extensive simulation studies, develop software, and distribute R packages to CRAN. In aim (3) we will identify
molecular features associated with cure and survival using our large unique AML dataset from the Alliance for
Clinical Trials in Oncology and assess robustness of findings using AML datasets from Gene Expression Omnibus
and The Cancer Genome Atlas project. This research will fill a critical gap as there are currently no mixture cure
models for high-dimensional data. We anticipate application of our methods to our AML data will enhance existing
risk stratification systems used in daily clinical practice that determine treatment intensity and modality.