Penalized mixture cure models for identifying genomic features associated with outcome in acute myeloid leukemia - Molecular features associated with time-to-event outcomes, such as overall or disease-free survival, may be prognostically relevant or potential therapeutic targets. Therefore, analyzing data from high-throughput genomic assays with clinical follow-up data has been of growing interest. The Cancer Genome Atlas (TCGA) Project has collected baseline demographic, clinical characteristics, and follow-up data for 11,125 patients for 32 different cancer types and corresponding tissue samples were processed for examining SNPs, copy number, methylation, miRNA expression, and mRNA expression. Because the number of variables (P ) exceeds the sample size (N), one strategy frequently employed when associating molecular features to survivorship data is to fit univariable Cox proportional hazards (PH) models followed by adjustment for multiple hypothesis tests using a false discovery rate approach. However, most chronic conditions and diseases, including cancer, are likely caused by multiple dysregulated genes or mutations. It is therefore critical to fit multivariable models in the presence of a high- dimensional covariate space. Traditional statistical methods cannot be used when the number of features exceeds the sample size (e.g., P > N), though penalized methods perform automatic variable selection and accommodate the P > N scenario. Penalized approaches including LASSO, smoothly clipped absolute deviation (SCAD), adaptive LASSO, and Bayesian LASSO have all been extended to Cox's PH model for handling high-dimensional covariate spaces. However, when modeling survival or other time-to-event outcomes, the Cox PH model assumes that all subjects will experience the event of interest, which is violated when a subset of subjects are cured. Instead, when a subset of subjects in the data are cured, mixture cure models should be fit. Although mixture cure models have been described for traditional settings where the number of samples exceeds the number of covariates, limited variable selection methods and no methods for high-dimensional model fitting currently exist for mixture cure models. Therefore, this project will overcome a critical barrier to progress in this field by developing penalized parametric and semi-parametric mixture cure models applicable for high-dimensional datasets. The specific aims of this application are to: (1) Develop penalized parametric mixture cure models for high-dimensional datasets; and (2) Develop a penalized semi-parametric proportional hazards mixture cure model for high-dimensional datasets. For both aims we will characterize the performance of the methods using extensive simulation studies, develop software, and distribute R packages to CRAN. In aim (3) we will identify molecular features associated with cure and survival using our large unique AML dataset from the Alliance for Clinical Trials in Oncology and assess robustness of findings using AML datasets from Gene Expression Omnibus and The Cancer Genome Atlas project. This research will fill a critical gap as there are currently no mixture cure models for high-dimensional data. We anticipate application of our methods to our AML data will enhance existing risk stratification systems used in daily clinical practice that determine treatment intensity and modality.