Abstract
With recent advancements in screening, diagnosis and treatment, many diseases are identi¿ed at an early stage and
a signi¿cant proportion of patients suffering from these diseases are clinically cured. That is, these patients will never
experience recurrence, metastasis or death due to the primary disease. Among patients with early-stage diseases,
it is clinically important to identify cured patients early, based on their pre-treatment characteristics, so that these
patients can be protected from the additional risks of high-intensity treatments. Similarly, identifying uncured patients
early is also important so that they can be treated timely before their diseases progress to advanced stages for which
therapeutic options are rather limited. Such identi¿cation is also crucial for clinical trials to develop effective adjuvant
therapies. Thus, there is an immense need for a predictive model that can take patient survival data and any available
information on patient-related characteristics (or features) as simple inputs and predict the cured or uncured status of
patients with high accuracy. Existing state-of-the-art models capable of such prediction come with several drawbacks
that make them hard to meet the increasing needs for advanced applications. These include the lack of biological
motivation and restrictive model assumptions, non-robustness and global convergence problems with the associated
estimation procedures, inability to ef¿ciently handle high-dimensional data which leads to impreciseness in predictive
accuracies of cure/uncure, and unavailability of the models and the associated methods as ready-to-use software
packages with most of them requiring rich programming experience for successful implementation. The proposed
research seeks to address the aforementioned issues by developing a next generation model, based on decreased
complexity and lower computational cost, for highly accurate prediction of cured or uncured status in the presence of
high-dimensional data. The novel idea here is to integrate machine learning with modern predictive statistical model
to capture complex patterns in the data. We hypothesize that capturing such complex patterns will greatly improve
the predictive accuracy of cure and will also result in improved prediction of the survival distribution of the uncured
patients. In particular, the following speci¿c aims are proposed. Aim 1: To develop a novel support vector machine-
based predictive model that can capture the patient population as a mixture of cured and uncured patients; Aim 2: To
develop new computationally ef¿cient estimation and feature selection methods that can handle high-dimensional data;
Aim 3: To develop new method for validating the proposed model using existing patient survival data and develop R
software package for free and non-pro¿t use. Successful completion of this research will aid in treatment assignment
and the need to develop effective adjuvant therapies for the overall bene¿t of patients.