Cancer is a major morbidity and mortality burden throughout the world. While much progress has been
made, the elimination of cancer has not yet been achieved. In the currently funded grant, we have developed
statistical methods for genome-wide association analysis of cancer and studied cancer by the site of origin.
However, even within a site, cancer can have distinct mutational profiles across patients. Pooling all cancer
cases occurring at one site as one disease may miss important clinical and etiological insights. Recently
technology advances have made it possible to characterize somatic mutations at great detail in large numbers
of tumors, providing a unique opportunity to study tumor heterogeneity. The objective of this competitive
renewal is to continue our statistical methods development for association analyses of tumor heterogeneity
with clinical outcomes, and for studying the underlying genetic and environmental etiology.
There are challenges in analyzing the somatic mutation data. First, somatic mutation may only exist in a
subset of tumor cells of a patient, so called intra-tumor heterogeneity. While our application is focused on
tumor heterogeneity across patients, because intra-tumor heterogeneity can also impact clinical outcomes,
important insight could be missed if it were not accounted for. The goal of Aim 1 is to develop statistical
methods to account for intra-tumor heterogeneity when assessing the association of somatic mutations with
clinical outcomes. Second, it is of great interest to discover germline-somatic mutation link; however, despite
that tumor studies are considerably larger than before due to technology advances, the power for discovering
such links remains limited because of moderate genetic effects and the burden of accounting for multiple
comparison from testing millions of variants. The goal of Aim 2 is to develop novel screening strategies for
prioritizing genetic variants in testing genome-wide association with tumor heterogeneity. We will achieve
optimal power by using the weighted hypothesis testing framework, allowing for correlated genetic variants and
continuous screening statistics. Third, it is common that tumor blocks can usually only be retrieved from a
subset of cases and tumor sequencing data are thus only available for this subset. Meanwhile, extensive risk
factor information has already been collected for the larger study. The goal of Aim 3 is to develop a robust and
efficient approach to incorporate the summary statistics information from the larger study for characterizing the
effects of genetic and environmental risk factors on risk of developing cancer with specific tumor feature.
The methods will be applied to the Genetics and Epidemiology of Colorectal Cancer Consortium
(GECCO, PI: Ulrike Peters; Lead Biostatistician: Li Hsu), which includes over 125,000 colorectal cancer cases
and controls all with GWAS data and additionally 7,000 tumors sequencing data. As our methods are also
applicable to other cancer studies, we will implement them in computationally efficient and user-friendly
software packages and disseminate them to the community through R/CRAN, R/Bioconductor, or Github.