An integrated, multi-cohort approach for cancer health disparities and risk assessment - SUMMARY Of the 8,314 drug indications entering Phase II clinical trials from 2011–2020, 85% failed. This could have been greatly reduced if a drug could be administered only to the people most likely to respond. At the same time, over a million women/year receive unnecessary treatments due to breast cancer over-diagnosis in the US alone. Concurrently, 30-50% of patients with non-small cell lung cancer (NSCLC) develop recurrence and die after the standard curative resection, suggesting that many patients would have benefited from more aggressive treatments at early stages. These exist not only in breast and lung cancers, but also in many other cancers. In all cases outlined above, the problem lies in our inability to distinguish among populations at risk (low-risk vs. high-risk), patient subgroups (drug resistance vs. response), and disparities in each cancer. The main goal of this project is to establish a new methodology to identify populations experiencing cancer health disparities and their associated risk factors, as well as to develop a Bayesian model to accurately predict patient outcomes. The hypothesis is that populations at risk and their associated risk factors can only be identified by an integrative analysis of multiple large patient cohorts from different demographic distributions and geographic locations. It is because findings and conclusions obtained from small, controlled studies often do not generalize to broader populations, leading to unsuccessful randomized clinical trials and inaccurate diagnoses. The innovation of this work is the development of a novel approach able to integrate clinical variables with pathway expression (computed from multi-omics data using a method proposed by the parent U01 project). Fundamental to this approach is the capability to explain why patients with similar molecular profiles (mRNA, miRNA, methylation, mutation, etc.) but different backgrounds can greatly differ in terms of cancer evolution and response to treatments. The goal of this project will be achieved through three specific aims: 1) identify population subgroups and associated risk factors by a comprehensive analysis of millions of patient records from the Million Veteran Program, NCI SEER, All-of-Us, UK Biobank, cBioPortal, TCGA/GDC, and NCBI GEO, 2) combine key risk factors (clinical variables and background) with pathway activities in a Bayesian model to accurately predict patient outcomes, and 3) validate the proposed methods by leveraging large patient cohorts from public repositories, as well as data available at UPMC Hillman Cancer Center, Houston Methodist Hospital, David Grant USAF Medical Center, MD Anderson Cancer Center, Viet-Duc Hospital (Vietnam), and Vingroup Big Data Institute (Vietnam). The significance of the proposed work lies on its potential to provide new methods and tools to advance research in cancer health disparities, and better cancer management and prognosis. The methods will be made available through a CRAN R package.