Assessing feasibility of gastric cancer screening in the US - ABSTRACT In the United States, gastric cancer (GC) is commonly diagnosed at an advanced stage with a poor overall 5-year survival of 31%. Disparities in GC incidence and mortality affect every minority group, exceeding those seen in lung, breast, and colon cancer. GC incidence is also rising in younger patients, with a 10% increase over the past decade. Thus, there is an urgent and unmet need for an effective strategy to detect GC early to improve clinical outcomes of patients with this disease. GC incidence is equivalent to that of cervical cancer (7 per 100,000) and approximately 25% that of colorectal cancer, cancers for which screening is established. Early detection affects survival; 5-year survival is 95-99% in localized GCs as compared with 7% in metastatic GC. Given the relatively lower prevalence in the general population, a novel targeted approach of screening asymptomatic high-risk individuals, rather than universal screening, would be required to form the basis for a GC screening program in the US. Known risk factors for GC include sex, age, race, ethnicity, family history, smoking, and H. pylori infection, variables easily available in the electronic health record (EHR). To date, risk models to predict GC risk have mainly been studied in Asia, demonstrate limited accuracy, and are not generalizable to the US population. Machine learning (ML) methods have been explored in cancer prediction modeling. No studies have applied an EHR-based ML model to an average risk population to identify high-risk individuals appropriate for GC screening. A future RCT assessing the effectiveness of GC screening in a high-risk GC population would require robust preliminary data. Mathematical modeling is commonly used to support clinical decision-making that can simultaneously consider the impact of varied factors (e.g., patient characteristics, adverse events, cost) on the risk/benefit ratio of treatment. The overall goal of this study is to establish the feasibility of GC screening in diverse high-risk US populations. The Specific Aims are to: 1) Develop and validate a predictive model based on structured and unstructured EHR data that accurately identifies individuals at high risk for GC; and 2) Develop a mathematical model of the natural history, diagnosis, treatment and outcomes of GC and project the potential benefits, harms, and cost-effectiveness of GC screening in high-risk individuals. To achieve these Aims, we will refine our preliminary EHR-based ML model using approximately 5 million patient records from diverse populations at the Cleveland Clinic and Columbia University. We will externally validate the prediction model using an independent sample of ~3 million patients from University of California at San Diego, CA. We will then develop a mathematical model to determine the optimal use of GC screening by evaluating benefits versus harms and cost-effectiveness. Results from these studies will address a major unmet health need and improve survival in a preventable and curable cancer in at-risk populations.