Gastroshare: Integrating Electronic Health Systems to Improve the Management ofGastric Precursors across Multiethnic Populations. - ABSTRACT Gastric precursors are precancerous, intermediary states between normal tissue and gastric cancer (GC). Precursors originate through a carcinogenic cascade (Correa’s cascade) from a source of chronic inflammatory insult such as infection by Helicobacter pylori (Hp). While prevalent lesions (5-10% of all endoscopies performed), the management of precursors has proven controversial in the United States. There exist 1) sparse data on the natural history of precursor progression to GC derived from multiethnic populations, 2) no validated clinical risk stratification algorithms, and 3) no clinical trial data that endoscopic screening of precursors provides mortality benefit. There exists an absence of centralized pathology registries to create cohorts of substantive power, as well as difficulties in linkage between pathology databases and cancer registries of sufficient geographic coverage. In this project, we will perform a multi-institutional linkage of electronic health records (EHR; including clinical notes, pathology databases, and endoscopic records) from 1) a large academic healthcare system (Stanford Health Care), 2) an integrated healthcare network serving Northern California (Sutter Health, comprising eight hospitals, >200 clinics), and a comprehensive state-level tumor registry with legally-mandated reporting (California Cancer Registry, CCR). Our central hypothesis is that a unique and large cohort of gastric precursors (Gastroshare) with detailed EHR phenotyping can be created, comprehensively linked to cancer occurrence, and utilized to answer key questions in progression, screening, and outcomes. In Aim 1, we will determine if a hybrid modeling approach consisting of both structured data extraction and natural language processing methods, including generative large language model-based methods, can enhance characterization and reduce missingness in this integrated database. In Aim 2, we will leverage high-dimensional competing-risk EHR data to develop dynamic risk prediction models for GC, employing a temporal landmark framework. This method will be applied to predict dynamic GC risk by capturing evolving EHR data measured after precursor diagnosis, such as Hp eradication, medication use, and smoking behavioral changes. In Aim 3, we will evaluate the feasibility and utility of a novel causal inference method which explicitly emulates a hypothetical randomized surveillance trial. Using this emulation framework, we will evaluate the effect of endoscopic surveillance of precursors on GC mortality. Successful completion of this proposal will provide key clinical data on the natural history of precursors, assess clinical risk stratification tools, and evaluate the efficacy of secondary prevention strategies in this enhanced-risk population.