Genomic and exposomic factors in the cause and rise of autism - ABSTRACT This collaboration between the Children’s Hospital of Philadelphia (CHOP) and the University of Pennsylvania (Penn) focuses exactly on the key priorities set out for Autism Data Science Initiative: 1) to identify genetic and environmental factors that lead to an autism diagnosis and 2) determine how these factors have either contributed to or are reflected in the rising prevalence. Most of the existing research on environmental contributions to ASD focuses on isolated exposures, neglecting the systemic nature of real-world environmental interactions. This gap limits the development of actionable risk models for early diagnosis and personalized intervention. By integrating multi-level exposome data (geocoded structural factors, individual social determinants, and perinatal exposures) with genomic and omic profiles, this study will advance understanding of how environmental contexts interact with biological susceptibility to shape ASD phenotypes, offering a foundation for tailored clinical management and novel preventive strategies. While publicly available research datasets are critical to this effort, an integrated database on a well- characterized, large, clinical population is essential. CHOP successfully implemented universal early autism screening in its primary care network in 2011, and has studied early screening, diagnostic outcomes, and prevalence through electronic health record (EHR) data since then. This cohort now includes 104,405 children born between 2008-2017 who were screened for autism at an 18- or 24-month well-child visit and had at least one additional visit within the CHOP network at 4+ years of age. Thus, this is a well- characterized and research-ready dataset of ~4000 children with autism and ~100,000 without. Our proposed three-year project is to aggregate clinical and genomic data from CHOP’s EHR, clinical and research biorepositories; with Penn’s EHR data on pregnancy, maternal, and birth outcomes, and research databases (ROA Task 1). We will also generate geo-coded exposomic data including daily air and water components, greenspace, built environment, and the Childhood Opportunity Index during pregnancy, birth, and childhood (ROA Task 2). We will then use advanced data science methods to generate hypotheses about relationships among and between variables to predict autism diagnosis and explain the increased prevalence over time (ROA Task 3). With this resource and the predictive power of our machine learning analytic plan, this project will have the statistical power to identify potential causes of autism in specific populations (e.g., starting with genetic and phenotypic subgroups). We will develop a prediction model that incorporates child-, family-, neighborhood- and community-level clinical, genomic, and exposomic data to predict autism in our cohort of 104,405 children who were all screened for autism as toddlers and who are now age 8-17 years. We will also identify genomic, exposomic, clinical-practice factors, and their interactions, that contribute to the increase in autism prevalence and heterogeneity of the autism phenotype, including increases in medical and genetic risks; changes in environmental exposures; and critical changes in diagnostic practice, criteria, and service availability. Our multi-disciplinary team represents the best clinical, informatics, genetics, data science, and community engagement expertise for developing machine learning methods and tools for large-scale, high-resolution, meaningful, and actionable autism research. If successfully implemented, this study will create an unprecedented data resource for researchers to derive specific, testable causal hypotheses and definitively answer questions regarding causes of autism in a way that was not previously possible. Furthermore, the results will parse the variance across broad domains of genomic and exposomic factors that may cause autism, which is critical for setting future research priorities.