Modeling Substance Abuse via a Behavioral Foundation Model Trained on Large-Scale Survey Data - Project Summary/Abstract Substance use disorders (SUD) pose a major public health crisis that exacts heavy tolls on communities and healthcare systems, yet current survey data remain underutilized due to limitations in conventional analytic methods. This project proposes to develop a novel behavioral foundation model that transforms qualitative epidemiological survey responses into robust, quantitative latent representations of substance use behaviors. By harmonizing data from NESARC-III, NSDUH, and UK Biobank, we will “textualize” both structured and free- text responses into unified narratives that capture the nuanced details of individual experiences. Our approach leverages advanced natural language processing to convert diverse survey data into coherent, machine- interpretable inputs, and fine-tunes state-of-the-art, open-source large language models (LLMs) with integrated demographic tokens to enhance subgroup-specific predictions. We will rigorously validate the model’s performance against established machine learning techniques using metrics such as area under the ROC curve, calibration, and cross-dataset generalizability. Downstream applications include precise risk stratification for SUD outcomes, latent clustering to identify distinct risk and resilience profiles, and data-driven survey instrument optimization. Open-access dissemination of our tools will empower precision public health initiatives, enhance early identification of high-risk groups, and support targeted interventions to reduce the societal burden of substance use disorders.