PROJECT SUMMARY The evolution of data science, encompassing statistical machine learning (SML) and
artificial intelligence (AI), has been underway since the 1990s. Nevertheless, the field of SML/AI has witnessed
a significant acceleration in recent decades, exemplified by the launch of ChatGPT (Chat Generative Pre-
trained Transformer) by OpenAI in November 2022. This development underscores the growing potential to
harness data science for training the next generation of clinical investigators, empowering them to utilize
available computing algorithms and tools in the rapidly expanding data science field to address novel questions
in changing research environments. The addiction research field has challenges, including small data sets for
low base rate behaviors (e.g., epidemiologic studies), high dimensional and sparse data (p >> n; e.g., addiction
neuroscience, genetics, mHealth), non-Gaussian outcome data distributions (e.g., intervention and treatment
trials), and not fully engaging in Open Science practices. The proposed R25 research education grant,
“Preparing the Next Generation of Addiction Researchers in Computational AI/ML Techniques,” responds to
RFA-DA-24-025, NIDA REI: Training a Diverse Data Science Workforce for Addiction Research. PI Mun and
her long-term collaborative team have a track record of engaging in and promoting Open Science and have
been recognized as one of the finalist teams for NIH DATAWorks! Prize Challenge in 2022. The proposed
NextGen Research Education program will be 12 months long and interdisciplinary. It will be delivered primarily
online, targeting four cohorts of predoctoral and postdoctoral trainees for five years. We anticipate up to 8-10
trainees per cohort recruited from the pool of trainees receiving support from institutional training grants such
as T32 and R25 as well as from national research societies and their diversity networks (e.g., the Society for
Prevention Research, the Research Society on Alcohol) and the National Research Mentoring Network of the
Advance Health Equity and Researcher Diversity (AIM-AHEAD) program. The NextGen Research Education
program will encompass (1) up to 18 modules of SML/AI training, (2) biweekly research seminars and
conference participation, and (3) hands-on research experience culminating in papers guided by a team of
program faculty (mentors). The SML/AI training will cover AI-assisted programming in R and Python, advanced
statistics and SML, deep learning and cloud computing, the FAIR principles, Open Science practices, and
research ethics, including responsible conduct in research. The outcomes of the program will include a series
of online learning modules that will be publicly accessible and free. The NextGen Research Education program
will make all data, codes, and packages publicly accessible. Program effectiveness will be monitored for
improvement, with success gauged by publications and their impact on the field, as measured collectively by
the entire cohorts of trainees and program faculty annually. This program will result in well-trained SML/AI
clinical investigators contributing to cumulative and reproducible addiction science.