PROJECT SUMMARY
Early onset of alcohol use during adolescence is associated with increased probability of later alcohol
dependence, polydrug abuse, victimization, conduct problems, psychiatric comorbidities, and delayed
achievement of adult milestones. Methods that yield rapid, accurate, and reliable predictions of which children
and teens are at risk for early onset can improve the targeting of prevention interventions and enable the
concentration of resources on the most debilitating and costly cases. One promising and untapped approach
to this prediction problem is machine learning (also called “statistical learning,” “data mining,” or “predictive
modeling”), a class of techniques arising from statistics, computer science, and engineering that seeks to build
data-driven predictive algorithms. These techniques are most noticeably distinguished from “traditional”
statistical methods (e.g., ordinary least squares regression) by their extreme emphasis on prediction of future
cases, rather than explanation of the current data, and thus they may offer dramatic advantages over
traditional approaches to identifying which children and teens will develop early onset alcohol use. This
proposal will explore the potential contribution of machine learning methods by directly comparing their
predictive performance to that of the traditional approach in a large-scale, multisite longitudinal study of the
development of early onset alcohol use (N = 731). If machine learning methods do significantly outperform the
traditional approach, future directions might include the development and implementation of machine-learning-
based screening methods for real-world use. On the other hand, if machine learning methods do not
outperform the traditional approach, this will suggest that at least in the context of the present study (i.e., these
predictors, timeline, and outcome), machine learning does not improve the prediction of early onset alcohol
use. Analyses will investigate whether the performance of machine learning methods varies across the nature
of predictor variables use, the age span covered, and the outcome to be predicted. Thus, the current proposal
uses an extant longitudinal dataset to carry out two specific aims: (1) Train five different machine learning
algorithms and one traditional algorithm (ordinary logistic regression) for predicting later early onset alcohol
use in a subset (70%) of the data. (2) Test these six predictive algorithms on the rest (30%) of the data and
directly compare their predictive performance in multiple contexts.