Using machine learning to accelerate our understanding of risks for early substance use among child-welfare and community youth - PROJECT SUMMARY/ABSTRACT The economic toll of substance use/abuse is estimated to be over $740 billion annually as a result of accidents, health care, homelessness, unemployment, and criminal activity. Adolescence is a critical time for intervention, as 90% of adults who meet the criteria for addiction initiate use of alcohol or drugs in adolescence. Yet, prevention efforts have been hindered by minimal substance use screening by primary care providers as well as low rates of disclosure by adolescents in medical settings. These challenges necessitate new approaches to detect key risk factors and enhance screening methods. Importantly, adolescents with experiences of child maltreatment are more susceptible to early substance use and more likely to progress from experimentation to addiction than non-maltreated youth. The accumulating evidence and our preliminary data suggest that the top predictors of early substance use are not relevant for child welfare (CW) youth, requiring new studies of the relevant risk factors for this vulnerable population. The proposed study addresses these gaps by using cutting-edge Machine Learning models to provide vital new evidence regarding risk factors specific to the CW population as well as risks for early substance use that may be common to both CW and non-CW youth. We will use two unique data sources to accomplish this. Our primary data will come from electronic health records (EHR) of Kaiser Permanente Southern California (KPSC) members (estimated sample size of 3.4 million children, 2007-2020). We will use diagnosis codes for maltreatment to indicate the CW sample, a reasonable assumption of referral to child welfare. Risk factors will be obtained from diagnosis codes and abstractable progress notes in the EHR of children and parents as well as county crime and geographic income data. Second, to address the limitations of EHR to capture more detailed psychosocial data, we will use an existing longitudinal dataset of 454 youth, 303 referred from child welfare and 151 in a comparison group (YAP study). Participants were seen at mean ages 11, 13, 15, and 18 years old and are racially/ethnically diverse. Collected data includes measures of child level, parent level, family level, and neighborhood risk factors and CW case records. These two data sources will allow us to: 1) produce critical new knowledge regarding the relevant predictors of early substance use for CW versus non-CW youth and 2) use intensive survey data (YAP) to determine risk factors that are not currently collected in EHR data (KPSC) that may inform the development of new screening questions. Lastly, our predictive model has translational potential to advance screening methods for adolescent substance use risk in pediatric primary care through the use of risk scores integrated into clinical decision support tools. These findings, if implemented in clinical care settings, would allow medical providers to more accurately identify those at risk and trigger stratification into different treatment pathways to prevent substance abuse.