The primary goal of this project is to bring powerful data mining and analytics methods, as well as computing
technology and technical software, to the substance abuse and HIV research communities to enable everyone,
regardless of expertise, to model multimodal data for the purpose of disease prediction with the end goal of
clinical decision support. Our working hypothesis is that automated machine learning (AutoML) will accelerate
the development of innovative strategies for translation of research findings to clinical use by enabling everyone
to analyze biomedical data using data mining methods. This project builds on our user-friendly and open-source
Tree-Based Pipeline Optimization (TPOT) platform that represents one of the very first and most widely used
open-source AutoML methods. A major benefit of this approach is that it makes machine learning accessible to
novice users because it takes the guesswork and complexity out of picking, running, tuning, and optimizing
machine learning algorithms and the various pre- and post-processing methods. Bringing this technology to the
clinical and translational research communities will open the door to broad adoption of data mining methods for
embracing the complexity of the relationship between multimodal substance abuse biomarkers and clinical
outcomes such as HIV progression and severity. We propose here novel algorithms to adapt and extend TPOT
for the large volumes of clinical data that are being collected on patients infected with HIV at Cedars-Sinai
Medical Center in Los Angeles. Specifically, we will first develop an ontology-based Addiction KnowledgeBase
(AddictionKB) tailored to HIV endpoints and clinical data derived from electronic health records (EHRs) and their
relationships with HIV infection and outcomes to inform the machine learning algorithms and assist with
interpretation (AIM 1). We will then develop a large language model (AddictionLLM) using Bloom to allow for
natural language queries of AddictionKB to perform knowledge-guided feature selection (AIM 2). We will extend
our TPOT AutoML (AddictionML) to include special operators to call the AddictionLLM algorithm for automated
knowledge-guided feature selection within machine learning pipelines (AIM 3). We will apply AddictionKB,
AddictionLLM, and AddictionML to the identification of substance abuse disorders and other clinical measures
that are predictive of HIV progression and severity (AIM 4). Finally, we will distribute and support AddictionKB,
AddictionLLM, and AddictionML as open-source software (AIM 5).