Development of Validation of a Novel Software Platform for Decentralized Clinical Trial in Substance Use Disorder Incorporating Large Language Models - Abstract Substance use disorder (SUD) clinical trials traditionally perform recruitment, enrollment, interventions, as well as information collection at the clinical trial sites. In-person visits at the sites are used for screening, assignment to interventions, treatment, or other intervention, follow up, as well as collection of information from participants and entering it in a database. Requirements for participants to be physically present at the study location may be limiting the diversity of the participants in the trials due to a variety of factors including, for example, time to travel to the center, loss of time from work, transportation issues, child, and elder care responsibilities. The perception of stigma associated with participating in a SUD trial can also prevent participation. These factors often hamper engagement, recruitment, screening as well as retention in trials. In a survey of clinical trials participants, it was found that they most often dislike the location, the length of the visit and time commitment. Decentralized clinical trials (DCTs) allows for most of the visits to be converted into telehealth visits and/or phone calls. The monitoring, when needed, is done with equipment delivered to the patient's house or at local clinical laboratories and/or health care providers. Wearable devices have demonstrated sufficient accuracy in many monitoring tasks such as steps or heart rates. This approach may solve the commuting problem and decrease the loss of time from work and home care issues. However, there are still gaps that prevent the utilization of DCT in SUD mainly due to the heterogeneity of the data and difficulty for the clinical trial coordinators to go through all pieces of the data to ensure the accuracy of every entry. Current clinical trial platforms do not offer a good support for DCTs in SUD. On another end, the development of large language models (LLMs) has shown significant potential to function like a human in natural language processing (NLP) tasks including sematic analysis and text generation. By finetuning an existing LLM with specific knowledge, such as SUD, it can perform better at this domain and can potentially automate many communication and data collection tasks, which are traditionally performed by human experts such as the clinical trial coordinators. In this project, we aim to develop and validate a novel software platform that will specifically solve the challenges in DCT under the context of SUD and leverage the cutting-edge LLM technology to improve efficiency and accuracy of data collection. In the Phase I period of this project, we aim to perform feasibility testing with the following two aims: 1) develop a novel software platform for DCT in SUD with fine-tuned LLM, 2) validate the LLM accuracy and software usability in an emulated SUD trial using retrospective data. By addressing the technical and practical challenges of DCT and leveraging cutting-edge technology, this project has the potential to shift the clinical paradigm for SUD clinical trials to encourage more participants from all background and lead to improved SUD treatments for all patient population.