Developing large language models for drug safety and effectiveness causal analysis - Project Summary Because randomized controlled trials often severely underrepresent frail and complex patients, it is pivotal to inform physicians’ treatment choices with drug safety and effectiveness studies based on real-world data. Electronic health records (EHR) contain rich clinical information and are among the most commonly used real- world data for causal effect estimation for pharmacotherapies. However, much of the essential data is embedded in the free-text clinical notes and reports (unstructured EHR). However, the traditional natural language processing (NLP) approaches require a labor-intensive process of knowledge acquisition and training dataset creation for each phenotype. This makes it not scalable for the large numbers of outcome phenotypes, risk stratification factors, and potential confounders (often>200) that need to be created for a typical pharmacoepidemiologic study. In contrast, developing Large Language Models (LLMs) is a more scalable approach because LLMs can be used to predict phenotypes not defined during the training stage. Yet, existing LLMs were not tailored for determining essential phenotypes for causal effect estimation of pharmacotherapies. Our objective is to build an LLM-based causal analytical platform for drug safety and effectiveness using two large multi-center EHR systems linked with Centers for Medicare & Medicaid Services (CMS) utilization, clinical assessment, and pharmacy dispensing data covering>1.3 million lives from 2000-2024. Our central working hypothesis is that our novel LLMs have robust performance in determining a wide variety of clinical phenotypes, including those not originally targeted during the training stage, and they can be used to reduce missing data for pharmacoepidemiology causal analysis. In Aim 1, we will train novel LLMs for phenotypes commonly used in drug safety and effectiveness causal analysis building on existing general-purpose LLMs. The reference standard of the target phenotypes will be provided by large-scale annotation based on structured data in the linked external clinical data. The targeted phenotypes include cognitive function, mental and functional status, pain levels, mood symptoms, adherence to chronic medications, and healthcare utilization outside of study EHR. In Aim 2, we will assess the generalizability of the novel LLMs to predict eight new categories of phenotypes (not already targeted in Aim 1) in an independent dataset. We will further optimize the LLMs based on the performance in the validation dataset. In Aim 3, we will determine the impact of LLM-derived features on causal effect estimation in three categories of highly relevant empirical drug safety and effectiveness studies in terms of bias and variance reduction. This LLM-based causal analytical platform can be used to generate a wide range of high-validity clinical features that enable causal effect estimation with adequate patient outcome phenotyping, confounding adjustment, and treatment effect heterogeneity evaluation, which is required for high-quality evidence for individualized prescribing.