FullMouth: Enhancing Dental Clinical Data and Reducing Disparities through Innovative ML Approaches. - Project Abstract/Summary The vast amount of health data created in the United States may hold the key to understanding disease, improving quality, and lowering healthcare costs. Electronic health records (EHRs), digital collections of patient healthcare events and observations, are now ubiquitous in medicine and critical to healthcare delivery, operations, and research. EHR data is often classified as structured or unstructured. Structured EHR data include standardized diagnoses, medications, and laboratory values in fixed numerical or categorical fields. For structured data, challenges such as missing, incomplete, and inconsistent data are very prevalent. Unstructured data, in contrast, refer to free-form text written by healthcare providers, such as clinical notes and discharge summaries. Dental care providers often write detailed findings, diagnoses, treatment plans and prognostic factors in free-text format for clinical care purposes. While this information is easily accessible during patient care, extracting it for generating meaningful insights for secondary analysis can be challenging. Utilizing these records requires manual review by domain experts, which can be time-consuming and costly, particularly when dealing with a large number of patient records. Unstructured data represents about 60% of total EHR data. Recently, Large Language Models (LLMs) and newer deep learning approaches to Natural Language Processing (NLP) have made considerable advances, outperforming traditional statistical and rule-based systems on a variety of tasks. To fully realize the promise of health information technology in dentistry, it is important to address data missingness and disparity in missingness. Through a periodontal use-case, this proposal will tackle the challenge of missing structured, and ‘technically’ inaccessible, unstructured clinical data. Periodontal (advanced gum disease) problems are very pervasive, and unlike caries (whose prevalence has steadily declined over the past four decades), disease burden and tooth loss secondary to periodontal disease remain intractable. In preliminary work at two dental institutions, we observed that most patients seen for a comprehensive oral evaluation had missing or incomplete documentation with respect to clinical periodontal indices/diagnosis, demographic, and health-related behavior information – all of which are critical in diagnosing and treating periodontal disease. This significantly limits our ability to learn and improve. Aim 1 will focus on using LLM-based NLP approaches for the conversion of unstructured note entries into structured and machine-readable information. In Aim 2, we will use imputation techniques to fill in missing structured clinical data entries. Aim 3 will then evaluate the impact of reduction in clinical data missingness for both clinical and research applications. This work builds on our prior work in developing the BigMouth Dental Data Repository (which contains regularly updated structured data on 4.6 million patients). We will be supported by the collective strength of the 11 core BigMouth, and other allied dental institutions that currently share and/or contribute data to the repository.