Enhancing AI-based Speech Therapy through Acoustic-Derived Articulatory Feedback - Project Summary/Abstract This proposal aims to leverage advances in artificial intelligence (AI) to develop clinically impactful tools that can support progress in speech therapy for children with Speech Sound Disorder (SSD) while alleviating the pressing issue of overburdened speech language pathologist (SLP) caseloads [5]. SSD affects a significant portion of school-aged children [1, 13], leading to social and emotional challenges that can persist into adolescence and adulthood [14, 16, 17]. This project aligns with the NIH health-related mission by focusing on improving access to effective speech therapy, thereby enhancing the quality of life for individuals with SSD. Research suggests roughly 5,000 accurate speech productions are necessary for speech sound generalization [4], motivating the development of AI systems to supplement therapy. However, a critical challenge in automating therapy is providing accurate qualitative feedback, especially for complex sounds like the American English rhotic /ɹ/. Ultrasound biofeedback has shown promise in therapy [24], but cost and training barriers limit its adoption. The research design involves using state-of-the-art machine learning algorithms to analyze a multimodal corpus of children's speech collected in our team’s previous research. Acoustic and ultrasound data will be processed and segmented to extract relevant features, which will then be used to train and evaluate the performance of the machine learning models. The project will incorporate innovative techniques such as cross-modal embedding and retrieval [11] to integrate information from multiple modalities and improve the accuracy of pattern recognition. More specifically, the first aim focuses on training an acoustic classifier to identify tongue shapes within the class of perceptually accurate /ɹ/ productions using acoustic data. Ground truth labels were obtained from previous human coding of articulatory patterns [9] and will be supplemented with additional labeling. Supervised machine learning methods will be employed to differentiate between tongue shapes, aiming for high precision and recall. The second aim focuses on utilizing semi-supervised machine learning methods to distinguish clinically relevant acoustic-articulatory patterns within both accurate and inaccurate /ɹ/ productions. Ground truth labels provided by certified SLPs will inform the training of models, which will be designed with cross-modal embedding to recognize patterns with high precision and recall in a high-dimensional multimodal dataset comprising audio and ultrasound data [11]. The study anticipates developing classifiers suitable for differentiating tongue shapes within accurate and inaccurate /ɹ/ productions, with potential applications in enhancing clinician cueing and providing at-home practice for children with SSD. The long-term goal is to integrate these classifiers into freely available clinical speech software which will provide accurate customized articulatory feedback for /ɹ/ as a first step, with the potential to expand to other speech sounds in the future.