A Structure-based orthology approach to predict protein function in eukaryotic parasites - PROJECT SUMMARY Eukaryotic parasites are a diverse group of organisms that can cause a wide range of infectious diseases in humans. These diseases have a significant impact on global health, with millions of people affected annually. The advent of affordable genome sequencing has revolutionized the study of pathogens for therapeutic development. However, the functional annotation of the proteins encoded by these genomes has struggled to keep pace with the rapid advancements in sequencing technology. Traditional methods based on sequence orthology have failed to fully annotate a third of all proteins in VEupathDB, a sequence database dedicated to eukaryotic parasites and a Bioinformatics Resource Centers (BRCs) project funded by the National Institute of Allergy and Infectious Diseases (NIAID). To overcome this challenge, we propose to apply a novel structure-based orthology approach to predict protein function. This approach relies on AlphaFoldDB, a vast repository of precomputed models, and Foldseek, a revolutionary algorithm that aligns structures with high accuracy and speed, combined with OrthoMCL sequence orthology. The successful completion of this project has the potential to be a game-changer in infectious disease research. The ability to functionally annotate thousands of uncharacterized proteins in VEupathDB will provide a valuable tool for identifying potential targets for further functional studies and therapeutic development. Furthermore, the ability to automate this process represents a significant advancement over the current manual process, which requires extensive structural biology knowledge. Specific Aims: 1) Define Domain-based Structure Orthology Groups (DSOGs) at scale by leveraging AlphaFoldDB and Foldseek to identify structural orthologs and rank them based on conservation of positions in structure-based sequence alignments. 2) Predict the function of DSOGs at scale with natural language processing techniques, such as ProtNLM, to generate unified names and functional annotations for proteins with similar functions within DSOGs. We will collaborate with VEupathDB to update existing official product names and annotations. By successfully completing these specific aims, our project will deliver annotations for thousands of uncharacterized proteins in VEupathDB as well as an automated pipeline that streamlines the laborious process of protein functional annotation. The annotations will provide valuable insights into the biology of parasitic organisms, help identify potential targets for further functional studies, and facilitate the development of novel therapeutic interventions for infectious diseases.