Machine Learning Models for Studying Protein Interactions in the Context of Immune Receptors - Project Summary / Abstract Protein interactions are the fundamental basis of all cellular processes. Proteins interact with each other to form complexes that carry out a wide range of functions, from signal transduction, and gene regulation, to DNA re- pair. Disruptions in protein interactions are implicated in a wide range of human diseases. While wet-lab assays to study protein interactions are indispensable, with advancements in algorithms and machine learning, com- putational methods for predicting protein interactions have the potential to revolutionize our understanding of cellular processes, identify new drug targets, and develop more effective therapies. Our research applies domain knowledge from biological sequence analysis, structural biology, and machine learning to computationally predict whether given protein complexes will interact or generate novel protein receptors that may recognize target lig- ands. These computational algorithms and machine learning models can be used to 1) develop new therapeutic molecules to treat infectious diseases or cancer, and 2) produce new diagnostic tools to detect abnormality in cells. To provide biological sequences (such as protein sequences) as input to these computational methods, one must first express them as a fixed-size numeric vector, often referred to as an embedding of the input sequence. However, the mainstream embedding techniques for biological sequences are simple adaptations of embedding techniques from the field of natural language processing. Biological sequences are highly complex and struc- tured, where the unit of information is less noticeable when compared to natural languages. Two primary goals of the proposed research in this project are: 1) pinpointing the determinant of an effective embedding of biological sequences to have generalized principles to design protein language models for a given specific family of pro- teins, and 2) applying these embedding techniques to better generate immune receptors such as T cell receptors (TCRs) and B cell receptors (BCRs) that interact with a target epitope. Both research goals build on our previous TCR embedding model that boosts downstream model performance by a wide margin on TCR-epitope binding prediction and clustering of TCR repertoire. The outcome of this project will be a unified computational frame- work for predicting protein interactions and designing novel TCRs and BCRs, which will have a profound impact on human health.