Integrative deep learning algorithms for understanding protein sequence-structure-function relationships: representation, prediction, and discovery - Abstract Understanding the sequence-structure-function relationship of proteins is of vital importance to protein biology, biomedicine, and bioengineering. Recent advances in biotechnology have been generating rich datasets to characterize proteins, such as next-generation sequencing data, three-dimensional (3D) structures, ontology annotations, and measurements of functional activities, yet how to computationally operationalize these datasets to fully unveil the structural or functional mechanisms of proteins remains a significant challenge. Existing computational methods often struggle with the size, high-dimensionality, heterogeneity, incompleteness, and intrinsic noise of those data, limiting our ability to study protein biology in a holistic and integrated system view. The goal of this research is to develop new artificial intelligence (AI) methods for effectively integrating and intelligently modeling heterogeneous protein-related datasets and to advance our understanding of the mechanical connections between proteins’ sequence, structure, and function. This project not only represents timely research that leverages the unprecedented opportunities offered by recent AI breakthroughs such as AlphaFold, but also goes beyond these efforts from protein structure prediction to systematic analyses of protein biology and unlocks new analytic frameworks that could not be realized previously. Specifically, we will first develop novel machine learning methods to learn statistical representations that are grounded on the sequence and structure of proteins and reflect their functional properties. The learned representations will allow us to characterize how the composition of amino acids and the 3D shape of protein structure determine the function of a protein. Second, we will develop unified, biology-guided deep learning frameworks to integrate domain knowledge, such as structural properties and evolutionary relationships, and study several key problems for characterizing protein functions, including genome-scale function annotation and variant effect prediction. These efforts will shift the classic sequence-first paradigm of previous studies to a new integrative paradigm and provide accurate, robust, and interpretable predictions of protein functions. Finally, we will develop a computational platform that combines data-efficient AI models, uncertainty-guided exploration algorithms, and deep learning- based generative models for AI-aided directed evolution and sequence-structure co-design of proteins, which will assist and accelerate the discovery and design of functional proteins. Overall, this proposal will study the sequence-structure-function relationship of proteins from an integrative perspective, provide new state-of-the-art AI algorithms with applications in fundamental problems for understanding protein function and human disease, and generate new actionable biological hypotheses for the discovery and design of novel functional proteins. The resulting software and data resources will be publicly available through open-access platforms.