Abstract
Understanding the sequence-structure-function relationship of proteins is of vital importance to protein biology,
biomedicine, and bioengineering. Recent advances in biotechnology have been generating rich datasets to
characterize proteins, such as next-generation sequencing data, three-dimensional (3D) structures, ontology
annotations, and measurements of functional activities, yet how to computationally operationalize these datasets
to fully unveil the structural or functional mechanisms of proteins remains a significant challenge. Existing
computational methods often struggle with the size, high-dimensionality, heterogeneity, incompleteness, and
intrinsic noise of those data, limiting our ability to study protein biology in a holistic and integrated system view.
The goal of this research is to develop new artificial intelligence (AI) methods for effectively integrating and
intelligently modeling heterogeneous protein-related datasets and to advance our understanding of the
mechanical connections between proteins’ sequence, structure, and function. This project not only represents
timely research that leverages the unprecedented opportunities offered by recent AI breakthroughs such as
AlphaFold, but also goes beyond these efforts from protein structure prediction to systematic analyses of protein
biology and unlocks new analytic frameworks that could not be realized previously. Specifically, we will first
develop novel machine learning methods to learn statistical representations that are grounded on the sequence
and structure of proteins and reflect their functional properties. The learned representations will allow us to
characterize how the composition of amino acids and the 3D shape of protein structure determine the function
of a protein. Second, we will develop unified, biology-guided deep learning frameworks to integrate domain
knowledge, such as structural properties and evolutionary relationships, and study several key problems for
characterizing protein functions, including genome-scale function annotation and variant effect prediction. These
efforts will shift the classic sequence-first paradigm of previous studies to a new integrative paradigm and provide
accurate, robust, and interpretable predictions of protein functions. Finally, we will develop a computational
platform that combines data-efficient AI models, uncertainty-guided exploration algorithms, and deep learning-
based generative models for AI-aided directed evolution and sequence-structure co-design of proteins, which
will assist and accelerate the discovery and design of functional proteins. Overall, this proposal will study the
sequence-structure-function relationship of proteins from an integrative perspective, provide new state-of-the-art
AI algorithms with applications in fundamental problems for understanding protein function and human disease,
and generate new actionable biological hypotheses for the discovery and design of novel functional proteins.
The resulting software and data resources will be publicly available through open-access platforms.