Project Summary
Quickly growing genomic and phenotypic data in large-scale biobank efforts are increasingly associating
genetic variants to the predisposition, onset, and progression of human diseases. However, knowledge
remains limited about the mechanisms underlying those associations, not to mention that many more
variants are with uncertain clinical significance. The widening gap between variant data and mechanistic
knowledge further hinders the delivery of prognostics, diagnostics, and therapeutics for growing
healthcare demands.
In response to the knowledge gap, the PI’s long-term research goal is twofold: (1) to unravel how a
genetic change ripples through various aspects across atomic, molecular, and cellular levels to cause
human diseases or confer drug resistance; and (2) to translate learned mechanistic knowledge to effective
therapeutic strategies for human diseases and drug resistance. Toward the long-term research goal, this
project focuses on coding variants leading to protein mutations, builds upon our recent progress in
physics-driven protein design and data-driven machine learning for mutational effects, and proposes to
widen and deepen the unraveling of disease-associated protein mutations along three directions. The first
two directions involve forward prediction and causal inference of hierarchical mutational phenotypes
across molecular and cellular levels, which will generate mechanistic hypotheses while predicting disease
phenotypes. The third direction involves inverse design of perturbation experiments, including protein
mutagenesis and ligand perturbation, to test the generated mechanistic hypotheses directly and rationally.
To advance along the three directions in the next five years, we will integrate molecular physics, systems
knowledge, and emerging large-scale functional data in a systematic and rigorous framework to
probabilistically predict, explain, and design phenotypes of protein mutations across biological scales.
And we will fuse molecular modeling, network analysis, multimodal machine learning, graph learning, and
conditional generative models in this regard, while continuing experimental and clinical collaborations in
teams and communities. The expected contributions of the project, besides the computational methods,
predicted phenotypes, hypothesized mechanisms, and designed experiments, also include an integrated
data platform friendly for cross-disciplinary machine learning and a resource and discovery platform
promoting clinical feedback loops.