The drug discovery process has a high failure rate which leads to a significant number of diseases that
remain untreatable. Often, drug therapies are limited by our incomplete understanding of the biological pathway
causing the disease. While some key pathways have been identified, and drug therapies have been designed to
modulate them, we expect that there are large numbers of uncharacterized pathways that must be discovered
for effective drug development. Biologists commonly conduct time-intensive experiments focused on well-
studied proteins and thus miss novel connections between proteins and associated phenotypes. Existing
computational methods often rely too much on known protein functional relationships. This reliance limits the
potential for discovery of novel protein interactions and pathways that modulate understudied phenotypes. The
increase in high throughput databases of protein data collected at scale allows us to overcome these biases.
High-throughput datasets provide information about sequence, interaction, structure, and expression—all of
which can provide useful information about proteins and their likely functional interactions. There are not
adequate algorithms to combine these high-throughput data sources in a biological coherent manner, in order to
predict protein interactions and pathways of biological response. Current pathway prediction algorithms use
known examples to generate a fitness function that is used to predict the likelihood of a novel pathway; this
approach may miss pathways that cannot be traversed with the heuristics used to define the fitness function.
This project addresses these issues by (1) creating a representation method for pathways that incorporates the
heterogenous data sources, specifically using attention to identify the most discriminant features, and (2)
implementing reinforcement learning algorithms that balance exploration and optimization to learn trajectories in
the protein network that correspond to pathway function. We will then (3) use the learned representation to
identify the phenotype associated with the novel proposed pathway and identify the impact of variations of this
pathway on disease. Application of this framework on the human proteome will enable better understanding of
the pathways responsible for psychopathologies, thus improving the specificity of potential drug targets and the
efficiency of drug development.
This project will take place in the Helix Lab, advised by Dr. Russ Altman, at Stanford University, and the
training plan is designed with the goal of becoming an independent researcher, developing computational
methods for molecular proteomics. Dr. Altman has an excellent record of mentorship and Stanford University
provides a diverse range of resources and collaborators. The Helix Lab has a strong history in both computational
protein characterization and drug response research, providing access to domain experts. Beyond research, the
training plan includes attending seminars and conferences, collaborations, coursework, and teaching.