Exploring Understudied Proteins to Predict Novel Pathways and Associations to Disease - The drug discovery process has a high failure rate which leads to a significant number of diseases that remain untreatable. Often, drug therapies are limited by our incomplete understanding of the biological pathway causing the disease. While some key pathways have been identified, and drug therapies have been designed to modulate them, we expect that there are large numbers of uncharacterized pathways that must be discovered for effective drug development. Biologists commonly conduct time-intensive experiments focused on well- studied proteins and thus miss novel connections between proteins and associated phenotypes. Existing computational methods often rely too much on known protein functional relationships. This reliance limits the potential for discovery of novel protein interactions and pathways that modulate understudied phenotypes. The increase in high throughput databases of protein data collected at scale allows us to overcome these biases. High-throughput datasets provide information about sequence, interaction, structure, and expression—all of which can provide useful information about proteins and their likely functional interactions. There are not adequate algorithms to combine these high-throughput data sources in a biological coherent manner, in order to predict protein interactions and pathways of biological response. Current pathway prediction algorithms use known examples to generate a fitness function that is used to predict the likelihood of a novel pathway; this approach may miss pathways that cannot be traversed with the heuristics used to define the fitness function. This project addresses these issues by (1) creating a representation method for pathways that incorporates the heterogenous data sources, specifically using attention to identify the most discriminant features, and (2) implementing reinforcement learning algorithms that balance exploration and optimization to learn trajectories in the protein network that correspond to pathway function. We will then (3) use the learned representation to identify the phenotype associated with the novel proposed pathway and identify the impact of variations of this pathway on disease. Application of this framework on the human proteome will enable better understanding of the pathways responsible for psychopathologies, thus improving the specificity of potential drug targets and the efficiency of drug development. This project will take place in the Helix Lab, advised by Dr. Russ Altman, at Stanford University, and the training plan is designed with the goal of becoming an independent researcher, developing computational methods for molecular proteomics. Dr. Altman has an excellent record of mentorship and Stanford University provides a diverse range of resources and collaborators. The Helix Lab has a strong history in both computational protein characterization and drug response research, providing access to domain experts. Beyond research, the training plan includes attending seminars and conferences, collaborations, coursework, and teaching.