Robust and Scalable Machine Learning and Statistical Methods for Genomics - PROJECT SUMMARY The Song lab consists of a diverse team of computer scientists, statisticians, and mathematicians dedicated to advancing biology. Their work focuses on developing efficient computational tools, robust statistical methods, and innovative machine learning models for understanding evolution and fundamental biological processes. Their goal is to facilitate the broader biomedical community’s research while also getting deeply involved in data analysis and interpretation to help make new biological discoveries. A central goal of biology is to unravel the wealth of information contained in the genome. Achieving this would enable the integration of personal genome interpretation into healthcare, aiding in disease diagnosis, personalized therapeutic regimens based on individual genetic makeup, and improved treatments with reduced side effects. The Song lab’s research program aims to help realize this transformative vision by addressing critical computational, statistical, and modeling challenges. Recent advances in machine learning (ML) and artificial intelligence (AI) have profoundly impacted diverse sci- entific fields, transforming approaches to model development, experimental design, data analysis, interpretation, and discovery. Applying AI/ML to biology holds enormous potential, but fully realizing this potential and making it accessible to the broader biomedical community requires addressing several key challenges. For instance, in- corporating biological knowledge and insight into AI/ML models is crucial for training effective models. Achieving this requires innovation in curating suitable training data, which demands a deep understanding of the specific application domain; developing appropriate model architectures and tuning their hyperparameters; and designing effective learning objectives. Evaluating and interpreting the trained models also require a substantial amount of rigorous work. Furthermore, training advanced AI/ML models typically demands large computational resources, which can seriously limit model development, exploration, and utility; thus, novel approaches are needed to train models more efficiently. This project aims to tackle these challenges and help bridge AI/ML with basic biomedical research. Over the next five years, the Song lab will investigate several basic research problems in evolutionary biology and genomics, and develop a suite of robust, scalable AI/ML models and statistical methods to benefit the broader community. In particular, they will develop innovative models to learn complex probability distributions over biological sequences (DNA, RNA, and proteins), to decipher the intricate information contained in them and to understand the functional constraints they entail. These efforts will have wide-ranging applications, including phylogenetic tree reconstruction, viral evolution prediction, variant effect prediction, transfer learning in genomics, and protein design. In parallel, the lab will also develop novel computational methods for genomics and continue collaborating with biologists to tackle basic research questions. Lastly, this project will integrate research with education to train a generation of researchers capable of developing cutting-edge AI/ML models for biology.