Chemical language models for peptide engineering - Summary Therapeutic peptides have emerged as a compelling and versatile class of molecules with immense promise in the clinic. The growing interest in therapeutic peptides underscores the urgent need for robust computational methodologies to support their discovery, optimization, and application. And yet, the recently developed deep- learning language models, which have shown monumental success in analysis of proteins and small molecules, lack a reliable format for peptide discovery and design. Existing biological language models have been trained on datasets consisting of either large, multi-domain proteins or of small molecules. To date, this has produced models ill-suited for modeling peptides, which are larger than small molecules but smaller than typical proteins. Therefore, we see a critical need to develop peptide foundation models that can be fine-tuned for various downstream tasks. For maximum usefulness, such models will need to be able to encode peptide backbone modifications, hundreds of non-canonical amino acids (natural and unnatural), and various side-chain modifications and cyclizations. Here, we will develop such a modeling framework, building on and extending prior works for small molecules and adapting them to peptides. The proposing team of PI Wilke and co-I Davies jointly has extensive experience in peptide biochemistry, machine learning, and LLMs, and UT Austin is uniquely positioned to provide the computational resources required for this project, through both the Texas Advanced Computing Center and the Center for Generative AI. We have three aims, to (1) develop chemical language models for peptides, (2) validate our models on the tasks of engineering membrane diffusion and cellular entry, and (3) develop and validate language models for protein–peptide interactions. In aggregate, this project will develop a robust platform for investigating peptide biochemistry. The models we will develop will open up several avenues of discovery including natural peptide identification, biochemical characterization of functional peptides, and classification of peptide activities. Ultimately, this work will enable predictive modeling of peptide-based macromolecules, with applications in natural peptide discovery, drug development, and rational peptide design.