Mapping the sequence-function landscape of a small reactive protein in ultra-high-throughput - Project Summary/Abstract Understanding how a protein’s amino acid sequence controls biophysical properties like stability, folding, and reactivity is key to designing better enzymes. However, this task remains challenging due to the complexity of protein conformational ensembles and the scarcity of functional data across sequence space. As a result, computational enzyme design is difficult, and most designed enzymes fail to function without clear indications why. Large-scale experiments can help uncover how specific residues and higher-order interactions affect function, offering the basis for improved computational models and better strategies for enzyme design. However, most surveys of sequence space either cover relatively few variants, or sample only a very narrow region of sequence space deeply. Combining high-throughput functional mapping with de novo design of new enzymes has been out of reach due to the size and complexity of enzymes typically targeted in such experiments. To this end, we propose to map the sequence-function landscape of SpyCatcher, a small reactive protein that forms a covalent bond with its substrate SpyTag, investigating both natural and de novo sequences. While not a true enzyme, SpyCatcher’s small size (80 residues) and simple reaction mechanism make it a tractable minimal model for enzyme-like reactivity. In Aim 1, we will use cDNA display-based to measure the reactivity, substrate binding kinetics, and stability of ~ 1 million natural SpyCatcher variants. The unprecedented scale of these functional measurements on a uniquely simple system will allow us to rigorously explore additive and non-additive sequence-function relationships and identify predictive patterns using machine-learning models. In Aim 2, we will leverage de novo design to explore further reaches of sequence space and test different design strategies in high-throughput, creating SpyCatcher-like proteins with sequences far from natural ones. These designs (three rounds of ~500,000 sequences) will be assayed with the same high-throughput assays to iteratively inform our design process and improve criteria for designing proteins with functional reactivity. We aim to generate high-quality, large-scale datasets spanning both natural and unnatural sequence space in order to uncover fundamental rules of reactive protein design, which can inform future engineering efforts. By producing interpretable data that links sequence to biophysical properties, we aim to enhance our ability to design proteins with tailored functions.