ECOD: Large scale classification of predicted and experimental protein structures - Project Summary Classification of protein domains have historically served to contextualize the 3D structural data collectively generated by experimental structure determination methods such as X-ray crystallography, nuclear magnetic resonance spectroscopy, and electron microscopy. Our database, Evolutionary Classification of protein Domains (ECOD), has served the biological community for seven years cataloguing evolutionary relationships between domains from experimental structures. The recent advent of high-accuracy structure prediction methods, such as AlphaFold (AF) and RoseTTAFold (RF), and the consequent release of 1 million predicted structures in AlphaFold Database (AFDB) heralds a paradigm shift in structural biology and domain classification. The rate of structure deposition is expected to jump between a hundred to a thousand- fold. We propose to take advantage of this revolution and transform ECOD into a comprehensive classification of the entire protein university using sequence, structure, and functional evidence. By simultaneously classifying experimental and predicted structures of proteins from model organisms and human pathogens, our classification will help the scientific community to critically evaluate structure models and utilize the evolutionary information to discover and experimentally characterize protein function. Classifying AF models challenges the ECOD pipeline by a 50-fold increase in the workload and by the significant fraction of non-globular and low-quality regions in the models. Thus, our first Aim is to upgrade ECOD’s infrastructure and develop methods to identify single domains from AF models and to integrate sequence, structure, and functional site similarities into our automatic classification. Compared to the current ECOD workflow that relies on human experts for structure-and- function-based classification, these improvements will drastically decrease the need for manual curation and will allow us to achieve our second Aim, i.e., classifying domains of over 1 million released AF models into ECOD via a combination of computational pipelines and minimal manual efforts (0.25% 1% cases). Utilizing the deluge of AF models, the new automatic pipeline, and expertise of human curators, we expect both to significantly improve ECOD and to evaluate the quality of AF models by (1) covering all known protein families in Pfam, (2) confirming remote homology via evolutionary intermediates, (3) comparing evolutionarily related experimental and predicted structures, and (4) resolving errors and inconsistency through periodic quality checks. Finally, we will take the lead in making functional discoveries for biomedically important proteins classified by ECOD in our third Aim, studying virulence factors (VFs) in bacterial pathogens modelled by AFDB or studied by our experimental collaborators, the Orth lab. Fast evolving VFs were a challenge for structure prediction or functional inference by sequence. We will identify candidate VFs in two dozen bacterial pathogens, obtain their structure models, and infer their function using similarities to known proteins in structure and functional sites. Promising hypotheses will be tested experimentally in the Orth lab through biochemical and genetic assays.