Software for the complete characterization of antibody repertoires: from germline and mRNA sequence assembly to deep learning predictions of their protein structures and targets - The B cell population in each individual produces an estimated 1010 different antibodies, collectively known as the antibody repertoire. This extraordinary diversity is essential for responding to the unique history of infections, vaccinations and cancer encountered over an individual’s lifetime. Conversely, regulatory errors in the system play a pivotal role in a host of auto-immune diseases. Antibodies are composed of two proteins, a heavy and light chain, each containing a variable region, VH and VL, which together confer antigen binding specificity. Diversity is initiated through differential recombination at the three V region encoding loci to produce the naïve repertoire. Upon antigen exposure, B cells expressing an antibody specific to that antigen undergo clonal expansion and concentrated somatic hypermutation (SHM) of V region sequences that code for the antigen recognition domain. Those clonally derived B cells (clonotypes) each express a different sequence and thereby structural variant of the initial unmutated antibody. Cells expressing higher affinity variants are selected for in a process known as affinity maturation. In this way, the mature repertoire is built from the history of antigenic encounters by that individual. Efficient deciphering of that history could contribute to improving human health in numerous ways from better clinical decision making to improved diagnostics and therapeutics. Toward that goal, ongoing technological advances in both DNA/RNA sequencing, protein structure modeling software and high-performance scalable computer hardware are making virtual repertoire scale antibody structure and antigen screening attainable in the not-too-distant future. In this Direct to Phase II application, we propose to build a software suite that bridges the gap between genomics and structural biology enabling antibody repertoires to be deciphered and mined in exquisite detail. To do so, we first leverage our highly extensible sequence assembler, XNG, to produce haplotype phased and annotated sequences of the germline IG loci from which the naïve repertoire can be simulated (Aim 1). Next, XNG is used to assemble and annotate bulk BCR-seq data producing the linear VH and VL encoding sequences of the mature repertoire (Aim 2). Translated repertoire sequences are then used as input for our protein modeling software, NovaFold-Ab and NovaFold-AI, where high accuracy 3D antibody structures are predicted (Aim 3). Those antibody structure libraries are then used in virtual screens to identify members that bind to a target antigen with our protein interaction modeling program, NovaDock (Aim 4). Screens can also be refined to specific epitopes of interest, for example, those known to elicit neutralizing antibodies. If realized, these capabilities will have significant commercial opportunities for complementing existing technology in improving clinical care and personalized medicine as well as aiding in the development of faster, more cost effective diagnostics and therapeutics.