Software for the complete characterization of antibody repertoires: from germline and mRNA sequence assembly to deep learning predictions of their protein structures and targets - The B cell population in each individual produces an estimated 1010 different antibodies, collectively known as
the antibody repertoire. This extraordinary diversity is essential for responding to the unique history of
infections, vaccinations and cancer encountered over an individual’s lifetime. Conversely, regulatory errors in
the system play a pivotal role in a host of auto-immune diseases. Antibodies are composed of two proteins, a
heavy and light chain, each containing a variable region, VH and VL, which together confer antigen binding
specificity. Diversity is initiated through differential recombination at the three V region encoding loci to produce
the naïve repertoire. Upon antigen exposure, B cells expressing an antibody specific to that antigen undergo
clonal expansion and concentrated somatic hypermutation (SHM) of V region sequences that code for the
antigen recognition domain. Those clonally derived B cells (clonotypes) each express a different sequence and
thereby structural variant of the initial unmutated antibody. Cells expressing higher affinity variants are selected
for in a process known as affinity maturation. In this way, the mature repertoire is built from the history of
antigenic encounters by that individual. Efficient deciphering of that history could contribute to improving
human health in numerous ways from better clinical decision making to improved diagnostics and therapeutics.
Toward that goal, ongoing technological advances in both DNA/RNA sequencing, protein structure modeling
software and high-performance scalable computer hardware are making virtual repertoire scale antibody
structure and antigen screening attainable in the not-too-distant future.
In this Direct to Phase II application, we propose to build a software suite that bridges the gap between
genomics and structural biology enabling antibody repertoires to be deciphered and mined in exquisite detail.
To do so, we first leverage our highly extensible sequence assembler, XNG, to produce haplotype phased and
annotated sequences of the germline IG loci from which the naïve repertoire can be simulated (Aim 1). Next,
XNG is used to assemble and annotate bulk BCR-seq data producing the linear VH and VL encoding
sequences of the mature repertoire (Aim 2). Translated repertoire sequences are then used as input for our
protein modeling software, NovaFold-Ab and NovaFold-AI, where high accuracy 3D antibody structures are
predicted (Aim 3). Those antibody structure libraries are then used in virtual screens to identify members that
bind to a target antigen with our protein interaction modeling program, NovaDock (Aim 4). Screens can also be
refined to specific epitopes of interest, for example, those known to elicit neutralizing antibodies. If realized,
these capabilities will have significant commercial opportunities for complementing existing technology in
improving clinical care and personalized medicine as well as aiding in the development of faster, more cost
effective diagnostics and therapeutics.