ABSTRACT
One of the “holy grails” in immunology is to be able to directly predict tight-binding variable chain antibody
sequences in silico against foreign or non-self `antigenic' proteins. Immunoglobulin chain rearrangement can
potentially encode approximately 1016 different variants of antibody heavy and light chain sequences. However,
only a small fraction of the sequence space is generally accessed for evolving antibodies against foreign proteins.
The computational challenge is to go from a model of the structure of an antigen to predicting a set of antibody
chain sequences that can bind tightly to the antigen. If solved, it might be possible to move in less than 24 hours
from the first cryo-electron-microscopic structure of a novel viral protein to advance a set of potent antibody-like
molecular candidates for testing. Towards solving this problem, this project aims to develop a deep learning
architecture that will take as input thermodynamic, quantum mechanical (density functional), and local structure-
based network topographical features of the antigens and their cognate antibodies, and will output their
respective binding affinity constants.
We will design a generative adversarial network (GAN), which we think is uniquely suited for regression-based
ML approaches for the immune system, to discover associations between the epitope and the variable chain
features. This approach requires a large data stream of antigen and cognate antibody sequences, which until
recently was difficult to obtain. A recently described single B-cell receptor (BCR) specific tagging method coupled
with single cell deep sequencing (“linking B cell receptor to antigen specificity through sequencing” or LIBRA-
seq) can rapidly isolate and sequence the BCR variable chain coding regions that can bind with high selectivity
to antigenic epitopes.
Towards the specific project goals, in Task 1, LIBRA-seq will be used to rapidly identify and generate candidate
immunoglobulin coding sequences in response to specific linear and nonlinear epitopes (against controls),
chosen through computational/molecular modeling and prioritized with SARS-CoV-2 Spike protein epitopes (but
not restricted to these), injected into a mouse model, to generate large training sets; in Task 2, these training
sets, along with other data sets already available in public databases, will generate a series of structural features
(described above), which will be used to train the GAN; in Task 3, the predicted epitope-antibody interactions
will be validated by direct experiments with synthetic antibody and phage-display systems. Thus, the proposed
strategy combines foundational principles in evolutionary biology, genomics, structural chemistry, and computer
science to the solution of a general biological engineering problem.
Results from this project are expected to lay the foundations for a rigorously tested and fully automated machine-
learning system that could rapidly generate synthetic antibody candidates from the structure of a novel virus
protein, which can enhance the rapid response ability against a future pandemic. The ability to develop targeted
antibody therapy against non-infectious or chronic diseases, and on the production of antibody-based industrial
enzymes, will also be dramatically enhanced if this project were to be successful.
The team: The team-leads of this multi-institutional research project comprise a computer scientist, a protein
crystallographer, an immunologist, and a molecular biologist.
1