Immune-repertoire sequence, which consists of an individual's millions of unique antibody and T-cell receptor
(TCR) genes, encodes a dynamic and highly personalized record of an individual's state of health. Our long-
term goal is to develop the computational models and tools necessary to read this record, to one day be able
diagnose diverse infections, autoimmune diseases, cancers, and other conditions directly from repertoire se-
quence. The key problem is how to find patterns of specific diseases in repertoire sequence, when repertoires
are so complex. Our hypothesis is that a combination of bottom-up (sequence-level) and top-down (systems-
level) modeling can reveal these patterns, by encoding repertoires as simple but highly informative models that
can be used to build highly sensitive and specific disease classifiers. In preliminary studies, we introduced
two new modeling approaches for this purpose: (i) statistical biophysics (bottom-up) and (ii) functional diversity
(top-down), and showed their ability to elucidate patterns related to vaccination status (97% accuracy), viral
infection, and aging. Building on these studies, we will test our hypothesis through two specific aims: (1) We
will develop models and classifiers based on the bottom-up approach, statistical biophysics; and (2) we will de-
velop the top-down approach, functional diversity, to improve these classifiers. To achieve these aims, we will
use our extensive collection of public immune-repertoire datasets, beginning with 391 antibody and TCR da-
tasets we have characterized previously. Our team has deep and complementary expertise in developing
computational tools for finding patterns in immune repertoires (Dr. Arnaout) and in the mathematics that under-
lie these tools (Dr. Altschul), with additional advice available as needed regarding machine learning (Dr.
AlQuraishi). This proposal is highly innovative for how our two new approaches address previous issues in the
field. (i) Statistical biophysics uses a powerful machine-learning method called maximum-entropy modeling
(MaxEnt), improving on past work by tailoring MaxEnt to learn patterns encoded in the biophysical properties
(e.g. size and charge) of the amino acids that make up antibodies/TCRs; these properties ultimately determine
what targets antibodies/TCRs can bind, and therefore which sequences are present in different diseases. (ii)
Functional diversity fills a key gap in how immunological diversity has been measured thus far, by factoring in
whether different antibodies/TCRs are likely to bind the same target. This proposal is highly significant for (i)
developing an efficient, accurate, generative, and interpretable machine-learning method for finding diagnostic
patterns in repertoire sequence; (ii) applying a robust mathematical framework to the measurement of immuno-
logical diversity; (iii) impacting clinical diagnostics; and (iv) adding a valuable new tool for integrative/big-data
medicine. The expected outcome of this proposal is an integrated pair of robust and well validated new
tools/models for classifying specific disease exposures directly from repertoire sequence. This proposal in-
cludes plans to make these tools widely available, to maximize their positive impact across medicine.