Project Summary/Abstract
The T cell receptor (TCR) repertoire of a subject resembles a huge book with millions of records, with each
record being a TCR. Some of these records are generated by a random process and do not contain any clinically
relevant information. Some other records, however, encode current or past history of immune-related diseases
or exposure to pathogens. Accurate decoding of a TCR book gives us immensely valuable information on the
health record of the corresponding subject. Such information can be collected through a single blood draw, which
is much more convenient than conducting separate tests for different diseases or exposures, and thus is ideal
as a screening or monitoring tool for a large population. In addition, T cell is very sensitive to detect even a very
small amount of antigen and T cell memory lasts many years after the initial immune response. These features
make the TCR repertoire an attractive source for constructing biomarkers for many immune-related diseases.
An important factor that affects the TCRs in a TCR repertoire is the Human Leukocyte Antigen (HLA) of the
corresponding subject. Each human has up to 16 unique HLA alleles that are part of thousands of HLA alleles
in the human population. Ignoring HLA information leads to reduced accuracy of TCR-based biomarkers,
particularly for the subjects with relatively rare HLAs. However, previous works have ignored the HLAs because
they are highly polymorphic. We propose a statistical framework that combines powerful computational tools
such as neural network with rigorous statistical models to fill this critical unmet need. Our methods deliver HLA-
specific associations between TCRs and disease status and use TCRs to predict the disease status of a subject
while conditioning on her/his HLA alleles. We will evaluate our methods by making prediction on the infection by
cytomegalovirus or SARS-COV-2, though our methods are general, and they can be applied to study many other
diseases/conditions that induce T cell response.