Abstract
The recent advances in high-throughput sequencing technologies enable cost-effective characterization of the
immune system and provide novel opportunities to study adaptive immune receptor repertoire (AIRR) at the
population scale. In particular, AIRR analysis provides essential insight into the complexity of the immune system
across a large variety of human diseases, including infectious diseases, cancer, autoimmune conditions, and
neurodegenerative diseases. A commonly used assay-based approach (i.e. AIRR-Seq) provides a detailed view
of the adaptive immune system by leveraging the deep sequencing of amplified DNA or RNA from the variable
region of the T and B cell receptors (TCR and BCR) loci. However, the limited number of samples probed by the
AIRR-Seq approach restricts the ability to detect novel population-specific V(D)J gene alleles across ethnically
diverse and admixed populations. Non-targeted next-generation sequencing (NGS) (e.g. WGS) promises to fill
the existing data gap by providing hundreds of thousands of NGS datasets across various ancestry groups.
However, reliable and scalable bioinformatics algorithms have yet to be developed to utilize non-targeted NGS
technologies to assemble novel population-specific alleles that would support effect-size heterogeneity across
ancestries. There's a lack of comprehensive population-specific allelic immunogenomics reference databases.
This void exacerbates existing health disparities, as discoveries in medical immunogenomics continue to be a
privilege and benefit for populations of European ancestry. The current state-of-the-art databases were built on
the genetic architecture based on individuals of European ancestry and thus fail to capture allelic variation across
diverse populations. Ongoing initiatives by the Adaptive Immune Receptor Repertoire Community (AIRR-C) to
improve the representation of diverse populations in reference databases (e.g. OGRDB and VDJbase) ignore
individuals of non-European ancestry and only incorporate an extremely small number of individuals of European
descent. We propose to utilize a data science approach for studying the variation of the human adaptive immune
system at a truly global scale, improving studies of immunological health and diseases, and reducing health
disparities. In this study, we will develop robust and scalable bioinformatics tools and databases able to leverage
the largest datasets covering individuals of various ancestries composed of over half a million NGS samples
spanning the AIRR-Seq, RNA-Seq, and WGS technologies. We will perform rigorous benchmarking of the
developed bioinformatics methods based on both simulated and real data to demonstrate the feasibility of using
NGS-based approaches to assemble novel V(D)J alleles. The availability of large and ethnically diverse sets of
samples will allow us to discover novel population-specific V(D)J alleles, which will enrich existing
immunogenomics databases with population-specific immune alleles. To promote the dissemination of the
obtained results, the novel alleles and assembled receptor sequences will be shared as an easy-to-use database
with a rich set of functionalities.