ABSTRACT
Genetic variations in HLA genes in humans are associated with over 200 diseases, and large-scale genomic
sequencing projects are now generating data on HLA genes from millions of individuals. Despite their immense
clinical relevance, next-generation sequencing based computational inference of short (SNP or
insertion/deletion) and copy number variations in HLA genes is difficult because of their highly polymorphic
nature, inter-HLA gene similarity, and strong linkage disequilibrium. Existing tools for HLA variant detection are
error-prone, not designed for scalability, not interoperable across sequencing formats, and the developers have
no formal mechanisms to provide support after publication. The objective of this application is to develop highly
accurate, robust, scalable, and deployment-ready pipelines for identifying germline and somatic variants in HLA
genes through integration and enhancement of our previously developed tools. To achieve this goal, we aim to
(1) Develop tools for detecting short germline and somatic HLA variants by enhancing our Polysolver tool for
allele inference across all HLA genes, and further developing the Mutect3 pipeline for mutation detection; (2)
Establish computational approaches for detecting germline and somatic copy number variation in HLA genes by
integrating Polysolver with the GATK-gCNV and ModelSegments tools, respectively; and (3) Use the widely used
GATK4 framework and Workflow Definition Language (WDL) to create and disseminate robust, scalable and
well-supported HLA variant detection pipelines. This will be the first such comprehensive HLA analysis toolkit,
which we expect will be widely used by both individual researchers and sequencing consortia in multiple disease
communities. Mutect3, which internally employs a “deep sets” architecture, will be the first mutation detection
tool capable of jointly calling germline and somatic short variants and handling multiple references at a genomic
locus. If successful, this project will unlock the hitherto untapped potential of rapidly growing sequencing datasets
by enabling discovery of new HLA alleles, variations in known HLA alleles, and novel HLA-disease associations
which can directly be harnessed for personalized preventive and therapeutic applications.