The human genome reference sequence is one of the foundations of genome sciences, especially in the context
of next-generation sequencing (NGS) analysis. The reference has enabled discoveries in biomedical research
and been particularly instrumental in human disease gene identification. However, the human genome reference
is limited by its static and linear nature. Specifically, the current reference lacks the featural and contextual
flexibility to represent the breadth of human variation. Important elements of individual genomes are either
missed or incorrectly represented. As a solution that will bridge the next generation of reference assemblies with
population genome sequencing studies, we have developed a K-mer-based indexing approach. This method is
more efficient computationally, provides accurate representation in the context of populations and facilitates the
analysis of diverse human genomes. Our goal is to use this strategy in developing a robust computational
architecture that will encode and annotate large collections of genomes in the context of a pan-genome
First, we plan to develop a scalable, efficient K-mer representation of a large collection of haplotype/phased
reference genomes, by 1) generating an index of all K-mers in human reference genome GRCh38 in a manner
that can efficiently store variant information as metadata, and then 2) incrementally updating the K-mer index to
include all novel K-mers derived from ongoing population sequencing efforts, while 3) developing schemes for
directly analyzing compressed genomic data.
Second, we plan to apply K-mer representation to genomic analysis by 1) providing the entirety of known
human genetic variation in an aggregated index that is computationally efficient and easy to understand, 2)
developing functions for our pan-genomic index that supports ultra-rapid queries, such as of clinically important
variants, and 3) linking conventional coordinate information to the K-mer metadata in the pan-genome index to
allow annotating genetic variation to a particular genome reference.
Third, we will create an online web portal for the pan-genome, using cloud computing, to maximize the utility
of our approach, to promote community engagement and to enabling contribution from the research community.
We expect that completion of these aims will provide: a scalable computational architecture which incorporates
the continuous addition of variant information without loss of resolution or accuracy;; rapid query speeds that will
remain nearly constant as the database grows;; a universally accessible portal using cloud computing.
This work will help solve the issues of multiple assemblies. It will improve researchers’ ability to understand
the relationship of variants and disease, while also providing great savings over the long-term in infrastructure
and computing costs.