The current human reference genome, GRCh38, plays a central role in medical and population human genetics.
It primarily models a single human individual and is missing hundreds of thousands of large structural variations
segregating in the population. This underrepresentation of genetic diversity leads to various artifacts in data
analysis and significantly hampers our understanding of the functional and medical relevance of these large
human variations, which may collectively have pervasive impact. To address this issue, we will extend our
previous work on sequence graphs and alignment algorithms and construct a pan-genome reference graph from
hundreds of long-read human assemblies that more completely represent genetic diversity. Specifically, we will
(1) design a reference graph model with a stable coordinate system compatible with GRCh38 and develop
toolkits and libraries to interact with this model; (2) develop minimizer-based sequence-to-graph alignment
algorithms for short and long sequences; (3) incrementally construct a reference graph by mapping assemblies
to an existing graph and updating the graph; and (4) develop a graph-based genotyping algorithm and apply it
to short-read based projects to call structural variations missed by the current pipelines. Upon completion, the
proposed project could replace the current practices based on a linear genome and will enable the profiling and
study of complex human variations missed in most current research.