Project Summary: Enabling Comparative Pangenomics
To many in the field, it is clear that we are moving rapidly toward a golden age of vertebrate comparative
genomics in which thousands of high quality genomes of different species are publicly available and used in
understanding the human genome. Despite the opportunity presented by the growth in available genomes,
there has been relative stagnation in the software used to compare complete genomes, most of the software
developed being old and limited in capabilities. To remedy this situation, we will create a hardened toolkit for
genome comparison and annotation that can be robustly applied to thousands of vertebrate genomes. To
demonstrate this toolkit and deliver its results to the broader genomics community, we will apply it to create a
resource within the existing UCSC and Ensembl Genome Browsers that will incorporate thousands of
vertebrate genomes. Large, well organized consortia have coalesced to take on the challenge of sequencing
and assembling vertebrate genomes. Our alignments will form a backbone of these projects’ analysis, and our
synthesis of their data will create a resource that is much greater than the sum of what might otherwise be a
series of smaller, fragmented and not directly comparable efforts. We will gather together more than 600
vertebrate genomes into our proposed resource in the first year of the proposal, rapidly delivering results.
Paralleling the growth in available reference genomes, the last decade has been marked by an
explosion in population sequencing projects. Although much of the cataloged human variation has a very
recent evolutionary origin, there is a tremendous opportunity to combine and so better understand intra- and
inter- species change using models from population genetics. We will create pangenome software to (i) avoid
reference bias in species comparisons (i.e. avoiding assumptions about which alleles are fixed when
comparing between species, which is important in quasi-species such as cichlids), (ii) allow ancestral alleles to
be comprehensively estimated, including those that are part of structural variation, and (iii) more easily enable
the study of balancing selection. To demonstrate the utility of comprehensive variation integration we will
create a prototype of a pan-genome for the apes. We will use this graph to identify ancestral alleles and to
dynamically convert annotations between species and assembly versions, and, via population mapping
experiments, we will demonstrate its power for typing segregating but ancient variation. Using knowledge of
ape evolution, we will ultimately extend this graph to adequately model the most complex regions of the human
genome.