Project summary
The human reference pangenome, which represents a collection of genome sequences in a single data
structure, has the potential to transform human genetics applications. Compared to a traditional linear
reference genome, pangenomes enable analysis of megabases of genetic sequence that were previously
ignored, reduce bias when analyzing diverse genomes, and provide dramatically improved genotyping of
structurally complex regions of the genome. These complex regions likely harbor medically relevant variants
contributing to a range of human traits. However, pangenomes have yet to be integrated into medical genetics
and complex trait workflows due to a lack of analysis and visualization tools that are accessible to non-experts.
Our central hypothesis is that pangenomes can be used to improve fine-mapping of trait associations
and detection of pathogenic variants in complex regions by identifying particular paths enriched in individuals
with a phenotype of interest. We focus on developing and applying tools that leverage pangenomes to identify,
visualize, and fine-map genomic loci associated with complex traits. The tools proposed below are motivated
by two major challenges identified by our own efforts to this end. First, visualization and browsing pangenome
subgraphs for loci of interest, which is a critical step in exploring and understanding complex genomic regions,
is currently a cumbersome and time-consuming process involving multiple command line tools geared at
bioinformatics experts. Second, there is a lack of tools for integrating existing biobank datasets for which both
genotype and phenotype data are available for complex traits analysis, with the reference pangenome.
Our proposal integrates multiple large datasets encompassing a range of technologies and builds on
existing pangenome resources and the computational infrastructure developed by the HPRC. In particular, we
use genotype data and whole genome sequencing (WGS) datasets available for hundreds of thousands of
individuals of a range of ancestries from the UKBiobank and All of Us as well as thousands of phenotypes
available for these samples. A key goal is to enable backwards compatibility with existing biobank-scale
datasets that have been mapped to linear reference genomes, which will facilitate more immediate use of the
pangenome reference. We additionally use near complete long read assemblies and the reference
pangenomes (primarily minigraph-cactus) released by HPRC. Further, our tools are designed to integrate with
the current pangenome computational ecosystem by incorporating existing file formats (e.g. rGFA) and toolkits
(e.g. vg). To this end we will develop a web-based pangenome browser that integrates with existing data based
on linear genomes (Aim 1), develop metrics to quantify local graph complexity and use these metrics to
characterize existing GWAS signals (Aim 2), and integrate pangenomes with existing biobank datasets to
perform fine-mapping and visualization of individual trait-associated loci (Aim 3).