Integrating the reference pangenome with biobank-scale data for complex trait analysis - Project summary The human reference pangenome, which represents a collection of genome sequences in a single data structure, has the potential to transform human genetics applications. Compared to a traditional linear reference genome, pangenomes enable analysis of megabases of genetic sequence that were previously ignored, reduce bias when analyzing diverse genomes, and provide dramatically improved genotyping of structurally complex regions of the genome. These complex regions likely harbor medically relevant variants contributing to a range of human traits. However, pangenomes have yet to be integrated into medical genetics and complex trait workflows due to a lack of analysis and visualization tools that are accessible to non-experts. Our central hypothesis is that pangenomes can be used to improve fine-mapping of trait associations and detection of pathogenic variants in complex regions by identifying particular paths enriched in individuals with a phenotype of interest. We focus on developing and applying tools that leverage pangenomes to identify, visualize, and fine-map genomic loci associated with complex traits. The tools proposed below are motivated by two major challenges identified by our own efforts to this end. First, visualization and browsing pangenome subgraphs for loci of interest, which is a critical step in exploring and understanding complex genomic regions, is currently a cumbersome and time-consuming process involving multiple command line tools geared at bioinformatics experts. Second, there is a lack of tools for integrating existing biobank datasets for which both genotype and phenotype data are available for complex traits analysis, with the reference pangenome. Our proposal integrates multiple large datasets encompassing a range of technologies and builds on existing pangenome resources and the computational infrastructure developed by the HPRC. In particular, we use genotype data and whole genome sequencing (WGS) datasets available for hundreds of thousands of individuals of a range of ancestries from the UKBiobank and All of Us as well as thousands of phenotypes available for these samples. A key goal is to enable backwards compatibility with existing biobank-scale datasets that have been mapped to linear reference genomes, which will facilitate more immediate use of the pangenome reference. We additionally use near complete long read assemblies and the reference pangenomes (primarily minigraph-cactus) released by HPRC. Further, our tools are designed to integrate with the current pangenome computational ecosystem by incorporating existing file formats (e.g. rGFA) and toolkits (e.g. vg). To this end we will develop a web-based pangenome browser that integrates with existing data based on linear genomes (Aim 1), develop metrics to quantify local graph complexity and use these metrics to characterize existing GWAS signals (Aim 2), and integrate pangenomes with existing biobank datasets to perform fine-mapping and visualization of individual trait-associated loci (Aim 3).