A copy number variant discovery pipeline for integrated genome-exome sequencing - Abstract Copy number variants (CNVs) involve deletions and duplications of genomic segments spanning more than 50 basepairs and represent one of the most penetrant sources of pathogenic variants in neuropsychiatric disorders, with myriad impacts on many other human phenotypes as well. However, the relative impact of CNVs at the resolution of individual genes, exons, or functional categories, and especially across diverse global populations, has never been systematically assessed in neuropsychiatric disorders at scale. This omission can be attributed to the technical barriers in CNV discovery as well as the lack of large-scale, diverse neuropsychiatric cohorts. Traditional cytogenetic methods for CNV detection, such as chromosomal microarrays (CMA), are relatively low- resolution, and have largely precluded gene and exon resolution analyses. Recent advances in sequencing with whole exome (ES) and whole genome sequencing (GS) have dramatically improved our resolution, including the discovery of exon and sub-exon level CNVs. However, neither GS nor ES are perfect. While GS can interrogate the whole spectrum of CNVs across frequency and size, it is expensive. ES on the other hand, though affordable, can only query the rare coding portion of the genome for CNVs. Promisingly, the blended genome exome (BGE) sequencing approach has recently undergone heavy development and rapid adoption in a number of large-scale, diverse neuropsychiatric sequencing efforts, including in the Populations Underrepresented in Mental Illness Association Studies (PUMAS) project, NeuroDev, and Akili studies. BGE is composed of a high coverage ES (~30x) with a low coverage GS backbone (~2-3x), at a cost comparable to traditional exome sequencing. With this blend, BGE has delivered on marrying the affordability of ES with the full range of variant detection of GS when used to detect single nucleotide variants (SNVs) across the entire genome. Leveraging our expertise in computational methods development for CNV detection and association across GS and ES, we believe that in addition to SNVs, BGE is the perfect platform to capture the full range of CNVs across the genome at: a significantly improved resolution compared to CMA and ES; a significantly lower reference-bias compared to CMA; and a dramatically lower cost compared to GS. To achieve this, we will extend our GATK-gCNV pipeline for rare CNV detection in conjunction with our ancestry-aware SV imputation pipeline for use with BGE data. Preliminary results have already shown great promise. We will apply this pipeline to the more than 110,000 available BGE samples across PUMAS, NeuroDev, and Akili to generate a large-scale, diverse CNV callset. These variants will be made publicly available and can immediately be leveraged to significantly advance our understanding of the genetic architecture of neuropsychiatric conditions, especially in context of diverse genetic ancestry groups.