Refining Mendelian disease analysis via detection of clinically relevant repeat variants - PROJECT SUMMARY Whole genome sequencing (WGS) has the potential to profile all clinically relevant genetic variants simultaneously. However, clinical variant discovery pipelines have focused largely on coding single nucleotide variants (SNVs), and to a lesser extent on regulatory SNVs and small indels, ignoring more complex classes of pathogenic variants such as repeats or structural rearrangements. Repeats can take many forms, but we consider three classes of repeats: short tandem repeats (STRs), variable number tandem repeats (VNTRs), and low-copy repeats or segmental duplications, together accounting for more than 8% of the human genome. These variant classes have been implicated in a number of Mendelian diseases. More than 30 disorders, primarily neurodegenerative, are caused by STR expansions, including Huntington’s Disease, Fragile X Syndrome, ALS/FTD, and hereditary ataxias. Similarly, VNTRs have been implicated in a range of psychiatric and other traits including medullary cystic kidney disease and type 1 diabetes. In many cases, the disease progression is correlated with germline repeat counts, but sequence variation within individual repeat units, and somatic instability of repeat length, has also been shown to be pathogenic in some cases. Finally, mutations in more than 100 duplicated genes have been implicated in rare Mendelian disorders and cancer, including PMS2 in Lynch Syndrome and STRC in hearing loss. Taken together, diseases associated with these repeat classes affect millions of individuals worldwide. Despite their relevance to disease, these repeat types are typically absent from sequence analysis pipelines due to the bioinformatics challenges they present. Over the last several years, we and others have made significant progress in developing methods to analyze clinically relevant repeats from short reads. However, important challenges remain, including the ability to genotype long, complex, imperfect, or GC rich repeats, to infer clinically relevant somatic variation, and the computational burden of existing methods. Further, existing frameworks for predicting the pathogenicity of individual SNVs or indels are not applicable to most repeats, and thus there is a need for prioritization methods to predict the impact of new repeat variants. The goal of this project is to make repeat analysis a standard component of existing Mendelian variant calling pipelines. To this end, we will develop novel methods for profiling repeat variants from long reads (Aim 1), extend our existing methods for short reads to consider more complex variant types (Aim 2), and establish a framework for prioritization of pathogenic repeat mutations (Aim 3).