Abstract
Short tandem repeats (STRs) are 1–6 bp repetitive and highly polymorphic DNA sequences. Expansions in
dozens of STRs are associated with genetic disease. However, STRs are challenging to sequence and interpret,
meaning that individuals with STR disease often go undiagnosed. In rare disease studies, it is now standard to
prioritize candidate pathogenic SNVs, indels and SVs by excluding variants that have high allele frequencies in
population-scale databases such as gnomAD. However, there is no such genome-wide database available for
large STR expansions. I will produce a publicly available STR variation community resource, stratified by
ancestry, to enable prioritization of candidate pathogenic STR expansions.
Long-read sequencing technologies from PacBio and Nanopore have been heralded as the solution to accurately
genotype long repeats because their reads can span the repetitive region. However, there are several challenges
when genotyping STRS in long-reads that are not adequately addressed by existing approaches. I will develop
a method to genotype STRs from long-read Oxford Nanopore sequencing data. It will discover informative reads
using a combination of alignment and identifying repetitive regions in reads. It will then infer the genotype by
integrating evidence from multiple reads, informed by my investigation of biases in these technologies.
Drawing together new short and long-read computational approaches to calling STR expansions, and my
population-scale STR catalog, with an emphasis on diverse and under-served populations, this proposal will
establish a genetic diagnosis for hundreds of patients, while searching for new STR disease loci. I will analyze
patient cohorts enriched for phenotypes associated with STRs from the UDN, University of Washington, Harry
Perkins Institute of Medical Research and Children’s Mercy Hospital to solve cases and discover new disease-
associated STRs in both short and long-read sequencing.