Deep learning methods for genotyping structural variants in human genomes - PROJECT SUMMARY:
Structural variants (SVs) play a causal role in numerous diseases. However, our ability to detect and analyze
disease-causing SVs, particularly de novo SVs, in short read genome sequencing data is limited by inaccurate
genotyping (determining zygosity). There exists a substantial gap between the genotyping accuracy for small
variants, e.g., single nucleotide variants, and SVs. Improving the accuracy of SV genotyping will increase the
rates of molecular diagnosis, improve our understanding of multiple diseases, and expand our knowledge of
human genetic variation. Our aim is to develop more accurate tools for genotyping SVs in short read genome
sequencing data by incorporating the specific genomic context, sequencing instrument, and analysis pipeline
into the genotyping model. Instead of attempting to develop a parametric model for those complex and
interconnected processes, we generate estimates of the expected evidence using simulation. Our goals are to:
1. Develop a deep learning-based SV genotyper that automatically learns informative features shared by the
real and simulated data in an image-based representation of the SV. Treating SV genotyping as an image
similarity problem will enable us to more accurately genotype the many different SVs that might exist, not just
those observed previously. 2. Deploy our new method to generate accurate genotypes for an ensemble of
short and long-read derived SV call sets in thousands of human genomes. The resulting dataset will increase
our understanding of the spectrum of structural variation across diverse populations. 3. Leverage our similarity
model to automatically correct otherwise imprecise or incorrect SV descriptions; doing so will increase
genotyping accuracy, improve the integration of different SV call sets, and enable more sensitive SV discovery
in the future.