A reference-free computational algorithm for comprehensive somatic mosaic mutation detection - ABSTRACT Somatic mosaicism (SM), i.e. the presence of cells with somatically acquired mutations, is a driving feature of cancer and several developmental diseases. However, whereas today we have detailed understanding and predictive models of benign and pathogenic inherited polymorphisms, germline de novo mutations, and tumor mutations, we have only limited knowledge of the burden, allele frequency spectrum, clonal patterns, and mutational signatures of healthy somatic mosaicism. Realizing that such currently missing knowledge is critical for informing experimental design in future studies of mosaicism’s biological and clinical consequences, NIH is launching an ambitious initiative, the Somatic Mosaicism across Human Tissues (SMaHT) project to construct a comprehensive human somatic mosaicism atlas. As part of this initiative, funding announcement RFA-RM-22- 011 calls for Tool Development Projects to develop “approaches that significantly improve the sensitivity, accuracy, and threshold of detection of all types of somatic variants across the complete genome”. Such comprehensive detection is currently challenging because somatic mosaicism mutations occur across a wide range of mutation types and lengths, but the majority of today’s variant detection tools have low sensitivity for larger, structural events. Furthermore, somatic mutations are typically at very low allele frequency (<1%), but accurate detection of low-frequency variation today is beyond the capabilities of most tools. We have pioneered a unique-kmer guided detection approach in our RUFUS tool, designed for germline de novo mutation detection. This approach focuses on identifying the novel DNA sequence created by a mutation, which allows the same underlying algorithm, with uniform algorithmic behavior and sensitivity, to be applied across the full range of mutation types. RUFUS has been validated for accurately detecting germline de novo mutations in large discovery datasets and rare-disease diagnostic studies. Our preliminary analyses also indicate that RUFUS has high sensitivity across a full range of somatic mutations. This application proposes to adapt the RUFUS algorithm for somatic mosaic mutation detection with high sensitivity and specificity across the entire mutation type, mutation length, and allele frequency spectrum; and thus, substantially contribute to the construction of a comprehensive mosaicism atlas. To achieve this overall goal, in the first (UG3) phase of the project we will focus on algorithmic development to improve low-frequency allele detection, empirically characterize RUFUS’s sensitivity and specificity, and ready the tool for adoption into the SMaHT Network’s central analysis pipelines. In the second (UH3) phase of the project, we will integrate RUFUS into the central analysis workflow of the SMaHT consortium; optimize and extend its performance for analyzing the vast SMaHT somatic mosaicism dataset. We anticipate that RUFUS will contribute substantially to the SMaHT Initiative's goal to comprehensively map out human somatic mosaicism across individuals, organs, and tissues.