Improved genomic sketching for MUMmer and metagenomics - PROJECT SUMMARY Increasing the efficiency of computational methods has been instrumental to extracting insight from genomic data. Fast aligners such as MUMMER, fast k-mer counters such as JELLYFISH, fast expression quantifiers such as SAILFISH and SALMON, and high-quality efficient genome assemblers such as MASURCA have been crucial to unlocking the potential of genomic and metagenomic data. Nevertheless, computation remains a time and cost bottleneck in many application areas. Algorithmic sketching methods, such as the minimizer schemes, have been a useful technique for achieving improved computational efficiency. However, despite their importance, these sketching techniques are understudied from a theoretical perspective and underused from a practical perspective. We propose to design, implement, test, and validate new sketching approaches based on significant extensions to the successful minimizers sketching schemes, greatly increasing the flexibility of these approaches and ex- panding their use into new areas including handling high-variance or highly repetitive sequences, and providing a new, standard sketching toolkit for genomic method designers and software implementors. These extensions, collectively referred to as marker selection schemes, will enable faster alignment, clustering, and assembly of genomic sequences, and will spur further computational innovation in genomic applications. To inform and validate this algorithmic work, we propose to enhance three important and broad areas of genomic computational methods. First, we will extend the widely-used MUMMER aligner with a number of application- specific “modes” that exploit these new and existing sketching schemes to achieve enhanced efficiency and greater sensitivity. This will ensure continued development and enhancement for additional applications of this important computational tool. Second, we will enhance the MASURCA genome assembler with updated in- tegration with the new MUMMER. Third, we will use the developed marker selection schemes and additional algorithmic ideas based on geometric embedding of sequences to develop more accurate, fast estimators of distances between genomic sequences. These approximate distance estimators are essential for a number of metagenomic applications including species classification, clustering, and search. We will advance the compu- tational accuracy of these tasks through these improved estimators. This project will result in a deeper toolbox of genomic sketching and distance estimation algorithms, software libraries encoding these new algorithms for wider use by the community, and an improved suite of genomic software, including enhancements to a widely used aligner and assembler and improved accuracy in existing and new metagenomic software.