Sequence transformations and microbiology: theory, tools, & discovery - Project Summary String similarity is one of the foundational problems in computational biology – it is what allows piecing together a genome from short sequenced read substrings, determining that human genomes share a more recent common ancestor with chimpanzee than they do with octopus, and detecting the horizontal transfer of genetic material across bacterial species, one of our current application focuses. Exact sequence matches between genomes is a long-solved problem, but in biology, due to the prevalence of both mutation and sequencing error, we care more about approximate matches, which are much harder to find and characterize. One of the major recent advances in bioinformatics has been the advent of increasingly sophisticated string transformations (sketching, k-mer-ization, alphabet reductions) that change the distribution of exact matches on the transformed strings, allowing for the development of faster software. My nascent research lab has been one of the pioneers in both developing rigorous mathematical theory to understand sequence transformations and in engineering software that turns that theory into usable bioinformatics software. Relevantly, in prior work, we gave the first rigorous proofs that k-mer sketching works with alignment [Shaw & Yu, Genome Research, 2023] successfully translated our theoretical understanding of k-mer sketching theory into a new metagenomics software “skani” [Shaw & Yu, Nature Methods, 2023] for computing pairwise average nucleotide identity (ANI), a standard measure of genome similarity. Skani is both more accurate and orders of magnitude (20x) faster than the state-of-the-art. Building on skani, we further produced skandiver [Zhang et al., Bioinformatics, 2024], a tool for detecting large intercellular mobile genetic elements by comparing all the chunks of a sequence against all the whole genomes in a database, without needing a reference database of mobile genetic elements specifically. Over the next five years, my research program will continue to straddle the line between advancing string algorithm theory and using that to build bioinformatics tools. On the theory side, we want to combine ideas from k-mer sketching theory with alphabet reductions to expand the utility of sketching for more dissimilar sequences (such as found in protein databases). On the applied side, we are going to push forward string similarity tools for microbial analysis, focusing on better characterizing mobile genetic elements – our proof-of-concept skandiver only does long-all-to-all-sequence similarity, and its design principles don’t work for intra-species MGEs, small MGEs, or even annotate the boundaries of the MGEs it does find.