PROJECT SUMMARY / ABSTRACT
In the modern era of genomics and proteomics, the vast amounts of biological data generated present both a
challenge and an opportunity. Central to this proposal is the innovative use of kmers, short nucleic or peptide
sequences, as a tool to navigate and interpret this data. Kmers are used in a variety of genomics and proteomics
applications, including genome assembly and alignment, genomic variant detection and metagenomics. With the
continued advancement of sequencing technology, kmers are poised to play an important role in research.
Quasi-primes are kmers found in only one single species. We have recently developed algorithms to efficiently
identify quasi-primes across every available genome and proteome. In humans, quasi-prime loci are primarily
found in brain-expressed genes associated with cognition and are enriched for quantitative trait loci, indicating
their significance in the development of species-specific traits. Over the next five years, we will examine quasi-
primes in populations of diverse ancestries, in archaic hominins and in primate and mammalian evolution to
improve our understanding of their functional and evolutionary significance. Additionally, we will leverage our
expertise in performing large scale analyses to expand upon our findings and characterize the functions of these
kmers across every sequenced organism and taxonomic group. This will allow us to investigate the underlying
mechanisms that enable species to develop new traits and adapt to their environment.
The composition of organismal genomes depends on a variety of factors, including genome size, genomic
instability, and biological processes, such as transcription and translation. We aim to investigate how these
factors shape the composition of genomes in every species and across all taxonomic groups. We will integrate
different types of genomic and proteomic data, including kmer frequency profiles, codon usage tables, and
transcription and translation annotations. Our goal is to deconvolute the relative contributions of different factors
shaping the composition and evolution of organismal genomes. Building on this, we plan to incorporate these
findings into generative artificial intelligence models to create improved simulated genomes that will have
significant applications as synthetic controls for bioinformatics analyses.
Finally, we will provide well-documented, open-source software tools and integrate the data from our projects
into accessible databases, aligned to the FAIR principles. In doing so, we aim to not only advance research in
our specific areas of focus but also equip other researchers with tools and datasets they can utilize in their distinct
domains of expertise.
In summary, our multifaceted approach seeks to harness the power of kmers in genomics and proteomics,
delve into the intricacies of evolutionary processes, and provide the scientific community with computational
resources, fostering collaboration and innovation in basic and biomedical research areas.