Advanced analytics for uncovering virus dynamics and functional potential - Project Summary/Abstract The human virome, i.e. the collection of viruses found in or on humans, is a complex set of viral communities whose diversity is only now starting to be explored and described. Along with viruses infecting humans and causing diseases, the human virome includes many viruses infecting microorganisms part of the human microbiome as well as transient viruses originating from e.g. food or drinking sources. Most of these viruses are currently known only through metagenomics, i.e. assays through which the genomes of viruses present in a given sample are sequenced directly without the need for laboratory cultivation or isolation. Because of challenges in analyzing this broad diversity of viral genomes, however, the biological information extracted from these metagenome-assembled viral genomes remains limited at this stage. As the field of viromics was being established, most of the effort so far has been focused on the development of methods to comprehensively identify the genomes of known and novel viruses in metagenomes. This resulted in multiple efficient tools for viral sequence detection, and the creation of large-scale catalogs of genomes from different parts of the human virome. Critical for understanding the biology of these viruses, however, will be our ability to classify these new viruses in a robust taxonomic framework, link these viruses to their host(s), and functionally annotate a majority of the genes they encode. Some approaches have been proposed to address these questions, but current tools are inadequate either in terms of resolution, accuracy, and/or throughput. Moreover, some of the most promising methods, such as the use of innovative AI for genome annotation, are only available as experimental software and not ready for large-scale application. Here, we aim to establish the necessary tools, curated databases, and integrated pipelines to enable any researcher with a set of viral genomes to (i) classify these viruses in quasi-taxa enabling robust comparison to other similar studies at multiple ranks, (ii) predict the most likely host(s) taxa and/or strains infected by these viruses, and (iii) identify the genetic potential and thus potential role(s) and impact(s) of these viruses. We will build this work on our previous experience in developing advanced viromics tools and databases, as well as recent developments in viral taxonomy, large-scale comparative genomics, and machine-learning for sequence analysis. Specifically, we intend to develop new genome comparison and clustering approaches to provide a comprehensive genome-based viral taxonomy database and an associated toolkit; expand the current host prediction tools by integrating large-scale CRISPR detection and viral phenotype prediction in a virus-host network framework to enable virus-host linkage at the strain level; and establish a new functional annotation pipeline leveraging protein structure prediction and genomic neighborhood. We intend to develop these new tools in close collaboration with members of the Human Virome Program to build a robust viromics toolkit enabling researchers to thoroughly investigate the direct and indirect impact(s) of these viruses on humans.