Project Summary/Abstract
The human virome, i.e. the collection of viruses found in or on humans, is a complex set of viral
communities whose diversity is only now starting to be explored and described. Along with viruses infecting
humans and causing diseases, the human virome includes many viruses infecting microorganisms part of the
human microbiome as well as transient viruses originating from e.g. food or drinking sources. Most of these
viruses are currently known only through metagenomics, i.e. assays through which the genomes of viruses
present in a given sample are sequenced directly without the need for laboratory cultivation or isolation.
Because of challenges in analyzing this broad diversity of viral genomes, however, the biological information
extracted from these metagenome-assembled viral genomes remains limited at this stage.
As the field of viromics was being established, most of the effort so far has been focused on the development
of methods to comprehensively identify the genomes of known and novel viruses in metagenomes. This
resulted in multiple efficient tools for viral sequence detection, and the creation of large-scale catalogs of
genomes from different parts of the human virome. Critical for understanding the biology of these viruses,
however, will be our ability to classify these new viruses in a robust taxonomic framework, link these
viruses to their host(s), and functionally annotate a majority of the genes they encode. Some approaches
have been proposed to address these questions, but current tools are inadequate either in terms of resolution,
accuracy, and/or throughput. Moreover, some of the most promising methods, such as the use of innovative AI
for genome annotation, are only available as experimental software and not ready for large-scale application.
Here, we aim to establish the necessary tools, curated databases, and integrated pipelines to enable any
researcher with a set of viral genomes to (i) classify these viruses in quasi-taxa enabling robust comparison to
other similar studies at multiple ranks, (ii) predict the most likely host(s) taxa and/or strains infected by these
viruses, and (iii) identify the genetic potential and thus potential role(s) and impact(s) of these viruses. We will
build this work on our previous experience in developing advanced viromics tools and databases, as well as
recent developments in viral taxonomy, large-scale comparative genomics, and machine-learning for sequence
analysis. Specifically, we intend to develop new genome comparison and clustering approaches to provide a
comprehensive genome-based viral taxonomy database and an associated toolkit; expand the current host
prediction tools by integrating large-scale CRISPR detection and viral phenotype prediction in a virus-host
network framework to enable virus-host linkage at the strain level; and establish a new functional annotation
pipeline leveraging protein structure prediction and genomic neighborhood. We intend to develop these new
tools in close collaboration with members of the Human Virome Program to build a robust viromics toolkit
enabling researchers to thoroughly investigate the direct and indirect impact(s) of these viruses on humans.