Project Summary/Abstract
The overall goal of this project is to use approaches from statistics and computer science to solve signi¿cant chal-
lenges in the analysis of metabarcode and metagenomics data. Metagenomics, the study of combined genomes
of organisms present in a single community, is an emerging highly interdisciplinary ¿eld that combines genomics,
bioinformatics, systems biology, among other areas. Metagenomics has many applications to public health es-
pecially in the areas of pathogen detection, human microbiome analysis, and biodiversity monitoring. The larger
objective of this proposal is to leverage the use of the open source software, tronko, a fast approximate likelihood
phylogenetic placement method that I developed for taxonomic classi¿cation, which is the ¿rst phylogenetic place-
ment method that truly enables the use of large-scale reference databases and next generation sequencing data
desired as queries. Tronko will be used to solve fundamental problems in analyses of metabarcode and metage-
nomic data in addition to developing an application to analyses of severe acute respiratory syndrome coronavirus
2 (SARS-CoV-2) sequences that will greatly enhance the utility of environmental monitoring of SARS-CoV-2. The
speci¿c aims of this proposal are to (1) solve an important theoretical problem by applying a rigorous species
delineation to assignment, (2) to apply tronko to solve an important practical problem of estimating the compo-
sition of SARS-CoV-2 lineages in wastewater surveillance samples, and (3) to develop a rapid custom reference
database builder for analyzing metabarcode and metagenomics data. For Aim 1, different phylogenetic groups
have different variability in different parts of the tree, therefore, I plan to use Bayesian methods to estimate effec-
tive population sizes locally to establish appropriate cut-off thresholds for species assignments in different parts
of the phylogeny. Current methods use arbitrary thresholds for delineation of taxonomic groups and this method
would provide an elegant solution to a long-standing limitation in species classi¿cation. For Aim 2, SARS-CoV-2
monitoring of wastewater is an effective strategy for early detection of outbreaks. I plan to build a pipeline, and
subsequently a web portal for researchers, that uses tronko to ¿rst detect the virus within a wastewater sample
then subsequently uses an expectation-maximization algorithm to estimate the proportions of viral strains. This
aim would greatly aid public health researchers in assessing and managing the pandemic since no established
methods are currently available for this type of analysis. For Aim 3, current custom reference database builders
require weeks if not months of consecutive computational time in addition to access to a large amount of data
storage. I propose to build a method which can be completed within a day. The method will perform in silico
ampli¿cation of primers and subsequently use the ampli¿ed fragments in a kmer-based approach for identifying
relevant sequences within a nucleotide database with utilization both across a network connection and a local
database. Execution of these aims will solve important theoretical, practical, and computational problems in the
¿eld of metagenomics.