Phylogenetic and computational methods for accurate and efficient analyses of large-scale metagenomics datasets - Project Summary/Abstract The overall goal of this project is to use approaches from statistics and computer science to solve significant chal- lenges in the analysis of metabarcode and metagenomics data. Metagenomics, the study of combined genomes of organisms present in a single community, is an emerging highly interdisciplinary field that combines genomics, bioinformatics, systems biology, among other areas. Metagenomics has many applications to public health es- pecially in the areas of pathogen detection, human microbiome analysis, and biodiversity monitoring. The larger objective of this proposal is to leverage the use of the open source software, tronko, a fast approximate likelihood phylogenetic placement method that I developed for taxonomic classification, which is the first phylogenetic place- ment method that truly enables the use of large-scale reference databases and next generation sequencing data desired as queries. Tronko will be used to solve fundamental problems in analyses of metabarcode and metage- nomic data in addition to developing an application to analyses of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) sequences that will greatly enhance the utility of environmental monitoring of SARS-CoV-2. The specific aims of this proposal are to (1) solve an important theoretical problem by applying a rigorous species delineation to assignment, (2) to apply tronko to solve an important practical problem of estimating the compo- sition of SARS-CoV-2 lineages in wastewater surveillance samples, and (3) to develop a rapid custom reference database builder for analyzing metabarcode and metagenomics data. For Aim 1, different phylogenetic groups have different variability in different parts of the tree, therefore, I plan to use Bayesian methods to estimate effec- tive population sizes locally to establish appropriate cut-off thresholds for species assignments in different parts of the phylogeny. Current methods use arbitrary thresholds for delineation of taxonomic groups and this method would provide an elegant solution to a long-standing limitation in species classification. For Aim 2, SARS-CoV-2 monitoring of wastewater is an effective strategy for early detection of outbreaks. I plan to build a pipeline, and subsequently a web portal for researchers, that uses tronko to first detect the virus within a wastewater sample then subsequently uses an expectation-maximization algorithm to estimate the proportions of viral strains. This aim would greatly aid public health researchers in assessing and managing the pandemic since no established methods are currently available for this type of analysis. For Aim 3, current custom reference database builders require weeks if not months of consecutive computational time in addition to access to a large amount of data storage. I propose to build a method which can be completed within a day. The method will perform in silico amplification of primers and subsequently use the amplified fragments in a kmer-based approach for identifying relevant sequences within a nucleotide database with utilization both across a network connection and a local database. Execution of these aims will solve important theoretical, practical, and computational problems in the field of metagenomics.