Continued Development and Maintenance of the MG-RAST Metagenomics Pipeline -
DESCRIPTION (provided by applicant): Metagenomics, the study of microbial populations sampled directly from the environment, affords avenues for discovering novel enzymes via microbial profiling; using microbial shifts as predictors for health; or gauging the sustainabilityof human operations like mineral mining. However, the volume of metagenomic data is large (e.g., the metagenome of a human's gut microbiota is about 1 Gigabasepairs in size) and the processing that needs to be done to extract meaning out of the large datasets is significant, such as to identify what organisms' genomes are in the sample (taxonomic annotation) and what are they doing (functional annotation) via comparisons with continually updated knowledge databases. These numbers are only growing as experimentalists demand more and more metagenomic analysis runs. Borne out of this need, our MG-RAST (Metagenomics-Rapid Annotation) portal, an open-source, high-throughput, metagenomics service, has been a major community resource since 2008, housing over 160K datasets and 40K users. However, since its original design, MG-RAST has witnessed the frenetic development of next-generation sequencing technologies, drastically altered computing landscape (both in hardware and software), changed requirements in terms of number of users and datasets' volumes and diversity, increasing complexity of pipeline components, and requirements for higher throughput. To adapt to this, MG-RAST has been continually modified. Modifications included upgrading the pipeline components with several algorithmic improvements; deploying a customized data and workflow management system - the SHOCK object store and AWE workflow manager; and porting MG-RAST to a cloud-based distributed architecture. Notwithstanding our continual, albeit ad-hoc system improvements, our pilot studies have indicated the need for a comprehensive redesign of MG-RAST to keep pace with the needs of the rapidly advancing field of metagenomics. Our proposed enhancements are based on expressed user requirements, new usage patterns, and flexibility to incorporate new tools, especially for the compute-intensive similarity analysis for queried sequences. Through this project, we propose to accomplish MG-RAST's transformation via (i) improving its functionality and data reproducibility; (ii) improving its software quality and performance through automated monitoring and generation of test suites; and (iii) moving toward a federated infrastructure for metagenomics data. Overall, the successful accomplishment of our aims will support alternate metagenomics service models through federation of services and data and result in a robust state-of-the-art metagenomics resource. Federation in biomedical pipelines is in general a powerful direction to leverage the expertise of diverse user-bases and, reciprocally, benefit its users. Thus, MG-RAST, as a state- of-the-art pipeline, will be capable of supporting an ever increasing user-base, handling larger and more varied datasets, and evolving in concert with new genomics technologies. This, with the ultimate goal, to accelerate advances in end-user applications, e.g., personalized medicine, tailored to the patient's microbiome.