PROJECT SUMMARY
Title: Exploration of cloud computing for CAZyme research
Our R01 parent project (R01GM140370) intends to develop four bioinformatics tools for automated annotation
of CAZymes (Carbohydrate Active Enzymes) and CAZyme Gene Clusters (CGCs) in human gut microbiome.
These automated tools will enhance: (i) the basic biomedical science to characterize new polysaccharide (or
glycan) metabolic enzymes and polysaccharide utilization loci (PULs, gene clusters with known carbohydrate
substrates) in the human gut microbiome, and (ii) the emerging personalized nutrition practice (e.g., using gut
microbiome sequencing to infer if a person is a responder to certain dietary glycans or prebiotics).
In the past two years, we have developed dbCAN3 (Aim 1 of R01) and dbCAN-seq (Aim 3 of R01), one web
server and one online database, to allow users submit their genomic data of any microbiomes for automated
CAZyme, CGC, and glycan substrate annotation. Both websites are now hosted on our lab’s standalone
desktop server (a six-year-old computer with 16-core/32-thread CPU), which is not a secure and sustained
solution and cannot meet the increasing demand from users who routinely submit jobs to our servers. For
example, our popular dbCAN2 web server (the second version of dbCAN that started in 2012) processed over
35,000 user submitted jobs in 2022 all by this desktop computer.
Therefore, challenges/risks exist that may disrupt the popular service we provide to tens of thousands
of microbiome users all over the world and additional support is requested to explore moving dbCAN3
to a cloud computing platform, e.g., Amazon Web Services (AWS). These challenges include: (i) our local
desktop server is no longer able to meet the continuously growing job submissions, (ii) the server built in 2017
is already out of warranty, (iii) the server have been frequently reported by the University IT service department
to have numerous security vulnerabilities due to its old operating system and software system.
Therefore, the major goal of this supplement grant proposal is to explore and test the application of
Amazon Web Services (AWS) to support the CAZyme bioinformatics tool development objective of our R01
parent project. In this one-year project, we aim to test the deployment of our dbCAN3 website on AWS by
taking advantage of AWS web hosting service and AWS Batch for automatic workload distribution, and
compare with two on-prem solutions in terms of their computational efficiency.
To achieve these goals, we have assembled a multi-disciplinary research team including three faculty, one
computer specialist, and one graduate students. We have all necessary expertise in bioinformatics, cloud
computing, and high-performance computing. The successful completion of this cloud exploration project will
significantly increase our knowledge using AWS for our R01 CAZyme bioinformatics tool development.