PROJECT SUMMARY
Bioinformatic analysis of large genomic datasets is a critical barrier for many biologists, especially
those at smaller research institutions. Leveraging our team's bioinformatics experience, our goal is to
develop an interactive web application that can be used to easily translate RNA sequencing data
into biological insights. We hypothesized that an integrated tool for reproducible, in-depth analysis of
expression data will democratize access to high-throughput technologies and help biologists pinpoint
molecular pathways from large data. Our goal is to develop a carefully-designed user-friendly pipeline
with rich data visualization capacity. As a proof of concept, the team developed a prototype called iDEP
(integrated Differential Expression and Pathway analysis) for the analysis of summarized expression
matrices. It's unique features include (1) comprehensive analytic functionality based on 63 R and
Bioconductor packages, covering exploratory data analysis, clustering, differential gene expression and
pathway analysis; (2) a massive knowledgebase for automatic gene ID conversion, annotation, and
pathway analysis for over 2000 archaeal, bacterial and eukaryotic species; (3) reproducibility of some
core steps by generating R and R Markdown notebooks; (4) application programming interfaces (APIs)
for retrieval of protein-protein interaction networks and KEGG pathway diagrams, and (5) easy access
to about 13000 processed public RNA-seq data in 9 species. Compared with existing tools, the key
innovation is the emphasis on deep integration (tools, annotation, pathways, and public datasets), user-
friendliness, and reproducibility. Even with limited features, iDEP is beginning to be adopted by
researchers from diverse fields.
In this proposal, the team plans to complete the development of iDEP. The goal of Specific Aim 1 is
to (a) re-write iDEP in a modular, object-oriented fashion, (b) make an R package for generating fully
reproducible R Markdown notebooks, and (c) add essential functionalities such as bias correction (batch
effect, GC content, gene length, expression level), time-course analysis, supervised classification, and
additional methods for existing functional modules. We will also enable gene ontology enrichment
analysis for unannotated species using Blast2GO. Specific Aim 2 focuses on (a) substantially
expanding the pathway database for frequently studied species and (b) collecting more uniformly
processed RNA-seq and DNA microarray datasets to facilitate the re-analysis and meta-analysis of
public expression data. In Specific Aim 3, the team will conduct hardware upgrade, rigorous testing,
code review, documentation, and community integration. The development of iDEP can help make
standard RNA-seq analysis accessible for a very broad community of researchers.