A Genome Data Analysis Center Focused on Batch Effect Analysis and Data Integration - * * * * PROJECT SUMMARY * * * * Abstract: Technical batch effects pose a fundamental challenge to quality control and reproducibility of even single-laboratory research projects, but the possibilities for serious error are greatly magnified in complex, multi- institutional enterprises such as the cancer molecular profiling projects being undertaken by the NCI Center for Cancer Genomics (CCG). To aid in detection, quantitation, interpretation, and (when appropriate) correction for technical batch effects in such data, we have developed the MBatch software system. MBatch proved indispensable for quality-control “surveillance” of data in The Cancer Genome Atlas (TCGA) and ongoing CCG projects. But detecting and quantitating batch effects (or trend effects or statistical outliers) are just the first steps in a process. The next steps involve detective work in collaboration with those who generated the data, drawing upon expertise in integrative analysis across data types, pathways, and systems-level biology. That detective work usually succeeds in diagnosing the cause of a batch effect as technical or biological. If technical, then computational methods to ameliorate the batch effect can be applied (judiciously). The primary aim of the proposed Genome Data Analysis Center (GDAC) is to continue to translate that successful quality-control model to the CCG’s other current and future large-scale molecular profiling projects We will be ready to do that on Day 1. We will continue to enhance and extend the power of MBatch and incorporate a number of innovative new algorithms, tools, and interactive visualizations into it (OmicPioneer-sc, MutBatch, CarDEC, and CorNet). Evaluating and correcting batch effects is a complex process, so we will collaborate with other GDACs and data generating centers to determine the influence of artifacts on any analysis results they produce. The second aim is to contribute and enhance additional competencies. We are prepared to (i) provide integrated cluster solutions to segregate cases into biologically relevant groups; (ii) provide tools and expertise for high-level visualization of omic data (including single-cell data); and (iii) analyze RPPA proteomic data from the subset of projects that generate such data. Our final aim is to communicate results and distribute corrected data back to other network members, project stakeholders, and the scientific community. We bring a number of assets to the table, including multidisciplinary expertise in bioinformatics, biostatistics, software engineering, cancer biology and cancer medicine; PIs with a combined 40+ years of experience in molecular profiling of cancers; expertise gained in 10 years of doing the batch effects surveillance for TCGA and other CCG projects; a highly professional software engineering team with a track record of producing high-end bioinformatics tools; extensive computing resources, including one of the most powerful academic clusters in the world; and close working relationships with first-class basic, translational, and clinical researchers across MD Anderson, one of the foremost cancer centers in the U.S. The bottom-line mission of the GDAC will be to aid the research community’s effort to understand cancer and to prevent, detect, diagnose, and treat it more effectively.