Research Summary/Abstract
The generation of biological data is rapidly presenting us with one of the most demanding data analysis
challenges the world has ever faced - not only in terms of storage and accessibility, but perhaps more critically
in terms of its extensive heterogeneity and variability. In this proposal, we present a new approach to these
challenges, which we call “Deep Curation”: a large-scale, integrated modeling approach to simultaneously
cross-evaluate millions of heterogeneous data against themselves. The word “deep” reflects the multiple
layers of curation we perform, including layers not only for data, but also for parameters derived from these
data, the mathematical equations, the unified model, and the simulation output. Thus, the deeply-curated
model is an invaluable tool for processing, curating and analyzing data automatically. Our proposed efforts in
Deep Curation are based on a computer model of Escherichia coli that accounts for the function of roughly
40% of the well-annotated genes, and is based on an extensive set of diverse measurements compiled from
thousands of reports (currently in 2nd round of review at Science). The goal of this proposal is to expand this
model to enable Deep Curation of data related to growth on >100 currently-unincorporated environments. We
can then assess the cross-consistency of the data sets simultaneously, as a unified whole, identifying critical
areas in which datasets are not cross-consistent and therefore further experimental investigation is needed.
The Significance of this proposal is that Deep Curation represents a first-in-kind quantum leap forward in our
ability to exploit massively heterogeneous, variable and complex biological datasets; that it automates and
accelerates transformative biomedical discovery; that we will create a bi-directional pipeline between EcoCyc,
the most comprehensive database on any organism, and the most complex biological model in existence; and
that whole-cell modeling is a rapidly-growing field with transformative potential as it advances towards more
complex cells and groups of cells. The Innovation associated with this proposal is that Deep Curation is a
brand-new and highly innovative approach that is not currently available to any other lab in the world; that the
proposed work will produce a dramatically expanded whole-cell model of previously-unseen complexity; as well
as novel and highly innovative modeling technology; that we include explicit curation of knowledge regarding
mechanism in addition to data; and that the automated communication between the EcoCyc database and the
E. coli model will dramatically expand the capacity, scope and visibility of both in a synergistic way. Our
Specific Aims are: Aim 1 (Curation), build the Data and Parameter layers related to E. coli growth on diverse
environments; Aim 2 (Modeling), implement the Equation, Model and Simulation layers; Aim 3 (Deep Curation),
use the integrated model to cross-evaluate the unified data set at the whole-organism scale; and Aim 4
(Distribution), make the model available to the broader community via GitHub (software tools), EcoCyc (data
and parameters), and Google Cloud (simulations and interactive visualizations).