The NIH and other agencies are funding high-throughput genomics (‘omics) experiments that deposit
digital samples of data into the public domain at breakneck speeds. This high-quality data measures the
‘omics of diseases, drugs, cell lines, model organisms, etc. across the complete gamut of experimental factors
and conditions. The importance of these digital samples of data is further illustrated in linked peer-reviewed
publications that demonstrate its scientific value. However, meta-data for digital samples is recorded as free
text without biocuration necessary for in-depth downstream scientific inquiry.
Deep learning is revolutionary machine intelligence paradigm that allows for an algorithm to program
itself thereby removing the need to explicitly specify rules or logic. Whereas physicians / scientists once
needed to first understand a problem to program computers to solve it, deep learning algorithms optimally tune
themselves to solve problems. Given enough example data to train on, deep learning machine intelligence
outperform humans on a variety of tasks. Today, deep learning is state-of-the-art performance for image
classification, and, most importantly for this proposal, for natural language processing.
This proposal is about engineering Crowd Assisted Deep Learning (CrADLe) machine intelligence to
rapidly scale the digital curation of public digital samples. We will first use our NIH BD2K-funded Search Tag
Analyze Resource for Gene Expression Omnibus (STARGEO.org) to crowd-source human annotation of open
digital samples. We will then develop and train deep learning algorithms for STARGEO digital curation based
on learning the associated free text meta-data each digital sample. Given the ongoing deluge of biomedical
data in the public domain, CrADLe may perhaps be the only way to scale the digital curation towards a
precision medicine ideal.
Finally, we will demonstrate the biological utility to leverage CrADLe for digital curation with two large-
scale and independent molecular datasets in: 1) The Cancer Genome Atlas (TCGA), and 2) The Accelerating
Medicines Partnership-Alzheimer’s Disease (AMP-AD). We posit that CrADLe digital curation of open samples
will augment these two distinct disease projects with a host big data to fuel the discovery of potential biomarker
and gene targets. Therefore, successful funding and completion of this work may greatly reduce the burden of
disease on patients by enhancing the efficiency and effectiveness of digital curation for biomedical big data.