New technologies afford the acquisition of dense “data clouds” of individual humans. However,
heterogeneity, dimensionality and multi-scale nature of such data (genomes, transcriptomes,
clinical variables, etc.) pose a new challenge: How can one query such dense data clouds of
mixed data as an integrated set (as opposed to variable by variable) against multiple knowledge
bases, and translate the joint molecular information into the clinical realm? Current lexical mapping
and brute-force data mining seek to make heterogeneous data interoperable and accessible
but their output is fragmented and requires expertise to assemble into coherent actionable
information. We propose DeepTranslate, an innovative approach that incorporates the known
actual physical organization of biological entities that are the substrate of pathogenesis into (i)
networks (data graphs) and (ii) hierarchies of concepts that span the multiscale space from
molecule to clinic. Organizing data sources along such natural structures will allow translation of
burgeoning high-dimensional data sets into concepts familiar to clinicians, while capturing
mechanistic relationships. DeepTranslate will take a hybrid approach to learn and organize its
content from both (i) existing generic comprehensive knowledge sources (GO, KEGG, IDC, etc.)
and (ii) newly measured instances of individual data clouds from two demonstration projects: (1)
ISB’s Pioneer 100 and (2) St. Jude Lifetime cancer survivors. We will focus on diabetes as
test case. These two studies cover a deep biological scale-space and thus can test the full extent
of the multiscale capacity of DeepTranslate in a focused application.
1. TYPES OF RESEARCH QUESTION ENABLED. How can a clinician find out that the dozens
of “out of range” variables observed in a patient’s data cloud, form a connected set with respect
to pathophysiology pathways, from gene to clinical variable? How can the high-dimensional
data of studies that measure for each individual 100+ data points of various types
(“personal data clouds”) be analyzed as one set in an integrated fashion (as opposed to variable
by variable) against existing knowledge bases and also be used to improve the databases?
DeepTranslate addresses these two types of questions and thereby will accelerate translation of
future personal data clouds into (A) care decisions and (B) hypotheses on new disease mechanisms
/ treatments, thereby benefiting providers as well as researchers.
2. USE OF EXPERTISE AND RESOURCES. • ISB: pioneer in personalized, big-data driven
medicine (Demo Project 1); biomedical content expertise; multiscale omics and molecular pathogenesis,
big data analysis, housing of databases for public access; query engine designs, GUI.
• UCSD: leader in biomedical data integration; automated assembly of molecular and clinical
data into hierarchical structures; translation between data types • U Montreal: biomedical database
curation from literature and construction of gene/protein/drug interaction networks; machine
learning, open resource database • St Jude CRH: Cancer monitoring Demo Project 2,
cancer patient data analytics.
3. POTENTIAL DATA AND INFRASTRUCTURE CHALLENGES. (a) Existing comprehensive
clinical data sources are not uniform and not explicitly based on biological networks; cross-mapping
is being performed at NLM based on lexical relationships: HPO (phenotypes) vs.
SNOMED CT (for EMR) vs. IDC or Merck Manual (for diseases). Careful selection of these
sources in close collaboration with NLM is needed. (b) Existing molecular pathway databases
are static, based on averages of heterogeneous non-stratified populations, while the newly
measured high-dimensional data clouds are varied due to intra-individual temporal fluctuation
and inter-individual variation. How this will affect building of ontotypes in our hybrid approach,
and how large cohorts of data clouds must be to offer statistical power is yet to be determined.
Our two Demonstration Projects with their uniquely deep (high-dimensional and multiscale) data
in cohorts of limited but growing size are thus crucial first steps in a long journey of collective
learning in the TRANSLATOR community.