Project Summary/Abstract
Many diseases are understudied because they are rare or of little public interest. The effect of each understudied
disease may be limited, but the cumulative effects of all these diseases could be profound. One common
research challenge for these diseases is that the resources allocated to each is often limited. For instance, large-
scale screening of drugs is often challenging, if not possible, in small labs. The decreasing costs of next
generation sequencing make possible the generation of gene expression profiles of understudied disease
samples. Integrating these expression profiles with other open data provides tremendous opportunities to gain
insights into disease mechanisms and identify new therapeutics for understudied diseases. We have utilized a
systems-based approach that employs gene expression profiles of disease samples and drug-induced gene
expression profiles from cancer cell lines to predict new therapeutic candidates for hepatocellular carcinoma,
Ewing sarcoma and basal cell carcinoma. All these candidates were successfully validated in preclinical models.
The success of this approach relies on multiscale procedures, such as quality control of disease samples,
selection of appropriate reference tissues, evaluation of disease signatures, and weighting cell lines. There is a
plethora of relevant datasets and analysis modules that are publicly available, yet are isolated in distinct silos,
making it tedious to implement this approach in translational research. A centralized informatics system that
allows prediction of therapeutics for further experimental validation is thus of great interest to researchers
working on understudied diseases. Accordingly, we propose four specific aims: 1) developing novel deep
learning methods to select precise reference normal tissues for disease signature creation, 2) developing
computational methods to reuse drug profiles from other disease models for drug prediction, 3) integrating open
efficacy data to identify new targets from the systems-based approach, and 4) developing a centralized platform
and promoting the platform in the scientific community. This proposal will reuse several big open databases (e.g.,
TCGA, TARGET, GTEx, GEO, LINCS, CTRP, GDSC) and employ cutting-edge informatics methods (e.g., deep
learning). To demonstrate the scalability of the system, we will investigate three representative understudied
diseases: multiple organ dysfunction syndrome (Aim 1), diffuse intrinsic pontine glioma (Aim 2) and
hepatocellular carcinoma (Aim 3). Successful implementation of the systems-based approach can be used as a
model for using other large open omics (proteins, metabolites) to discover therapeutics for diseases with unmet
needs. This proposal will bring together experts in informatics, statistics, computer science, and physicians from
Michigan State University, Stanford University, UC Berkeley and Spectrum Health. All data and code will be
released to the public for continuing development. The system will be deployed to our OCTAD portal
(http://octad.org), an open workplace for therapeutic discovery.