PROJECT SUMMARY
Untargeted metabolomics using tandem mass spectrometry (MS) have attained substantial success in the
discovery of biomarkers and advancing our understanding of cellular metabolism. Despite this success, only a
small fraction of measured spectra can currently be annotated (assigned a chemical identity). This bottleneck
can be attributed to the limitations of current annotation tools that have not yet exploited advances in deep
learning and available data modalities (spectra, peaks, molecules, and fragments). The goal of this application
is to advance the interpretation of spectra collected through untargeted metabolomics. We focus on annotating
data collected through liquid or gas chromatology followed by MS, or MS/MS, as these three tandem
technologies have become dominant technologies. Over the next five years, the plan is to harness deep learning
to address three problems: 1) annotation, 2) translation between spectra measured under different instrument
settings, and 3) explainable models for annotation, where explainability arises from connecting peaks to their
respective molecular fragments.
The Hassoun lab has extensive, relevant deep learning experience to effectively tackle these problems.
The Lab also has experience in dealing with the nuances of metabolomics datasets. The Lab recently developed
a novel deep learning annotation model that achieves 41% and 30% performance improvement over multi-layer
neural networks and graph neural networks, respectively. Additionally, our lab has developed an ontology-
traversal algorithm that yields correct-by-construction molecular substructures that can be assigned to peaks,
thus giving rise to datasets that can be used to train explainable annotation models.
The Significance of this research is that it addresses fundamental barriers that hinder developing deep
learning annotation models. Our models and datasets will be released on GitHub to benefit biological and
biomedical applications and metabolomics research. Because of their expected high accuracy and explainability,
the models will expedite the interpretation of experiments, improve our understanding of cellular metabolism,
and facilitate data sharing among labs. The innovation lies in maximally learn from data modalities and in
creating models that exploit the learned representations. Further, the annotation and translation problems are
formulated as a bidirectional mapping between domains, in contrast to current annotation models that assume
unimodal mappings. These innovations are necessary to advance metabolomics research and they will open
new research horizons in the field of metabolomics.