A machine-learning platform to illuminate the chemical dark matter in mass spectrometry-based metabolomics - 7. PROJECT SUMMARY/ABSTRACT –––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– The human body contains thousands of small molecules, and is exposed to thousands more during daily life. This complex chemical ecosystem reflects both the endogenous metabolism of human cells, as well as xenobiotic exposures from our diets, our gut flora, and our natural and built environments. At present, however, the vast majority of these small molecules remain unknown. Remarkably, this gap is not due to a lack of appropriate experimental technology: mass spectrometry-based metabolomics routinely detects thousands of distinct chemical signals in any biological sample. However, only a small fraction of these signals are routinely identified. The remaining profusion of unidentified chemical entities has been dubbed the “dark matter” of the metabolome. Computational tools to shed light on this chemical dark matter could transform our understanding of disease pathobiology, open new avenues for personalized medicine, and increase the scope and efficiency of any metabolomic study. At the same time, true chemical dark matter must be differentiated from the variety of technical artefacts, contaminants, and redundant forms of the same biomolecules that are also detected by mass spectrometry. This project proposes to establish a suite of computational tools that will dramatically advance our ability to interpret mass spectrometry-based metabolomic datasets, and thereby begin to unlock the dark metabolome. These tools will apply emerging techniques from the field of natural language processing, including the same large language model (LLM) architectures that power tools like ChatGPT, to address two of the most important unmet needs in small molecule mass spectrometry. In Aim 1, we will develop DecipherMS, a computational tool for de novo annotation of both known and unknown chemical structures from MS/MS spectra. Despite decades of work in computational mass spectrometry, de novo annotation of unknown molecules remains a critical gap, with virtually all existing tools designed to search in a database of known structures. DecipherMS will overcome this gap by using language models to decode unknown chemical structures directly from MS/MS spectra, using a novel data augmentation strategy to learn effectively from limited training data. In Aim 2, we will develop FoundationMS, a foundation model for mass spectrometry-based metabolomics. FoundationMS will standardize data preprocessing workflows that are required to identify mass spectrometric signals that should be brought forward for annotation in the first place, which will be achieved by learning from a repository-scale corpus of metabolomic data in a self-supervised manner. The resulting model will be fine-tuned to perform common preprocessing tasks including peak picking, retention time alignment, adduct removal, and chemical formula assignment. Both DecipherMS and FoundationMS will be rigorously benchmarked using appropriate datasets. Implementing these approaches in well-documented, user-friendly, and computationally efficient software will address central gaps in our ability to measure small molecules and shift existing paradigms in metabolomic data analysis.