Project Abstract/Summary
More than 75% of the data generated from mass spectrometry (MS) - based omics experiments are wasted due to
inefficiency of existing algorithmic methods that deduce peptides. The peptides that do get identified by existing
computational methods usually come from abundant proteins – and hence recent calls by scientists to study
overlooked proteins are gaining traction. These non-abundant and overlooked proteins might have the same (or
more) importance in human systems biology health, and disease. Yet, all downstream analysis and conclusions –
related to human health – are based on suboptimal and incomplete peptide deductions indicating formal investigation
is warranted and urgently needed. In the recent decade, advances in machine-learning (ML) models have provided
a critical step and have made it possible to develop more accurate and deeper pipelines for MS data analysis. Our
preliminary work and experiments suggest that the limited training search-space exhibited by labelled spectral libraries
makes robustness, and generalizability of existing ML models highly susceptible and may not effectively work for real-
world data. The overall objective of my research lab using this MIRA mechanism is to design and develop robust,
reliable, and generalizable machine-learning models for peptide deduction from MS data from omics experiments.
Our proposed work fills four key knowledge gaps in development of ML models pursued via this MIRA grant that, if
filled, will lead to superior computational techniques capable of inferring both abundant and non-abundant peptides.
Our general strategy will involve design and development of generative models, self-learning models, biologically
inspired models, and methods to infer uncertainty quantification. In addition, we will strive to focus on two key gaps in
adaptation of ML models that will be filled via developing ML-ready workflows and developing easy-to-use software
infrastructure that can be used by scientists. All this effort via MIRA grant mechanism will fill a critical gap in our
understanding and ability to deduce peptide (that are novel) and will contribute a fundamental tool for studying complex
communities in proteomics, and meta-proteomics data. At the end of this grant funding cycle, it is our expectation that
we will have designed and developed highly accurate ML peptide deduction engine capable of end-to-end analysis of
the MS based omics data– that is robust, generalizable, and more accurate than their algorithmic counterpart. Our
proposed work will facilitate reproducibility by developing ML models that perform well – irrespective of underlying MS
data quality or completeness – will be a highly impactful outcome. This proposed work will also serve as the
foundation for analysis of more complex data sets related to meta-proteomics, and proteogenomics as one of our
long-term goals that we hope to achieve using this MIRA grant mechanism.