PROJECT SUMMARY / ABSTRACT
Proteins play vital functional roles in essentially all biological systems, factoring into the complex expression of
phenotypes and diseases observed in human populations. The quantitative study of all proteins, i.e.
proteomics, has the potential to directly assess how protein dynamics vary across individuals, treatments, and
exposures, ideally in an unbiased fashion not requiring pre-formed and targeted candidates. Historically a
proteomics approach has been constrained due to limitations of the original mass spectrometry (MS)
technology available. Transcriptomics has often been used in place of proteomics, though notably, the
regulation of proteins can be decoupled from their transcripts, rendering them imperfect proxies. The feasibility
of accurate and reliable proteomics has been aided by rapid advancement in MS technology. Currently the
statistical tools for proteomics lag behind and present an impediment to the full use of these rich data
resources.
MS proteomics data possess a number of unique and challenging features that need to be addressed in their
statistical analysis. Proteins are not directly measured, but instead pre-fragmented into smaller peptides. A
protein's abundance must then be reconstructed from its component peptides. Complications to this process
includes peptides that possess coding variants (~10% of peptides in one of our data sets), peptides that map
to multiple proteins (~50%) and high levels of peptides that are unobserved in at least one of the samples
(~50%). Desing features of the MS experiment, such as the use of isobaric labels, can influence the observed
pattern of missing data as well as the extent of technical sources of variation, motivating the need for flexible
analytical tools. To accomplish this, I will use Bayesian approaches to model MS proteomics data to flexibly
incorporate multiple sources of error, as well as address these challenging features of the MS experimental
procedure. The resulting statistical software will be employed on multiple large proteomics data sets from
genetically diverse mouse populations that possess similar levels of genetic variability as human populations.
With the improved protein abundance estimates from my software, I will then perform genetic analyses to
identify novel genetic regulators of the abundance of proteins, their complexes, and their interaction networks.
Specific the experimental context of each data set, I will connect these regulatory signatures to important
biological processes, such as aging in the kidney and heart and glucose metabolism in pancreatic islet cells.
This project will produce new statistical tools that will increase the utility of MS proteomics data and the power
of downstream genetic analyses, which will be demonstrated in real data. Novel genetic regulatory
relationships underlying protein dynamics and functional networks will be identified. These tools and
approaches will be relevant across diverse interest groups, spanning humans, model organism systems, and
various disease-focused communities.