Abstract
Proteoforms are key mediators of biological phenotypes. However, there is no systematic way to uniquely identify
these chemical entities and no database to catalog proteoforms for future reference and use. To enable the
proteoforms to be findable, accessible, interoperable and reusable (FAIR), experimentally verified proteoforms
need to be uniquely identified and stored in an open framework for use by the scientific community. If a
proteoform is easily recognized and linked to known biological metadata, then future researchers can link their
discoveries with previous ones and formulate new hypotheses. This is important to the members of the
Consortium for Top-Down Proteomics (CTDP), a non-profit organization established to promote top-down
proteomics.
Here, we propose to create a scalable, two-tiered informatic framework for the organization and storage of
experimentally verified proteoforms. The system will have a central database, which stores a minimal set of
information regarding each proteoform, and a flexible framework for creating individual proteoform
knowledgebases. Interest from the top-down proteomic community, software developers, and leading
bioinformaticians to develop such a resource is high (see 17 Letters of Support). This includes a strong desire
from UniProt to use experimentally verified proteoforms to bolster their leading protein knowledgebase. After the
granting period is over, we believe that the central database should be community-owned and curated by the
CTDP, and the knowledgebase framework should be open-source and maintained by the top-down community.
Therefore, this proposal is split between both the development of deliverable software and the expansion of
existing community-centered collaborations for software dissemination.
The Specific Aims focus on: 1) Establishing norms for communicating proteoforms. 2) Developing public
proteoform databases and the domain-specific proteoform knowledgebase framework and 3). Engaging the
scientific community to promote its use. The success of this project is measured through its dissemination.
Upon completion of this grant, we will have established and prepared a self-governing body to oversee the
development and maintenance of bioinformatic software for the storage and dissemination of experimentally
verified proteoforms. This body, managed by the CTDP, will have the initial tools to create public proteoform
databases and have a sustainable governance system in compliance with FAIR principles.