SUMMARY
Proteins are responsible for much of the structure and function of all cells. Subtle changes in expression of var-
ious protein forms are critical for proper growth and development, but irregularities can cause deleterious
cellular effects or large-scale biological dysfunction. Sequencing samples with both high- and low-abundance
proteins could greatly accelerate research into protein function and biology, but there is currently no efficient
and cost-effective strategy to sequence mixtures of unknown protein molecules at single-amino-acid resolution.
Two methods are commercially available for protein sequencing. The first method, “Edman degradation”, re-
quires purification of the individual target protein. Bulk quantities of whole protein or purified fragments are
sequenced by cleaving off the first (N-terminal) amino acid and chemically identifying it. The second method,
based on mass spectrometry, requires enzymatically degrading a single protein or mixture of proteins into
small fragments, then analyzing the molecular mass and charge of each fragment. This information is com-
pared to that of known protein sequences to infer the identity of the input proteins. Both of these methods
require ~1 million molecules of each protein for detection. Currently, Edman degradation cannot be used on
heterogeneous protein mixtures, further limiting its utility.
Single molecule protein sequencing is hindered by the number and diversity of amino acids, as well as the in-
teractions between amino acids that interfere with chemical identification of their side chains. Identifying N-
terminal amino acid that is still attached to the rest of the protein will be hindered by the N-1 (and N-2) amino
acids, proportional to the bulk of the side chain. Harsh denaturation agents can mitigate some of these issues.
However, these reagents can compromise the biomolecule-based identification systems themselves and do
not fully remove the steric hindrance, affecting the access to the N-terminal amino acid.
Glyphic Biotechnologies has developed a novel “Next-Generation” protein sequencing strategy, in which DNA
barcodes associate rounds of cleaved N-terminal amino acid with a protein-specific barcode. Each of the 20
different amino acids will be first cleaved (circumventing the stearic hindrance of the N-1 amino acid) and then
captured by specific antibodies. Each amino acid will then be associated with two barcodes, indicating the (1)
originating protein and (2) sequential position this amino acid can be found in. After next-generation DNA se-
quencing of all conjugated barcodes, this information can be deconvoluted – placing each amino acid into the
correct position within the correct protein.
This approach has the potential to be scaled to sequence millions to billions of single molecules simultaneously
in hours. Developing this technology will revolutionize protein analysis by making large-scale protein sequenc-
ing feasible, inexpensive, and routine.