Deep learning methods to accelerate discoveryof drugs targeting gene regulatory proteins - SUMMARY To evaluate how a drug candidate affects cells, researchers often study how the abundance or behavior of a specific set of proteins is changed by treatment with each compound. However, it is not currently possible to test the effect of every possible drug compound (>500,000) on every human protein (~20,000) in hundreds of different types of cells. Even the most advanced protein analysis systems available today could only measure and process a tiny fraction of these combinations in a feasible timeframe. One method of measuring the abundance of all the proteins in a cell sample is mass spectrometry, but available instruments can only analyze several samples per day. To increase the throughput of these mass spectrometry experiments, in Aim 1 of the proposed project we will develop a machine learning algorithm that will reconstruct the peptide composition of a large number of samples from measurements of a smaller number of mixtures of those samples. This technology, called “compressed sensing” was developed for digital imaging to reduce (com- press) the file size of an image. Importantly, it can also “decompress” a low amount of collected information to reconstruct an image with surprisingly high detail. Similarly, we will develop a compressed sensing algorithm to extract the individual protein profiles from mixtures of multiple combined samples. Initially, this approach will analyze 1,000 samples from 250 measurements of mixtures of those samples, providing a 4-fold increase in speed. Ultimately, with a much higher number of samples, it may allow a 100-fold increase in samples analyzed. To accelerate interpretation of this type of data for drug discovery, we will create a machine learning algorithm to simplify complex patterns of interactions between test compounds and the proteins within various types of cells. Previously acquired data will be modeled to learn the effects of individual compounds on various proteins. By learning from a large number of these data sets that describe interactions between specific compounds and proteins, in many different cell types, the model will be able to predict the effect of untested compounds on proteins within various types of cells. In addition, it will be able to indicate which experiments would be most useful to perform in the future, to obtain information on classes of compounds or proteins that are lacking in the current data sets. The combination of these two techniques has the potential to greatly accelerate development of novel drugs by providing a potentially huge increase in protein abundance measurements, along with a powerful method to predict how drugs will alter the expression of proteins in cells.