Decode polyadenylation in health and disease - SUMMARY The 3’ ends of almost all eukaryotic mRNAs are formed by cleavage and polyadenylation (CPA), a process that is finely controlled through the interplay between cis elements at the poly(A) site (PAS) and a large number of trans-acting factors, including the core CPA machinery and regulatory factors. CPA is not only an essential step in mRNA maturation, but also directly impacts transcription termination, export, translation, and stability. Moreover, ~70% of human genes produce multiple mRNA isoforms by using alternative PASs in a process called alternative polyadenylation (APA). Different APA isoforms from the same gene can encode different proteins or can be differentially regulated. CPA is highly regulated in development and aberrant CPA has been causally linked to a broad range of diseases, from neurological disorders to cancer. For example, genetic variants at PASs have been shown to cause IPEX syndrome, thrombophilia, alpha/beta thalassaemia and complex immunodeficiency diseases. Despite the importance of CPA, it remains a major challenge to predict CPA efficiency from PAS sequences or to predict the impact of genetic variants on CPA and gene expression due to the highly variable nature of PAS sequences. CPSF73, a core component of the CPA machinery and the endonuclease responsible for the cleavage step, has emerged as a drug target for treating many diseases. For example, an anti-cancer and anti-inflammation compound JTE-607 directly targets CPSF73 and inhibits its activity. Although CPSF73 is required for CPA in all cell types, JTE-607 preferentially kills myeloid leukemia and Ewing’s sarcoma cells. The mechanism underlying such specificity remains unknown. Our recent publication revealed that JTE-607 blocks mRNA 3’ processing in a sequence-dependent manner, suggesting that PAS sequence determines not only its overall CPA efficiency, but also its drug sensitivity. To fully understand the role of CPA in health and disease we thus need to expand the poly(A) code to capture the interplay between sequence, CPA efficiency and drug sensitivity. Such a poly(A) code will be important not only for understanding the fundamental mechanism of gene expression, but also for identifying disease-causing mutations, especially in the non-coding regions of the human genome. Massively parallel reporter assays (MPRA) coupled with machine learning have emerged as a power tool for studying sequence-function relationships in a wide variety of biological contexts. Here we propose to apply this method to understand how PAS sequence determines CPA efficiency and drug sensitivity. We will apply this code to identifying disease-causing genetic variants in the human genome. Successful completion of the proposed studies will improve our understanding of the fundamental mechanisms of gene regulation and facilitate discovery of disease-causing genetic variations and novel anti-cancer therapy.