Statistical Approaches to Unlock Protein Function from Deep Mutational Scans - Project Summary/Abstract Understanding how genetic variants impact protein function is essential for unraveling the mechanisms underlying both basic biology and disease, particularly for rare genetic variants. Of the 4.6 million missense variants found in large population studies, only about 2% have clinical interpretations. Due to their rarity, these variants are exceptionally challenging to study through observational methods. However, Deep Mutational Scanning (DMS) offers a high-throughput method for testing thousands of protein variants by generating a mutant library and obtaining a phenotypic readout for each mutation in one sequencing assay. Initially focused on fitness-based readouts, DMS has expanded to include fluorescence-based methods for protein profiling, binding assays, and more. It has been crucial for studying proteins like SARS-CoV-2, BRCA1, and drug-metabolism transporters like OCT1. With over 1,000 protein datasets publicly available, a recent study highlights technical advances by independently assaying over 500 additional proteins in one study. Unfortunately, the development of statistical methods to interpret and analyze these technologies has not kept pace. For example, DMS with fluorescence-activated cell sorting (DMS-FACS), which has been used for nearly a decade to measure protein abundance and other functional phenotypes, still lacks dedicated analysis methods. As a result, analyses are often ad hoc, and small sample sizes (typically three replicates) make standard statistical methods unsuitable. Our recent work demonstrates that naive approaches miss many real effects and lead to many false discoveries. We propose three statistical areas to improve DMS analysis and interpretation through accurate sample comparisons, epistasis analysis, and causal inference. First, we will develop methods to analyze DMS-FACS for assessing how genetic variants affect molecular phenotype targeted by FACS, and enabling precise comparisons between experimental conditions. Second, we will develop methods to improve genetic interaction (epistasis) analysis and interpretation within proteins, and thus ask which protein regions are acting in concert. Third, we open a new area of research for DMS, aiming to identify the causal impact of variants through measured pathways, including complex traits. In summary, we will solve the analysis gap for DMS-FACS, epistasis DMS, and causally link DMS data through structural causal models by leveraging our expertise in DMS data and small sample statistics. Leveraging our expertise in DMS data and small sample statistics, we will create reliable, robust tools for common workflows while also enabling new types of analyses that improve the interpretation of DMS, epistasis, and phenotypic relationships. With strong collaborations with assay developers and DMS experts, along with a proven track record in developing tools for high-throughput sequencing in small sample contexts, we are well-positioned to lead this effort.