Scaling up computational genomics with tree sequences - Project Summary/Abstract Increasing sample size is a tremendously important factor in building our understanding of the genetics of human disease. As we discover that more and more diseases have a complex web of genetic causation, we need larger and larger genetic datasets to disentangle them, and to ultimately produce successful therapies. Driven in part by this need, the community is now assembling vast collections of human genome sequences, and millions of samples will soon be commonplace. There is a profound problem, however: our computational methods for storing, processing, and analyzing genomic data are lagging far behind. The algorithms and data structures underlying today’s computational methods were designed for thousands of samples, not millions. Without fundamental change in how we store and process genomic data, we will either not fully tap the potential of the data we collect, or the computational costs will be astronomical – or both. Nonhuman datasets, with applications in epidemiology, ecology, evolution, and agriculture, may not reach these sample sizes soon, but here we nevertheless face a related barrier. Simulation is increasingly important for tasks from hypothesis generation to parameter inference. However, current simulation methods only scale to tens or hundreds of thousands of individuals, inappropriate for many species of interest (e.g., mosquitos). This is crucial, since evolution and ecology in large populations differs from small ones, in ways that cannot be avoided by mathematical tricks (like rescaling). Our proposal addresses these critical needs by focusing on a new data structure: the “tree sequence”, which encodes genetic variation data using the population genetics processes that produced the data itself, by representing variation among contemporary samples using the underlying genealogical trees. This yields extraordinary levels of data compression, with file sizes hundreds of times smaller than current community standards. Since the tree sequence was introduced in 2016 it has led to performance increases of 2–4 orders of magnitude in genome simulation, calculation of statistics, and ancestry inference. Such sudden leaps in computational performance are vanishingly rare, and only possible through deep algorithmic advances. Our research plan builds on the extraordinary successes of tree sequence methods so far, scaling up three crucial layers of computational genomics: analysis, simulation, and inference. First, we will continue our development of highly efficient tree-sequence-based methods for fundamental operations in statistical and population genetics. Second, we will scale up genome simulations by integrating tree sequence methods into complex forward-time simulations and utilizing modern, multicore processors. Third, we will combine efficient simulations and the rich information contained in the tree sequence with cutting-edge deep-learning techniques to develop new inference methods. Together, we aim to revolutionize the way we work with and learn from population genetic variation data.