Project Summary/Abstract
Increasing sample size is a tremendously important factor in building our understanding of the genetics of
human disease. As we discover that more and more diseases have a complex web of genetic causation, we
need larger and larger genetic datasets to disentangle them, and to ultimately produce successful therapies.
Driven in part by this need, the community is now assembling vast collections of human genome sequences,
and millions of samples will soon be commonplace. There is a profound problem, however: our computational
methods for storing, processing, and analyzing genomic data are lagging far behind. The algorithms and data
structures underlying today’s computational methods were designed for thousands of samples, not millions.
Without fundamental change in how we store and process genomic data, we will either not fully tap the
potential of the data we collect, or the computational costs will be astronomical – or both.
Nonhuman datasets, with applications in epidemiology, ecology, evolution, and agriculture, may not reach
these sample sizes soon, but here we nevertheless face a related barrier. Simulation is increasingly important
for tasks from hypothesis generation to parameter inference. However, current simulation methods only scale
to tens or hundreds of thousands of individuals, inappropriate for many species of interest (e.g., mosquitos).
This is crucial, since evolution and ecology in large populations differs from small ones, in ways that cannot
be avoided by mathematical tricks (like rescaling).
Our proposal addresses these critical needs by focusing on a new data structure: the “tree sequence”,
which encodes genetic variation data using the population genetics processes that produced the data itself,
by representing variation among contemporary samples using the underlying genealogical trees. This yields
extraordinary levels of data compression, with file sizes hundreds of times smaller than current community
standards. Since the tree sequence was introduced in 2016 it has led to performance increases of 2–4 orders
of magnitude in genome simulation, calculation of statistics, and ancestry inference. Such sudden leaps in
computational performance are vanishingly rare, and only possible through deep algorithmic advances.
Our research plan builds on the extraordinary successes of tree sequence methods so far, scaling up three
crucial layers of computational genomics: analysis, simulation, and inference. First, we will continue our
development of highly efficient tree-sequence-based methods for fundamental operations in statistical and
population genetics. Second, we will scale up genome simulations by integrating tree sequence methods
into complex forward-time simulations and utilizing modern, multicore processors. Third, we will combine
efficient simulations and the rich information contained in the tree sequence with cutting-edge deep-learning
techniques to develop new inference methods. Together, we aim to revolutionize the way we work with and
learn from population genetic variation data.