Next-Generation Algorithms in Statistical Genetics Based on Modern Machine Learning - PROJECT SUMMARY/ABSTRACT
Advances in technology are enabling the collection of massive datasets of millions of human genomes. The
long-term vision underlying this proposal is to leverage modern datasets and machine learning (ML) to under-
stand how genetics and the environment determine traits and outcomes important to improving human health.
Modern ML thrives on vast datasets of millions of unstructured datapoints (genomes, clinical notes, images),
and stands to greatly impact statistical genetics, the field which studies the genotype-phenotype link. Improve-
ments in statistical genetics have in turn the potential to elucidate the genetic basis of disease and support per-
sonalized medical therapies.
This proposal advances the above vision via two thrusts: (1) developing novel machine learning (ML) algo-
rithms motivated by problems in statistical genetics; (2) creating open-source software systems for scientists
and clinicians based on the above algorithms. Specifically, we describe a plan for the development of computa-
tional methods in three broad areas in statistical genetics: modeling linkage disequilibrium and the structure of
genetic variation, analyzing genome-wide association study data, and predicting risk from genetic and environ-
mental factors. Within in each area, we aim to develop open-source software for key applied problems includ-
ing genetic imputation, haplotyping, low-pass sequencing, causal variant identification, and risk scoring.
Our research seeks to establish a foundation for statistical genetics based on modern ML and also advance
ML in directions that may not be pursued in other application domains. Our methods will support technologies
that have immediate applications in healthcare and that help reveal novel genetic factors that influence dis-
ease; improve the accuracy of genomic prediction in domains from preventive medicine to pharmacogenomics;
significantly reduce the cost of genomic sequencing assays, and ultimately improve human health.