Project Summary
Modern human genomes are mosaics of variation from numerous archaic non-human hominins, often termed
“ghost” populations. However, our understanding of the evolutionary history of “ghost” variation is still developing.
Importantly, computational methods to address missing “ghost” variation are still nascent, and not accounting for
the presence of “ghosts” often leads to erroneous inference. Here I propose a series of programmatic
developments to address inference of evolutionary history from modern human genomes, while
accounting for gene flow from archaic “ghosts”. In AIM 1, I propose to develop a parallelized statistical
framework for estimating population genetic structure from multi-allelic, multi-locus genomic data that
incorporates sequencing and imputation errors of data considered missing due to gene flow from archaic “ghost”
populations into a maximum likelihood based statistical framework. This method will be incorporated into a
computationally efficient program called p-MULTICLUST, a multi-threaded, parallelized tool which extends the
popular “admixture” model incorporated in tools like STRUCTURE and ADMIXTURE to account for missing multi-
allelic human genomic data. AIM 2 will involve a two-pronged approach to estimate evolutionary history and
population structure in the presence of gene flow from an archaic “ghost” under the Isolation with Migration (IM)
model. We will (a) develop extensions to the IMa3/IMa2p suite of tools to incorporate joint estimation of
population structure and demographic history from genomic data, and (b) train undergraduate students in
developing simulation models for the stdpopsim consortium under two important models of human history – (1)
archaic “ghost” gene flow in native Africans, and (2) multiple-epochs of admixture into Asians/Oceanians. In AIM
3, I propose to quantify the selection landscape of “ghost” variation across diverse human genomes due to
ancestral gene flow from now extinct “ghost” populations. In this aim, we will focus on (a) improvements to the
MigSelect program to quantify linked selection effects due to gene flow from “ghost” populations under the IM
model, and (b) a larger, more encompassing study of functional genomic variation across diverse human
populations including high-quality genomes from Africa, supplemented with more complete Neanderthal, and
other non-human hominin genomes which will help us delineate patterns of human evolutionary history, and
understand the functional consequences of archaic gene flow. These discoveries also have direct consequences
for understanding modern human ancestry, and disease allele evolution. Importantly, this R15 will train numerous
underrepresented Undergraduate and Graduate students in genomics and bioinformatics, towards careers in
the biomedical and data sciences.