PROJECT SUMMARY
Advances in omics technology have the power to provide integrative models of disease risk and influence
health outcomes. However, the utility of these models has so far been limited to non-African populations, due
to biases in available datasets. Further, efforts to identify medically relevant genetic variants have included only
a subset of known genetic variants and have had limited focus on phenotypes most relevant to Africa. Newly
available genomic datasets from the African continent provide a rich opportunity to begin addressing this gap.
Most large genomics efforts in both Africans and non-Africans have focused on single nucleotide
polymorphisms (SNPs), excluding a large fraction of more complex and ancestry-specific variant types such as
genomic repeats. Here, we consider multiple complex variant types, focusing on tandem repeats (TRs). TRs
are well known to contribute to human disease. For example, large repeat expansions are implicated in
Huntington’s Disease and other disorders, and stepwise variation in repeat copy number at TRs has been
implicated in a variety of complex traits. Although their role in human phenotypes is well established, discovery
efforts in repeat regions have been largely limited to datasets and phenotypes dominated by non-Africans.
We hypothesize that detailed analysis of repeat variants in Africa will identify novel disease-associated
loci including pathogenic repeat expansions, as well as improve the utility of risk prediction models,
ultimately leading to improved diagnosis and health outcomes. Our proposal leverages existing and novel data
analysis approaches to interrogate technically challenging repetitive regions and integrates diverse genomics
datasets from across the African continent including (1) whole genome-sequencing (WGS) from more than
1,000 individuals, (2) SNP array data from more than 10,000 individuals, and (3) health outcome information
related to trypanosomiasis, HIV status, chronic kidney disease, cancer risk, and cardiometabolic traits with high
prevalence in African populations. We will further incorporate existing biobanks containing tens of thousands of
diverse genomes (admixed Africans from All of Us and UK Biobank) to validate findings and improve power.
The overall goal of this proposal is to improve health outcomes in Africa using innovative data analysis
and machine learning techniques. Specifically, we will characterize genome-wide TR variation in African
individuals (Aim 1), identify signals of positive and negative selection at these regions (Aim 2), and identify TRs
associated with medically relevant phenotypes and generate improved ancestry specific polygenic risk scores
(Aim 3). We bring together a diverse team spanning Africa (headed by MPIs Adebiyi and Jjingo) and the US
(MPI Gymrek) which has already initiated a fruitful collaboration. Further, analyses will be performed primarily
using existing African supercomputing infrastructure and led by new and early-stage African investigators and
trainees. Overall, the proposed aims will likely identify novel medically relevant genetic variants and continue to
foster data science capabilities within Africa.