Genome-wide characterization of complex variants and their phenotypic effects in African populations - Advances in omics technology have the power to provide integrative models of disease risk and influence health outcomes. However, the utility of these models has so far been limited to a subset of individuals, due to available limited datasets. Further, efforts to identify medically relevant genetic variants have included only a subset of known classes of genetic variation. Newly available genomic datasets from the African continent provide a rich opportunity to begin addressing this gap by covering a far broader scope (like we have in the USA) of global genetic variation compared to existing datasets that focused heavily on individuals of European descent. Most large genomics efforts in both Africans and non-Africans have focused on single nucleotide polymorphisms (SNPs), excluding a large fraction of more complex and population-specific variant types such as genomic repeats. Here, we consider multiple complex variant types: short tandem repeats (STRs), variable number tandem repeats (VNTRs), and structural variants (SVs), which are well known to contribute to human disease. For example, large repeat expansions are implicated in Huntington’s Disease and other disorders, and stepwise variation in repeat copy number at TRs has been implicated in a variety of complex traits. Although their role in human phenotypes is well established, discovery efforts in repeat regions have been largely limited to datasets and phenotypes dominated by very limited populations. We hypothesize that detailed analysis of repeat variants in multiple populations like we have in the USA and Africa will identify novel disease-associated loci including pathogenic repeat expansions, as well as improve the utility of risk prediction models in the world populations, ultimately leading to improved diagnosis and health outcomes. Our proposal leverages existing and novel analysis approaches to interrogate technically challenging repetitive regions and integrates datasets from across the African continent including (1) whole genome-sequencing (WGS) from more than 1,000 African individuals (H3Africa-Baylor, TrypanoGen, CAfGEN, and various other African genomics cohorts), (2) SNP data from more than 10,000 individuals (AWI-Gen), and (3) health outcome information related to trypanosomiasis, HIV status, chronic kidney disease, cancer risk, and cardiometabolic traits with high prevalence in populations with African population. We will also incorporate existing biobanks containing tens of thousands of genomes from different populations (admixed Africans from All of Us and UK Biobank) to validate our findings and improve power. The overall goal of this proposal is to improve health outcomes using innovative data analysis and machine learning techniques. We bring together a multi-disciplinary team spanning Africa (headed by MPIs Oyelade and Jjingo and consultant Senior Scientist Adebiyi) and the US (MPI Gymrek) which has already initiated a fruitful collaboration. Specifically, we will characterize genome-wide TR variations (Aim 1), identify signals of positive and negative selection at these regions (Aim 2), and identify TRs associated with medically relevant phenotypes and generate improved population specific polygenic risk scores (Aim 3).