PROJECT SUMMARY
Genetic variation affecting gene expression level and splicing accounts for a large proportion of phenotypic
variation between humans, including health and disease. The variants that underlie these phenotypic changes
are often discovered by associating individuals’ gene expression data with their genotypes. These methods
can be confounded by population structure in the sample, which leads to false positive and negative errors. As
such, samples are often selected from relatively homogenous populations. However, this limits the applicability
of results to populations not included in the study, and limits the resolution at which potentially causal variants
can be identified. Previous work has shown that controlling for population structure locally across the genome
in association studies of diverse samples serves to reduce error. However, these methods assign individuals to
one of a few ancestral populations and do not fully capture the relatedness between included samples.
To extend the results of association studies to diverse cohorts, I will develop a method to control for
local relatedness between samples in association studies. The Ancestral Recombination Graph (ARG) is a
data structure which encodes the genealogical relationships between samples at each locus along the
genome. In Aim 1, I will develop a linear mixed model approach for association mapping that utilizes a
similarity matrix derived from the ARG to control for local relatedness between samples.
One barrier in extending the results of association studies investigating gene expression is that the
majority of data currently available is from individuals of European descent. To address this limitation, I
recently generated gene expression data for a large, globally diverse human sample. In Aim 2, I will use the
method developed in Aim 1 to map expression level- and splicing-associated variation in this sample. I will then
investigate enrichment of epigenomic features near associated variants to determine the functional
mechanisms by which they may be driving transcription differences, and I will intersect my findings with
previously discovered disease associations. Using this globally diverse dataset, I will also explore the diversity
and evolution of human gene expression, elucidating the extent to which patterns of gene expression are
partitioned within versus between populations and the sources of such stratification.
Extending association studies to diverse cohorts requires not only diverse datasets, but also tools that
can appropriately control for patterns of population structure within those datasets; the research proposed here
addresses both goals. This will allow the discovery of associations in previously underrepresented groups and
will also serve to improve confidence in discovering causal variants. Together, this proposed work will
characterize the functional mechanisms linking genetic variation and phenotypic differences in a globally
diverse human cohort.