Our ability to identify genetic sequence variation in humans has thus far outstripped the field’s ability to
interpret these mutations. Genome-wide association studies have identified hundreds of thousands of genomic
loci associated with disease risk and human phenotypic traits, yet in few instances do we know the identity of
the exact causal mutation, nor the molecular mechanism behind its function. Much of this limitation is due to a
large portion of this variation residing in cis-regulatory regions (CREs), where our inability to identify a variants’
regulatory impacts or target gene(s) presents a major hurdle. Better understanding of this regulatory grammar -
the complex logic of how sequence content in CREs controls transcription – is a crucial next step for genomics,
but requires a vast expansion of well characterized regulatory mutations.
To achieve this goal, we will employ a multi-pronged approach to build a large-scale, regulatory variant
functional catalog. We will focus on CREs harboring genetically fine-mapped, likely causal variants from global
populations for a variety of metabolic traits and disease (Aim 1). We will first identify CRE-gene interactions
using highly-sensitive and scalable endogenous CRISPR approaches. This large-scale mapping effort will
inform our understanding of the CRE-gene targeting logic of regulatory grammar. We will use this data to map
the transcriptional architecture of metabolic complex traits. We then propose to interrogate sequence
determinants of regulatory grammar for hundreds of trait-associated CREs at their endogenous location in the
genome (Aim 2). We will first develop an endogenous saturation mutagenesis system to generate hundreds of
thousands of nucleotide changes in these CREs. We will then assay the regulatory architecture of these
changes using multiplexed amplicon ChIP-sequencing to identify epigenetic changes, and HCR-FlowFISH to
detect transcriptional changes. In addition to identifying causal variants for a variety of metabolic diseases, this
proposal will generate a repertoire of
300,000+ functionally characterized regulatory variants. This variant
impact catalog will serve as an ideal training set to model regulatory grammar with our powerful machine
learning approaches. We will incorporate endogenous saturation mutagenesis data into our variant effect
prediction models (VEPs). Importantly, such models will find utility across global populations as they will
explain a universal regulatory code of the human genome and thus enable interpretation of population-specific
variation. We will then deploy these VEPs to understudied variation and in understudied populations.
Overall, this proposal is structured to generate a functional characterization catalog at multiple levels:
first providing molecular mechanisms and gene targets for thousands of causal variants, secondly building
comprehensive genomic etiological understanding for phenotypically related complex traits, and lastly
providing the scale of endogenous data necessary to improve VEPs. Our approach combines our group’s
unique expertise spanning functional genomics, CRISPR screens, statistical genetics, and machine learning.