Project Summary
A person’s genome typically contains millions of variants which represent the differences between this personal
genome and the reference human genome. Interpretation of how these variants cause diseases and
understanding the mechanism(s) of their statistical associations to phenotype are crucial problems in
computational biology and genetics. The problems are not straightforward to address because over 90% of
disease-associated variants are in non-coding regions that have highly specific cellular context regulatory
functions and about which we have limited comprehension. The long-term goal of this project is to explain
mechanistically how non-coding genetic variants affect cellular context-dependent gene regulatory
networks and influence phenotypes. Expression quantitative trait locus (eQTL) mapping and Gene
regulatory networks (GRNs) are two common approaches for interpreting regulatory mechanisms of genetic
variants. eQTL mapping connects variants in non-coding regions to genes by a population-based association
study. GRNs provide information on the cis-regulatory elements that control context-specific expression of target
genes, and information about the transcription factors that act on these elements. GRN-based variant
interpretation is complementary to eQTL mapping and has the potential to overcome the limitations of eQTL
mapping, which are: (1) eQTL mapping is biased for common alleles; (2) eQTL mapping cannot distinguish
variants in strong linkage disequilibrium; and (3) the power to detect trans-eQTL is low. Most previous regulatory
analysis research based on ENCODE data did not include personal genotyping data, and most eQTL mapping
research did not include regulatory information. Joint modelling of eQTLs and GRNs would enable high-accuracy
and mechanistic variant interpretation. However, the required dataset for such analysis - matched gene
expression, epigenome, and genotyping data from the same individuals - are not available for a large human
sample. Available datasets are cross-individual paired genotyping and gene expression data, such as GTEx
data, and cross-cellular-contexts paired gene expression and epigenomics data, such as ENCODE data. These
two types of paired data are also available at the single cell level. To achieve our long-term goal, we will develop
statistical methods to integrate these unmatched datasets (either bulk or single cell) from different sources to (1)
infer high accuracy context-specific GRNs to connect variants, transcription factors, cis-regulatory elements, and
target genes; and (2) detect trans-eQTLs that regulate target genes. These methods can be extended to interpret
disease-associated variants, identify causal variants, and infer personalized drug response to provide guidance
for precision medicine. This project is fundamental for precision medicine, and it will increase our understanding
of how genetic variants contribute to phenotype.