PROJECT SUMMARY/ABSTRACT
Nominating candidate risk genes and gene sets underlying disease-critical processes is of utmost importance
for developing drug targets and informing CRISPR screening experiments. To this end, large scale single-cell
genomic and epigenomic data (from RNA-seq, ATAC-seq, Perturb-seq) can be integrated with genome wide
association studies (GWAS) to enhance our understanding of the genetic architecture of human complex
diseases and traits. In this proposal, I plan to develop new computational approaches to integrate single-
cell functional genomic and epigenomic data with GWAS data for complex diseases and traits to identify
and rank disease-critical genes and gene sets characterizing functional processes, as well as pinpoint
short genomic regions linked to these disease-associated genes. My K99 training will be conducted at the
Harvard T.H. Chan School of Public Health, as well as the Broad Institute, under the mentorship of Dr. Alkes
Price. The key areas of my training will be to develop and evaluate approaches for gene-level and gene set-level
functional architecture of diseases and traits and integrative analysis of single-cell, as well as bulk, functional
genomics data with human disease genetics. My proposed approaches will attempt to bridge the gap between
functional genomics and human genetics and downstream clinical drug/gene intervention experiments. The long-
term goal of this research is to produce a set of computational tools that identify and rank top disease-critical
genes, top disease-critical gene sets characterizing cell types or cellular processes and gene-linked genomic
regions for each disease/trait. These approaches will reshape our understanding of the functional architecture
of human diseases at cellular level and will inform future drug perturbation and CRISPR screening experiments.
The first aim of this proposal is to develop methods to identify and rank disease-critical genes by integrating
common and rare variant disease associations with gene-level functional information derived from single-cell
genomics experiments. Here I will develop, compare and contrast multiple gene prioritization strategies that differ
in how they annotate SNPs for a gene, how they aggregate variant level associations at gene level and how they
use functional data in performing the gene prioritization. The second aim of this proposal is to develop new
computational strategies to assess disease information in sets of genes that underlie a cell type or cellular
processes active within or across cell types in a tissue. The third aim of this proposal is to pinpoint and prioritize
short genomic regions that are either proximally or functionally linked (for example, as an enhancer) to disease-
critical genes and gene sets from Aims 1 and 2. Here, I plan to integrate GWAS association signal near these
gene-linked regions with deep learning models that can infer allelic effects at base pair resolution and single-cell
ATAC-seq data. All disease-critical genes, gene sets and gene-linked regions along with relevant computational
tools will be distributed publicly to the scientific community.