Project Summary/Abstract
The Human Genome Project (HGP) completed the first draft human genome sequence two decades ago. The
HGP revealed that human complexity arises from only approximately 20,000 coding genes, roughly the same
number as much simpler organisms such as nematodes. Intricate patterns of transcriptional regulation mediated
by non-coding regulatory elements specify the myriad cell types and states required for human complexity.
Genome-wide association studies have subsequently identified thousands of disease-associated variants, many
of which interrupt the function of these non-coding elements to disrupt transcriptional regulation. Thus, in order
to better understand human physiology and pathophysiology, comprehensive atlases of regulatory elements are
essential. Many previous efforts, including the International Human Epigenome Consortium (IHEC), the
FANTOM Consortium, the Roadmap Epigenomics Project, and the ENCODE Project, have aimed to build
comprehensive collections of regulatory elements, as well as computational models to better predict regulatory
activity and understand the sequence features underlying regulatory function. ENCODE (2003-2022) is a large-
scale consortium effort which aims to annotate every functional non-coding element of the human genome;
during our work on the project, we built a Registry of approximately 1 million human candidate cis-regulatory
elements (cCREs). We further developed deep-learning approaches which model the transcription factor motif
syntax that underlies element function at base-pair resolution and built two web-based resources, SCREEN and
Factorbook, to make our results accessible to the scientific community. Here, we propose to extend this
framework to build the Community Resource for Transcriptional Regulation (CRTR), a comprehensive atlas of
non-coding regulatory elements and machine-learning models which will encompass community and consortium
deep-sequencing data, both bulk and single cell, across a broad array of cell types and states. Our project has
five aims. First, we aim to curate community and consortium data for inclusion in CRTR and perform uniform
processing and quality control. Second, we aim to train deep-learning sequence models on bulk epigenetic
datasets to identify transcription factor motif syntax driving regulatory element activity in distinct tissues and cell
types. Third, we aim to train sequence models on single cell datasets to identify transcription factor motif syntax
driving transcriptional regulation in high-resolution cell states and during cell state transitions. Fourth, we aim to
use the aforementioned results to build comprehensive benchmark datasets and machine-learning model
collections, which will aid future analysts in designing new models to predict regulatory readouts. Fifth, we aim
to build a state-of-the-art web-based user interface to enable users to perform integrative analyses and in silico
experimentation with CRTR, and hold workshops and other outreach to maximize the impact of the resource and
its accessibility to the broader scientific community.