Project Summary. Single-cell RNA-sequencing (scRNAseq) technologies measure transcriptome-wide gene
expression at the single-cell level. In contrast to bulk RNA-sequencing, scRNAseq can elucidate dynamic
expression patterns between different cellular populations. A key problem in scRNAseq studies is the inability
to transfer knowledge between independent sequencing studies directly. As a result, it has been necessary for
researchers to spend a significant amount of time and resources generating massive datasets to enable
meaningful analyses, a process that is costly and often not reproducible. Another transformative technology is
spatial transcriptomics (ST), which provides genetic profiles of cells while containing the positional information
on the sequenced cell. ST has the potential to expand our understanding of cellular heterogeneity, interactions,
and pathology; however, ST is still an emerging technology and is not widely available for many studies.
This proposal will fulfill the unmet need for scalable algorithms that transfer knowledge from existing datasets
to new studies, leveraging learned representations to construct the sequenced tissue's spatial information. I
propose to achieve these goals through the following aims: (1) Transfer knowledge from existing public single-
cell data to new experimental data using a deep neural-attention network, and (2) develop the first spatially-
informed model for generating realistic scRNAseq data. In Aim 1, I will use the "attention" mechanisms (which
have revolutionized many fields in computer science) to learn complex gene dependencies intelligently and
learn important biological features (e.g., marker genes) in a fully self-supervised manner, providing biological
interpretability that is desperately needed. Such a model can be used in many tasks and for datasets with
relatively few samples. The learned knowledge obtained from Aim 1 will be used directly in Aim 2. In Aim 2, I
will build upon our state-of-the-art generative model to generate synthetic data that contains spatial information
(coordinates) of sequenced cells, even when no atlas is available. This model will allow researchers to produce
synthetic data with spatial information and augment sparse and noisy datasets for more robust and accurate
analyses, all possible without the need for additional costly experiments.
This proposal will support my dissertation research, which will be the foundational body of work for my career
as a researcher in computational genomics. During the tenure of this award, I will receive specialized training
in the underlying mathematics and biology needed for developing frameworks for scRNAseq analysis. I will
contribute to the existing literature by developing novel methodology and creating open-source software,
making our tools and models easily accessible to the broader scientific community. Achieving the proposed
aims will significantly enhance scRNAseq pipelines and analysis, making them more robust and accurate. This
will additionally facilitate the study of smaller datasets, potentially reducing the number of patients and animals
necessary in initial studies.