Project Summary
With the rise of high-throughput sequencing and multiplexed biotechnologies enabling single-cell multi-omics
and massively parallel CRISPR experiments, the biomedical community is generating a monumental amount of
data. These data promise to reveal new biology and drive personal and precision medicine. However, the sheer
volume of genomic data is overwhelming current computational resources, requiring prohibitively high compute
time, memory usage, and storage. My lab has been at the forefront of solving big data challenges in genomics,
designing novel algorithms that enable efficient and secure analyses that were previously computationally
infeasible, and that reveal novel structural, cellular, and systems biology. Drawing upon our expertise in
developing scalable and insightful algorithms for analyzing genomic, transcriptomic, and proteomic data, we aim
to tackle two key data-driven challenges facing the biological community: 1) efficient, accurate, and robust
characterization of tissues at the single-cell level, and 2) translating high-throughput datasets into biological
discoveries via machine learning-based prediction. To solve the first challenge, we will leverage our discovery
that seemingly high-dimensional sequencing data often lies on low-dimensional manifolds that capture the
underlying biological state of interest. We will design algorithms that generate these compact, meaningful
manifold representations of single-cell omics datasets. This will enable a number of key applications including
characterizing co-expression and gene-modules that define healthy and pathologic cell states; integrating
multi-modal single-cell omics datasets to more richly characterize cellular diversity; and investigating the
mechanisms underlying transcriptomic diversity across tissues and developmental states. To solve the second
challenge, we will take a two-pronged approach. First, we will design novel machine learning frameworks that
provide a measure of confidence when predicting in unfamiliar biological states, enabling prediction that is robust
to “out-of-distribution” (unobserved) examples. We will then work with our experimental collaborators and CROs
to rapidly perform experimental validation of model-based predictions. Finally, we will return the experimental
results to the model to further improve performance. This will enable an “active learning” feedback loop to
efficiently explore a complex biological space for outcomes of interest. We will use this uncertainty-powered
active learning approach to explore several pressing biological concerns such as the identification of small
molecule compounds with enzymatic or whole-cell growth inhibitory properties, efficient design of spatial-
transcriptomic experiments, computationally guided CRISPR perturbation experiments, and identification of
functional non-coding mutations. This project will result in 1) numerous software tools with wide utility that
efficiently analyze massive biological datasets and guide complex experimentation, and 2) reveal biological
insights, especially into biomolecular interactions and cellular heterogeneity.