PROJECT SUMMARY / ABSTRACT
Advances in sequencing technologies provide new opportunities to interrogate biological systems from multiple
perspectives. However, the introduction of new technologies highlights a problem many researchers face:
missing data. Missing observations across technologies and biological states is a frequently observed problem
in the field of computational biology. This missingness can be a result of limitations in the technology, the rarity
of a biological state, or because the technology has not been widely adopted. While one technology may have
high sparsity in biological observations, there is an opportunity to leverage existing, complementary data from
an established technology to impute the missing biological observations.
We address these issues by utilizing new methodological advances in machine learning, primarily focusing on
domain adaptation techniques. These techniques learn patterns in one dataset that can be adapted to another
dataset, enabling cross-technology information sharing. Our proposal introduces a general framework in which
domain adaptation techniques can be used to unite an emerging technology with a different, but technology. To
highlight the broad utility of this approach, we apply this model to three biomedical applications: 1) Predict
cell-type-specific perturbation response in rheumatoid arthritis; 2) Predict tissue-of-origin from cell-free DNA
(cfDNA); 3) Predict progenitor-specific gene signatures from cell-free DNA in acute myeloid leukemia (AML).
The proposed aims not only unite existing and emerging sequencing technologies, but enable the discovery of
new biology that is difficult or infeasible to directly observe.
The research proposed builds on my experience in using statistical approaches for transcriptomic data. During
the K99 phase I will require further training from my mentoring team in deep generative modeling (Dr. Casey
Greene), modeling of single-cell data (Dr. Fan Zhang), and modeling of cfDNA and chromatin accessibility (Dr.
Srinivas Ramachandran). The research will be conducted at the University of Colorado, Anschutz Medical
Campus, in the Center for Health AI. In this institution, I will have access to the Colorado Clinical and
Translational Sciences Institute and the RNA Bioscience Initiative, which provide resources for building an
interdisciplinary and translational research program. With this training and available institutional resources, I
will have a solid foundation on which to build an independent research program focused on domain adaptation
applications for high-throughput sequencing technologies.