SUMMARY
While single-cell methods for analyzing gene expression are becoming a standard tool for unpacking cellular
heterogeneity and understanding complex tissues in health and disease, other molecular features, especially
open chromatin landscapes via ATAC-seq, but also surface protein abundance and the presence of CRISPR
guides, are rapidly expanding in their application. Indeed, commercial platforms for generating diverse single-
cell data sets have led to an immense increase in scale of these data, and methods for split-and-pool based
assays and decreasing sequencing cost all presage an exponentially increasing corpus of future large-scale
datasets. We developed ArchR, an analysis infrastructure specifically designed for analysis of large-scale single-
cell (sc) ATAC-seq data sets that enables a diverse suite of complex analysis (including QC, doublet removal,
iterative TF-IDF clustering, approximation methods for large-scale data sets, trajectory analysis, RNA-seq
integration, track visualization, marker peak identification, etc.), all with minimal computing hardware
requirements. We estimate that ArchR has thousands of active users and is rapidly becoming the “go to” analysis
software for large scATAC-seq data sets. To further extend the utility of ArchR for analyzing multi-omic data sets,
we will first engineer substantial improvements to computational efficiency of underlying single-cell
computational infrastructure. To do this, we will (1) encode our fundamental matrix operations in C++ to enable
streaming data matrix access, thus reducing memory requirements and effectively “lifting the cap” on the number
of cells capable of being analyzed through rapid on-the-fly calculations of diverse operations and (2) implement
and benchmark efficient on-disk storage using bitpacking algorithms. These data structures and atomic operation
libraries will be shared with the genomics community (and are being integrated into the popular Seurat package),
allowing repurposing of these performance improvements. Second, we will develop, implement, and benchmark
powerful analytical tools for the analysis of large, diverse, and/or multi-omic datasets. We will enable the handling
of diverse independent and simultaneously acquired (multi-omic) data types including RNA-seq, ATAC-seq, ADT
(CITE-seq), and CRISPR-based perturbation methods. We will develop accurate methods for cross-manifold
data linkage for distinct data sets, forced-projection and regression analysis, multi-modality cell clustering, joint
analysis of single-cell molecular data sets with CRISPR-based perturbations, single-cell inference of enhancer
function via correlation and the “ABC” model, and identification of continuous differentiation trajectories and
chromatin “potential.” Finally, we will develop plug-and-play cell type specific deep learning models for prediction
of the regulatory effects of noncoding sequence changes. These models will learn single-cell chromatin
accessibility profiles from DNA sequence to predict the cell type-specific effects of noncoding sequence changes.
We will create a user-friendly system for training, deployment, and sharing sequence-based models of cell type-
specific chromatin accessibility, bringing cutting-edge machine learning for functional genomics to wide use.