Fast, powerful, scalable, usable, and distributable methods for multi-modal single cell analyses - SUMMARY While single-cell methods for analyzing gene expression are becoming a standard tool for unpacking cellular heterogeneity and understanding complex tissues in health and disease, other molecular features, especially open chromatin landscapes via ATAC-seq, but also surface protein abundance and the presence of CRISPR guides, are rapidly expanding in their application. Indeed, commercial platforms for generating diverse single- cell data sets have led to an immense increase in scale of these data, and methods for split-and-pool based assays and decreasing sequencing cost all presage an exponentially increasing corpus of future large-scale datasets. We developed ArchR, an analysis infrastructure specifically designed for analysis of large-scale single- cell (sc) ATAC-seq data sets that enables a diverse suite of complex analysis (including QC, doublet removal, iterative TF-IDF clustering, approximation methods for large-scale data sets, trajectory analysis, RNA-seq integration, track visualization, marker peak identification, etc.), all with minimal computing hardware requirements. We estimate that ArchR has thousands of active users and is rapidly becoming the “go to” analysis software for large scATAC-seq data sets. To further extend the utility of ArchR for analyzing multi-omic data sets, we will first engineer substantial improvements to computational efficiency of underlying single-cell computational infrastructure. To do this, we will (1) encode our fundamental matrix operations in C++ to enable streaming data matrix access, thus reducing memory requirements and effectively “lifting the cap” on the number of cells capable of being analyzed through rapid on-the-fly calculations of diverse operations and (2) implement and benchmark efficient on-disk storage using bitpacking algorithms. These data structures and atomic operation libraries will be shared with the genomics community (and are being integrated into the popular Seurat package), allowing repurposing of these performance improvements. Second, we will develop, implement, and benchmark powerful analytical tools for the analysis of large, diverse, and/or multi-omic datasets. We will enable the handling of diverse independent and simultaneously acquired (multi-omic) data types including RNA-seq, ATAC-seq, ADT (CITE-seq), and CRISPR-based perturbation methods. We will develop accurate methods for cross-manifold data linkage for distinct data sets, forced-projection and regression analysis, multi-modality cell clustering, joint analysis of single-cell molecular data sets with CRISPR-based perturbations, single-cell inference of enhancer function via correlation and the “ABC” model, and identification of continuous differentiation trajectories and chromatin “potential.” Finally, we will develop plug-and-play cell type specific deep learning models for prediction of the regulatory effects of noncoding sequence changes. These models will learn single-cell chromatin accessibility profiles from DNA sequence to predict the cell type-specific effects of noncoding sequence changes. We will create a user-friendly system for training, deployment, and sharing sequence-based models of cell type- specific chromatin accessibility, bringing cutting-edge machine learning for functional genomics to wide use.