Accurate, fast, and distributed atlas-scale integration of scRNAseq data - Project Summary Single-cell transcriptomics have revolutionized biomedical research over the last decade and are now widely used in both industry and academia. Today there are >100 million cells profiled and available in the public domain, and many of these datasets were generated through Common Fund projects such as HuBMAP, GTEx, BRAIN, and SCORCH. These and other projects aim to serve as reference datasets that can be re-used for comparisons with other datasets, and such resources are often referred to as cell atlases. Therefore, the ability to integrate cell atlases with each other and with external datasets is crucial to ensure their utility. However, combining datasets can be challenging due to the presence of batch effects—experimental artifacts resulting from variability in sample processing. If not accounted for, batch effects can be mistaken for biological signals. In recent years, several computational methods to identify and correct batch effects have been developed. Nonetheless, challenges remain, particularly in scaling up to millions of cells from thousands of batches. Additionally, users need to download large volumes of data, which can be costly and time-consuming. One of the most popular methods for batch correction is Harmony, which we published in 2019. Harmony is transparent, intuitive, fast, and accurate, as demonstrated by independent benchmarks. Briefly, Harmony first uses principal component analysis to project cells into a latent space. It then iteratively performs k-means clustering and ridge regression to identify robust clusters that are not confounded by artifacts. Here, we propose two aims to add new functionality to Harmony to eliminate some of the remaining bottlenecks. In many use cases, a researcher will want to integrate their own data with a subset of an atlas, such as datasets containing the same tissue type. To speed up this process, we will leverage our pre-calculated global integration to allow for fast comparisons. This will be achieved by building a tree structure where nodes represent cell types at various levels of granularity. New cells are then projected into the same latent space, and by traversing the tree, we can quickly identify which existing cell types they most closely resemble. With cell atlases containing millions of cells, downloading and storing the data can become an issue. One way to overcome this challenge is to perform computations where the data resides, rather than transferring the data first. As previously mentioned, Harmony relies on three classic algorithms (PCA, k-means, and sparse regression), all of which can be executed in a distributed manner. In this context, we assume different parts of the data will reside on distinct servers, and Distributed Harmony will require only summary statistics to be transferred between servers. We envision that the proposed methods can be deployed in collaboration with organizations maintaining cell atlases. Additionally, the distributed algorithms and infrastructure we develop will be valuable for many other applications in computational genomics, e.g. population genetics and microbiome research.