Sunday, August 17, 2025 8/17/2025

Accurate, fast, and distributed atlas-scale integration of scRNAseq data

Award Number: R03OD039974
ORGANIZATION: OFFICE OF THE DIRECTOR, NIH
OPDIV: NIH
AWARD CLASS: DISCRETIONARY
AWARD ACTIVITY TYPE: SCIENTIFIC/HEALTH RESEARCH (INCLUDES SURVEYS)
PERIOD OF PERFORMANCE START DATE: 06/10/2025
PERIOD OF PERFORMANCE END DATE: 05/31/2026

Group Awards By:

View Award Description

Accurate, fast, and distributed atlas-scale integration of scRNAseq data - Project Summary Single-cell transcriptomics have revolutionized biomedical research over the last decade and are now widely used in both industry and academia. Today there are >100 million cells profiled and available in the public domain, and many of these datasets were generated through Common Fund projects such as HuBMAP, GTEx, BRAIN, and SCORCH. These and other projects aim to serve as reference datasets that can be re-used for comparisons with other datasets, and such resources are often referred to as cell atlases. Therefore, the ability to integrate cell atlases with each other and with external datasets is crucial to ensure their utility. However, combining datasets can be challenging due to the presence of batch effects—experimental artifacts resulting from variability in sample processing. If not accounted for, batch effects can be mistaken for biological signals. In recent years, several computational methods to identify and correct batch effects have been developed. Nonetheless, challenges remain, particularly in scaling up to millions of cells from thousands of batches. Additionally, users need to download large volumes of data, which can be costly and time-consuming. One of the most popular methods for batch correction is Harmony, which we published in 2019. Harmony is transparent, intuitive, fast, and accurate, as demonstrated by independent benchmarks. Briefly, Harmony first uses principal component analysis to project cells into a latent space. It then iteratively performs k-means clustering and ridge regression to identify robust clusters that are not confounded by artifacts. Here, we propose two aims to add new functionality to Harmony to eliminate some of the remaining bottlenecks. In many use cases, a researcher will want to integrate their own data with a subset of an atlas, such as datasets containing the same tissue type. To speed up this process, we will leverage our pre-calculated global integration to allow for fast comparisons. This will be achieved by building a tree structure where nodes represent cell types at various levels of granularity. New cells are then projected into the same latent space, and by traversing the tree, we can quickly identify which existing cell types they most closely resemble. With cell atlases containing millions of cells, downloading and storing the data can become an issue. One way to overcome this challenge is to perform computations where the data resides, rather than transferring the data first. As previously mentioned, Harmony relies on three classic algorithms (PCA, k-means, and sparse regression), all of which can be executed in a distributed manner. In this context, we assume different parts of the data will reside on distinct servers, and Distributed Harmony will require only summary statistics to be transferred between servers. We envision that the proposed methods can be deployed in collaboration with organizations maintaining cell atlases. Additionally, the distributed algorithms and infrastructure we develop will be valuable for many other applications in computational genomics, e.g. population genetics and microbiome research.


Issue Date FY	Funding FY	Legal Entity Name	Legal Entity Address	Legal Entity City	Legal Entity State	Legal Entity Zip Code	Legal Entity COUNTY	Legal Entity COUNTRY	Assistance Listing	Award Code	Budget Year	Action Date	Action Type	Action Amount

Issue Date FY: 2025 ( Subtotal = $358,000 )
2025	2025	BRIGHAM & WOMENS HOSPITAL INC	75 FRANCIS ST	BOSTON	MA	02115	SUFFOLK	USA	Trans-NIH Research Support	000	1	6/10/2025	NEW	$358,000
														Subtotal = $358,000

Grand Total All Awards = $358,000

Top

All Categories

About

Search

Reports

Data Submission

Award Information

Accurate, fast, and distributed atlas-scale integration of scRNAseq data

Award Number: R03OD039974

ORGANIZATION: OFFICE OF THE DIRECTOR, NIH

OPDIV: NIH

AWARD CLASS: DISCRETIONARY

AWARD ACTIVITY TYPE: SCIENTIFIC/HEALTH RESEARCH (INCLUDES SURVEYS)

PERIOD OF PERFORMANCE START DATE: 06/10/2025

PERIOD OF PERFORMANCE END DATE: 05/31/2026

Federal Websites

Department of Health & Human Services

HHS Operating Divisions

HHS Staff Divisions

Download A Document Viewer