Project Summary/Abstract
The NIH Common Fund have supported the generation, management, and sharing of single cell genomic data
from millions of cells through several large international consortia with the goal of building a comprehensive
reference of healthy cells across multiple organs in the human body. We will use single cell/nucleus RNA-
sequencing (scRNA-seq) data from the Common Fund-supported Human BioMolecular Atlas Program (HuBMAP)
and Genotype-Tissue Expression (GTEx) consortia to prototype a cell type harmonization protocol for
constructing a cross-consortia cell census meta-atlas. The HuBMAP consortium provides organ-specific cell
atlases for multiple organs, while GTEx provides an integrated cross-organ single cell atlas. Our group has
developed and extensively validated computational algorithms, NS-Forest and FR-Match, for biomarker
identification and robust cell type matching using scRNA-seq data. Our algorithms utilize Random Forest
machine learning and minimum spanning tree graphical modeling, which provide superior classification
performance while maintaining high explainability and interpretability for biological applications. In Specific Aim
1, rigorous data quality control approaches will be applied for dataset selection and preparation. The NS-Forest
algorithm will then be used to identify optimal biomarker combinations for characterization of organ-specific cell
types of individual organs in HuBMAP and cross-organ cell types in GTEx. In Specific Aim 2, we will focus on
human lung, as an exemplar organ, to prototype the assembly of a cross-consortia meta-atlas by developing a
robust cell type harmonization approach using our validated and benchmarked FR-Match algorithm and
HuBMAP-Lung, GTEx lung subset, and other publicly available Human Lung Cell Atlas (HLCA) datasets. We
will compare and benchmark FR-Match with two other popular methods, Azimuth and CellTypist, for cell type
matching and validate the matching results using all methods. We will also form a domain expert panel to review
and validate the cell type harmonization results using domain knowledge and literature information for community
approval. We will build a strategy for capturing sample metadata, anatomic structure information, cell type
nomenclature and biomarker-based definitions into an ontological representation for the meta-atlas and populate
the contents into the Provisional Cell Ontology. In Specific Aim 3, we will disseminate our results to key
stakeholder communities, including the HuBMAP Anatomical Structures, Cell Types and Biomarkers (ASCT+B)
Working Group and the GTEx Multi-Gene Single Cell Query platform. We will present the project and participate
in the Common Fund Data Ecosystem Spring Meeting for engaging the community and soliciting feedback.
Beyond the pilot phase, the cell type harmonization framework established in this project can be generally
applicable to integrate single cell-based cell type datasets across Common Fund and other data resources.