Coordination of gene expression and molecular function in known pathways - SUMMARY Transcriptomics generates massive amounts of data that we aim to physiologically interpret in terms of how cells respond to different conditions by regulating activity levels of pathways. Curators encode our knowledge in pathway databases like KEGG, Reactome, and, more recently, GO Causal Activity Models (GO-CAMs) to provide detailed qualitative depictions of pathways, but a major component of our knowledge is missing – a description of how gene expression varies as a pathway is differentially regulated by cells. For example, when glycolysis is differentially regulated, is it always the same subset of genes that are turned up or down? Knowing which genes vary together (or are uninformative) and having the context of the range of gene expression across cell types and in response to conditions would be invaluable, because then we would know which genes are informative and which to ignore to determine if a pathway is up or down regulated. This begs the question: Can the diversity of cell types or the responses of cells to various conditions be described by differential regulation of pathways instead differential expression of 20,000 genes? If so, this could allow us to infer physiology of cell types and the regulation of pathways by studying patterns of gene coordination within them. This project will use human single cell RNAseq atlases and GO-CAM pathway models to test the hypothesis that gene expression within pathways is coordinated in a stereotyped manner and that these coordination strategies are largely shared even by cell types but differentially tuned. To develop scoring metrics and references for transcriptional regulation of pathways, within each pathway, the genes’ whose expression levels are most informative of that pathway’s regulatory state will be identified through Weighted Gene Correlation Network Analysis. This may have tissue or cell type specificity, so an extension of k-means clustering to k-affine subspaces (i.e. points, lines, and planes in 3D) will be used to model subpopulations that coordinate gene expression differently, and cell type labels will be used to understand how gene expression is scaled along these subspaces by different cell types. These results will be made available as a database and tool. Lastly, experimentalists interpret their RNAseq results by considering the molecular function (MF) identities of pathway steps (i.e. receptors, enzymes, intracellular messengers). To determine whether such a high-level logic exists across pathways, the integration of GO-CAMs with the GO will be used to determine if MF or the type of causal edge between steps influences transcriptional co-regulation. These aims will advance our understanding of how gene expression is coordinated within pathways, how this is specific to or shared by cell types, and whether motifs for transcriptional regulation are shared across pathways. Together, these will improve our ability to interpret RNAseq data in a physiologically meaningful manner.