PROJECT SUMMARY/ABSTRACT
Mapping the gene regulatory networks driving human disease enables the design of network-correcting
treatments that target the core disease mechanism rather than merely managing symptoms. I previously
developed a framework for mapping disease-dependent gene networks to enable network-based screening
leveraging machine learning and human induced pluripotent stem cell modeling that identified a promising
network-correcting therapy for cardiac valve disease currently progressing towards clinical trial, reported in
Cell1 and Science2. However, computationally inferring the network map requires large amounts of
transcriptomic data to learn the connections between genes, which impedes network-correcting drug discovery
in settings with limited data including rare disease and disease affecting clinically inaccessible tissues.
Although data remains limited in these settings, recent advances in sequencing technologies have driven a
rapid expansion in the amount of transcriptomic data available from human tissues more broadly. Recently, the
concept of transfer learning has revolutionized fields such as natural language understanding and computer
vision by leveraging deep learning models pretrained on large-scale general datasets that can then be fine-
tuned towards a vast array of downstream tasks with limited application-specific data that would be too limited
to yield meaningful predictions in isolation. To test whether an analogous approach could enable gene network
predictions with limited data, I developed and pretrained my novel deep learning model, Geneformer, with a
large-scale pretraining corpus I assembled with ~30 million human single cell transcriptomes, thereby
generating an invaluable checkpoint from which fine-tuning towards a broad range of downstream applications
could be pursued to accelerate discovery of key network regulators and candidate network-correcting
therapies. Geneformer consistently boosted predictive accuracy in a diverse panel of downstream tasks using
just a limited set of task-specific training examples. I now propose to leverage Geneformer’s learned
understanding of contextual gene network dynamics to address two major challenges in cardiac biology. In Aim
1, I will determine novel dosage-sensitive gene combinations and their context-dependency in cardiac cell
types, thereby generating a map of contextual dosage sensitivity for genes individually or in combination that
has the potential of dramatically improving our interpretation of copy number variants in genetic diagnosis of
cardiac disease. In Aim 2, I will map the dysregulated gene network and discover candidate network-correcting
therapeutics in a prototypical rare disease affecting clinically inaccessible tissue where progress has been
impeded by limited data, hypertrophic cardiomyopathy, to accelerate the discovery of a much-needed targeted
therapeutic for this life-threatening progressive disease. Overall, my novel deep learning model, Geneformer,
pretrained with large-scale single cell transcriptomic data has the potential of revolutionizing the field of
network biology through transfer learning to accelerate discovery in settings with limited data.