Project Summary
Identification of the genetic basis for congenital heart disease (CHD) has benefitted from advances in exome
sequencing (ES) and genome sequencing (GS) pipelines. Large cohort studies, such as the NHLBI-funded
Pediatric Cardiovascular Genomics Consortium (PCGC), have sequenced the exomes or genomes of nearly
3000 CHD patients and identified variants with a high likelihood of contributing to CHD. Using approaches that
identified rare variants enriched in CHD patient populations and damaging effect prediction algorithms that
supported pathogenicity, a list of potentially pathogenic variants has been identified. In further support of
pathogenicity, these variants are found in genes which have prior association with human CHD or have been
implicated in heart development in animal models. While this approach has aided in identification of novel
variants, more than one potential genetic variant is identified in many cases rendering follow-up analyses difficult.
In the proposed exploratory grant, we will investigate the use of machine learning to use data obtained from
transcriptomic analysis of both mouse and induced pluripotent stem cell (iPSC) models of CHD. Rather than
building a common analytical pipeline by including all possible candidate genes for all CHDs, we will use genes
differentially regulated in CHD model systems that display phenotypes observed in the patient to prioritize
variants. To achieve this, the patient’s diagnosis will be used as input to identify RNA-seq datasets from
mouse/iPSC models with similar diagnoses from the Gene Expression Omnibus (GEO) database. The genes
differentially expressed in these datasets will carry additional weight in the prioritization pipeline. Simultaneously,
we will examine the expression of the genes in single-cell RNAseq datasets from developing human embryonic
hearts. This will allow us to evaluate a gene’s expression in relevant cell-types that contribute to normal heart
development. Genes that are observed in multiple patients with overlapping subtypes of CHD will be presented
as prioritized variants. This analysis pipeline will not exclude any genetic variant from consideration as a
candidate but will use expression analysis in CHD-model systems and single-cell transcriptomic data to rank the
variants. The result of this pipeline will be a ranked list of variants in each patient that are ordered based on the
information from the datasets mentioned above and current standards of variant prioritization such as minor
allele frequency and predicted damaging effect. As a direct consequence, we expect to discover novel candidate
genes for CHD and identify genes with a higher burden in a subset of CHD cases. The creation, training and
testing of the machine learning algorithm will provide a platform for variant prioritization in patients with CHD and
this model has the potential to be extended to other congenital malformations.