Whole-genome sequencing (WGS) is revolutionizing the diagnosis of rare diseases. However, at present, even
the most powerful approaches to etiological discovery typically fail to ¿nd a genetic cause in a majority of partici-
pants (Turro et al., Nature 2020). There are a number of reasons for this. Firstly, rare disease studies are typically
composed of small sets of unresolved cases, each sharing a different genetic etiology, which constrains statistical
power when only WGS and clinical phenotype data are available on participants. Secondly, the unknown causal
variants may have molecular consequences that are challenging to predict computationally, such as disruptions to
the regulatory elements (REs) of a gene or the introduction of a cryptic splice site. Thirdly, some types of causal
mutations, such as structural variants, are prone to being missed by WGS. Systematic, transcriptomic pro¿ling of
homogeneous cell populations taken from rare disease patients has the potential to overcome these limitations.
We have access to a collection of ¿1,000 comprehensively phenotyped rare disease study participants with WGS
and RNA-seq of platelets, neutrophils, monocytes and CD4+ T-cells. Here, we present a research program of
statistical, computational and experimental approaches to uncover novel etiologies of rare diseases that exploits
the high dimensionality and the hierarchical nature of these data. We will concentrate on the etiologies under-
lying ¿300 cases with a rare platelet disorder (RPD), exploiting our expertise in blood genomics. In Aim 1, we
will develop a Bayesian method for identifying rare disease-causing rare variants in REs, treating expression as a
molecular mediator of genetic etiology. Our approach models the causal path between rare variants that overlap
cell type-speci¿c REs, the corresponding cell type-speci¿c changes in expression, and the consequent alteration
in rare disease risk. To include a recently discovered class of enhancer marked by H3K122ac but not H3K27ac
in our hypothesis search space, we will generate H3K122ac data on the relevant cell types from healthy donors.
In Aim 2, we will apply several approaches for identifying pathogenic changes in transcript sequences. For ex-
ample, we will apply recently developed methodology for identifying splicing outliers within the cohort. To ensure
these outliers are extreme in the wider population, we will compute splicing frequency spectra in large RNA-seq
datasets such as GTEx. These spectra will capture the population distribution of the within-individual proportion
of RNA-seq reads for a gene that include a given splice junction. We will also exploit the joint availability of WGS
and RNA-seq in patients to identify extreme allelic imbalances at WGS-called heterozygote sites. The candidate
variants that we identify will be validated in cell lines and primary samples. Rare diseases collectively affect one
in 20 people but current etiological knowledge cannot resolve half of patients by WGS alone. The modeling and
analysis of large-scale, patient-derived RNA-seq data on multiple cell types as molecular mediators of disease
risk can ¿ll this gap. The methodological and etiological output of our research program will ultimately boost the
diagnostic power of WGS and broaden the scope of precision medicine.