ABSTRACT
The role of germline genetic variation and viral infection in development and progression has been studied
extensively in adult tumors and autoimmune disease. Less attention has been paid to the interaction of these
factors with birth defects and pediatric malignancies, particularly acute myeloid leukemia (AML), which in the
youngest patients is driven almost exclusively by structural variants (SVs) with poorly understood etiology. The
prevalence of gene fusion transcripts associated with leukemia at live birth is 10x to 100x greater than the
incidence of childhood leukemia, which suggests other risk factors must interact with SVs. One possible
candidate for interaction is the timing of viral infection, either in parents or children. Clinical trials of gene
therapy with viral vectors failed in part due to viral integrations activating oncogenes such as MECOM. Recent
work has shown a direct mechanism for derivative chromosome formation at the most common breakpoints in
leukemia, and human herpesviruses, including CMV, are one of the single greatest risk factors for
chromosomal birth defects. Elsewhere, we and others have documented germline and somatic copy number
and short sequence variants affecting e26 transformation specific (ETS) factors, which participate in high-risk
gene fusions seen in both solid and liquid tumors. These factors and their binding sites determine
developmental fates across tissues, yet their motifs are short tandem repeats -- the single most variable class
of features in the human genome. Small changes in dosage, as created by haploinsufficiency or variation in
ramp sequences, may be sufficient to predispose individuals to disease. The primary obstacle to studying
either of these mechanisms has long been the small sample sizes and biased coverage of cohorts assembled
for rare and childhood diseases. The vast quantity of whole-genome, whole-transcriptome, and long-read
sequencing data provided by the Gabriella Miller Kids First! (GMKF) Consortium and the INCLUDE cohorts
negate this obstacle. When combined with the thousands of pediatric clinical trial participants in Project: Every
Child, and the forthcoming X01 Long Read Pilot Project for omics-cold pediatric leukemia patients, we posit
that both the computational infrastructure and the sample sizes required for progress are now in place. We
propose to characterize germline regulatory, splicing, and structural variants as catalysts of risk for
leukemia and related preleukemic conditions, noting that triplication of the ETS factor ERG is an inherent
feature of Down syndrome-driven disease (a 500x multiplier for risk over the general population). With
colleagues, we have identified non-coding variants with strong effects in Down syndrome. This suggests that
the combination of sample size, cohort diversity, and representation of the most common structural variants in
human disease within the GMKF! Consortium presents an ideal opportunity to address this urgent need and
determine if predisposition risk can be mitigated by screening or prophylaxis.