PROJECT SUMMARY
The exploration and interpretation of large, complex datasets is vital to discovery in genomics. However,
researchers now confront a fundamental limitation; unprecedented experiments are possible thanks to modern
DNA sequencing technologies, yet existing “genome arithmetic” algorithms and data formats for comparing
and dissecting the resulting datasets are incapable of keeping pace with inexorable growth in dataset size and
complexity. Genome arithmetic (GA) represents a powerful and widely used set of techniques that allow one to
explore relationships among sets of genome features (e.g., a gene, sequence alignment, ChIP-seq peak, or
anything that can be described with chromosome coordinates). GA is used for a broad spectrum of analyses
including: the detection of intersecting/overlapping features (e.g., sequence alignments and exons), describing
feature coverage among datasets, and the merging, subtraction, and complementation of feature datasets. GA
functionality is used by all genome browsers and data visualization tools, and by analysis software such as
GATK and SAMTOOLS. Our BEDTOOLS software has become a staple of genomics research and is used in
a broad range of genomic analyses. However, continuous support and development have also revealed key
limitations with its current functionality and crucial limitations that hinder analytical flexibility. We argue that
innovations in genome arithmetic algorithms, data formats and user-friendly software are needed to: (1)
empower researchers to conduct large-scale analyses with simple, flexible tools; (2) improve analysis tools to
keep pace with the scale of modern datasets; (3) visualize and quantify relationships among genome
datasets.
Therefore, the overall objective of this proposal is to provide the genomics community with innovative
new algorithms and software that keep pace with modern genomics experiments and facilitate future
discoveries. The Specific Aims are to: (1) Develop a refined suite of genome arithmetic algorithms and
programming interface for scalable analysis with BEDTOOLS. (2) Create new algorithms and genome
interval sketching approaches to enable large-scale dataset comparisons. (3) Enable large-scale
visualization and statistical analyses grounded in our recent advances in devising scalable new data
formats. These innovations will yield with scalable new algorithms, data structures and formats that will
empower thousands of genomics researchers around the world.