Scalable post-assembly editing software for finishing and annotating personal genomes - We are entering a new era of personal genomics where an individual's genome sequence will be used to
identify disease susceptibility, improve diagnosis and better treat illnesses as well as be combined across
cohorts and populations to identify new biomarkers and causal mutations underlying any phenotype. Despite
the tremendous success of mapping short read next-generation sequencing (NGS) data onto a reference
genome (resequencing) in identifying genetic variation in a new genome, the inherent lack of long range
connectivity together with reference-induced biases make obtaining complete haplotype-phased genomes
exceedingly difficult. Emerging long read technologies are beginning to address this critical shortcoming by
direct de novo assembly of an individual's genome. However, initial de novo assemblies typically consist of
many thousands of unordered contigs that require extensive post-assembly processing to produce finished
sequences that can be effectively mined for genetic content and variation. Thus, there is an urgent need for
integrated, scalable post-assembly software that 1) automatically organizes, joins and phases the initial contigs
into complete haplotype sequences, 2) supports optional NGS and/or manual polishing and 3) provides initial
automated annotation of those sequences. Currently, such software does not exist and instead users must
cobble together a confusing array of difficult-to-use, task-specific pieces of open source programs.
DNASTAR's post-assembly editing program, SeqMan Pro (SMP), has a proven history in finishing bacterial
sized genomes although it currently lacks the scalability and all the needed functionality to tackle human
genome sized problems. The primary goal of this Fast Track proposal is to create a fully scalable version of
SMP for the automated finishing and annotation of de novo assembled large eukaryotic genomes while also
providing a manual editing platform when needed. During Phase I, we will develop two key prototypes: 1) a
new assembly file format, eBAM, which is interconvertible with the BAM format, but also is editable like our
SQD files and 2) a rapid reference-assisted contig scaffolding tool adapted from our proprietary Disk Sort
Alignment (DSA) algorithm. With that foundation, we will complete the transformation of SMP in Phase II by: 1)
refining the eBAM format for optimal editing performance, 2) building a new 64-bit version of the SMP editing
engine that incorporates the additional functionality necessary for post-assembly finishing of large eukaryotic
genomes including automated DSA-based scaffolding and phase-aware gap filling, contig joining and
haplotype refinement, 3) creating a new DSA-based genome aligner for rapidly aligning a finished sequence to
an annotated reference genome which together with 4) a new feature transfer and analysis module, will permit
initial annotation of the finished genome along with a cataloging of variants and their impact in both native and
reference coordinates. Inclusion of the reference coordinates allows variants in the new genome to be easily
associated with the wealth of information available through the numerous online knowledgebase resources.