Project Summary / Abstract
Repetitive DNA, especially that due to transposable elements (TEs), makes up a large fraction of many genomes.
Thorough and accurate annotation of repetitive content in genomes depends on a comprehensive database of
known TEs, along with robust statistical and procedural methods for recognizing decayed instances of elements
and disentangling their complex relationships.
Annotation of TE instances is usually performed using our RepeatMasker software, which compares a genome
to a database containing representations of known repeat families. These have historically been consensus
sequences, which generally approximate the sequences of the original TEs. The largest repository of such
consensus sequences is Repbase, whose restrictive license and limited interface for curators has led to a lack of
input from third parties and the creation of many unaffiliated, often organism-specific open databases. The parallel
existence of these many databases has led to a divergence in nomenclature and repeat definition.
Our Dfam database is an open access collection of repetitive DNA families, in which each family is represented
by a multiple sequence alignment and a profile hidden Markov model (HMM). We have demonstrated that profile
HMMs support improved annotation sensitivity, and Dfam provides numerous aids to both curators of TE families
and those who make use of the resulting annotations. In this proposal, we describe a plan to develop the
infrastructure of Dfam to expand to 1000s of genomes, and to establish a self-sustaining TE Data Commons
dependent on limited centralized curation. We further describe plans to improve the quality of repeat annotation
through development of methods for more reliable alignment adjudication, to expand approaches to visualization
of this complex data type, and to improve the modeling of TE subfamilies.
By further developing this open access database, we will provide a strong disincentive for the proliferation of
unaffiliated non-standard repeat datasets and ease the burden of data management for those developing TE