Project Summary / Abstract
Repetitive DNA, especially that due to transposable elements (TEs), makes up a large fraction of many genomes.
Thorough and accurate annotation of repetitive content in genomes depends on a comprehensive database of
known TEs, along with robust statistical and procedural methods for recognizing decayed instances of elements
and disentangling their complex relationships.
Annotation of TE instances is usually performed using our RepeatMasker software, which compares a genome
to a database containing representations of known repeat families. These have historically been consensus
sequences, which generally approximate the sequences of the original TEs. Our Dfam database is an open
access collection of repetitive DNA families, in which each family is represented by a multiple sequence
alignment and a profile hidden Markov model (HMM). We have demonstrated that profile HMMs support
improved annotation sensitivity, and Dfam provides numerous aids to both curators of TE families and those who
make use of the resulting annotations.
During the life of this grant, the database has grown to include families belonging to more than 1000 species
(from a baseline of 5). This growth has introduced a number of scale-based pressures, which in some cases
have forced us to reduce Dfam functionality in response, and in other cases highlighted ways that the resource
can better meet the needs of the community. Our proposed efforts largely target these matters while continuing
to expand and diversify the resource.