Enabling data quality assessment of organelle genomes archived on GenBank through novel open-source software tools - Project Summary/Abstract
The project aims to enable scientists from various biomedical disciplines to make informed, evidence-based
decisions on the reuse of archived organelle genomes. It aims to give scientists the computational means to
evaluate data quality among the thousands of mitochondrial and plastid genomes stored on the sequence
database GenBank. Organelle genomes archived on GenBank are retrieved and employed in many
biomedical investigations, including on human genetics, microbiology, environmental health, toxicology,
and forensics. However, many studies ignore that a considerable proportion of mitochondrial and plastid
genomes on GenBank exhibit signs of incorrect genome assembly, incomplete sequence annotation, or
both. Indications of low data quality are even found among organellar genome records with reference
genome status. Hence, new computational methods are needed to assess the data quality of GenBank-
archived organelle genomes so that only accurate and reliable genome records are selected and integrated
into new analyses. The proposed project develops such methods. It generates novel software tools that
enable scientists to assess GenBank-archived organelle genomes from various eukaryotic lineages in
an automated, standardized fashion. The new tools enable users to evaluate, quantify, and visualize
those aspects of organellar genome records on GenBank that are applicable across all such records and
indicative of their genome accuracy and completeness. A total of four software tools are developed. Tool
#1 automatically links organellar genome records to their short-read data in the database SRA. Tool #2
assesses the quality of organelle genomes by measuring sequencing coverage and sequencing evenness.
Tool #3 assesses genome quality by comparing the genome sequence of a given record to its de novo re-
assembly under modern assemblers. Tool #4 assesses genome quality by contrasting the gene annotations
of a given record to those of closely related individuals or species. Quality assessments at different scales
are enabled by implementing features that support the integration of each tool into automated analysis
pipelines. The tools are tested on large and diverse sets of GenBank-archived organelle genomes, including
a data set of thousands of human mitochondrial genomes. Each tool is written in a common scripting
language and distributed as an open-source application to encourage wide reuse by other scientists. As a
result, the project expands the existing computational toolkit for organelle genomics and allows scientists
across different biomedical disciplines to use quality metrics to decide which mitochondrial and plastid
genomes on GenBank to reuse. Taking place at a predominantly undergraduate institution, the project
actively includes student researchers with the goal of training them in genome assembly, annotation, and
bioinformatics tool development.