Enabling data quality assessment of organelle genomes archived on GenBank through novel open-source software tools - Project Summary/Abstract The project aims to enable scientists from various biomedical disciplines to make informed, evidence-based decisions on the reuse of archived organelle genomes. It aims to give scientists the computational means to evaluate data quality among the thousands of mitochondrial and plastid genomes stored on the sequence database GenBank. Organelle genomes archived on GenBank are retrieved and employed in many biomedical investigations, including on human genetics, microbiology, environmental health, toxicology, and forensics. However, many studies ignore that a considerable proportion of mitochondrial and plastid genomes on GenBank exhibit signs of incorrect genome assembly, incomplete sequence annotation, or both. Indications of low data quality are even found among organellar genome records with reference genome status. Hence, new computational methods are needed to assess the data quality of GenBank- archived organelle genomes so that only accurate and reliable genome records are selected and integrated into new analyses. The proposed project develops such methods. It generates novel software tools that enable scientists to assess GenBank-archived organelle genomes from various eukaryotic lineages in an automated, standardized fashion. The new tools enable users to evaluate, quantify, and visualize those aspects of organellar genome records on GenBank that are applicable across all such records and indicative of their genome accuracy and completeness. A total of four software tools are developed. Tool #1 automatically links organellar genome records to their short-read data in the database SRA. Tool #2 assesses the quality of organelle genomes by measuring sequencing coverage and sequencing evenness. Tool #3 assesses genome quality by comparing the genome sequence of a given record to its de novo re- assembly under modern assemblers. Tool #4 assesses genome quality by contrasting the gene annotations of a given record to those of closely related individuals or species. Quality assessments at different scales are enabled by implementing features that support the integration of each tool into automated analysis pipelines. The tools are tested on large and diverse sets of GenBank-archived organelle genomes, including a data set of thousands of human mitochondrial genomes. Each tool is written in a common scripting language and distributed as an open-source application to encourage wide reuse by other scientists. As a result, the project expands the existing computational toolkit for organelle genomics and allows scientists across different biomedical disciplines to use quality metrics to decide which mitochondrial and plastid genomes on GenBank to reuse. Taking place at a predominantly undergraduate institution, the project actively includes student researchers with the goal of training them in genome assembly, annotation, and bioinformatics tool development.