DEVELOPMENT OF DNA SEQUENCE DATA-QUALITY METRICS FOR PERSONAL GENOMICS -
DESCRIPTION (provided by applicant): In June 2011, the FDA hosted a public meeting: "Ultra High Throughput Sequencing for Clinical Diagnostic Applications - Approaches to Assess Analytical Validity" (FDA Public Meeting, 2011). The background documentation for this meeting noted that "In order to effectively utilize new sequencing technologies for clinical applications, appropriate evaluation tools (e.g., standards, well established criteria) are needed to determine the accuracy of the results." Achieving excellent data quality from next-generation sequencing technologies and understanding when the results may be in error is of clear importance, whether the results are being viewed by a clinician or a consumer. For this application, 23andMe will focus on the analysis of the accuracy of next-generation sequencing technologies using approximately 150 exomes (including 50 new exomes sequenced for this project) and 100 whole genomes, specifically with reference to false positive and false negative rates for variants located in known disease genes. 23andMe has genotyped over 125,000 individuals and reported data back to them on hundreds of disease-associated variants. This experience has shown us that many important disease genes are difficult to assay with a genotyping chip, whether due to pseudogenes (e.g., GBA), paralogs (e.g., SMN1, CYP2D6) or for unknown reasons (e.g., APOE). We have also noted differences in genotyping accuracy between blood and saliva. For this reason, we expend significant resources validating the results of our genotyping chip using positive controls derived from the 23andMe customer database. The 50 exomes we will sequence for this project will be chosen to carry Sanger sequencing-validated disease-associated variants in the disease genes listed above. This project is a crucial first step
in our goal of creating a pipeline for next-generation sequence annotation that combines (a) stringent QC based on genotyping array and Sanger sequencing data; (b) manually curated data from the human genetics literature; and (c) computationally derived variant assessment for variants of unknown significance; to produce a report that will be returned to a consumer for a personalized health assessment.
PUBLIC HEALTH RELEVANCE: Before we can achieve broad adoption of novel sequencing technologies in the clinic, we must understand when their results are accurate. This project will investigate error rates from next-generation sequencing technologies in clinically relevant disease genes. This will help us define data quality metrics and technical specifications for a sequencing-based Personal Genome Service(R).