Generating better-targeted training data for computational enhancer discovery in vector insects - Vector insects are responsible for over 700,00 deaths annually and hundreds of billions of dollars in associated economic impact. Although many vector insect species have had their genomes sequenced, the regulatory component of these genomes is largely undefined. Characterizing regulatory sequences—in particular, “enhancers”—is critical for understanding the organization of gene regulatory networks and how the genome informs phenotype. Enhancers play important roles in mediating insecticide resistance, pathogen transmission, host recognition, mating success, and other critical vector-relevant biological processes. Enhancers also serve as major components of biotechnology tools requiring precisely targeted gene expression, adding additional importance to their identification and characterization. Given that enhancers are so important, but that few insect enhancers are known other than for the fruit fly, Drosophila melanogaster, there is a pressing need for efficient methods for enhancer discovery that can be applied to a wide range of vector species. We previously developed “SCRMshaw,” a computational method for enhancer discovery, and used it to predict enhancers in over 36 insect species including vector mosquitoes. SCRMshaw utilizes known enhancers as “training data” to guide its search for unknown enhancers with related function. We have demonstrated that we can use Drosophila enhancers, of which many (~38,000) are well- characterized, as training data to discover similar enhancers in other insect species. Unfortunately, despite this large number of known Drosophila enhancers, training enhancers are not available for many significant cell types. This is especially true for non-embryonic stages of the life cycle and cells of particular interest for vector biology, e.g. those involved in insecticide resistance, pathogen transmission, and reproduction. This proposal addresses this major shortcoming by developing a new way to generate SCRMshaw training data, using scATAC-seq data. This will open up a broad potential source of training data for currently under-studied cell types, and enable prediction of enhancers in species at greater evolutionary distances from Drosophila. It will also allow for significantly improved estimation of true- and false-positive enhancer prediction rates. The proposed approach is rapid; inexpensive; and requires empirical data from only a single representative organism but can be applied to dozens to hundreds of loosely-related organisms. It will allow a single set of quality scATAC-seq experiments in Drosophila or a vector mosquito to be leveraged for enhancer prediction in the majority of vector insect species. This work will therefore provide immediate important outcomes by enabling functional regulatory annotation of an expanded set of relevant cell types for a significant portion of sequenced vector insects. It will have a major long-term impact on our ability to address fundamental questions of vector biology and to improve strategies for vector control.