Project Summary
We will continue to develop WormBase, a broadly and often daily-used knowledgebase of information about the
C. elegans genome, genes, sequence features, gene function, gene interactions, and related information. C.
elegans is a premier research organism with about 1500 registered laboratories worldwide who use the short
generation time, complete genome, efficient genome editing, defined anatomy and neuroanatomy to study a
wide range of biomedical and fundamental topics. WormBase also curates, stores, and displays information
about nine other nematode genomes of biomedical importance. We will continue to develop necessary
ontologies and gene nomenclature to support systematic annotation of the genome and gene function and
expression. After 20 years of independent infrastructure development, we will now use the Alliance of Genome
Resources infrastructure for data ingest, storage, efficient curation, and presentation via download, API, and
web portal. We will complete the migration of the software infrastructure by the second year. This project will
focus on curation of genome scale datasets and individual experiments from the literature as well as storage and
display of C. elegans- or nematode-specific data. A major challenge is the increased published data and datasets
and decreased staff, which we will proactively address by streamlining and making our systems more automated
and high throughput. Our main strategies for scaling curation are by increased automation, namely machine
learning (ML) and artificial intelligence (AI); and by community input powered by ML/AI, also incentivized by
microPublication-based reviews of pathways and genes. As we are trying to scale, while maintaining our very
high-quality data collection (which is re-used by many other bioinformatic resources), professional biocurators
with a deep understanding of the biology and researchers needs will increasingly focus on data modeling, quality
control, development and training of automated systems, and supporting community curation. We will curate
information directly tied to nucleic acid sequence including the genome sequence; sequence features such as
gene structure models, regulatory regions, variants, sequence-based reagents, genome-scale experiments; and
gene expression including reporter gene assays and RNA-seq, sc-RNA-seq. We will curate information centered
on gene function including phenotype of variants and perturbations, disease models, genetic and physical
interactions, Gene Ontology (GO) annotations, and pathways using GO-Causal Activity Models. After we
transition computational infrastructure to the Alliance, we will continue to curate datasets unique to C. elegans
and add them to the Alliance infrastructure. We will support researchers by a 24/7 help desk, which provides
advice and often analysis; curation, storage, and display of worm-specific datasets; provision of customized
analysis tools; and a community forum. For new data, we will specify software requirements for development at
the Alliance.