The AnVIL Data Ecosystem
Project Summary / Abstract
In this proposal, we bring together a unified team with a strong track record of developing secure and scalable
software systems to support flagship scientific efforts, such as the All of Us Research Program, the Genomic
Data Commons (GDC), and the Human Cell Atlas (HCA). Our group will leverage these experiences, and the
software developed for them, to create an ecosystem of applications that will both serve the needs of the
AnVIL and interoperate with other NIH data resources. We will accomplish this through the following Aims:
¿ Aim 1 (Software Engineering): Leverage existing software capabilities to create tools for storing,
sharing, and analyzing AnVIL datasets at unlimited scale. During the past five years, our groups
have created a suite of modular and open source software capabilities that address key needs in
genomic data science. We will leverage these existing capabilities and extend them in novel directions
to address AnVIL-specific scientific goals relating to human genetics and functional genomics.
¿ Aim 2 (Data Engineering): Curate data and metadata resources so that they are easily
accessible. The AnVIL will not only be a suite of software services, but also a vast repository of
genotypic and phenotypic information. For this resource to be usable by the community, it must be
organized, curated, and made accessible. We will accomplish this by processing genomic datasets
using a consistent set of best-practices pipelines, and mapping phenotypes to a common data model.
¿ Aim 3 (Operations): Stand up and support a data environment for the AnVIL community, and
integrate it with other NIH resources as part of a federated NIH-wide genomic data commons.
The modular components of Aim 1 are critical building blocks, but they alone are not enough to meet
the needs of the AnVIL; they must also be stood up as services and integrated into a coherent entity,
which we call a “data environment.” We propose to create an AnVIL data environment that will enable
researchers to access datasets in a secure, compliant, and facile manner.
The guiding principle of these efforts is that progress in genomic science will happen most rapidly if there is a
diversity of solutions created by a plurality of groups. Towards that end, our approach to engineering the
software components of Aim 1, curating the datasets of Aim 2, and operating the software services of Aim 3 is
to catalyze an ecosystem of activity around the AnVIL. Our proposal focuses not only on creating and
operating software services ourselves, but also on incorporating third-party solutions. We propose to
accomplish this by architecting the AnVIL data environment according to the following principles: (i) modularity,
(ii) openness, (iii) community engagement, (iv) standardization, and (v) interoperability.