Dockstore - A Platform for Sharing Tools & Workflows on the Cloud Commons - Project Abstract
Across all sectors of the Internet security, regulatory and privacy concerns coupled with bandwidth limitations
are shifting Big Data technologies away from data transfer and toward algorithms that analyze data in situ.
Biomedicine can take advantage of general technologies like Docker that have evolved to meet the needs of
this shift. Already Docker, which allows a new level of lightweight portability for computer code, has
significantly penetrated bioinformatics. No recent project illustrates this better than the large, international Pan
Cancer Analyses of Whole Genomes (PCAWG, https://dcc.icgc.org/pcawg) collaboration. This effort saw the
creation of common analytical pipelines that were uniformly applied to the whole genome sequences of over
2,800 cancer donors in 14 disparate HPC and cloud computing environments, making extensive use of Docker
container technology. Critical to this effort was a rethink of the way algorithms were developed, packaged, and
moved from environment to environment. The net outcome was the creation of the Dockstore project
(http://dockstore.org). Dockstore facilitates the sharing and mobility of biomolecular analysis tools and
workflows. It allows bioinformaticians to bring together individual tools and entire workflows packaged in
portable Docker images (containers), described using either the Common Workflow Language (CWL) or
Workflow Description Language (WDL). In this way, Dockstore standardizes computational analyses, making
them precisely reproducible and runnable in any environment that supports Docker.
The work proposed here supports the extended development, hardening, content addition, cloud integration
and dissemination of the Dockstore to the wider biomedical research community. Most importantly, it supports
full federation, under the auspices of the Global Alliance for Genomics Health (GA4GH), of the original
Dockstore with other similar projects worldwide through an API that makes it possible to search for containers
and workflows across a global network. The federated network will allow groups and individual projects to
create not only individual analyses, but entire analysis repositories that are institutionally branded and shared
with the rest of the world under a common GA4GH index and set of interoperability standards. The result will
be an integrated network providing portable, securely signed, easily deployed workflows and tools covering the
spectrum of biomedical analyses. It will make finding, testing and applying these analyses to new data far less
time consuming and error prone, and reduce redundant reimplementation of key bioinformatic tasks. In
contrast to the approach taken previously by influential efforts like Galaxy, which resulted in pushbutton
methods that proved hard to scale to large datasets, the focus on portable, scalable workflow standards, which
can be run within a variety of platforms, make this the right basis for a broad biomedical analysis commons.