Histotools: scaling digital pathology curation tools for quality control, annotation, labeling, and dataset identification - ABSTRACT: With recent approval of whole slide scanners for primary diagnosis, wherein routine glass histopathology slides are digitized and presented to clinical pathologists for diagnosis on computer monitors, a wealth of new untapped data is being created in routine clinical practice and placed in growing data lakes. In digital format, these whole slide images (WSIs) can be subjected to digital pathomics, i.e., the process of extracting quantitative image features associated with morphology, attributes, and relationships of histologic objects in WSIs. These features can subsequently be employed for discovery in many domains such as histogenomics, which sees associating phenotypical presentations with biological pathways and gene ontologies. Additionally, low-cost non-tissue destructive image-based companion diagnostic assays (CDx) can be developed for predicting prognosis and treatment response of patients. Unfortunately, unprocessed large data lakes (e.g., TCGA) are not alone sufficient for pathomics, and often require an intractable amount of human curation effort in (i) performing meticulous quality control of WSI (i.e., avoid “garbage-in, garbage-out”) and subsequently (ii) precisely annotating (e.g., cell boundary) and labeling (e.g., cell type) histologic objects. To address these major limiting factors in curating data lakes, we propose developing our small-scale HistoTools prototypes to employ computing clusters and thus enable their function at the scale of large digital slide repositories (DSR): (i) HistoQC for robust, reproducible quality control of WSI by identifying artifacts (blurriness) and outliers (poorly stained slides) for avoidance in downstream analyses, (ii) CohortFinder for identification and compensation of batch affects, (iii) Quick Annotator for rapid computer aided annotation generation via a combination of active and machine learning, (iv) PatchSorter for improving sub-typing of histologic objects with machine learning. We will evaluate HistoTools for improvement of quality control and the efficiency of both segmenting and labeling histologic objects of interest via (a) onsite curation and release of the 14k WSI used during our internal validation and (b) supported external curation of at least 100k WSI via 24-clinical affiliates from every continent, except Antarctica, whom together have access to over 20 million WSI during this proposal. Our validation use cases are designed to expedite existing onsite projects in the CDx space, consisting of 4 organs (breast, lung, heart, kidney), 3 diseases (cancer, kidney disease, and organ rejection) and WSIs collected from >70 sites. These cohort characteristics will help ensure the generalizability of our tools for curated data lake creation, with open-source and usability study approaches employed to obtain feedback from collaborators and the larger research community. Dissemination through consortia (ITCR, NEPTUNE) and websites (Github, TCIA) will improve visibility and adoption. The tools and well-curated data sets we release are anticipated to bootstrap researcher-initiated CDx discovery projects, along with the creation of their own onsite manicured data lakes. Together, this proposal will engender digital pathology based precision medicine research.