Project Summary and Abstract
Narratives of electronic health records (EHRs) contain useful information that is difficult to automatically extract,
index, search, or interpret. Natural language processing (NLP) technologies can extract this information and
convert it in to a structured format that is more readily accessible by computerized systems. However, the
development of NLP systems is contingent on access to relevant data and EHRs are notoriously difficult to obtain
because of privacy reasons. Despite the recent efforts to de-identify and release narrative EHRs for research,
these data are still very rare. As a result, clinical NLP, as a field has lagged behind. To address this problem,
since 2006, we organized thirteen shared tasks, accompanied with workshops and journal publications. Twelve
of these shared tasks have focused on the development of clinical NLP systems and the remaining one on the
usability of these systems. We have covered both depth and breadth in terms of shared tasks, preparing tasks
that study cutting-edge NLP problems on a variety of EHR data from multiple institutions. Our shared tasks are
the longest running series of clinical NLP shared tasks, with ever growing EHR data sets, tasks, and participation.
Our most popular three data sets have been cited 495 (2010 data), 284 (2006 de-id data), and 274 (2009 data)
times, respectively, representing hundreds of articles that have come out of these three data sets alone. Our
goal in this proposal is to continue the efforts we started in 2006 under i2b2 shared task challenges (i2b2, NIH
NLM U54LM008748, PI: Kohane and R13 LM011411, PI: Uzuner) to de-identify EHRs, annotate them with gold-
standard annotations for clinical NLP tasks, and release them to the research community for the development
and head-to-head comparison of clinical NLP systems, for the advancement of the state of the art. Continuing
our efforts under National NLP Clinical Challenges (n2c2) based at the Health Data Science program of the
newly established Department of Biomedical Informatics at Harvard Medical School, we aim to form partnerships
with the community to grow the shared task efforts in several ways: (1) grow the available de-identified EHR data
sets through partnerships that can contribute to the volume and variety of the data, and (2) grow the available
gold-standard annotations in terms of depth and breadth of NLP tasks. Given these aims and partnerships, we
plan to hold a series of shared tasks. We will complement these shared tasks with workshops that meet in
conjunction with the Fall Symposium of the American Medical Informatics Association and with journal special
issues so that advancement of the state of the art can be sped up and future generations can build on the past.