Project Summary
As United States healthcare seeks to address inconsistent quality and overwhelming cost, data and
technology have become central to all suggested approaches. With newly available electronic health
data and massive growth in processing power, the hardest challenges in using clinical data are becoming
clear.
Big data holds the potential to enable personalized patient care, population health management, and
value-based payment models. However, it also creates challenges in discriminating accurate data from
inaccurate or incomplete information. One of the greatest areas of data inaccuracy is the patient
phenotype, or clinical description of the patient. Every clinical decision support tool, population health
management system, and payment reform product relies on accurate electronic patient descriptions as
its source data.
But, the descriptions are not accurate, most notably in terms of completeness and granularity. Recall
often falls below 50% in describing a patient’s medical conditions, such as heart failure and cancer.
Detailed descriptions such as low ejection fraction heart failure or stage III breast cancer, needed for
downstream analytics, are lacking in the discrete record. Poor data puts care delivery, payment reform,
and population health efforts in peril. The time is right for technology to proactively define the clinical
phenotype from source data, without reliance on current manual approaches. This will necessitate
overcoming challenges in harmonizing discrepant narrative and discrete data, inferring when a
characteristic such as cough is a primary condition versus symptom of another condition, and screening
noise from signal in robust narrative text.
This Small Business Innovation Research (SBIR) Phase I project will include the following specific aims:
1. Create the components required to define an accurate and comprehensive clinical phenotype,
including: (i) extract problem, medication, procedure, and lab features from clinical data using
natural language processing (NLP) and ontologic mapping, (ii) build a large knowledge database
of associated clinical conditions, and (iii) assess extracted features against the knowledge
database to accurately distinguish symptoms from diseases and surface relevant active diseases
in a candidate problem list.
2. Validate the clinical phenotyping components using de-identified longitudinal clinical data for
10,000 patients
The goal, dependent on Phase I success, is to create an automated, accurate, and robust clinical
phenotyping engine to enable personalized patient care, population health management, and value-
based payment models.