Project Summary
Carbapenem-resistant Enterobacterales (CRE) cause more than 13,000 infections in U.S. inpatients annually,
with mortality rates that can exceed 50%. In hospitals, asymptomatic colonization is emerging as a critical
target for CRE infection prevention: epidemiologically, early identification of colonized patients can reduce
intra-hospital spread. And clinically, colonized inpatients face significantly higher — but potentially modifiable
— risks of CRE infection. Due to diagnostic limitations, however, widescale CRE colonization screening
remains impractical for most U.S. hospitals. Prediction models offer alternative strategies for identifying
patients at high risk of colonization and of subsequent infection. However, models face two methodological
obstacles, limiting their wider utility: (1) strong colonization risk factors are “locked” in electronic health record
(EHR) free-text that is unavailable for model-building unless records are reviewed manually; and (2) due to low
CRE prevalence, most statistical models can only evaluate limited numbers of candidate variables.
We propose to exploit state-of-the-art machine learning and natural language processing (NLP) techniques to
improve identification of CRE-colonized and infected patients. We will apply these methods to EHRs from
>21,000 patients screened for CRE at The University of Maryland and The Johns Hopkins hospitals. In Aim 1,
we will build and validate NLP algorithms on admission histories to detect pre-admission exposures that are
strong colonization risk factors but poorly captured in structured EHR data fields. NLP is a cutting-edge
computational technique for “unlocking” these types of unstructured data. We will also use text-mining
approaches to identify potential new or local CRE risk factors. In Aims 2 and 3, we will build and validate
models from NLP-derived variables and other EHR data to predict colonization at admission (Aim 2) and
progression to infection (Aim 3) using machine learning algorithms that excel on high-dimensional data.
Taken together, this work will help hospitals identify patients at high risk of CRE colonization and infection
early, when deleterious patient outcomes are still preventable. Because NLP is automated, successful models
could be exported to other hospitals and integrated into EHRs; all algorithms resulting from this work will be
made freely available. This will be the first study to deploy NLP for bacterial carriage screening and the largest
U.S. study to follow CRE-colonized inpatients for infection. As a PhD-trained epidemiologist with a CRE and
machine learning background, and who previously practiced FDA law, I am drawn to interdisciplinary, rigorous
approaches and policies for reducing the toll of antibiotic resistance in hospitalized patients. In the short-term,
Career Development Award support would allow me to build experience using sophisticated computational
approaches for EHR-based information extraction and predictive modeling. In the long-term, the skills I acquire
would position me as a leader at leveraging novel health information technology tools to design and test new
strategies for responding to the threat of antibiotic resistance and other emerging pathogens in U.S. hospitals.