We propose to build a knowledge provider that will seek out, integrate and provide AI-ready,
BioLink-compatible models via high-performance text-mining of the biomedical literature.
Problems with Translator’s current mining of the biomedical literature that we intend to
solve include: (1) weaknesses in framework extensibility and benchmarking that make
integrating and validating new text-mining approaches difficult; (2) problematic licensing of
software, terminologies and other resources that do not adequately support FAIR (and TLC)
best practices; (3) processing only PubMed titles and abstracts, not full text publications; (4)
Translator’s use of older NLP technology with relatively poor performance; (5) lack of a
mechanism for community feedback regarding errors and other problems; (6) lack of continuous
updates to add knowledge from new publications; (7) output knowledge representation that is
simplistic and vague, failing to reflect the richness of what is expressed in scientific documents.
Plan for implementation: Our team has a long history of productive NLP research,
successful open source software projects, effective benchmarking and broad community
engagement. We will build on the results of NLM-funded work in information extraction, our
gold-standard Colorado Richly Annotated Full Text (CRAFT) corpus, a recent BioNLP Open
Shared Task (BioNLP-OST) that we organized, and recent advances in state-of-the-art NLP.
For Segment 1, we will: (1) Demonstrate BioStacks, an extensible, cloud-based text-mining
framework that produces knowledge graphs grounded in the Open Biomedical Ontologies
(OBOs). This BioStacks demo will include a state-of-the-art OBO concept recognizer for multiple
ontologies, a state-of-the-art semantic relationship prediction tool, and a state-of-the-art
structural analysis tool. All generated assertions will have provenance metadata linking the
assertion to a particular text span in a document specified by PMCID. (2) Demonstrate CRAFTST,
a cloud-based text-mining evaluation system that evaluates the performance of text-mining
systems against the CRAFT gold standard. (3) Demonstrate an adaptive machine learning
process illustrating how to efficiently create tools to extract BioLink association types.
For Segment 2, we propose to extend the text-mining and evaluation frameworks to align
with BioLink and the Translator community, improve text-mining quality and expand the
collection of source documents mined. Specifically, we propose to target 10 long term
milestones: (1) Align CRAFT to BioLink. (2) Develop new tools for extracting associations from
text. (3) Develop and manage a community engagement process on text-mining for Translator.
(4) Extend benchmarking. (5) Improve recall. (6) Improve precision. (7) Improve computational
efficiency. (8) Expand BioStacks to include all available full text biomedical journal articles. (9)
Expand document collections to include Patents & Regulatory filings. (10) Develop a scientist-based
movement to improve document access for text-mining from non-open publishers.
The types of questions the resulting knowledge graph can be used to address are
extremely broad, as it is generated by mining a large part of the biomedical literature.
Questions that can be answered include those about specific assertions (e.g. is this drug an
agonist-activator of this protein?), general relations (are these two proteins often mentioned
together?), and documents (which publications mention this gene, mutation and drug?).
Integration: We are long-time contributors to the open-science community and have
longstanding collaborations with existing awardees; we were participants in the NIH Data
Commons Pilot. We propose to align the output of text-mining tools to the BioLink model via
OBO terms. We propose to implement our frameworks in NIH Cloud Computing environments.
We propose to adopt the CD2H Contributor Attribution Model to foreground community
contributions. We plan to coordinate with the NLM’s nascent benchmarking activities and the
SmartAPI effort to build Translator standard interfaces.
Challenges and gaps: High-performance mining of rich, contextualized knowledge from the
literature remains a difficult task, and is unlikely to be solved in the next five years. Many
important publications remain inaccessible to text-mining due to restrictive licensing.