Project Summary/Abstract
Two new virtual chemistry technologies will be added to the NCATS ASPIRE project as separate modules. The
first module will enable new chemistries to be modelled and selected from cutting edge (deep) machine
learning technology using the latest structure/activity data taken directly from instruments. The second module
will be a novel informatics system for capturing chemistry-rich data in a semantic template as
machine-readable reactions which will increase the utility of chemical reactions in electronic lab notebooks and
allow more precise interrogation and automation of reaction analyses (and their corresponding reaction
products).
The deep learning technology in module 1 is based on our new chemically rich vector (CRV) methodology,
which is able to compress information about chemical structures into a vector of 64 numbers with an efficiency
that allows the encoding process to be reversed: not only can a CRV be converted back into its original
structure with high success (>90% exact match), but a modified CRV can be converted into a structure that is
representative of that point in chemical space. CRVs make excellent descriptors for SAR/QSAR iteration
because they contain much more chemical information in a small space, allowing the automation of
structure-activity models to be more streamlined, relative to conventional descriptors. The resulting models will
explore the multi-dimensional space via an interactive visual interface (human-directed) or a back-end
algorithm to constantly search for new and better structures (machine-directed). Both interactive and
automated processes will be connected back into the ASPIRE automation cycle so that they can be
synthesized and measured (hypothesis evaluation and iterative optimization).
The second module, machine-readable reactions, draws from our extensive experience developing the
BioHarmony Annotator (formerly: BioAssay Express) which uses natural language models to assign semantic
ontology terms to biological assay protocols, turning them from unstructured text into machine-readable data.
Extracting the full content of reactions from protocols and chemical structure diagrams is remarkably difficult
given the unstructured nature of text, abbreviations, shortcuts and assumptions that go into diagrams. It is
further complicated by the need to connect the materials in the scheme with the reaction text description (e.g.
reagents, solvents, the sequences involved in the recipe, reaction workup, and product characterization). As an
alternative, we will modularize the CDD stoichiometric sketcher, which will allow us to extract this data. We will
work with NCATS to identify important fields to capture, creating a machine readable chemical reaction
template.