Current generation molecular simulation models are insuf¿ciently accurate, and current generation tools for building
those models are limited, not automated, and based on aging infrastructure.
Our original R01, “Open Data-driven Infrastructure for Building Biomolecular Force Fields for Predictive Biophysics
and Drug Design,” aims to solve these problems, producing a modern infrastructure for building, applying, and
improving accurate molecular mechanics force ¿elds. As part of our NIH-funded project, we have collaborated
closely with the Molecular Sciences Software Institute (MolSSI) to use the QCArchive ecosystem to gen-
erate and continuously expand very large quantum chemical datasets relevant to biomolecular systems
on a variety of supercomputing resources. QCArchive now contains over 42M quantum chemical calculations
for over 39M molecules, and has become incredibly popular, with over 1.79M accesses/month.
Large quantum chemical datasets relevant to biomolecular systems are incredibly valuable to the AI/ML
community. Data is the key element needed for both fundamental research into ML architectures and constructing
predictive models for downstream use. Unfortunately, quantum chemical datasets are incredibly expensive to
generate, limiting in-house generation of large, useful datasets needed to drive AI/ML research to a few large
companies and researchers with access to suf¿cient computing resources. While AI/ML quantum chemical
methods have shown immense promise for biomolecular systems, the limited access to large, curated
datasets has greatly hindered researchers from making rapid progress in this area.
We aim to bridge this gap by working closely with MolSSI QCArchive developers to address robustness, scal-
ability, and data delivery challenges to meet the needs of the biomolecular AI/ML community requiring access
to large quantum chemistry datasets (Aim 1). Additional software developers will enable improvements to the
QCArchive infrastructure to meet the rapidly growing demands of the AI/ML community. As QCArchive is primarily
maintained by a single MolSSI Software Scientist, additional developers are necessary for fully enabling the AI/ML
community to take full advantage of the wealth of data generated by our NIH-funded project directly, as well as the
data actively being generated by the tools our project has engineered to enable distributed, fault-tolerant quantum
chemistry that is rapidly populating QCArchive. We will additionally develop interfaces and dashboards to enable
facile discovery, retrieval, and import of quantum chemical datasets within popular machine learning frameworks
(Aim 2). To ensure our tools are speci¿cally useful for the most promising AI/ML applications, we will collaborate
directly with AI researchers in the OpenMM, TorchMD, and SchNetPack communities actively developing and
deploying quantum machine learning (QML) potentials for biomolecular simulation, with the goal of producing
generally useful tools suitable for the wider community yet capable of driving these high-priority applications.