Project Summary/Abstract
Drug discovery is one of the most challenging tasks in biological sciences; it takes about 10-15 years and
$2 billion on average to discover a new drug. The main goal in drug discovery is identifying drug-like com-
pounds (ligands) capable of modulating speci¿c biological targets (proteins). One key feature of protein-ligand
interactions is the binding free energy change, G, that occurs between the protein and the ligand upon the
ligand's attachment. This physiochemical feature heavily dictates how strongly a protein and ligand interact and
is particularly useful to understand for drug design. While wet-lab experiments accurately estimate G, they
are signi¿cantly slow, costly, and laborious. On the other hand, computational simulations enable signi¿cantly
faster estimation of G and shed light on the binding mechanism of various structures that could have been
complicated to be examined otherwise. The implicit solvent framework, which treats solvent as a continuum
with the dielectric and non-polar properties of water, offer much more ef¿cient estimation of G compared
to other computational methodologies, such as alchemical free energy methods. Despite noticeable progress
in implicit solvent modeling, serious concerns about its accuracy remain that stem from the underlying physi-
cal approximations. This research will employ modern machine learning techniques to bridge the accuracy gap
between a physics-based implicit solvent model and experimental references in terms of G calculations. In
particular, experimental data will be integrated into a generalized Born (GB) implicit solvent model so that with
adherence to the physical model, new structural features could improve the accuracy. In addition to the model
accuracy, it is essential to retain interpretability (that accounts for the model simplicity) and transferability (that
assures consistent performance on different datasets). To this end, a novel multi-objective loss function will be
introduced that takes “accuracy”, “interpretability”, and “transferability” into consideration. Standard protein-ligand
databases, benchmarks, and datasets will be used for designing the proposed hybrid model, including host-guest
systems, SAMPL challenge benchmarks, PDBbind, and BindingDB. While some of these sources contain clean
data, many require further post-processing to prepare for running the GB model. Careful data preparation will
be performed by following standard protocols and via popular web services. The modular characteristics of the
proposed physics-data model will allow for testing various ¿avors of implicit solvent (physics-based model) and
modi¿cations to the proposed Graph Convolutional Network (data-driven model). This ¿exibility of the hybrid
model facilitates new interdisciplinary research between the classical physics-based and the modern data-driven
ends. The ¿nal source code and parameterized datasets will be available freely to the public. They could be
incorporated into the high-throughput virtual screening of candidate drugs in the early stages of drug discovery.
The outcome of this research will bene¿t the biomolecular modeling community by providing an approach to build
novel, accurate, and ef¿cient computational models for studying protein-ligand interactions.