Improving the Accuracy of Implicit Solvents with a Physics-Guided Neural Network - Project Summary/Abstract Drug discovery is one of the most challenging tasks in biological sciences; it takes about 10-15 years and $2 billion on average to discover a new drug. The main goal in drug discovery is identifying drug-like com- pounds (ligands) capable of modulating specific biological targets (proteins). One key feature of protein-ligand interactions is the binding free energy change, G, that occurs between the protein and the ligand upon the ligand's attachment. This physiochemical feature heavily dictates how strongly a protein and ligand interact and is particularly useful to understand for drug design. While wet-lab experiments accurately estimate G, they are significantly slow, costly, and laborious. On the other hand, computational simulations enable significantly faster estimation of G and shed light on the binding mechanism of various structures that could have been complicated to be examined otherwise. The implicit solvent framework, which treats solvent as a continuum with the dielectric and non-polar properties of water, offer much more efficient estimation of G compared to other computational methodologies, such as alchemical free energy methods. Despite noticeable progress in implicit solvent modeling, serious concerns about its accuracy remain that stem from the underlying physi- cal approximations. This research will employ modern machine learning techniques to bridge the accuracy gap between a physics-based implicit solvent model and experimental references in terms of G calculations. In particular, experimental data will be integrated into a generalized Born (GB) implicit solvent model so that with adherence to the physical model, new structural features could improve the accuracy. In addition to the model accuracy, it is essential to retain interpretability (that accounts for the model simplicity) and transferability (that assures consistent performance on different datasets). To this end, a novel multi-objective loss function will be introduced that takes “accuracy”, “interpretability”, and “transferability” into consideration. Standard protein-ligand databases, benchmarks, and datasets will be used for designing the proposed hybrid model, including host-guest systems, SAMPL challenge benchmarks, PDBbind, and BindingDB. While some of these sources contain clean data, many require further post-processing to prepare for running the GB model. Careful data preparation will be performed by following standard protocols and via popular web services. The modular characteristics of the proposed physics-data model will allow for testing various flavors of implicit solvent (physics-based model) and modifications to the proposed Graph Convolutional Network (data-driven model). This flexibility of the hybrid model facilitates new interdisciplinary research between the classical physics-based and the modern data-driven ends. The final source code and parameterized datasets will be available freely to the public. They could be incorporated into the high-throughput virtual screening of candidate drugs in the early stages of drug discovery. The outcome of this research will benefit the biomolecular modeling community by providing an approach to build novel, accurate, and efficient computational models for studying protein-ligand interactions.