C ABSTRACT
Myc Transcription Factor Inhibitor Design: Integrating Atomic and Mesoscale with Semi-Supervised Gen-
erative Deep Learning Models
Inhibition of master regulators such as Myc have considerable interest due to the reversal of the oncogenic state
evoked by their removal. Adding to the mystique is the technical challenge in targeting a protein which possesses
large regions of disorder. Though widely considered “undruggable”, the library of hits that disrupt Myc function
continuously grows. The chemical features of a hit are difficult to deduce besides high molecular weight, aro-
maticity, rigidity, and hydrophobicity. Understanding the more specific features of a protein-protein interaction
(PPI) inhibitor is considerably difficult. In order to circumvent answering this question, machine learning methods
have been applied to expand the library of experimentally determined hits in hopes of finding an improved inhibitor
nearby in chemical space. Recently, the natural application of generative deep learning techniques to this prob-
lem have been reported. This proposal explains a protocol for a semi-supervised expansion of small molecules
which inhibit various reactions in the Myc transactivation pathway. The PPI inhibitors from three publicly available
databases make up the training set (n=9516) while the known Myc inhibitors are the test set (n=100). In order to
surpass the effectiveness of the test set, all known Myc inhibitors are removed from the training set. A number of
latent variables which suffice to recreate the training set are solved. These variables represent the general struc-
tural properties of PPI inhibitors, which may be associated with activities at various binding sites. The efficient
calculation of activities is crucial to obtaining good performance. Therefore, a well-tempered ensemble of target
configurations is pre-calculated at the all-atom resolution. Additionally, in order to incorporate the population
level behavior of multiple Myc molecules into inhibitor design, mesoscale coarse-grain simulations in various sol-
vents which drive liquid-liquid phase separation are performed. To identify interactions which correlate with phase
response, various points in coarse-grain phase space are converted to all-atom resolution, further refined, and
converted into contact maps. When evaluating a new lead, ensemble-based docking calculations are used, which
calculate an average of averages of a ligand in different poses binding to different conformations randomly drawn
from the ensembles. Reinforcement learning is applied to significantly reduce the time spent docking batches of
leads while maintaining confidence in the result. Once new molecules are generated, these new leads are also
optimized using absolute and relative free energy of binding methods. Ultimately, this study will test the limits of
generative models to integrate data across multiple scales and develop inhibitors which evoke potent inhibition of
intrinsically disordered proteins.