Brian Hie, Bryan D. Bryson, Bonnie Berger (2021)
Highlighted by Jan Jensen
Part of the TOC figure. (c) The authors. Reproduced under the CC-BY-NC-ND license
It is rare to see successful ML studies with training sets of 72 molecules, but this is one such study.
The data set is 72 compounds with measured binding affinities to 442 different kinases, i.e. 32K datapoints. The Kds span a range of 0.1 nm to 10 μm. This data is used to train several ML models, some of which include uncertainty estimation and some which do not. The main finding is that for points with low uncertainties the ML model is better at separating active from inactive compounds. Interestingly, compounds with low uncertainty have extreme Kds.
The models (retrained on the whole dataset) is then used to screen a set of 10,833 purchasable compounds. The top five candidates for each model where purchased and checked against four different kinases (i.e. 20 ligand-kinase pairs per model). For the uncertainty based ML models the top candidates are molecules with both low Kd and low uncertainty, while for the other models the decision was made solely based on Kd.
None of the molecules picked solely based on Kd showed at Kd less than 10 μm, whereas 18, 10, and 2 ligand-kinase pairs had Kds lower than 10 μm, 100 nm, and 1 nm, respectively.
Some ML details: The molecules where featurised using a graph convolutional junction tree approach (JTNN-VAE), which was found to work better han fingerprints (data not shown). The uncertainty is predicted using four different approaches: GP regression (GP), GP regression of the MLP error (MLP+GP), Bayesian multilayer perceptron (BMLP) and an ensemble of MLPs that each emits a Gaussian distribution (GMLPE). Nigam et al. recently published a very nice overview of such methods.
This work is licensed under a Creative Commons Attribution 4.0 International License.