Friday, September 30, 2022

Active Learning for Small Molecule pKa Regression; a Long Way To Go

Paul G. Francoeur, Daniel PeƱaherrera, and David R. Koes (2022)
Highlighted by Jan Jensen

Parts of Figures 5 and 6. (c) The authors 2022. Reproduced under the CC-BY licence

One approach to active learning is to grow the training set with molecules for which the current model has the highest uncertainties. However,  according to this study, this approach does not seem to work for small-molecule pKa prediction where active learning and random selection give the same results (within the relatively high standard deviations) for three different uncertainty estimated. 

The authors show that there are molecules in the pool that can increase the  initial accuracy drastically, but that the uncertainties don't seem to help identify these molecules. The green curve above is obtained by exhaustively training a new model for every molecule in the pool during each step of the active learning  loop and selecting the molecule that gives the largest increase in accuracy for the test set. Note that the accuracy decreases towards the end meaning that including some molecules in the training set diminishes the performance.

The authors offer the following explanation for their observations: "We propose that the reason active  learning failed in this pKa prediction task is that all of the molecules are informative."

That's certainly not hard to imagine given the is the small size of the initial training set (50). It would have been very instructive to see the distribution of uncertainties for the initial models. Does every molecule have roughly the same (high) uncertainty? If so, the uncertainties would indeed not be informative. 

Also, uncertainties only correlate with (random) errors on average. The authors did try adding molecules in batches, but the batch size was only 10. 

It would have been interesting to see the performance if one used the actual error, rather than the uncertainties, to select molecules. That would test the case where uncertainties correlate perfectly with the errors.



This work is licensed under a Creative Commons Attribution 4.0 International License.