Computational Chemistry Highlights: Applying statistical modeling strategies to sparse datasets in synthetic chemistry

Wednesday, January 29, 2025

Applying statistical modeling strategies to sparse datasets in synthetic chemistry

Brittany C. Haas, Dipannita Kalyani, Matthew S. Sigman (2025)
Highlighted by Jan Jensen

A few weeks ago I listened to an online talk by Matt Sigman and one thing that really surprised me was the remarkable success he had with very, very simple decision trees (sometimes just with a single node!). I wanted to learn more about it and luckily for me he has just published this excellent perspective.

First of all, these methods are used out of necessity because the data sets are relatively small (typically <1000 and often <100). So why do they work so well? Each application is on one particular organometallic catalyst and reaction type, and the experimental data usually comes from the same lab. The descriptors are usually obtained by DFT calculations and carry a lot of high quality chemical information. In fact, the authors make the point that if the approach fails they look for new descriptors rather than new ML methods. In fact you can view the single-node decision tree as an automated way of finding the single best descriptor in a collection.

You could of course ask: if you are doing DFT calculations, why not simply compute the barriers of interest rather then using ML. The problem is that problem like yield optimisation translate to very small changes (on the order of 0.5 - 2 kcal/mol) in barrier height, and even CCSD(T) is simply not up to the task for the systems of this size, where conformational sampling/Boltzmann averaging, explicit solvent effects, and anharmonic effects become important.

Could this approach be applied to other problems, like drug discovery? The application would probably have to be something like IC50 prediction for lead optimisation where the molecules share a common core. One main difference to organometallic catalysis is that the activity of the catalyst is a function of a relatively localised molecular region compared to ligand-protein binding, which has contributions from most of the molecule. It thus seems difficult to find one or a few descriptors that capture the binding, and that can be computed reliably by DFT calculations - especially if no experimental protein-ligand structures are available.

However, this paper suggest that it may be better to focus on developing such descriptors instead of new ML methods.

This work is licensed under a Creative Commons Attribution 4.0 International License.