Friday, February 28, 2025

GOAT: A Global Optimization Algorithm for Molecules and Atomic Clusters

Bernardo de Souza (2025)
Highlighted by Jan Jensen


If you want to predict accurate reaction energies and barrier heights of typical organic molecules then you are spending a significant portion of CPU time on the conformational search. While generating a large number of random starting structures often works OK for smaller molecules (with less than, say, 15 rotatable bonds) it fails for larger molecules where the odds of randomly generating the global minimum quickly approach zero. You thus need methods that focus the search arounds low energy regions of the PES.

In essence, the the algorithm walks up in some random direction, detects when a conformation barrier has been crossed, minimises the energy, and decides whether a new conformer has been found. New conformers are then included in the ensemble using a Monte-Carlo criteria with simulated annealing. The process is repeated until no new low energy conformers are found.

GOAT is better or very similar to CREST for all but one organic molecule tested. For organometallic complexes GOAT is better or similar to CREST, except for three cases where CREST fails in some way. For small molecules GOAT is a bit slower than CREST, but for large molecules GOAT is usually considerably faster.

GOAT is thus a valuable addition to the computation chemistry toolbox.


This work is licensed under a Creative Commons Attribution 4.0 International License.



Wednesday, January 29, 2025

Applying statistical modeling strategies to sparse datasets in synthetic chemistry

Brittany C. Haas, Dipannita Kalyani, Matthew S. Sigman (2025)
Highlighted by Jan Jensen



A few weeks ago I listened to an online talk by Matt Sigman and one thing that really surprised me was the remarkable success he had with very, very simple decision trees (sometimes just with a single node!). I wanted to learn more about it and luckily for me he has just published this excellent perspective.

First of all, these methods are used out of necessity because the data sets are relatively small (typically <1000 and often <100). So why do they work so well? Each application is on one particular organometallic catalyst and reaction type, and the experimental data usually comes from the same lab. The descriptors are usually obtained by DFT calculations and carry a lot of high quality chemical information. In fact, the authors make the point that if the approach fails they look for new descriptors rather than new ML methods. In fact you can view the single-node decision tree as an automated way of finding the single best descriptor in a collection.

You could of course ask: if you are doing DFT calculations, why not simply compute the barriers of interest rather then using ML. The problem is that problem like yield optimisation translate to very small changes (on the order of 0.5 - 2 kcal/mol) in barrier height, and even CCSD(T) is simply not up to the task for the systems of this size, where conformational sampling/Boltzmann averaging, explicit solvent effects, and anharmonic effects become important. 

Could this approach be applied to other problems, like drug discovery? The application would probably have to be something like IC50 prediction for lead optimisation where the molecules share a common core. One main difference to organometallic catalysis is that the activity of the catalyst is a function of a relatively localised molecular region compared to ligand-protein binding, which has contributions from most of the molecule. It thus seems difficult to find one or a few descriptors that capture the binding, and that can be computed reliably by DFT calculations - especially if no experimental protein-ligand structures are available. 

However, this paper suggest that it may be better to focus on developing such descriptors instead of new ML methods.



This work is licensed under a Creative Commons Attribution 4.0 International License.