Monday, March 31, 2025

Designing Target-specific Data Sets for Regioselectivity Predictions on Complex Substrates

Jules  Schleinitz,  Alba  Carretero-Cerdán, Anjali  Gurajapu, Yonatan  Harnik, Gina  Lee, Amitesh  Pandey, Anat  Milo, and Sarah Reisman (2025)
Highlighted by Jan Jensen



Have you ever looked at a poorly performing ML model and thought: "Hmm, maybe I should make my training set smaller"? Me neither. 

Well, this paper shows an example where that actually works. In particular, they show examples where regioselectivity predictions for some molecules are improved by using only some of the available data. They show that if you start with a very small training set you get the wrong prediction of where in a molecule the reaction occurs and when you add more data the model often makes the right prediction eventually. However, if you keep adding data the model starts making a wrong prediction again! In other words, they get a much better classification model if the make a bespoke training set for each molecule.

This raises two important questions: 1) Does this apply to all datasets and properties? and 2) How do you figure out which data points to include in your training set for a particular molecule when you don't know the right answer?

If the answer to the first question is "no" (and it probably is), then how do we figure out when it this is a good strategy? (other than trial-an-error).  I suspect that the predictions of local properties (such as the reactivity of an atom) are more likely to benefit from bespoke training sets, compared to global properties such as solubility. But that is just a guess.

Another guess, it that this will apply mostly to small, inhomogeneous datasets. If so, we could easily generate bespoke models for each individual prediction on-the-fly if that would lead to better predictions. But we need to figure out the answers to question 2 first. 

I also think that if we can understand how additional data can hurt a models performance, it would give us some valuable insights into how ML models learn.



This work is licensed under a Creative Commons Attribution 4.0 International License.



Friday, February 28, 2025

GOAT: A Global Optimization Algorithm for Molecules and Atomic Clusters

Bernardo de Souza (2025)
Highlighted by Jan Jensen


If you want to predict accurate reaction energies and barrier heights of typical organic molecules then you are spending a significant portion of CPU time on the conformational search. While generating a large number of random starting structures often works OK for smaller molecules (with less than, say, 15 rotatable bonds) it fails for larger molecules where the odds of randomly generating the global minimum quickly approach zero. You thus need methods that focus the search arounds low energy regions of the PES.

In essence, the the algorithm walks up in some random direction, detects when a conformation barrier has been crossed, minimises the energy, and decides whether a new conformer has been found. New conformers are then included in the ensemble using a Monte-Carlo criteria with simulated annealing. The process is repeated until no new low energy conformers are found.

GOAT is better or very similar to CREST for all but one organic molecule tested. For organometallic complexes GOAT is better or similar to CREST, except for three cases where CREST fails in some way. For small molecules GOAT is a bit slower than CREST, but for large molecules GOAT is usually considerably faster.

GOAT is thus a valuable addition to the computation chemistry toolbox.


This work is licensed under a Creative Commons Attribution 4.0 International License.



Wednesday, January 29, 2025

Applying statistical modeling strategies to sparse datasets in synthetic chemistry

Brittany C. Haas, Dipannita Kalyani, Matthew S. Sigman (2025)
Highlighted by Jan Jensen



A few weeks ago I listened to an online talk by Matt Sigman and one thing that really surprised me was the remarkable success he had with very, very simple decision trees (sometimes just with a single node!). I wanted to learn more about it and luckily for me he has just published this excellent perspective.

First of all, these methods are used out of necessity because the data sets are relatively small (typically <1000 and often <100). So why do they work so well? Each application is on one particular organometallic catalyst and reaction type, and the experimental data usually comes from the same lab. The descriptors are usually obtained by DFT calculations and carry a lot of high quality chemical information. In fact, the authors make the point that if the approach fails they look for new descriptors rather than new ML methods. In fact you can view the single-node decision tree as an automated way of finding the single best descriptor in a collection.

You could of course ask: if you are doing DFT calculations, why not simply compute the barriers of interest rather then using ML. The problem is that problem like yield optimisation translate to very small changes (on the order of 0.5 - 2 kcal/mol) in barrier height, and even CCSD(T) is simply not up to the task for the systems of this size, where conformational sampling/Boltzmann averaging, explicit solvent effects, and anharmonic effects become important. 

Could this approach be applied to other problems, like drug discovery? The application would probably have to be something like IC50 prediction for lead optimisation where the molecules share a common core. One main difference to organometallic catalysis is that the activity of the catalyst is a function of a relatively localised molecular region compared to ligand-protein binding, which has contributions from most of the molecule. It thus seems difficult to find one or a few descriptors that capture the binding, and that can be computed reliably by DFT calculations - especially if no experimental protein-ligand structures are available. 

However, this paper suggest that it may be better to focus on developing such descriptors instead of new ML methods.



This work is licensed under a Creative Commons Attribution 4.0 International License.