Friday, May 30, 2025

Repurposing quantum chemical descriptor datasets for on-the-fly generation of informative reaction representations: application to hydrogen atom transfer reactions

Javier E. Alfonso-Ramos, Rebecca M. Neeser, Thijs Stuyver (2024)
Highlighted by Jan Jensen



If you have very little data, the single most useful thing you can do is find good descriptors. Sigman, Doyle, and others have shown this very nicely for reactivity predictions of transition metal containing catalysts, but there's less systematic work for other types of reactions. 

In this paper, Stuyver and co-workers suggest a descriptor set for barriers of hydrogen atom transfer (HAT) reactions that are based valence bond (VB) theory. In practise this translates to computing the bond dissociation energies (BDEs) without relaxing the geometry, and combining them with the BD free energies (BDFE, where ΔBDFE corresponds to ΔGrp). In addition, atomic Mulliken charges, spin densities, and buried volume are also added. All descriptors are predicted by surrogate models to avoid QM-based calculations.

Using these descriptors they get significantly better barrier predictions compared to fingerprint or graph convolution representation, even using simple models such as linear regression. Even the simple Bell-Evans-Polanyi model (a linear model based solely on ΔGrp) outperforms the models using fingerprints and graph convolution, with an R2 of 0.71 compared to 0.65 for graph convolution. For, comparison the R2s for the VB-based descriptors are 0.80-0.85, depending on the ML-model.

I wonder what other approximate chemical methods contain inspirations for new descriptors?


This work is licensed under a Creative Commons Attribution 4.0 International License.



Wednesday, April 30, 2025

Known Unknowns: Out-of-Distribution Property Prediction in Materials and Molecules

Nofit Segal, Aviv Netanyahu, Kevin P. Greenman, Pulkit Agrawal, Rafael Gómez-Bombarelli (2025)
Highlighted by Jan Jensen

Pat Walters recently wrote a blog post called Why Don’t Machine Learning Models Extrapolate? where he showed that ML models trained on a low-MW molecules cannot accurately predict the MW of larger molecules, while linear regression appears to do OK. Here I want to highlight a recently proposed ML method by Segal et al. that does appear to be able to extrapolate with decent accuracy. 

Before diving into the paper I let me try to explain why some traditional ML methods such as tree-based methods and NNs might struggle with extrapolation, using MW as the property of interest

Linear Regression. Let's start with linear regression (I'll skip the bias for simplicity),  

$$y_{pred} = X_1w_1+X_2w_2+... +X_nw_n$$

If $X_i$ is the number of atom $i$ in the molecule and $w_i$ is the atomic weight of atom $i$, then you have a perfect model that is able to extrapolate (assuming all atom types are represented in the training data). However, if X is a binary fingerprint then you obviously won't get good extrapolated values, so the molecular representation can also impacts whether a model can extrapolate accurately (I'll return to this point below). Pat uses count fingerprints, which contains the heavy-atom count, so the model "just" has to learn to infer the H-atom count from the remaining fragments and it does a pretty good job of that.

Random Forest. In RF each tree (i) predicts one of the $y$ values in the training set and the prediction is simply the average of $N$ trees:  

$$y_{pred} = (y_1 + y_2 + ... +y_N)/N $$

So the largest possible value for $y_{pred}$ is the maximum $y$ values in the training set $\max(y)$. Thus, RF is fundamentally incapable of any extrapolation.

LGBM. In gradient-boosted tree methods each tree $i$ tries to predicts the deviation from the mean of the training set ($lr$ is the learning rate, which is typically a small number):

$$ y_{pred} = \langle y \rangle +lr(\Delta y_1 + \Delta y_2 + ... + \Delta y_N) $$

The largest possible value of $\Delta y_i$ is the largest deviation from the mean found in the training set, but one can image a combination of $lr$, $N$, and $\Delta y_i$'s where test set-$y_{pred}$ could be larger than any $y$ value in the training set, but it is also unlikely that is it going to be significantly bigger, since it is tied to the mean of the training set. This is indeed what Pat finds.

Feed Forward Neural Networks. For an NN the prediction is a linear combination of the outputs from the activation functions from the last hidden layer:

$$y_{pred} = a_1 w_1+a_2 w_2+... +a_n w_n$$

If the activation function is something like sigmoid or tanh, then the maximum value of $a_i$ is 1, no matter the molecule. So the model is clearly going to have a very hard time extrapolating at all, in analogy with linear regression using binary fingerprints.

With the ReLU activation function the model has some theoretical chance of extrapolating. In fact, if the input is the simple atom-count vector discussed above one could arrive at a perfect model by setting most of the weights to 0, a few to 1, and some of the weights in the last layer to the atomic weights. However, the odds of finding that global error-minimum by traditional training techniques is very small. But note that it is a practical rather than a fundamental limitation, for this particular property and molecular representation. As before, if you use a binary fingerprint it will be impossible to make an method capable of extrapolating, no matter what activation function you use. However, a-ReLU NN can predict values larger than $\max(y)$, given the right molecular representation. Which brings us to ...

Graph Neural Networks. In GNNs (e.g. ChemProp) the molecular representation that is fed to the NN is some combination of atomic descriptor-vectors. The most popular way to combine the atomic vectors is to average them, which will result in molecular descriptor vectors that is roughly independent of molecular size. This will make accurate extrapolation harder, even when using ReLU.  

The approach by Segal et al. Segal et al. have developed a method called bilinear transduction, which optimises a bilinear function, with contributions from one model that depends on $X$ and one that depends on $\Delta X$, i.e. the difference between molecules in the test set.

$$ y_{pred} = f_1(\Delta X) g_1(X) + f_2(\Delta X) g_2(X) + ... + f_n(\Delta X) g_n(X)$$

The basic idea is that while the training set may not contain, say, molecules with 15 C atoms, it contains molecules with 10 C atoms, and molecule pairs that differ by 5 C atoms. So if you combine that knowledge you should be able to make a reasonable extrapolation to 15 C atoms. 

This turns out to work quite well as you can see from the rightmost panel here (for the FreeSolv data set)


This work is licensed under a Creative Commons Attribution 4.0 International License.

Monday, March 31, 2025

Designing Target-specific Data Sets for Regioselectivity Predictions on Complex Substrates

Jules  Schleinitz,  Alba  Carretero-Cerdán, Anjali  Gurajapu, Yonatan  Harnik, Gina  Lee, Amitesh  Pandey, Anat  Milo, and Sarah Reisman (2025)
Highlighted by Jan Jensen



Have you ever looked at a poorly performing ML model and thought: "Hmm, maybe I should make my training set smaller"? Me neither. 

Well, this paper shows an example where that actually works. In particular, they show examples where regioselectivity predictions for some molecules are improved by using only some of the available data. They show that if you start with a very small training set you get the wrong prediction of where in a molecule the reaction occurs and when you add more data the model often makes the right prediction eventually. However, if you keep adding data the model starts making a wrong prediction again! In other words, they get a much better classification model if the make a bespoke training set for each molecule.

This raises two important questions: 1) Does this apply to all datasets and properties? and 2) How do you figure out which data points to include in your training set for a particular molecule when you don't know the right answer?

If the answer to the first question is "no" (and it probably is), then how do we figure out when it this is a good strategy? (other than trial-an-error).  I suspect that the predictions of local properties (such as the reactivity of an atom) are more likely to benefit from bespoke training sets, compared to global properties such as solubility. But that is just a guess.

Another guess, it that this will apply mostly to small, inhomogeneous datasets. If so, we could easily generate bespoke models for each individual prediction on-the-fly if that would lead to better predictions. But we need to figure out the answers to question 2 first. 

I also think that if we can understand how additional data can hurt a models performance, it would give us some valuable insights into how ML models learn.



This work is licensed under a Creative Commons Attribution 4.0 International License.



Friday, February 28, 2025

GOAT: A Global Optimization Algorithm for Molecules and Atomic Clusters

Bernardo de Souza (2025)
Highlighted by Jan Jensen


If you want to predict accurate reaction energies and barrier heights of typical organic molecules then you are spending a significant portion of CPU time on the conformational search. While generating a large number of random starting structures often works OK for smaller molecules (with less than, say, 15 rotatable bonds) it fails for larger molecules where the odds of randomly generating the global minimum quickly approach zero. You thus need methods that focus the search arounds low energy regions of the PES.

In essence, the the algorithm walks up in some random direction, detects when a conformation barrier has been crossed, minimises the energy, and decides whether a new conformer has been found. New conformers are then included in the ensemble using a Monte-Carlo criteria with simulated annealing. The process is repeated until no new low energy conformers are found.

GOAT is better or very similar to CREST for all but one organic molecule tested. For organometallic complexes GOAT is better or similar to CREST, except for three cases where CREST fails in some way. For small molecules GOAT is a bit slower than CREST, but for large molecules GOAT is usually considerably faster.

GOAT is thus a valuable addition to the computation chemistry toolbox.


This work is licensed under a Creative Commons Attribution 4.0 International License.



Wednesday, January 29, 2025

Applying statistical modeling strategies to sparse datasets in synthetic chemistry

Brittany C. Haas, Dipannita Kalyani, Matthew S. Sigman (2025)
Highlighted by Jan Jensen



A few weeks ago I listened to an online talk by Matt Sigman and one thing that really surprised me was the remarkable success he had with very, very simple decision trees (sometimes just with a single node!). I wanted to learn more about it and luckily for me he has just published this excellent perspective.

First of all, these methods are used out of necessity because the data sets are relatively small (typically <1000 and often <100). So why do they work so well? Each application is on one particular organometallic catalyst and reaction type, and the experimental data usually comes from the same lab. The descriptors are usually obtained by DFT calculations and carry a lot of high quality chemical information. In fact, the authors make the point that if the approach fails they look for new descriptors rather than new ML methods. In fact you can view the single-node decision tree as an automated way of finding the single best descriptor in a collection.

You could of course ask: if you are doing DFT calculations, why not simply compute the barriers of interest rather then using ML. The problem is that problem like yield optimisation translate to very small changes (on the order of 0.5 - 2 kcal/mol) in barrier height, and even CCSD(T) is simply not up to the task for the systems of this size, where conformational sampling/Boltzmann averaging, explicit solvent effects, and anharmonic effects become important. 

Could this approach be applied to other problems, like drug discovery? The application would probably have to be something like IC50 prediction for lead optimisation where the molecules share a common core. One main difference to organometallic catalysis is that the activity of the catalyst is a function of a relatively localised molecular region compared to ligand-protein binding, which has contributions from most of the molecule. It thus seems difficult to find one or a few descriptors that capture the binding, and that can be computed reliably by DFT calculations - especially if no experimental protein-ligand structures are available. 

However, this paper suggest that it may be better to focus on developing such descriptors instead of new ML methods.



This work is licensed under a Creative Commons Attribution 4.0 International License.



Sunday, December 29, 2024

dxtb —An efficient and fully differentiable framework for extended tight-binding

Marvin Friede, Christian Hölzer, Sebastian Ehlert, and Stefan Grimme (2024)
Highlighted by Jan Jensen



When I first saw this paper on SoMe I was incredibly excited because I first thought it was the release of the long-anticipated g-xTB method (I have a bad habit of reading very superficially and seeing what I want to see). But then I skimmed the abstract, saw my mistake, and promptly forgot about it the paper until I saw Jan-Michael Mewes' recent Bluesky thread.

The paper described a fully differentiable Python-baed PyTorch implementation of GFN1-xTB. In the paper they use it to compute some new molecular properties, but the real strength will be in developing new xTB methods for specific applications, i.e. a physics-based alternative to ML potentials. Jan give an illustrative example of this in his thread.

While this is application is mentioned in the paper it doesn't contain an actual application. It remains to be seen how fiddly the actual retraining will be, compared to MLPs, but the hope it that the bespoke xTB methods will require significantly less training data and be more broadly applicable than MLPs.

That's assuming that g-xTB doesn't solve all our problems, which is very much my expectation based on Grimme's talks about it (but keep in mind that my listenings skills are even worse than by reading skills).



This work is licensed under a Creative Commons Attribution 4.0 International License.



Wednesday, November 27, 2024

The vDZP Basis Set Is Effective For Many Density Functionals

Corin C. Wagen and Jonathon E. Vandezande (2024)
Highlighted by Jan Jensen



While this is an interesting paper, a cursory reading (like the one I did initially) can leave you with the wrong impression. The paper shows that the vDZP basis set that Grimme and co-workers develops as part of the  ωB97X-3c method gives good results with other functionals. The results are always better than using other DZ basis sets and sometimes better using than TZ or even QZ basis sets, depending on the property! That's the good news.

The bad news the computational cost of the vDZP basis set is about 40% more expensive than a TZ basis set (at least for typical organic molecules). The reason is that the vDZP consists of more primitives compared to a typical TZ basis set (but considerably less compared to a typical QZ basis set).

So, for me, the main take-home message is that there is a basis set that is somewhere between TZ and QZ in cost, that may be worth trying if the TZ results are not acceptable but QZ is too expensive. However, the paper doesn't show any convincing examples of this. Yes, for isomerization reactions, B97-D3BJ/vDZP is more accurate than B97-3c (which uses the mTZVP basis set) and even B97-D3BJ/def2-QZVP.  But you get much better (and faster) results by using r2SCAN-3c (which uses the mTZVPP basis set).

One exception is if you are working with molecules with a lot of heavy atoms (post C row), then vDZP may be faster than TZ basis sets, because it uses ECPs.



This work is licensed under a Creative Commons Attribution 4.0 International License.