Wednesday, April 30, 2025

Known Unknowns: Out-of-Distribution Property Prediction in Materials and Molecules

Nofit Segal, Aviv Netanyahu, Kevin P. Greenman, Pulkit Agrawal, Rafael Gómez-Bombarelli (2025)
Highlighted by Jan Jensen

Pat Walters recently wrote a blog post called Why Don’t Machine Learning Models Extrapolate? where he showed that ML models trained on a low-MW molecules cannot accurately predict the MW of larger molecules, while linear regression appears to do OK. Here I want to highlight a recently proposed ML method by Nofit et al. that does appear to be able to extrapolate with descent accuracy. 

Before diving into the paper I let me try to explain why some traditional ML methods such as tree-based methods and NNs might struggle with extrapolation, using MW as the property of interest

Linear Regression. Let's start with linear regression (I'll skip the bias for simplicity),  

$$y_{pred} = X_1w_1+X_2w_2+... +X_nw_n$$

If $X_i$ is the number of atom $i$ in the molecule and $w_i$ is the atomic weight if atom $i$, then you have a perfect model that is able to extrapolate (assuming all atom types are represented in the training data). However, if X is a binary fingerprint then you obviously won't get good extrapolated values, so the molecular representation can also impacts whether a model can extrapolate accurately (I'll return to this point below). Pat uses count fingerprints, which contains the heavy-atom count, so the model "just" has to learn to infer the H-atom count from the remaining fragments and it does a pretty good job of that.

Random Forest. In RF each tree (i) predicts one of the y values in the training set and the prediction is simply the average of N trees:  

$$y_{pred} = (y_1 + y_2 + ... +y_N)/N $$

So the largest possible value for $y_{pred}$ is the maximum y values in the training set $\max(y)$. Thus, RF is fundamentally incapable of any extrapolation.

LGBM. In gradient-boosted tree methods each tree $i$ tries to predicts the deviation from the mean of the training set ($lr$ is the learning rate, which is typically a small number):

$$ y_{pred} = \langle y \rangle +lr(\Delta y_1 + \Delta y_2 + ... + \Delta y_N) $$

The largest possible value of $\Delta y_i$ is the largest deviation from the mean found in the training set, but one can image a combination of $lr$, $N$, and $\Delta y_i$'s where test set-$y_{pred}$ could be larger than any $y$ value in the training set, but it is also unlikely that is it going to be significantly bigger, since it is tied to the mean of the training set. This is indeed what Pat finds.

Neural Networks. For an NN the prediction is a linear combination of the output from the activation functions from the last hidden layer

$$y_{pred} = a_1 w_1+a_2 w_2+... +a_n w_n$$

If the activation function is something like sigmoid or tanh, then the maximum value of $a_i$ is 1, no matter the molecule, and the model is clearly going to have a very hard time extrapolating at all, in analogy with linear regression using binary fingerprints.

With the ReLU the model has some theoretical chance of extrapolating. In fact, if the input is the simple atom-count vector discussed above one could arrive a perfect model by setting most of the weights to 0, a few to 1, and some of the weights in the last layer to the atomic weights. However, the odds of finding that global error-minimum by traditional training techniques is very small. But note that it is a practical rather than a fundamental limitation, for this particular property and molecular representation. As before, if you use a binary fingerprint it will be impossible to make an method capable of extrapolating, no matter what activation function you use. However, a-ReLU NN can predict values larger than $\max(y)$, given the right molecular representation. Which brings us to ...

Graph Neural Networks. In GNNs (e.g. ChemProp) the molecular representation that is feed to the NN is some combination of atomic descriptor-vectors. The most popular way to combine the atomic vectors is to average them, which will result in molecular descriptor vectors that is roughly independent of molecular size, which will make accurate extrapolation harder, even when using ReLU.  

The approach by Nofit et al. Nofit et al. have developed a method called bilinear transduction, which optimises a bilinear function, with contributions from one model that depends on $X$ and one that depends on $\Delta X$, i.e. the difference between molecules in the test set.

$$ y_{pred} = f_1(\Delta X) g_1(X) + f_2(\Delta X) g_2(X) + ... + f_n(\Delta X) g_n(X)$$

The basic idea is that while the training set may not contain, say, molecules with 15 C atoms, it contains molecules with 10 C atoms, and molecule pairs that differ by 5 C atoms. So if you combine that knowledge you should be able to make a reasonable extrapolation to 15 C atoms. 

This turns out to work quite well as you can see from the rightmost panel here (for the FreeSolv data set)


This work is licensed under a Creative Commons Attribution 4.0 International License.