Friday, February 28, 2025

GOAT: A Global Optimization Algorithm for Molecules and Atomic Clusters

Bernardo de Souza (2025)
Highlighted by Jan Jensen


If you want to predict accurate reaction energies and barrier heights of typical organic molecules then you are spending a significant portion of CPU time on the conformational search. While generating a large number of random starting structures often works OK for smaller molecules (with less than, say, 15 rotatable bonds) it fails for larger molecules where the odds of randomly generating the global minimum quickly approach zero. You thus need methods that focus the search arounds low energy regions of the PES.

In essence, the the algorithm walks up in some random direction, detects when a conformation barrier has been crossed, minimises the energy, and decides whether a new conformer has been found. New conformers are then included in the ensemble using a Monte-Carlo criteria with simulated annealing. The process is repeated until no new low energy conformers are found.

GOAT is better or very similar to CREST for all but one organic molecule tested. For organometallic complexes GOAT is better or similar to CREST, except for three cases where CREST fails in some way. For small molecules GOAT is a bit slower than CREST, but for large molecules GOAT is usually considerably faster.

GOAT is thus a valuable addition to the computation chemistry toolbox.


This work is licensed under a Creative Commons Attribution 4.0 International License.



Wednesday, January 29, 2025

Applying statistical modeling strategies to sparse datasets in synthetic chemistry

Brittany C. Haas, Dipannita Kalyani, Matthew S. Sigman (2025)
Highlighted by Jan Jensen



A few weeks ago I listened to an online talk by Matt Sigman and one thing that really surprised me was the remarkable success he had with very, very simple decision trees (sometimes just with a single node!). I wanted to learn more about it and luckily for me he has just published this excellent perspective.

First of all, these methods are used out of necessity because the data sets are relatively small (typically <1000 and often <100). So why do they work so well? Each application is on one particular organometallic catalyst and reaction type, and the experimental data usually comes from the same lab. The descriptors are usually obtained by DFT calculations and carry a lot of high quality chemical information. In fact, the authors make the point that if the approach fails they look for new descriptors rather than new ML methods. In fact you can view the single-node decision tree as an automated way of finding the single best descriptor in a collection.

You could of course ask: if you are doing DFT calculations, why not simply compute the barriers of interest rather then using ML. The problem is that problem like yield optimisation translate to very small changes (on the order of 0.5 - 2 kcal/mol) in barrier height, and even CCSD(T) is simply not up to the task for the systems of this size, where conformational sampling/Boltzmann averaging, explicit solvent effects, and anharmonic effects become important. 

Could this approach be applied to other problems, like drug discovery? The application would probably have to be something like IC50 prediction for lead optimisation where the molecules share a common core. One main difference to organometallic catalysis is that the activity of the catalyst is a function of a relatively localised molecular region compared to ligand-protein binding, which has contributions from most of the molecule. It thus seems difficult to find one or a few descriptors that capture the binding, and that can be computed reliably by DFT calculations - especially if no experimental protein-ligand structures are available. 

However, this paper suggest that it may be better to focus on developing such descriptors instead of new ML methods.



This work is licensed under a Creative Commons Attribution 4.0 International License.



Sunday, December 29, 2024

dxtb —An efficient and fully differentiable framework for extended tight-binding

Marvin Friede, Christian Hölzer, Sebastian Ehlert, and Stefan Grimme (2024)
Highlighted by Jan Jensen



When I first saw this paper on SoMe I was incredibly excited because I first thought it was the release of the long-anticipated g-xTB method (I have a bad habit of reading very superficially and seeing what I want to see). But then I skimmed the abstract, saw my mistake, and promptly forgot about it the paper until I saw Jan-Michael Mewes' recent Bluesky thread.

The paper described a fully differentiable Python-baed PyTorch implementation of GFN1-xTB. In the paper they use it to compute some new molecular properties, but the real strength will be in developing new xTB methods for specific applications, i.e. a physics-based alternative to ML potentials. Jan give an illustrative example of this in his thread.

While this is application is mentioned in the paper it doesn't contain an actual application. It remains to be seen how fiddly the actual retraining will be, compared to MLPs, but the hope it that the bespoke xTB methods will require significantly less training data and be more broadly applicable than MLPs.

That's assuming that g-xTB doesn't solve all our problems, which is very much my expectation based on Grimme's talks about it (but keep in mind that my listenings skills are even worse than by reading skills).



This work is licensed under a Creative Commons Attribution 4.0 International License.



Wednesday, November 27, 2024

The vDZP Basis Set Is Effective For Many Density Functionals

Corin C. Wagen and Jonathon E. Vandezande (2024)
Highlighted by Jan Jensen



While this is an interesting paper, a cursory reading (like the one I did initially) can leave you with the wrong impression. The paper shows that the vDZP basis set that Grimme and co-workers develops as part of the  ωB97X-3c method gives good results with other functionals. The results are always better than using other DZ basis sets and sometimes better using than TZ or even QZ basis sets, depending on the property! That's the good news.

The bad news the computational cost of the vDZP basis set is about 40% more expensive than a TZ basis set (at least for typical organic molecules). The reason is that the vDZP consists of more primitives compared to a typical TZ basis set (but considerably less compared to a typical QZ basis set).

So, for me, the main take-home message is that there is a basis set that is somewhere between TZ and QZ in cost, that may be worth trying if the TZ results are not acceptable but QZ is too expensive. However, the paper doesn't show any convincing examples of this. Yes, for isomerization reactions, B97-D3BJ/vDZP is more accurate than B97-3c (which uses the mTZVP basis set) and even B97-D3BJ/def2-QZVP.  But you get much better (and faster) results by using r2SCAN-3c (which uses the mTZVPP basis set).

One exception is if you are working with molecules with a lot of heavy atoms (post C row), then vDZP may be faster than TZ basis sets, because it uses ECPs.



This work is licensed under a Creative Commons Attribution 4.0 International License.

Thursday, October 31, 2024

Lifelong Machine Learning Potentials

Marco Eckhoff and Markus Reiher (2023)
Highlighted by Jan Jensen

While machine learning potentials (MLPs) can give you DFT accuracy at FF costs, they also come with some practical problems: they often need to be retrained from scratch when adding new data to avoid catastrophic forgetting, and most structural descriptors struggle to efficiently represent a large number of different chemical elements.

This paper presents solutions to some of these problems by introducing element-embracing atom-centered symmetry functions (eeACSFs) that incorporate periodic table trends, enabling efficient multi-element handling, and by proposing a lifelong learning framework that includes continual learning strategies, the continual resilient (CoRe) optimizer, and uncertainty quantification to allow MLPs to adapt to new data incrementally without losing prior knowledge.

The eeACSFs differ from conventional ACSFs by integrating element information based on periodic table trends rather than creating separate descriptors for each element combination, which allows them to efficiently handle systems with multiple elements without a combinatorial increase in descriptor size.

The CoRe optimizer is designed to balance efficient convergence with stability, adapting dynamically to the learning context. It combines the robustness of RPROP (resilient backpropagation) with the performance benefits of Adam. Specifically, the optimizer adjusts learning rates based on gradient history, which allows for faster convergence initially and a more stable final accuracy. Additionally, it includes a plasticity factor that selectively freezes parameters critical to prior knowledge while allowing other parameters to adapt. This prevents the “catastrophic forgetting” problem common in continual learning, where new learning can overwrite prior knowledge.

The lifelong learning approach include adaptive selection factors where each data point has a selection factor that updates based on its contribution to the loss function. If a point is well-represented in training, its selection factor decreases, reducing its likelihood of being chosen in future training epochs. Conversely, data that are underrepresented have higher selection factors, ensuring they are revisited. In addition, redundant data (those with low loss contributions) are excluded from the training set, reducing the memory and computational load. Data points that the model consistently fails to learn are also excluded, which improves training efficiency and prevents model instability from conflicting data.

The paper acknowledges that further refinement is needed for scenarios involving the addition of new chemical systems. Although the lMLP can expand its conformation space efficiently, accuracy still falls slightly below training on a single large dataset. Additionally, the method’s application to other MLP architectures and addressing consistency across different electronic states and computational methods remain areas for future work. The authors also suggest that larger and more diverse datasets will be necessary to fully realize the potential of lMLPs in simulating complex chemical systems.



This work is licensed under a Creative Commons Attribution 4.0 International License.



Sunday, September 29, 2024

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, Christopher Olah (2022)
Highlighted by Jan Jensen



Most NNs are notoriously hard to interpret. While there are a few cases, mostly in image classification, where some features (like lines or corners) can be assigned to particular neurons, in general is it seems like every part of the NN contributes to every prediction. This paper provides some powerfull insight into why this is, by analysing simple toy models. 

The study builds on the idea that the output of a hidden layers is an N-dimensional embedding vector (V) that encodes a feature of the data (N is the number of neurons in the layers). You might have seen this famous example from language models: V("king") - V("man") + V("woman") = V("queen"). 

Naively, one would expect that a N-neuron layer can encode N different features, since there are N different (i.e. orthogonal) vectors. However, the papers points out that the number of almost orthogonal vectors (say, with angles between 89° and 91°) increases exponentially with N, so that NNs can represent many more features than they have dimensions, which they term "superposition". 

Since most features are stored in orthogonal vectors they will necessarily have many non-zero contributions and this cannot be assigned to a specific neuron. The authors further show that the superposition is driven by data sparcity, i.e. few examples of a particular input feature: more data sparcity, more superposition, less interpretability.

The paper is very thorough and there are many more insights that I have skipped. But I hope this highlight has made you curious enough to have a look at the paper. I can also recommend this brilliant introduction superposition by 3Blue1Brown to get you started.

Now, it's important to note that these insights are obtained by analysing simple toy problems. It will be interesting to see if and how they apply to real-world applications, including chemistry. 


This work is licensed under a Creative Commons Attribution 4.0 International License.



Wednesday, August 28, 2024

Variational Pair-Density Functional Theory: Dealing with Strong Correlation at the Protein Scale

Mikael Scott, Gabriel L. S. Rodrigues, Xin Li, and Mickael G. Delcey (2024)
Highlighted by Jan Jensen

As I've said before, one of the big problems in quantum chemistry is that we still can't routinely predict the reactivity of TM-containing compounds with the same degree of accuracy as we can for organic molecules. This paper might offer a solution by combining CASSCF with DFT in a variational way.

While such a combination has been done before, that implementation basically compute the DFT energy based on the CASSCF density. If you haven't heard of this approach, it's probably because it didn't work very well. 

This paper presents a variational implementation, where you minimise the energy if a CASSCF wavefunction subject to an exchange-correlation density functional, an the results are significantly better - in some cases approaching chemical accuracy! This is pretty impressive given that they used off-the-shelf GGA functionals (BLYP and PBE) so further improvements in accuracy with bespoke functionals is quite likely.

Oh, and one of the applications presented in the paper is multiconfigurational calculation on an entire metallo-protein!



This work is licensed under a Creative Commons Attribution 4.0 International License.