Sunday, December 29, 2024

dxtb —An efficient and fully differentiable framework for extended tight-binding

Marvin Friede, Christian Hölzer, Sebastian Ehlert, and Stefan Grimme (2024)
Highlighted by Jan Jensen



When I first saw this paper on SoMe I was incredibly excited because I first thought it was the release of the long-anticipated g-xTB method (I have a bad habit of reading very superficially and seeing what I want to see). But then I skimmed the abstract, saw my mistake, and promptly forgot about it the paper until I saw Jan-Michael Mewes' recent Bluesky thread.

The paper described a fully differentiable Python-baed PyTorch implementation of GFN1-xTB. In the paper they use it to compute some new molecular properties, but the real strength will be in developing new xTB methods for specific applications, i.e. a physics-based alternative to ML potentials. Jan give an illustrative example of this in his thread.

While this is application is mentioned in the paper it doesn't contain an actual application. It remains to be seen how fiddly the actual retraining will be, compared to MLPs, but the hope it that the bespoke xTB methods will require significantly less training data and be more broadly applicable than MLPs.

That's assuming that g-xTB doesn't solve all our problems, which is very much my expectation based on Grimme's talks about it (but keep in mind that my listenings skills are even worse than by reading skills).



This work is licensed under a Creative Commons Attribution 4.0 International License.



Wednesday, November 27, 2024

The vDZP Basis Set Is Effective For Many Density Functionals

Corin C. Wagen and Jonathon E. Vandezande (2024)
Highlighted by Jan Jensen



While this is an interesting paper, a cursory reading (like the one I did initially) can leave you with the wrong impression. The paper shows that the vDZP basis set that Grimme and co-workers develops as part of the  ωB97X-3c method gives good results with other functionals. The results are always better than using other DZ basis sets and sometimes better using than TZ or even QZ basis sets, depending on the property! That's the good news.

The bad news the computational cost of the vDZP basis set is about 40% more expensive than a TZ basis set (at least for typical organic molecules). The reason is that the vDZP consists of more primitives compared to a typical TZ basis set (but considerably less compared to a typical QZ basis set).

So, for me, the main take-home message is that there is a basis set that is somewhere between TZ and QZ in cost, that may be worth trying if the TZ results are not acceptable but QZ is too expensive. However, the paper doesn't show any convincing examples of this. Yes, for isomerization reactions, B97-D3BJ/vDZP is more accurate than B97-3c (which uses the mTZVP basis set) and even B97-D3BJ/def2-QZVP.  But you get much better (and faster) results by using r2SCAN-3c (which uses the mTZVPP basis set).

One exception is if you are working with molecules with a lot of heavy atoms (post C row), then vDZP may be faster than TZ basis sets, because it uses ECPs.



This work is licensed under a Creative Commons Attribution 4.0 International License.

Thursday, October 31, 2024

Lifelong Machine Learning Potentials

Marco Eckhoff and Markus Reiher (2023)
Highlighted by Jan Jensen

While machine learning potentials (MLPs) can give you DFT accuracy at FF costs, they also come with some practical problems: they often need to be retrained from scratch when adding new data to avoid catastrophic forgetting, and most structural descriptors struggle to efficiently represent a large number of different chemical elements.

This paper presents solutions to some of these problems by introducing element-embracing atom-centered symmetry functions (eeACSFs) that incorporate periodic table trends, enabling efficient multi-element handling, and by proposing a lifelong learning framework that includes continual learning strategies, the continual resilient (CoRe) optimizer, and uncertainty quantification to allow MLPs to adapt to new data incrementally without losing prior knowledge.

The eeACSFs differ from conventional ACSFs by integrating element information based on periodic table trends rather than creating separate descriptors for each element combination, which allows them to efficiently handle systems with multiple elements without a combinatorial increase in descriptor size.

The CoRe optimizer is designed to balance efficient convergence with stability, adapting dynamically to the learning context. It combines the robustness of RPROP (resilient backpropagation) with the performance benefits of Adam. Specifically, the optimizer adjusts learning rates based on gradient history, which allows for faster convergence initially and a more stable final accuracy. Additionally, it includes a plasticity factor that selectively freezes parameters critical to prior knowledge while allowing other parameters to adapt. This prevents the “catastrophic forgetting” problem common in continual learning, where new learning can overwrite prior knowledge.

The lifelong learning approach include adaptive selection factors where each data point has a selection factor that updates based on its contribution to the loss function. If a point is well-represented in training, its selection factor decreases, reducing its likelihood of being chosen in future training epochs. Conversely, data that are underrepresented have higher selection factors, ensuring they are revisited. In addition, redundant data (those with low loss contributions) are excluded from the training set, reducing the memory and computational load. Data points that the model consistently fails to learn are also excluded, which improves training efficiency and prevents model instability from conflicting data.

The paper acknowledges that further refinement is needed for scenarios involving the addition of new chemical systems. Although the lMLP can expand its conformation space efficiently, accuracy still falls slightly below training on a single large dataset. Additionally, the method’s application to other MLP architectures and addressing consistency across different electronic states and computational methods remain areas for future work. The authors also suggest that larger and more diverse datasets will be necessary to fully realize the potential of lMLPs in simulating complex chemical systems.



This work is licensed under a Creative Commons Attribution 4.0 International License.



Sunday, September 29, 2024

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, Christopher Olah (2022)
Highlighted by Jan Jensen



Most NNs are notoriously hard to interpret. While there are a few cases, mostly in image classification, where some features (like lines or corners) can be assigned to particular neurons, in general is it seems like every part of the NN contributes to every prediction. This paper provides some powerfull insight into why this is, by analysing simple toy models. 

The study builds on the idea that the output of a hidden layers is an N-dimensional embedding vector (V) that encodes a feature of the data (N is the number of neurons in the layers). You might have seen this famous example from language models: V("king") - V("man") + V("woman") = V("queen"). 

Naively, one would expect that a N-neuron layer can encode N different features, since there are N different (i.e. orthogonal) vectors. However, the papers points out that the number of almost orthogonal vectors (say, with angles between 89° and 91°) increases exponentially with N, so that NNs can represent many more features than they have dimensions, which they term "superposition". 

Since most features are stored in orthogonal vectors they will necessarily have many non-zero contributions and this cannot be assigned to a specific neuron. The authors further show that the superposition is driven by data sparcity, i.e. few examples of a particular input feature: more data sparcity, more superposition, less interpretability.

The paper is very thorough and there are many more insights that I have skipped. But I hope this highlight has made you curious enough to have a look at the paper. I can also recommend this brilliant introduction superposition by 3Blue1Brown to get you started.

Now, it's important to note that these insights are obtained by analysing simple toy problems. It will be interesting to see if and how they apply to real-world applications, including chemistry. 


This work is licensed under a Creative Commons Attribution 4.0 International License.



Wednesday, August 28, 2024

Variational Pair-Density Functional Theory: Dealing with Strong Correlation at the Protein Scale

Mikael Scott, Gabriel L. S. Rodrigues, Xin Li, and Mickael G. Delcey (2024)
Highlighted by Jan Jensen

As I've said before, one of the big problems in quantum chemistry is that we still can't routinely predict the reactivity of TM-containing compounds with the same degree of accuracy as we can for organic molecules. This paper might offer a solution by combining CASSCF with DFT in a variational way.

While such a combination has been done before, that implementation basically compute the DFT energy based on the CASSCF density. If you haven't heard of this approach, it's probably because it didn't work very well. 

This paper presents a variational implementation, where you minimise the energy if a CASSCF wavefunction subject to an exchange-correlation density functional, an the results are significantly better - in some cases approaching chemical accuracy! This is pretty impressive given that they used off-the-shelf GGA functionals (BLYP and PBE) so further improvements in accuracy with bespoke functionals is quite likely.

Oh, and one of the applications presented in the paper is multiconfigurational calculation on an entire metallo-protein!



This work is licensed under a Creative Commons Attribution 4.0 International License.



Tuesday, July 30, 2024

Reproducing Reaction Mechanisms with Machine Learning Models Trained on a Large-Scale Mechanistic Dataset

Joonyoung F. Joung, Mun Hong Fong, Jihye Roh, Zhengkai Tu, John Bradshaw, and Connor Wilson Coley (2024)
Highlighted by Jan Jensen

Figure 1 from the paper. (c) the authors 2024

If you don't follow this particular subject, you might be surprised to learn that there isn't a large database of elementary reactions relevant to organic synthesis. Until now. 

While datasets such as Reaxys contain millions of reactions, they are typically multistep reactions. That's mostly fine for training retrosynthesis algorithms (although the authors present discuss some disadvantages), but presents a challenge if you want to use more physically based methods such as QM to predict reactivity. For example, while there are some databases of transition states (TSs) they are typically for synthetically irrelevant reactions. So, for example, while very promising methods have been developed for TS prediction, they have been trained on these datasets and are thus have limited practical applicability to synthesis.

This paper is an important step towards fixing this:

"We  identified the most popular 86 reaction types in Pistachio and curated elementary reaction templates (Figure 1c) for each of these 86 reaction types with 175 different reaction conditions (e.g., types of mechanisms). ... By applying these expert elementary reaction templates to the reactants in Pistachio, we obtained the recorded products as well as unreported  byproducts and side  products. We systematically  selected  and  preserved  the  mechanistic  pathways leading to the formation of the recorded product for  each  entry,  resulting in a comprehensive dataset comprising 1.3 million overall reactions and 5.8 million elementary reactions."

The next step is now to use this data to obtain TSs for these elementary reactions - a difficult but important challenge to the CompChem community.



This work is licensed under a Creative Commons Attribution 4.0 International License.



Sunday, June 30, 2024

Using GNN property predictors as molecule generators

Félix  Therrien, Edward H. Sargent, and Oleksandr Voznyy (2024)
Highlighted by Jan Jensen

Figure 1 from the paper. (c) 2024 the authors

Now this is a very neat idea. Normally, we use back propagation to alter the weight in order to minimise the difference between the output and the ground truth. Instead, the authors use back propagation to alter the input to minimise the difference between the output and a desired value. In this case the input is the molecular adjacency matrix and the result is a molecule with the desired property.

It's one of those "why didn't I think of this?" ideas, but, in practise, there are a few tricky problems to overcome. These include recasting the integer adjacency matrix as a smooth float matrix, finding the right constraints to yield valid molecules, and finding the right loss function.  The authors manage to find clever solutions to all these problems and show that this simple idea actually works quite well. As I read it, the current implementation if limited to HCNOF molecules, but generalising it should not be an insurmountable task.

Even if this approach doesn't turn out to be the best generative model, it is one of these obvious (in hindsight) methods that have to be tested to justify more complicated approaches.   



This work is licensed under a Creative Commons Attribution 4.0 International License.



Thursday, May 30, 2024

FragGT: Fragment-based evolutionary molecule generation with gene types

Joshua Meyers and Nathan Brown (2024)
Highlighted by Jan Jensen


Figure 1 from the paper. (c) The authors. Reproduced under the CC-BY license

Genetic algorithms (GAs) allow for changes at the atom level (as opposed to molecular fragments) allow for a very fine-grained search of chemical space. However, some of the resulting molecules are not chemically sensible and one usually has to include a synthetic accessibility constraint in the scoring function. 

However, another approach is to use fragments and include synthetic accessibility in the fragmentation scheme, which is what this study did. Specifically they use the BRICS fragmentation scheme implemented in RDKit and the corresponding combination rules to turn the genes into molecules. 

The authors do indeed find that the resulting molecules do indeed look more reasonable (though it is not quantified). However, the authors note that the method is a "relatively inefficient explorer of chemical space", requiring a large number of scoring function evaluations.

The problem is probably, the short-chromosome/many-genes problem. GAs do best at optimizing long chromosomes made of only a few different genes, while the opposite is the case here: there are 211,388 unique BRICS fragments and each molecule contains only around 10 fragments. So you need to run a lot to make sure that all (reasonably) possible genes have been sampled at each position.

It presents a very interesting open challenge to the cummunity.


This work is licensed under a Creative Commons Attribution 4.0 International License.



Tuesday, April 30, 2024

Invalid SMILES are beneficial rather than detrimental to chemical language models

Michael A. Skinnider (2024)
Highlighted by Jan Jensen

Figure 3c from the paper. (c) The author. Reproduced under the CC-BY License

Language models (LMs) don't always produce valid SMILES and while for modern methods the percentage of invalid SMILES tends to be relatively small, much effort has been expended on making it as small as possible. SELFIES was invented as a way to make this percentage 0, since SELFIES is design to always produce valid SMILES.

However, several studies have shown that SMILES-based LMs tends to produce molecular distributions that is closer to the training set, compared to SELFIES. This paper has figured out the reason and it turns out to be both trivial and profound at the same time.

It turns out that the main difference in the molecules produced using SMILES and SELFIES is that the former has a much larger proportion of aromatic atoms. Furthermore, this difference goes away if the SELFIES-based method is allowed to make molecules with pentavalent carbons, which are then subsequently discarded when converted from SELFIES to SMILES.

The reason for this is that in order to generate a valid SMILES or SELFIES string for an aromatic molecule you have to get the sequence of letters exactly right. If it goes wrong for SMILES it is discarded, but if it goes wrong for SELFIES it is usually turned into a valid non-aromatic molecule, i.e. the mistake is not discarded. 

For example, the correct SMILES string for benzene is "c1ccccc1", and generated strings with one more or one less "c" character ("c1cccccc1" and "c1cccc1") are invalid and will be removed. The corresponding SELFIES string for benzene is "[C][=C][C][=C][C][=C][Ring1][=Branch1]", but generated strings with one more or one less [C] character will result in non-aromatic molecules with SMILES strings like "C=C1C=CC=CC1" and "C1=CC=CC=1".

There's a lot ML papers that simply observe what works best, but very few that determine why. This is one of them and it is very refreshing!



This work is licensed under a Creative Commons Attribution 4.0 International License.



Sunday, March 31, 2024

An evolutionary algorithm for interpretable molecular representations

Philipp M. Pflüger, Marius Kühnemund, Felix Katzenburg, Herbert Kuchen, and Frank Glorius (2024)
Highlighted by Jan Jensen

Parts of Figures 2 and 6 combined. (c) 2024 Elsevier, Inc

This paper presents a very novel approach to XAI that allows for direct comparison with chemical intuition. Molecular fingerprints (either binary or count) are defined using randomly generated SMARTS patterns and then uses a genetic algorithm to find the optimum fingerprint of a certain length. Here the optimum is defined as the one giving the lowest error when used with CatBoost. The GA search requires many thousands of models so the approach is not practical for more computational expensive ML models. 

Nevertheless, the authors show that CatBoost is competitive with more sophisticated ML models even when using FP lengths as low as 256 (or even 32 in some cases). One can then analyse the SMARTS patterns to gain chemical insights. 

Even more interestingly, one can use the approach to directly compare to chemical intuition. The authors did this by asking five groups of chemists to come up with the 16 most structural features that explain the Doyle-Dreher dataset of 3,960 Buchwald-Hartwig cross-coupling yields. ML models based on the corresponding FPs tended to perform worse than the 16-bit FPs found by the GA. However, it there were also many similarities between the FPs indicating that the method can extract features that are in agreement with chemical intution.  


This work is licensed under a Creative Commons Attribution 4.0 International License.



Wednesday, February 28, 2024

AiZynth Impact on Medicinal Chemistry Practice at AstraZeneca

Jason D. Shields, Rachel Howells, Gillian Lamont, Yin Leilei, Andrew Madin, Christopher E. Reimann, Hadi Rezaei, Tristan Reuillon, Bryony Smith, Clare Thomson, Yuting Zhengc and Robert E. Ziegler (2024)
Highlighted by Jan Jensen

Figure 3 from this paper (c) the authors 2020. Reproduced under the CC-BY license

This is one of the rare papers where experimental chemists talk candidly about their experiences using ML models developed by others. In this case it is AiZynthFinder, which is developed at AstraZeneca Gothenburg and predicts retrosynthetic paths, while the users are most synthetic chemists at AstraZeneca in the UK, US, and China. The paper is really well written and well worth reading. I'll just include a few quotes below to whet your appetite.  

"New users of AI tools in general are often disappointed by the failure of AI to live up to their expectations, and chemists' interaction with AiZynth is no exception. The first molecule that most new users test is one that they have personally synthesised recently, and AiZynthFinder rarely replicates their route exactly. Due in part to our self-imposed requirement to run fast searches, AiZynthFinder often gets close to a good route. Thus, experienced users seek inspiration from AiZynth rather than perfection."

"Common problems include proposals that would lead to undesired regioselectivity, functional group incompatibility, or overgeneralisation of precedented reactions to an inappropriate context."

"Early problems also included protection/deprotection cycles, which had to be intentionally penalised in order to focus AiZynth on productive chemistry. We have found that protecting group strategy is still best decided by the chemist. Thus, the AI proposals discussed in the case studies do not make heavy use of protecting groups, whereas several of the laboratory syntheses do."



This work is licensed under a Creative Commons Attribution 4.0 International License.



Wednesday, January 31, 2024

TS-Tools: Rapid and Automated Localization of Transition States Based on a Textual Reaction SMILES Input

Thijs Stuyver (2024)
Highlighted by Jan Jensen


Figure 2 from the paper. (c) the author 2024 reproduced under the CC-BY-NC-ND licence

This paper caught my eye for several reasons. It's an open source implementation of Maeda's AFIR method, but modified for double-ended TS searches. The setup is completely automated and interfaced to  xTB so it is fast. It's applied to really challenging problems such as solvent assisted bimolecular reactions and uncovers some important shortcomings of the xTB method. 


This work is licensed under a Creative Commons Attribution 4.0 International License.