Tuesday, July 30, 2024

Reproducing Reaction Mechanisms with Machine Learning Models Trained on a Large-Scale Mechanistic Dataset

Joonyoung F. Joung, Mun Hong Fong, Jihye Roh, Zhengkai Tu, John Bradshaw, and Connor Wilson Coley (2024)
Highlighted by Jan Jensen

Figure 1 from the paper. (c) the authors 2024

If you don't follow this particular subject, you might be surprised to learn that there isn't a large database of elementary reactions relevant to organic synthesis. Until now. 

While datasets such as Reaxys contain millions of reactions, they are typically multistep reactions. That's mostly fine for training retrosynthesis algorithms (although the authors present discuss some disadvantages), but presents a challenge if you want to use more physically based methods such as QM to predict reactivity. For example, while there are some databases of transition states (TSs) they are typically for synthetically irrelevant reactions. So, for example, while very promising methods have been developed for TS prediction, they have been trained on these datasets and are thus have limited practical applicability to synthesis.

This paper is an important step towards fixing this:

"We  identified the most popular 86 reaction types in Pistachio and curated elementary reaction templates (Figure 1c) for each of these 86 reaction types with 175 different reaction conditions (e.g., types of mechanisms). ... By applying these expert elementary reaction templates to the reactants in Pistachio, we obtained the recorded products as well as unreported  byproducts and side  products. We systematically  selected  and  preserved  the  mechanistic  pathways leading to the formation of the recorded product for  each  entry,  resulting in a comprehensive dataset comprising 1.3 million overall reactions and 5.8 million elementary reactions."

The next step is now to use this data to obtain TSs for these elementary reactions - a difficult but important challenge to the CompChem community.



This work is licensed under a Creative Commons Attribution 4.0 International License.