James M. Stevenson, Leif D. Jacobson, Yutong Zhao, Chuanjie Wu, Jon Maple, Karl Leswing, Edward Harder, and Robert Abel (2019)
Highlighted by Jan Jensen
There are a lot of ML models trained on HCNO data sets. This is fine for proof-of-concept, but severely limits applications to real world problems. For example HCNO-only molecules comprise only 46% of molecules in the CHEMBL database of drug-like molecules.
The main problem with extending these methods to other elements is that the size of the chemical space grows non-linearly with respect to the number of different elements. Furthermore, there aren't any generally available and comprehensive sets of molecules similar to QMx or GDB-x.
The current study extends the ANI-1 method to H, C, N, O, S, F, Cl, and P, which covers 94% of CHEMBL. The authors use a combination of stochastic sampling and extensive pre-screening to distill the training/validation set to only 10 million DFT single point calculations on relatively small molecules, which took "just a few days and at a very reasonable compute cost".
The main focus was on relative conformer energies, since the bulk of CPU time for many studies is typically spent on conformational searches. The RMSE for this data is 0.70 kcal/mol relative to DFT, which is quite impressive.
As the name suggests, the work was done at Schrödinger, so the method is not open sourced. However, an earlier version for 4 elements is available here. More importantly, the methodology behind the dataset generation is well described and appears to be practically feasible for academic labs.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Highlighted by Jan Jensen
There are a lot of ML models trained on HCNO data sets. This is fine for proof-of-concept, but severely limits applications to real world problems. For example HCNO-only molecules comprise only 46% of molecules in the CHEMBL database of drug-like molecules.
The main problem with extending these methods to other elements is that the size of the chemical space grows non-linearly with respect to the number of different elements. Furthermore, there aren't any generally available and comprehensive sets of molecules similar to QMx or GDB-x.
The current study extends the ANI-1 method to H, C, N, O, S, F, Cl, and P, which covers 94% of CHEMBL. The authors use a combination of stochastic sampling and extensive pre-screening to distill the training/validation set to only 10 million DFT single point calculations on relatively small molecules, which took "just a few days and at a very reasonable compute cost".
The main focus was on relative conformer energies, since the bulk of CPU time for many studies is typically spent on conformational searches. The RMSE for this data is 0.70 kcal/mol relative to DFT, which is quite impressive.
As the name suggests, the work was done at Schrödinger, so the method is not open sourced. However, an earlier version for 4 elements is available here. More importantly, the methodology behind the dataset generation is well described and appears to be practically feasible for academic labs.
This work is licensed under a Creative Commons Attribution 4.0 International License.