Figure 3 from the paper. (c) The authors 2020 reproduced under the CC BY-NC-ND 4.0 license
The red and green columns show the accuracy of regioselectivity prediction as a function of training set size (N) for two ML-models: one based on QM descriptors and the other based on a graph NN (GNN). For N = 200 QM outperforms GNN by 9%, but the performance of QM doesn't improve by more than 1.5% for larger training sets. GNN does improve and ends up outperforming QM by 2.5% for large training sets.
Combining QM and GNN (QM-GNN) gives roughly the same accuracy as QM and GNN for small and large training sets, respectively. To remove the cost of the QM, a separate GNN model for the QM descriptors is developed and combined with the GNN model of regioselectivity (ml-QM-GNN), which gives roughly the same results at much faster speed. Note that this GNN descriptor model is trained on a different, and much larger, data set (since no experimental data is needed) and can be used to augment other types of predictions.
The fact that ml-QM-GNN outperforms QM-GNN for N = 200 indicates the accuracies are good to no more than +/- 1%, so the slightly better performance for ml-QM-GNN compared to GNN for N = 2000 is not real. So ml-QM only enhances the accuracy for ca N < 1000 for this particular property, but is definitely worth doing for problems with only a few hundred data points. Especially now that the ml-QM model is already has been developed.