Philipp M. Pflüger, Marius Kühnemund, Felix Katzenburg, Herbert Kuchen, and Frank Glorius (2024)
Highlighted by Jan Jensen
This paper presents a very novel approach to XAI that allows for direct comparison with chemical intuition. Molecular fingerprints (either binary or count) are defined using randomly generated SMARTS patterns and then uses a genetic algorithm to find the optimum fingerprint of a certain length. Here the optimum is defined as the one giving the lowest error when used with CatBoost. The GA search requires many thousands of models so the approach is not practical for more computational expensive ML models.
Nevertheless, the authors show that CatBoost is competitive with more sophisticated ML models even when using FP lengths as low as 256 (or even 32 in some cases). One can then analyse the SMARTS patterns to gain chemical insights.
Even more interestingly, one can use the approach to directly compare to chemical intuition. The authors did this by asking five groups of chemists to come up with the 16 most structural features that explain the Doyle-Dreher dataset of 3,960 Buchwald-Hartwig cross-coupling yields. ML models based on the corresponding FPs tended to perform worse than the 16-bit FPs found by the GA. However, it there were also many similarities between the FPs indicating that the method can extract features that are in agreement with chemical intution.
This work is licensed under a Creative Commons Attribution 4.0 International License.