Monday, March 31, 2025

Designing Target-specific Data Sets for Regioselectivity Predictions on Complex Substrates

Jules  Schleinitz,  Alba  Carretero-Cerdán, Anjali  Gurajapu, Yonatan  Harnik, Gina  Lee, Amitesh  Pandey, Anat  Milo, and Sarah Reisman (2025)
Highlighted by Jan Jensen



Have you ever looked at a poorly performing ML model and thought: "Hmm, maybe I should make my training set smaller"? Me neither. 

Well, this paper shows an example where that actually works. In particular, they show examples where regioselectivity predictions for some molecules are improved by using only some of the available data. They show that if you start with a very small training set you get the wrong prediction of where in a molecule the reaction occurs and when you add more data the model often makes the right prediction eventually. However, if you keep adding data the model starts making a wrong prediction again! In other words, they get a much better classification model if the make a bespoke training set for each molecule.

This raises two important questions: 1) Does this apply to all datasets and properties? and 2) How do you figure out which data points to include in your training set for a particular molecule when you don't know the right answer?

If the answer to the first question is "no" (and it probably is), then how do we figure out when it this is a good strategy? (other than trial-an-error).  I suspect that the predictions of local properties (such as the reactivity of an atom) are more likely to benefit from bespoke training sets, compared to global properties such as solubility. But that is just a guess.

Another guess, it that this will apply mostly to small, inhomogeneous datasets. If so, we could easily generate bespoke models for each individual prediction on-the-fly if that would lead to better predictions. But we need to figure out the answers to question 2 first. 

I also think that if we can understand how additional data can hurt a models performance, it would give us some valuable insights into how ML models learn.



This work is licensed under a Creative Commons Attribution 4.0 International License.