Michael A. Skinnider (2024)
Highlighted by Jan Jensen
Figure 3c from the paper. (c) The author. Reproduced under the CC-BY License
Language models (LMs) don't always produce valid SMILES and while for modern methods the percentage of invalid SMILES tends to be relatively small, much effort has been expended on making it as small as possible. SELFIES was invented as a way to make this percentage 0, since SELFIES is design to always produce valid SMILES.
However, several studies have shown that SMILES-based LMs tends to produce molecular distributions that is closer to the training set, compared to SELFIES. This paper has figured out the reason and it turns out to be both trivial and profound at the same time.
It turns out that the main difference in the molecules produced using SMILES and SELFIES is that the former has a much larger proportion of aromatic atoms. Furthermore, this difference goes away if the SELFIES-based method is allowed to make molecules with pentavalent carbons, which are then subsequently discarded when converted from SELFIES to SMILES.
The reason for this is that in order to generate a valid SMILES or SELFIES string for an aromatic molecule you have to get the sequence of letters exactly right. If it goes wrong for SMILES it is discarded, but if it goes wrong for SELFIES it is usually turned into a valid non-aromatic molecule, i.e. the mistake is not discarded.
For example, the correct SMILES string for benzene is "c1ccccc1", and generated strings with one more or one less "c" character ("c1cccccc1" and "c1cccc1") are invalid and will be removed. The corresponding SELFIES string for benzene is "[C][=C][C][=C][C][=C][Ring1][=Branch1]", but generated strings with one more or one less [C] character will result in non-aromatic molecules with SMILES strings like "C=C1C=CC=CC1" and "C1=CC=CC=1".
There's a lot ML papers that simply observe what works best, but very few that determine why. This is one of them and it is very refreshing!
This work is licensed under a Creative Commons Attribution 4.0 International License.
In my opinion, the real take-home message of this paper is not that invalid SMILES are better - the message is that SELFIES are simply an inferior line notation for most purposes and its "100% robustness" guarantee causes more problems than the one it "solves". Even in the original SELFIES paper this is fairly clear: the authors there show what happens when single tokens are removed, usually leading to molecules with half of the atoms dropped or into polyenes and macrocycles.
ReplyDeleteWe shouldn't forget the underlying truth of chemical line notation is the chemical graph (and thus, valence bond theory). An invalid line notation does not have any chemical meaning, but in the case of SELFIES a "valid" line notation is not necessarily as chemically meaningful as you would expect, in the sense that string similarity (eg levenshtein) of two SELFIES strings will frequently be pretty far from chemical graph similarity (via ECFP tanimoto, graph edit distance, etc). In fact, even for SMILES this relation is a bit better, and one reason for this is because there are no special tricks to make a string parsable at all costs even if it leads to nonsense molecules.
I have to laud the author of this paper for clear and extended experiments, but to me, the conclusion invalid SMILES are beneficial, while correct, is a distraction - this work to me is a call to action: line notation grammar is overhead complicating the interpretation of chemical graphs, and clever grammar tricks to "square the circle" like in SELFIES can complicate this even more. There has to be a better way!