Michael A. Skinnider (2024)
Highlighted by Jan Jensen
Figure 3c from the paper. (c) The author. Reproduced under the CC-BY License
Language models (LMs) don't always produce valid SMILES and while for modern methods the percentage of invalid SMILES tends to be relatively small, much effort has been expended on making it as small as possible. SELFIES was invented as a way to make this percentage 0, since SELFIES is design to always produce valid SMILES.
However, several studies have shown that SMILES-based LMs tends to produce molecular distributions that is closer to the training set, compared to SELFIES. This paper has figured out the reason and it turns out to be both trivial and profound at the same time.
It turns out that the main difference in the molecules produced using SMILES and SELFIES is that the former has a much larger proportion of aromatic atoms. Furthermore, this difference goes away if the SELFIES-based method is allowed to make molecules with pentavalent carbons, which are then subsequently discarded when converted from SELFIES to SMILES.
The reason for this is that in order to generate a valid SMILES or SELFIES string for an aromatic molecule you have to get the sequence of letters exactly right. If it goes wrong for SMILES it is discarded, but if it goes wrong for SELFIES it is usually turned into a valid non-aromatic molecule, i.e. the mistake is not discarded.
For example, the correct SMILES string for benzene is "c1ccccc1", and generated strings with one more or one less "c" character ("c1cccccc1" and "c1cccc1") are invalid and will be removed. The corresponding SELFIES string for benzene is "[C][=C][C][=C][C][=C][Ring1][=Branch1]", but generated strings with one more or one less [C] character will result in non-aromatic molecules with SMILES strings like "C=C1C=CC=CC1" and "C1=CC=CC=1".
There's a lot ML papers that simply observe what works best, but very few that determine why. This is one of them and it is very refreshing!
This work is licensed under a Creative Commons Attribution 4.0 International License.