It makes sense that allowing for things many consider to be incorrect or impossible, like through removing valence restrictions, improves prediction quality. As an example, pentavalent carbon is very much real, and so disallowing it means whatever chemical system is learned by the model is less aligned with reality. It doesn't matter if pentavalent carbon is extremely conditional to specific circumstances, many novel and interesting molecules are.
No excuses necessary I think that's a great question. Yes, like in the article - you can remove extremely unlikely to be real things, unlikely to be real could be because of things such as being extremely conditional (pentavalent carbon for example) or simply implausible. The point is that by preventing the model from learning a distribution that captures the extremely rare though real instances of things, you end up with a distribution that departs from whatever the reality of the situation is. You end up with a biased model, and that is highly unlikely to be a good thing.
Bottom line, it is better to let a model learn the full distribution then clip predictions that come from extreme ends of the distribution, rather than preventing the model from considering these extremes in the first place. Either way you don't get predictions from the extremes, but the model allowed to consider extremes is more likely to do interesting things.
I think I may have heard of a similar problem in a different field, computational thermodynamics. Solutions to a set of equations may be real or complex. The nonlinear solvers worked with real numbers exclusively and from time to time would not converge to any solution. Apparently if one used a solver that worked with complex numbers, the solver would reach a solution in real space more often than exclusively real solvers. Of course solving was much more expensive. Your explanation seems to fit here too
Interesting, thanks for sharing! Do you have any papers or other publicly available works to share that discuss this? I'd be very interested in reading about it.
> The point is that by preventing the model from learning a distribution that captures the extremely rare though real instances of things, you end up with a distribution that departs from whatever the reality of the situation is. You end up with a biased model, and that is highly unlikely to be a good thing.
I (think I) understand. On top, the model being able to generate for conditions able to bring forth a "materially" (in realizable fact) unlikely condition, might yield results that are interesting, -particularly- because the conditions that it would be contingent upon -are- so hard, or unlikely in actual fact. Makes sense.-
in the article they talk about syntactically invalid smile formula (mismatch parenthesis and cie, check figure 1b), impossible molecules can be expressed with syntactically valid SMILE. But nonetheless your explanation appears valid to my non chemist eyes. from the paper:
Structural biases limit generalization An ideal generative model would sample evenly from the chemical space surrounding the molecules in the training set. The observation of structural biases in the outputs of language models trained on SELFIES is at odds with this goal.
I am not convinced. All models are trained with teacher forcing, thus the model never sees invalid sequences during training. It's only during inference that this difference appears, as shown by the experiment of selfies without valency cobstrainnts. To me, it seems like this is an artifact caused by details of how sampling from a model is done, rather than something about the models or training procedures.
I'm no chemist, but from a comp sci perspective the title claim seems overstrong. One can show that certain models of one category outperform certain models of another on certain benchmarks, or first demonstrate some impressive result with a mode based on such a contrary insight. But it's very very hard to prove that a whole category - e.g. models eliminating invalid SMILES - won't lead to competitive results.
as a chemist, it does make sense to allow 'invalid' things, you're learning a distribution of numbers, it makes no sense to only allow the numbers you expect, where if you know what to expect, why do you need a model?
Imagine if we forced people to think only in valid math.
It would make proofs by contradiction impossible to even imagine.
Alternatively think of computer programming. Imagine your editor forces your code to be compilable at every keystroke. It would suck (who used paredit knows the struggle ;) ).
So it's perfectly possible that making "invalid" results expressable in the language your model uses might be beneficial.
Proofs by contradictions are still expressed in valid math. Invalid math is when you start giving answers like “1.3(4æ87+^%<[{“, that is completely disregarding syntactical and consistency rules of the language you are using to express yourself in.
> going from one to another valid expression is often faster if you can go through invalid expressions.
I still don’t think you really comprehend what an invalid expression is in this case. Ñ¥æd huiåp xiip ææææOP. It won’t make anything “faster” to “go through” these, as expressed extremely cleanly and fast in the last statement expressed in invalid English.
Invalid expressions doesn’t mean “this mathematical expression expresses an inconsistency” like 3/0, it means “this expression isn’t a mathematical expression because it doesn’t follow the basic rules for the language” like “$/(!!@?3.3.3”