While this is indeed clickbait as mentioned by others - I am consistently shocked with how little the most common technique for ensuring that a model you trained works on unseen data, cross-validation, is used in the real world.
I had it drilled into my brain that I really shouldn't trust anything except the average validation score of a (preferably high K value) K-fold cross evaluated model when trying to get an idea of how well my ML algorithm performs on unseen data. Apparently most people in my field (NLP) did not have this drilled into their head either. This is partly why NLP is filled with unreproducable scores (because the magically score if it was ever found was only found on seed #3690398 train/test split)
As far as I'm concerned, if you didn't cross-validate, the test set score is basically useless.
The point of the article is more that even if all of your testing and validation is rigorous and the performance looks great, trivial changes in the production data can break your model anyway.
My view is that all high value production models should include out of distribution detection, uncertainty quantification, and other model specific safeguards (like self-consistency) to confirm that the model is only being used to predict on data it is competent to handle.
All that is only needed because incremental learning algorithms don't really work all that well. It's a dirty secret in the field that we still don't have good answers for catastrophic forgetting in neural networks (the best candidate incremental learner as of right now), and the other alternatives are far worse.
This is good to have but it doesn't really address the problem of predictive accuracy in the presence of nonstationarity. The safeguards just help us switch off the model at the right time. We're still stuck with no capability in the new environment.
I think knowing what you don't know is still a pretty big win. It can help people trust the models in cases they do work, and it can serve as a diagnostic for why it fails in certain circumstances.
This sounds an awful lot like guassian processes which are fairly common in research environments. I don't know how common it is to deploy guassian processes in the real world, but I see published papers integrating them into other models all the time. The gist is that instead of input -> prediction you get input -> prediction + sigma (every prediction is given as a guassian distribution).
For common neural network models, the output probability has no meaning for out of distribution inputs, so you need to do an ensemble or some other method to get at the actual model confidence. I don't know enough about Gaussian processes to know if they have any limitations like that.
But it's an interesting point, e.g. if a CNN works better on in-distribution data, but a Gaussian process is better at providing a confidence estimate for OOD points (if it does), a hybrid model is possible.
No, this does not solve the problem that he describes in the article. You can have a great crossvalidation score and still struggle on unseen data if the data is relatively dissimilar from your train set. Like X-Ray scans produced from a different machine. There are numerous other examples. CNNs on images for example are famously known to disintegrate on images + white noise (which look the same to a human).
I think cross validation is a powerful tool but not always necessary and is prone to abuse, such as overfittimg to the test set.
A skilled modeller can reduce variance using domain specific tricks that are more powerful at variance reduction than cross validation. But still cross val is usually good to use as well.
I had it drilled into my brain that I really shouldn't trust anything except the average validation score of a (preferably high K value) K-fold cross evaluated model when trying to get an idea of how well my ML algorithm performs on unseen data. Apparently most people in my field (NLP) did not have this drilled into their head either. This is partly why NLP is filled with unreproducable scores (because the magically score if it was ever found was only found on seed #3690398 train/test split)
As far as I'm concerned, if you didn't cross-validate, the test set score is basically useless.