When the author says "if a researcher runs 400 experiments on the same train-test splits" - then depending on what he means by 'test' set, that researcher is wrong. In pretty much all machine learning literature I've come across, it's drilled into you that you never look at your held-out test set until the very end. Hyperparameter optimisation and/or model selection happens on the training set and only when you've tuned your hyperparameters and selected your best model do you run the model on your test set to see how it's done.
Once you've run the model on the test set once, you can't go back to tweak your model because you're introducing bias and you no longer have any data left that your model's never seen before.
To avoid overfitting, you can use cross-validation to effectively re-use your training set and create multiple training/validation splits. (As an aside, I find it frustrating how liberally different sources switch between 'validation set' and 'test set', it's really confusing).
Your second and third paragraphs are also exactly correct. I attempted to make those points in the post, but you've done so more effectively here.
It really seems like there should be more of a theory around these issues. Even a dreadfully abstract and/or terse VC-dimension-scariness level of a theory.
I keenly await the author's next post.
There's definitely a lot of theory, but it hasn't yet been turned into readily usable R libraries.
In Python this is easy enough with scikit-learn, and in R the caret package makes semi-automatic tuning really easy.
So the question becomes: how far do we really need to go to create business value, and does it actually make sense to go all Kaggle on the problem?
Won't trying different combinations of hyper parameters/lambda (over a small range) help us arrive better instead of manually tuning it? Or is that what the author meant by manual tuning?
As I understand it, one of the pitfalls of automatic tuning is that it becomes hard to account for seasonality and you will likely end up with useless parameters - for instance a customer ID is rarely a good parameter to optimize on, even as a categorical variable, except in very specific cases. It is probably a proxy variable for one or more other ones that you need to tease out of the rest of the data.
(warning, potentially me talking nonsense coming up) Automatic tuning is no substitute for a talented analyst who knows the data well and understands the goal. But if you've got hundreds to millions of parameters, you may not have another choice really.