At the root of this is something I'm constantly saying to other members in my re...

tzhenghao · on May 29, 2023

I remember working with several ML teams on specifically inference performance (latency, memory usage etc.), and it's not a surprise to see some object detection performance variance depending on a scene.

Sometimes, even capping the model architecture to bound us from exceeding performance thresholds is non-trivial in itself, but convincing "some" researchers why p99, inference latency for example, is more important than the p50 case in a safety critical system...that's surprisingly several magnitudes harder.

godelski · on May 29, 2023

Most, if not all (can't think of one), research datasets have enough problems where once you reach a certain threshold your model will start to overfit despite no divergence between training and test data. But this is hard to explain to many as they only understand overfitting as divergence. But maybe worse, is that it is status quo to tune hyper-parameters on test data results[0] (instead of validation), which causes information leakage and helps support overfitting. But damned if you do, damned if you don't. And don't get me started on generative models where there aren't even test datasets[1].

[0] I blame the status quo of lazy reviewers and rejecting works based on benchmarks as well as being uninformed.

[1] I'll admit this is a sore spot right now as a reviewer took my paper's discussions about the limitations of FID and asked why I didn't present a new metric.

YeGoblynQueenne · on May 30, 2023

>> But maybe worse, is that it is status quo to tune hyper-parameters on test data results[0] (instead of validation), which causes information leakage and helps support overfitting.

This is extremely hard to convince people about. I'm starting to doubt that even veteran researchers realise how unreliable their error estimates become if they're tuning their models on the test set, which, like you say, is standard practice. And yet, people will stand on that rotten practice and talk about the amazing generalisation ability of neural nets, and how over-parameterising and over-training defies statistical learning theory. That's why we have such gems as this paper:

The unreasonable effectiveness of deep learning in artificial intelligence

https://www.pnas.org/doi/10.1073/pnas.1907373117

Or "the grokking paper" and so on. Machine learning is starting to look more and more like the social sciences, where people simply pick and choose results from the (mostly non-peer reviewed) literature, just because they like the claim in a paper (or because it has a catchy title, or it went viral on twitter), and not because they make any serious attempt to check the results themselves.

P.S. Sorry about your review. It's a good idea to avoid any discussion that goes beyond the central claim of a paper. It can only cause reviewers confusion and distract them from the meat and potatoes of the work. Unfortunately, sticking to that advice makes for dry and boring to read papers, which also reduces the chances of acceptance.

godelski · on May 30, 2023

Oh yeah, those are great papers, which I wish more people read. (I'll never not laugh at the journal name) I'll add this too[0] since it seems to be highly missed. But as far as validation sets, I have a hard time convincing anyone I talk to that tuning on test data is information leakage. I've also had issues with discussing people (even outside my lab, even at big schools/labs) what "uncurated samples" means. I say "you sample batch and show that" they say you can sample multiple batches and pick the best one... So this is the same kind of thinking and shows why evaluating generative papers is so difficult.

Fwiw, I blame the conferences for this. We have too many papers to review, too few people to review, and a system that isn't even good at selecting people from the pool of reviewers (I got 0 reviews to do, my coworker got 6 ¯\_(ツ)_/¯). Quality control on reviewers is non existent and ACs don't check that they follow reviewing rules. This results in a situation where I cannot think where it isn't in your advantage to be an evil reviewer: reject everything, be lazy about it. So we promote benchmarkism and abstract everything and everything so that nothing is novel (before we talk about collusion and ethics violations). Without a radical change I don't think the system will work anymore. There's too much incentive to cheat and play dirty now.

For my paper, all that stuff was in the appendix fwiw. Since it is a generative paper I took my chance to do a deep sample analysis (even inventing a new technique) to analyze the biases in different models and noting key indicators that were visible to the eye. So of course I had a small discussion about how FID is limited (see [1,2], [2] isn't deep enough though) and will not capture these differences. These differences matter when you're pushing the edge of the maximum FID on the dataset, which is not 0 as many people think[3]. I do feel like it is my duty as a researcher to stick my flag in the ground and point out how we need to do things better. (I do think it also made the paper much clearer and got good feedback from my colleagues fwiw) That is what research is after all. Research requires nuance, and if we're being honest, that shouldn't be something I have to say. People should deeply understand their metrics (not just for evaluating models, but works). The system is just too noisy right now to be meaningful imo.

[0] A note on the evaluation of generative models: http://arxiv.org/abs/1511.01844

[1] Improved Precision and Recall Metric for Assessing Generative Models: https://arxiv.org/abs/1904.06991 (original is good too: https://arxiv.org/abs/1806.00035)

[2] The Role of ImageNet Classes in Fréchet Inception Distance: https://arxiv.org/abs/2203.06026

[3] Fwiw, train vs test set in CIFAR10 has FID 3.15 and FFHQ256 top 10k vs bottom 60k is 2.25 (which current top paper beats, but that's 50k gen samples vs 50k random dataset samples). These of course have biases though since the sizes are unequal but since FID is distributional it does give us some strong clues about the variance in the datasets. My paper didn't go this far though.

YeGoblynQueenne · on May 30, 2023

They funny thing is that all this is clearly spelled out in PAC-Learning and Statistical Learning Theory, but the trend in most machine learning courses is to avoid all that hairy stuff, and concentrate on the practical tasks of training classifiers and using popular libraries and so on.

Btw, in the above comment:

FID = Fréchet inception distance

NLL = Negative Log Likelihood

godelski · on May 30, 2023

Yeah this is what bugs me. If these people took a single stats class it would be drilled into them. Doesn't matter if ISLR, PAC-LSLT, SR, or ROS, it will be there. But it isn't. I push for as much of this as I can when I teach my ML course but my advisor pushes back for being "too mathy." I swear, ML people are adverse to math (though they like adding meaningless math equations to papers to make them look technical, but that is probably aligned with the former comment). As the current environment stands, I do not think many ML people understand concepts like data leakage, Bayes, or even likelihood. A problem is that popular way to teach ML is like how you teach software: coding. This doesn't work when you need statistical theory to understand your results.