Hacker News new | past | comments | ask | show | jobs | submit login

>> But maybe worse, is that it is status quo to tune hyper-parameters on test data results[0] (instead of validation), which causes information leakage and helps support overfitting.

This is extremely hard to convince people about. I'm starting to doubt that even veteran researchers realise how unreliable their error estimates become if they're tuning their models on the test set, which, like you say, is standard practice. And yet, people will stand on that rotten practice and talk about the amazing generalisation ability of neural nets, and how over-parameterising and over-training defies statistical learning theory. That's why we have such gems as this paper:

The unreasonable effectiveness of deep learning in artificial intelligence

https://www.pnas.org/doi/10.1073/pnas.1907373117

Or "the grokking paper" and so on. Machine learning is starting to look more and more like the social sciences, where people simply pick and choose results from the (mostly non-peer reviewed) literature, just because they like the claim in a paper (or because it has a catchy title, or it went viral on twitter), and not because they make any serious attempt to check the results themselves.

P.S. Sorry about your review. It's a good idea to avoid any discussion that goes beyond the central claim of a paper. It can only cause reviewers confusion and distract them from the meat and potatoes of the work. Unfortunately, sticking to that advice makes for dry and boring to read papers, which also reduces the chances of acceptance.




Oh yeah, those are great papers, which I wish more people read. (I'll never not laugh at the journal name) I'll add this too[0] since it seems to be highly missed. But as far as validation sets, I have a hard time convincing anyone I talk to that tuning on test data is information leakage. I've also had issues with discussing people (even outside my lab, even at big schools/labs) what "uncurated samples" means. I say "you sample batch and show that" they say you can sample multiple batches and pick the best one... So this is the same kind of thinking and shows why evaluating generative papers is so difficult.

Fwiw, I blame the conferences for this. We have too many papers to review, too few people to review, and a system that isn't even good at selecting people from the pool of reviewers (I got 0 reviews to do, my coworker got 6 ¯\_(ツ)_/¯). Quality control on reviewers is non existent and ACs don't check that they follow reviewing rules. This results in a situation where I cannot think where it isn't in your advantage to be an evil reviewer: reject everything, be lazy about it. So we promote benchmarkism and abstract everything and everything so that nothing is novel (before we talk about collusion and ethics violations). Without a radical change I don't think the system will work anymore. There's too much incentive to cheat and play dirty now.

For my paper, all that stuff was in the appendix fwiw. Since it is a generative paper I took my chance to do a deep sample analysis (even inventing a new technique) to analyze the biases in different models and noting key indicators that were visible to the eye. So of course I had a small discussion about how FID is limited (see [1,2], [2] isn't deep enough though) and will not capture these differences. These differences matter when you're pushing the edge of the maximum FID on the dataset, which is not 0 as many people think[3]. I do feel like it is my duty as a researcher to stick my flag in the ground and point out how we need to do things better. (I do think it also made the paper much clearer and got good feedback from my colleagues fwiw) That is what research is after all. Research requires nuance, and if we're being honest, that shouldn't be something I have to say. People should deeply understand their metrics (not just for evaluating models, but works). The system is just too noisy right now to be meaningful imo.

[0] A note on the evaluation of generative models: http://arxiv.org/abs/1511.01844

[1] Improved Precision and Recall Metric for Assessing Generative Models: https://arxiv.org/abs/1904.06991 (original is good too: https://arxiv.org/abs/1806.00035)

[2] The Role of ImageNet Classes in Fréchet Inception Distance: https://arxiv.org/abs/2203.06026

[3] Fwiw, train vs test set in CIFAR10 has FID 3.15 and FFHQ256 top 10k vs bottom 60k is 2.25 (which current top paper beats, but that's 50k gen samples vs 50k random dataset samples). These of course have biases though since the sizes are unequal but since FID is distributional it does give us some strong clues about the variance in the datasets. My paper didn't go this far though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: