>> But maybe worse, is that it is status quo to tune hyper-parameters on test da...

godelski · on May 30, 2023

Oh yeah, those are great papers, which I wish more people read. (I'll never not laugh at the journal name) I'll add this too[0] since it seems to be highly missed. But as far as validation sets, I have a hard time convincing anyone I talk to that tuning on test data is information leakage. I've also had issues with discussing people (even outside my lab, even at big schools/labs) what "uncurated samples" means. I say "you sample batch and show that" they say you can sample multiple batches and pick the best one... So this is the same kind of thinking and shows why evaluating generative papers is so difficult.

Fwiw, I blame the conferences for this. We have too many papers to review, too few people to review, and a system that isn't even good at selecting people from the pool of reviewers (I got 0 reviews to do, my coworker got 6 ¯\_(ツ)_/¯). Quality control on reviewers is non existent and ACs don't check that they follow reviewing rules. This results in a situation where I cannot think where it isn't in your advantage to be an evil reviewer: reject everything, be lazy about it. So we promote benchmarkism and abstract everything and everything so that nothing is novel (before we talk about collusion and ethics violations). Without a radical change I don't think the system will work anymore. There's too much incentive to cheat and play dirty now.

For my paper, all that stuff was in the appendix fwiw. Since it is a generative paper I took my chance to do a deep sample analysis (even inventing a new technique) to analyze the biases in different models and noting key indicators that were visible to the eye. So of course I had a small discussion about how FID is limited (see [1,2], [2] isn't deep enough though) and will not capture these differences. These differences matter when you're pushing the edge of the maximum FID on the dataset, which is not 0 as many people think[3]. I do feel like it is my duty as a researcher to stick my flag in the ground and point out how we need to do things better. (I do think it also made the paper much clearer and got good feedback from my colleagues fwiw) That is what research is after all. Research requires nuance, and if we're being honest, that shouldn't be something I have to say. People should deeply understand their metrics (not just for evaluating models, but works). The system is just too noisy right now to be meaningful imo.

[0] A note on the evaluation of generative models: http://arxiv.org/abs/1511.01844

[1] Improved Precision and Recall Metric for Assessing Generative Models: https://arxiv.org/abs/1904.06991 (original is good too: https://arxiv.org/abs/1806.00035)

[2] The Role of ImageNet Classes in Fréchet Inception Distance: https://arxiv.org/abs/2203.06026

[3] Fwiw, train vs test set in CIFAR10 has FID 3.15 and FFHQ256 top 10k vs bottom 60k is 2.25 (which current top paper beats, but that's 50k gen samples vs 50k random dataset samples). These of course have biases though since the sizes are unequal but since FID is distributional it does give us some strong clues about the variance in the datasets. My paper didn't go this far though.