Hacker News new | past | comments | ask | show | jobs | submit login

At the root of this is something I'm constantly saying to other members in my research lab, arguing with reviewers, and passionately teaching my students:

datasets are proxies, measurements are fuzzy.

This is something that is drilled into students learning statistics and I can't for the life of me figure out why this changed with respect to ML (best guess is not requiring statistics courses).

Datasets are proxies: they represent the world, but aren't. They should generally be seen as narrow subsets too. Your dataset quality and type matter a lot! Things like medical image datasets also have tons of correlating factors that can easily invalidate all results without you being aware. There are simple datasets we use to prove a concept (toys, mnist, cifar, etc). There are large scale datasets that have internal inconsistencies (imagenet, flowers). There are huge datasets that haven't been properly filtered/deduplicated (LAION). (there are also just shitty datasets (HumanEval)) Thinking of datasets as proxies helps internalize the frustration that literally every production engineer faces (even outside ML and software): real world results are inconsistent with lab results. Dataset engineering is an underappreciated art that is extremely difficult. But everyone needs to internalize that datasets are just a map, not the territory, and your navigation will only be as good as the map (many are poorly drawn maps, often on purpose)

Measurements are fuzzy: Benchmarkism is running rampant in the ML world and it baffles me that a field who's bottomline objective deals with alignment isn't able to align how we evaluate ourselves. No measurement is perfect, many are far from them. You can train two language models to the same NLL and one might sample well and the other outputs garbage. You can train two image models to have identical FIDs and one samples clearly and the other doesn't. Likelihood also doesn't guarantee sharpness and I can go on. You must think about the limitations of your measurements and know them in depth. This also has seem to have gotten away from us and people just use the measurement libraries and call it that. We've reached a point where ImageNet classification accuracy has decoupled from downstream performance (object detection and segmentation), and things like this are confusing to production people because just taking the model with the highest score doesn't always result in the best performing model for their work (even before we consider things like throughput and memory usage). It is a Goodhart problem through and through.

ML is at a serious point where we've gotten away from our basic stats learning. That's going to pose a real danger to society, not AGI. It's like handing powertools to chimps (who don't know how powertools work), it won't end well. But that is happening because we've shifted focus to meet targets, not to measure our work. Targets are easy, science is hard. Unless we bring these nuances back to our evaluation of works then we are just handing powertools to chimps without any quality assurance.




I remember working with several ML teams on specifically inference performance (latency, memory usage etc.), and it's not a surprise to see some object detection performance variance depending on a scene.

Sometimes, even capping the model architecture to bound us from exceeding performance thresholds is non-trivial in itself, but convincing "some" researchers why p99, inference latency for example, is more important than the p50 case in a safety critical system...that's surprisingly several magnitudes harder.


Most, if not all (can't think of one), research datasets have enough problems where once you reach a certain threshold your model will start to overfit despite no divergence between training and test data. But this is hard to explain to many as they only understand overfitting as divergence. But maybe worse, is that it is status quo to tune hyper-parameters on test data results[0] (instead of validation), which causes information leakage and helps support overfitting. But damned if you do, damned if you don't. And don't get me started on generative models where there aren't even test datasets[1].

[0] I blame the status quo of lazy reviewers and rejecting works based on benchmarks as well as being uninformed.

[1] I'll admit this is a sore spot right now as a reviewer took my paper's discussions about the limitations of FID and asked why I didn't present a new metric.


>> But maybe worse, is that it is status quo to tune hyper-parameters on test data results[0] (instead of validation), which causes information leakage and helps support overfitting.

This is extremely hard to convince people about. I'm starting to doubt that even veteran researchers realise how unreliable their error estimates become if they're tuning their models on the test set, which, like you say, is standard practice. And yet, people will stand on that rotten practice and talk about the amazing generalisation ability of neural nets, and how over-parameterising and over-training defies statistical learning theory. That's why we have such gems as this paper:

The unreasonable effectiveness of deep learning in artificial intelligence

https://www.pnas.org/doi/10.1073/pnas.1907373117

Or "the grokking paper" and so on. Machine learning is starting to look more and more like the social sciences, where people simply pick and choose results from the (mostly non-peer reviewed) literature, just because they like the claim in a paper (or because it has a catchy title, or it went viral on twitter), and not because they make any serious attempt to check the results themselves.

P.S. Sorry about your review. It's a good idea to avoid any discussion that goes beyond the central claim of a paper. It can only cause reviewers confusion and distract them from the meat and potatoes of the work. Unfortunately, sticking to that advice makes for dry and boring to read papers, which also reduces the chances of acceptance.


Oh yeah, those are great papers, which I wish more people read. (I'll never not laugh at the journal name) I'll add this too[0] since it seems to be highly missed. But as far as validation sets, I have a hard time convincing anyone I talk to that tuning on test data is information leakage. I've also had issues with discussing people (even outside my lab, even at big schools/labs) what "uncurated samples" means. I say "you sample batch and show that" they say you can sample multiple batches and pick the best one... So this is the same kind of thinking and shows why evaluating generative papers is so difficult.

Fwiw, I blame the conferences for this. We have too many papers to review, too few people to review, and a system that isn't even good at selecting people from the pool of reviewers (I got 0 reviews to do, my coworker got 6 ¯\_(ツ)_/¯). Quality control on reviewers is non existent and ACs don't check that they follow reviewing rules. This results in a situation where I cannot think where it isn't in your advantage to be an evil reviewer: reject everything, be lazy about it. So we promote benchmarkism and abstract everything and everything so that nothing is novel (before we talk about collusion and ethics violations). Without a radical change I don't think the system will work anymore. There's too much incentive to cheat and play dirty now.

For my paper, all that stuff was in the appendix fwiw. Since it is a generative paper I took my chance to do a deep sample analysis (even inventing a new technique) to analyze the biases in different models and noting key indicators that were visible to the eye. So of course I had a small discussion about how FID is limited (see [1,2], [2] isn't deep enough though) and will not capture these differences. These differences matter when you're pushing the edge of the maximum FID on the dataset, which is not 0 as many people think[3]. I do feel like it is my duty as a researcher to stick my flag in the ground and point out how we need to do things better. (I do think it also made the paper much clearer and got good feedback from my colleagues fwiw) That is what research is after all. Research requires nuance, and if we're being honest, that shouldn't be something I have to say. People should deeply understand their metrics (not just for evaluating models, but works). The system is just too noisy right now to be meaningful imo.

[0] A note on the evaluation of generative models: http://arxiv.org/abs/1511.01844

[1] Improved Precision and Recall Metric for Assessing Generative Models: https://arxiv.org/abs/1904.06991 (original is good too: https://arxiv.org/abs/1806.00035)

[2] The Role of ImageNet Classes in Fréchet Inception Distance: https://arxiv.org/abs/2203.06026

[3] Fwiw, train vs test set in CIFAR10 has FID 3.15 and FFHQ256 top 10k vs bottom 60k is 2.25 (which current top paper beats, but that's 50k gen samples vs 50k random dataset samples). These of course have biases though since the sizes are unequal but since FID is distributional it does give us some strong clues about the variance in the datasets. My paper didn't go this far though.


They funny thing is that all this is clearly spelled out in PAC-Learning and Statistical Learning Theory, but the trend in most machine learning courses is to avoid all that hairy stuff, and concentrate on the practical tasks of training classifiers and using popular libraries and so on.

Btw, in the above comment:

FID = Fréchet inception distance

NLL = Negative Log Likelihood


Yeah this is what bugs me. If these people took a single stats class it would be drilled into them. Doesn't matter if ISLR, PAC-LSLT, SR, or ROS, it will be there. But it isn't. I push for as much of this as I can when I teach my ML course but my advisor pushes back for being "too mathy." I swear, ML people are adverse to math (though they like adding meaningless math equations to papers to make them look technical, but that is probably aligned with the former comment). As the current environment stands, I do not think many ML people understand concepts like data leakage, Bayes, or even likelihood. A problem is that popular way to teach ML is like how you teach software: coding. This doesn't work when you need statistical theory to understand your results.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: