The convention should be to use 1337 as your seed, and disclose that in your pub...

godelski · 2023-07-20T00:49:01

This is never going to happen given the need to SOTA chase. Need because reviewers care a lot about this (rather, they are looking for any reason to reject and lack of SOTA is a common reason). Fwiw, the checkpoints I release with my works include a substantial about of information, including the random seed and rng_state. My current project tracks a lot more. The reason I do this is both selfish as well as for promoting good science though, because it is not uncommon to forget what arguments or parameters you had in a particular run and the checkpoint serves as a great place to store that information, ensuring it is always tied to the model and can never be lost.

You could also use the deterministic mode in pytorch to create a standard. But I actually don't believe that we should do this. The solution space is quite large and it would be unsurprising that certain seeds make certain models perform really well while it causes others to fail. Ironically a standardization of seeds can increase the noise in our ability to evaluate! Instead I think we should evaluate multiple times and place some standard error (variance) on the results. This depends on the metric of course, but especially metrics that take subsets (such as FID or other sampling based measurements) should have these. Unfortunately, it is not standard to report these and doing so can also result in reviewer critique. It can also be computationally expensive (especially if we're talking about training parameters) so I wouldn't require this but it is definitely helpful. Everyone reports the best result, just like they tend to show off the best samples. I don't think there is inherently a problem showing off the best samples because in practice we would select these, but I think it is problematic that reviewers make substantial evaluations based on these as it is highly biased.

Noise is inherent to ML, and rather than getting rid of it, I'd rather we embrace it. It is good to know how stable your network is to random seeds. It is good to know how good it can get. It is good to have metrics. But all these evaluations are guides, not measures. Evaluation is fucking hard, and there's no way around this. Getting lazy in evaluation just results in reward hacking/Goodhart's Law. The irony is that the laziness is built on the over-reliance of metrics, in an attempt to create meritocratic and less subjective evaluations, but this ends up actually introducing more noise than if we had just used metrics as a guide. There is no absolute metric, all metrics have biases and limitations, and metrics are not always perfectly aligned with our goals.

We should be clear that starting seed isn't related to what I'm talking about in the previous comment. The previous comment was exclusively about sampling and not training.