Lessons from the trenches on reproducible evaluation of language models

jerpint · 2024-05-25T20:17:34 1716668254

One point they don’t seem to spend much time on is also the difficulty in reproducing outputs in closed-source models. Setting temperature to 0 and setting seeds doesn’t always seem to be enough to get exactly the same results for a given prompt

3abiton · 2024-05-25T23:20:46 1716679246

Are there other parameters that affect the output?

lumost · 2024-05-26T20:33:42 1716755622

Library versions with slightly different numeric rounding errors, alternative implementation, runtimes, and hardware variation could all lead to reproduction challenges.

There is no obligation for tf’s sigmoid implementation to exactly match PyTorch’s output. The same is true for nvidia vs amd.