Hacker News new | past | comments | ask | show | jobs | submit login
Lessons from the trenches on reproducible evaluation of language models (arxiv.org)
42 points by veryluckyxyz 7 months ago | hide | past | favorite | 3 comments



One point they don’t seem to spend much time on is also the difficulty in reproducing outputs in closed-source models. Setting temperature to 0 and setting seeds doesn’t always seem to be enough to get exactly the same results for a given prompt


Are there other parameters that affect the output?


Library versions with slightly different numeric rounding errors, alternative implementation, runtimes, and hardware variation could all lead to reproduction challenges.

There is no obligation for tf’s sigmoid implementation to exactly match PyTorch’s output. The same is true for nvidia vs amd.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: