One point they don’t seem to spend much time on is also the difficulty in reproducing outputs in closed-source models. Setting temperature to 0 and setting seeds doesn’t always seem to be enough to get exactly the same results for a given prompt
Library versions with slightly different numeric rounding errors, alternative implementation, runtimes, and hardware variation could all lead to reproduction challenges.
There is no obligation for tf’s sigmoid implementation to exactly match PyTorch’s output. The same is true for nvidia vs amd.