This is such a great piece.
tldr: old benchmarks saturated, methodology was liable to a lot of subtle biases. as she mentions on the pod, they're already working on leaderboard v3.