My experience with image classification benchmarks was that they approached huma...

My experience with image classification benchmarks was that they approached human levels only because the scoring only counts how much they get “right” and doesn’t penalize completely whack answers as much as they should (like getting full credit for being pretty sure a picture of a dog was either a dog or an alligator). I suspect there’s something similar going on in these language benchmarks.