First few questions for those who don't care to download. Most just seem to be about niche facts:
Who received the IEEE Frank Rosenblatt Award in 2010?
Who was awarded the Oceanography Society's Jerlov Award in 2018?
What's the name of the women's liberal arts college in Cambridge, Massachusetts?
In whose honor was the Leipzig 1877 tournament organized?
According to Karl Küchler, what did Empress Elizabeth of Austria's favorite sculpture depict, which was made for her villa Achilleion at Corfu?
How much money, in euros, was the surgeon held responsible for Stella Obasanjo's death ordered to pay her son?
a little glossed over, but they do point out that most important improvement o1 has over gpt-4o is not it's "correct" score improving from 38% to 42% but actually it's "not attempted" going from 1% to 9%. The improvement is even more stark for o1-mini vs gpt-4o-mini: 1% to 28%.
They don't really describe what "success" would look like but it seems to me like the primary goal is to minimize "incorrect", rather than to maximize "correct". the mini models would get there by maximizing "not attempted" with the larger models having much higher "correct". Then both model sizes could hopefully reach 90%+ "correct" when given access to external lookup tools.
Not surprising that this would be on a list of questions at least one model got wrong, since I think the real answer is "there isn't one anymore, but from 1879 to 1999 the answer would have been Radcliffe College".
I was wondering about that too. The direction of a high quality shading is not uniform: https://i.imgur.com/Y8hIWAD.png (taken from Fig. 3 in the paper)