This is a good start, but given the breadth of applications this would hardly give us enough to compare, as the goal of these models isn't to simply recite Wikipedia articles. What about language translation? Content summarization? Code generation? Turing test performance?
Both models were trained on Wikipedia, so that's a particularly bad choice. But yes, in practice this is what people tend to do. Take results with a very large grain of salt though, as the domain of the prompts you feed it make a huge difference.