One think I like about this effort is their attempt to factor out the cacheing of prior answers due to having asked a similar question before. Due to the nearly eidetic memoization ability of LLMs, no cognitive benchmark can be meaningful unless the LLM's question history can somehow be voided after each query. I think this is especially true when measuring reasoning which will surely benefit greatly from the cacheing of answers from earlier questions into a working set that will enhance its associations on future similar questions -- which only looks like reasoning.