> As long as an LLM is capable of inserting "9.99 > 10.01?" into an evaluation t...

> As long as an LLM is capable of inserting "9.99 > 10.01?" into an evaluation tool, we're on a good way

chatgpt will switch to python for some arithmetic with the result that you get floating point math issues when a 8yo will get the result right. I think "switch to a tool" still requires understanding of which tool to get a reliable result, which in turn means understanding the problem. It's an interesting issue.