My benchmark is giving it novel (i.e. guaranteed to not be in the training set) ...

CuriouslyC · 2024-05-09T20:25:56

That's occurring because you're giving it weak prompts, like I said. GPT4 has been trained to do things like chain of thought by default, where as you have to tell Llama/Claude to do some of that stuff. If you update your prompts to suggest reasoning strategies and tell it to perform some chain of thought before hand the difference between models should disappear.

int_19h · 2024-05-09T22:36:04

You are assuming a great deal of things. No, you can absolutely come up with puzzles where no amount of forced CoT will make the others perform on GPT-4 level.

Hell, there are puzzles where you can literally point out where the answer is wrong and ask the model to correct itself, and it will just keep walking in circles making the same mistakes over and over again.