Writing tests won't help you here, this problem is the same as other generation ...

Writing tests won't help you here, this problem is the same as other generation tasks. If the test passes, everything seems okay, right? Consider this: you now have a 50-line function just to display 'hello world'. It outputs 'hello world', so it scores well, but it's hardly efficient. Then, there's a function that runs in exponential time instead of the standard polynomial time that any sensible programmer would use in specific cases. It passes the tests, so it gets a high score. You also have assembly code embedded in C code, executed with 'asm'. It works for that particular case and passes the test, but the average C programmer won't understand what's happening in this code, whether it's secure, etc. Lastly, tests written by AI might not cover all cases, they could even fail to test what you intended because they might hallucinate scenarios (I've experienced this many times). Programming faces similar issues to those seen in other generation tasks in the current generation of large language models, though to a slightly lesser extent.