Because I have, and it's not been the experience you're describing. The LLM hallucinated the error message it was testing for (that it had itself written 5 minutes earlier, in the file it had been given as the source to test).
I don't think this can be solved with the current methodology we're using to create these assistants. I remain arguably highly intelligent and definitely convinced that LLMs need to evolve more to surpass humans as coders.
Because I have, and it's not been the experience you're describing. The LLM hallucinated the error message it was testing for (that it had itself written 5 minutes earlier, in the file it had been given as the source to test).
I don't think this can be solved with the current methodology we're using to create these assistants. I remain arguably highly intelligent and definitely convinced that LLMs need to evolve more to surpass humans as coders.