I've rarely seen completely wrong answers. When I did see them it was because I ...

I've rarely seen completely wrong answers. When I did see them it was because I was too prescriptive.

Like, I might ask "using this library, implement that feature" in the hopes that there it has learned of some way to do a thing I haven't been able to figure out. In those cases I see it hallucinate, which I assume means it's just combining information from multiple distinct environments.

If I'm not too specific, it does a pretty good job.

IMO its biggest fault is that it is not good at admitting it doesn't know something. If they can crank up the minimum confidence values (or whatever, the values used to guess the next token), maybe we'll see better results.