It's not that it didn't do an OK job, but more that you couldn't rely on what it had produced totally, nor rely on it having corrected the list without first having it reanalyse the "corrected" list.
It's still extremely helpful, I just found it strange that it seemed like a simple task - for something that has been fed millions of documents, for it to still give some incorrect results - especially AFTER it had analysed its own results and found some noun artikels to be incorrect.
I've found you still can't rely on LLMs to do anything 100% correct without human oversight. Unless you spend a lot of time prompt engineering and testing. Even then you might not get as close to 100% as you'd like.
But as you say, they are still extremely helpful anyway.
> Erstelle eine Liste mit 10 Nomen, die den Artikel "der" haben.
Maybe "reliably" is doing a lot of the heavy lifting?