My experience is querying an LLM usually does ok if the only knowledge it needs ...

My experience is querying an LLM usually does ok if the only knowledge it needs is directly included in the prompt (including for summarization). It’s not perfect but its reliability is within the realm of human error and so isn’t significantly worse than me skimming the text myself.

But it’s really unreliable if it’s leveraging “knowledge” from its training. Asking what something in an article means, for example, is pretty risky. Open-ended questions that cause knowledge dumps are very risky—the longer the answer, the riskier. Even text translation is risky, even though it’s ostensibly a simple transformation, because it has to draw the word mappings from training. And god help you if you ask it a leading question without realizing it. It’ll often just parrot your assumption back to you.

To your point, though, using an LLM as an oracle/teacher would be a lot less dangerous if it were always laughably wrong but it’s usually not. It’s usually 90% right, and that’s what makes it so insidious. Picking out the 10% that isn’t is really hard, especially if you don’t know enough to second-guess it. The wrong stuff usually looks plausible and it’s surrounded by enough valid info that what little you do know seems to corroborate the rest.