Problem:
1) we want to train on GitHub repos
2) most datasets are spoiled. Training on GitHub would definitely spoil
Solution:
Hand write new problems!!!
... leetcode style ....
... and we'll check if it passes test
Example:
What's the decimal part of this float?
Surely in all of GitHub such code doesn't exist!
Sure in all of GitHub we can filter such code out by ngram!
Maybe my favorite part is that it has 60 authors and became the de facto benchmark for awhile
In general my smoke test for this kind of things is, if the company (or whatever) gladly accept the full liability for the AI usage.
Cases like:
- The AI replaces a salesperson but the sales are not binding or final, in case the client gets a bargain at $0 from the chatbot.
- It replaces drivers but it disengages 1 second before hitting a tree to blame the human.
- Support wants you to press cancel so the reports say "client cancel" and not "self drive is doing laps around a patch of grass".
- Ai is better than doctors at diagnosis, but in any case of misdiagnosis the blame is shifted to the doctor because "AI is just a tool".
- Ai is better at coding that old meat devs, but when the unmaintainable security hole goes to production, the downtime and breaches cannot be blamed on the AI company producing the code, it was the old meat devs fault.
AI companies want the cake and eat it too, until i see them eating the liability, i know, and i know they know, it's not ready for the things they say it is.
Most doctors have insurance for covering their mistakes. We might expect an AI medical startup to pay analogous premiums when it’s paid analogous fees.
The obvious next step is not that the LLMs replace doctors, it’s that LLMs become part of the ‘standard of care’, a component of the triage process. You go to the emergency room, and an LLM assessment becomes routine, if not required. This study shows that doing that will significantly increase accurate diagnoses for the start. Everyone wins.
That's completely missing the point. The LLM score substantially higher than the clinician. Statistically this means the clinician will have many more misdiagnoses.
The point is that clinicians don't really get sued most of the time anyway for misdiagnoses. With AI, all one has to do is open up a new chat, tell the AI that its last diagnosis isn't really helping, and it will eagerly give an updated assessment. Compared to a clinician, the AI dramatically lowers the bar of iteratively working with it to help address an issue.
As for drug prescriptions, they are to be processed through an interactions checker anyway.
If you tell a LLM that its last effort was bad, it won't give you a better outcome. It will get worse at whatever you asked for.
The reason is simple. They are trained as plausibility engines. It's more plausible that a bad diagnostician gives you a worse outcome than a good one, and you have literally just prompted it that it's bad at diagnosis.
Sure, you might get another text completion. Will it be correct, actionable, reliable, safe? Even a stopped clock. Good luck rolling those dice with your health.
In summary, do not iterate with prompts for declining competence.
No, that's a gross frequentist assessment. In reality, the Bayesian assessment is contingent on the first response not helping, and is therefore more likely to be correct, not less. The second response is a conditional response that benefits from new information provided by the user. Accordingly, it's very possible that the LLM will suggest further diagnostic tests to sort out the situation. The same technique also works for code reviews, with stunning effect.
This recommendation isn't about prompts than include notes of "what didn't work". I'm talking about prompts that directly inform the model, "you are modelling an idiot".
The former is reasonable to include when iterating. The latter is a recipe for outcome degradation. GP above gave the latter form. That activates attention from parts of the model guiding towards confabulation and loss of faithfulness.
The model doesn't know what is true, only what is plausible to emit. The hypothesis that plausibility converges with scale towards truth and faithfulness remains very far from proven. Bear in mind that the training data includes large swatches of arbitrary text from the Internet, real life, and from fiction, which includes plenty of examples of people being wrong, stupid, incompetent, repetitive, whimsical, phony, capricious, manipulative, disingenuous, repetitive, argumentative, and mendacious. In the right context these are plausible human-like textual interactions, and the only things really holding it back from completion in such directions are careful training and the system prompt. Worst case scenario, perhaps the corpus included parliamentary proceedings from around the world. "Suppose you were an idiot. And suppose you were a member of Congress. But I repeat myself." - Mark Twain
I assume you ignored "teleology" because you concede the point, otherwise feel free to take it.
" Is there an “inventiveness test” that humans can pass but LLMs don’t?"
Of course, any topic where there is no training data available and that cannot be extrapolated by simply mixing the existing data. Of course that is harder to test on current unknowns and unknown unknowns.
But it is trivial to test on retrospective knowledge. Just train the AI with text say to the 1800 and see if it can come out with antibiotics and general relativity, or if it will simply repeat outdated notions of disease theory and newtonian gravity.
I don't think it will settle things even if we did manage to train an 1800 LLM with sufficient size.
LLMs are blank slates (like an uncultured primitive human being - albeit LLM comes with knowledge built-in, but builtin knowledge is mostly irrelevant here). LLM output is purely a function of the input (context), so agentic systems' capabilities do not equal underlying LLM's capabilities.
If you ask such an LLM "overturn Newtonian physics, come up with a better theory", of course the LLM won't give you relativity just like that. The same way an uneducated human has no chance of coming up with relativity either.
However, ask it this:
```
You are Einstein ...
<omitted: 10 million tokens establishing Einstein's early life and learnings>
... Recent experiments have put these ideas to doubt, ...<another bunch of tokens explaining the Michelson–Morley experiment>... Any idea why this occurs?
```
and provide it with tools to find books, speak with others, run experiments, etc. Conceivably, the result will be different.
Again, we pretty much see this play out in coding agents:
Claude the LLM has no prior knowledge of my codebase so of course it has zero chance of solving a bug in it. Claude 4 is a blank slate.
Claude Code the agentic system can:
- look at a screenshot.
- know what the overarching goal is from past interactions & various documentation it has generated about the codebase, as well as higher-level docs describing the company and products.
- realize the screenshot is showing a problem with the program.
- form hypothesis / ideate why the bug occurs.
- verify hypotheses by observing the world ("the world" to Claude Code is the codebase it lives in, so by "observing" I mean it reads the code).
- run experiments: modify code then run a type check or unit test (although usually the final observation is outsourced to me, so I am the AI's tool as much as the other way around.)
Problem is, paraphrasing Scott Kilmer, corporations are dead from the neck up. The conclusion for them was not that AI will help juniors, is that they will not hire juniors and will ask seniors the magic "10x" with the help of AI. Even some seniors are getting the boot, because AI.
Just look at recent news, layoff after layoff from Big Tech, Middle tech and small tech.