Thank you for the good laugh! This whole thread is peak satire.
Although, be careful. It reminds me of the foreword to a shortstory someone shared on HN recently: „[…] Read it and laugh, because it is very funny, and at the moment it is satire. If you’re still around forty years from now, do the existing societal equivalent of reading it again, and you may find yourself laughing out of the other side of your mouth (remember mouths?). It will probably be much too conservative.“ — https://www.baen.com/Chapters/9781618249203/9781618249203___...
You're right. They did it. The old man and dog joke has been realized, but the real answer of the future turned out to be: "the dog programs the game, and the man feeds the treat hopper."
Previous models from competitors usually got that correct, and the reasoning versions almost always did.
This kind of reflexive criticism isn't helpful, it's closer to a fully generalized counter-argument against LLM progress, whereas it's obvious to anyone that models today can do things they couldn't do six months ago, let alone 2 years back.
I'm not denying any progress, I'm saying that reasoning failures that are simple which have gone viral are exactly the kind of thing that they will toss in the training data. Why wouldn't they? There's real reputational risks in not fixing it and no costs in fixing it.
Given that Gemini 3 Pro already did solid on that test, what exactly did they improve? Why would they bother?
I double checked and tested on AI Studio, since you can still access the previous model there:
>You should drive.
>If you walk there, your car will stay behind, and you won't be able to wash it.
Thinking models consistently get it correct and did when the test was brand new (like a week or two ago). It is the opposite of surprising that a new thinking model continues getting it correct, unless the competitors had a time machine.
Why would they bother? Because it costs essentially nothing to add it to the training data. My point is that once a reasoning example becomes sufficiently viral, it ceases to be a good test because companies have a massive incentive to correct it. The fact some models got it right before (unreliably) doesn't mean they wouldn't want to ensure that the model gets it right.
Show us how to build machines, create factories, mines, chip fabs, etc., smelt steel, and so forth out of those bacteria and cells and you might have a point.
Same here. I did notice what I think was an actual error on someone's part, there was a chart in the files comparing black to white IQ distributions, and well, just look at it:
>Interactive Human Simulator is a bold way to describe spinning up a few GPT calls with mood sliders, but sure, let’s call it anthropology. Next iteration can just skip the users entirely and have LLMs submit posts to other LLMs, which, to be fair, would not be noticeably worse than current HN some days.
>If anything the agentic wave is showing that the chat interfaces are better off hidden behind stricter user interface paradigms.
I'm not sure that claim is justified. The primary agentic use case today is code generation, and the target demographic is used to IDEs/code editors.
While that's probably a good chunk of total token usage, it's not representative of the average user's needs or desires. I strongly doubt that the chat interface would have become so ubiquitous if it didn't have merit.
Even for more general agentic use, a chat interface allows the user the convenience of typing or dictating messages. And it's trivially bundled with audio-to-audio or video-to-video, the former already being common.
I expect that even in the future, if/when richer modalities become standard (and the models can produce video in real-time), most people will be consuming their outputs as text. It's simply more convenient for most use-cases.
Having already seen this explored late '24, what ends up happening is that the end user generates apps that have lots of jank, quirks, and logical errors that they lack the ability to troubleshoot or resolve. Like the fast forward button corrupting their settings config, the cloud sync feature causing 100% CPU load, icons gradually drifting away from their original positions on each window resize event, or the GUI tutorial activating every time they switch views in the app. Even worse, because their app is the only one of its kind, there is no other human to turn to for advice.
I found it genuinely impressive how useless their "GPTs" were.
Of course, part of it was due to the fact that the out-of-the-box models became so competent that there was no need for a customized model, especially when customization boiled down to barely more than some kind of custom system prompt and hidden instructions. I get the impression that's the same reason their fine-tuning services never took off either, since it was easier to just load necessary information into the context window of a standard instance.
Edit: In all fairness, this was before most tool use, connectors or MCP. I am at least open to the idea that these might allow for a reasonable value add, but I'm still skeptical.
> I get the impression that's the same reason their fine-tuning services never took off either
Also, very few workloads that you'd want to use AI for are prime cases for fine-tuning. We had some cases where we used fine tuning because the work was repetitive enough that FT provided benefits in terms of speed and accuracy, but it was a very limited set of workloads.
Very typical e-commerce use cases processing scraped content: product categorization, review sentiment, etc. where the scope is very limited. We would process tens of thousands of these so faster inference with a cheaper model with FT was advantageous.
Disclaimer: this was in the 3.5 Turbo "era" so models like `nano` now might be cheap enough, good enough, fast enough to do this even without FT.
A man, a dog and an instance of Claude.
The dog writes the prompts for Claude, the man feeds the dog, and the dog stops the man from turning off the computer.
reply