One fundamental challenge to me is that if each training run because more and more expensive, the time it takes it to learn what works/doesn't work widens. Half a billion dollars for training a model is already nuts, but if it takes 100 iterations to perfect it, you've cumulatively spent 50 billion dollars... Smaller models may actually be where rapid innovation continues simply because of tighter feedback loops. O3 may be an example of this.
AGI will arrive like self driving cars. it’s not that you will wake up one day and we have it. cars gained auto-braking, parallel parking, cruise control assist. and over a long time you get to something like waymo, which still is location dependent. i think AGI will take decades but sooner will be some special cases that are effectively the same
Interesting idea. The concept of The Singularity would seem to go against this, but I do feel that seems unlikely and that a gradual transition is more likely.
However, is that AGI, or is it just ubiquitous AI? I’d agree that, like self driving cars, we’re going to experience a decade or so transition into AI being everywhere. But is it AGI when we get there? I think it’ll be many different systems each providing an aspect of AGI that together could be argued to be AGI, but in reality it’ll be more like the internet, just a bunch of non-AGI models talking to each other to achieve things with human input.
I don’t think it’s truly AGI until there’s one thinking entity able to perform at or above human level in everything.
The idea of the singularity presumes that running the AGI is either free or trivially cheap compared to what it can do, so we are fine expanding compute to let the AGI improve itself. That may eventually be true, but it's unlikely to be true for the first generation of AGI.
The first AGI will be a research project that's completely uneconomical to run for actual tasks because humans will just be orders of magnitude cheaper. Over time humans will improve it and make it cheaper, until we reach some tipping point where letting the AGI improve itself is more cost effective than paying humans to do it
The Singularity is caused by AI being able to design better AI. There's probably some AI startup trying to work on this at the moment, but I don't think any of the big boys are working on how to get an LLM to design a better LLM.
I still like the analogy of this being a really smart lawn mower, and we're expecting it to suddenly be able to do the laundry because it gets so smart at mowing the lawn.
I think LLMs are going to get smarter over the next few generations, but each generation will be less of a leap than the previous one, while the cost gets exponentially higher. In a few generations it just won't make economic sense to train a new generation.
Meanwhile, the economic impact of LLMs in business and government will cause massive shifts - yet more income shifting from labour to capital - and we will be too busy dealing with that as a society to be able to work on AGI properly.
I think this whole “AGI” thing is so badly defined that we may as well say we already have it. It already passes the Turing test and does well on tons of subjects.
What we can start to build now is agents and integrations. Building blocks like panel of experts agents gaming things out, exploring space in a Monte Carlo Tree Search way, and remembering what works.
Robots are only constrained by mechanical servos now. When they can do something, they’ll be able to do everything. It will happen gradually then all at once. Because all the tasks (cooking, running errands) are trivial for LLMs. Only moving the limbs and navigating the terrain safely is hard. That’s the only thing left before robots do all the jobs!
It's not contradictory. It can happen over a decade and still be a dramatically sloped S curve with tremendous change happening in a relatively short time.
AGI is the holy grail of technology. A technology so advanced that not only does it subsume all other technology, but it is able to improve itself.
Truly general intelligence like that will either exist or not. And the instant it becomes public, the world will have changed overnight (maybe the span of a year)
Note: I don’t think statistical models like these will get us there.
There may well be an upper limit on cognition (we are not really sure what cognition is - even as we do it) and it may be that human minds are close to it.
LLMs have no real sense of truth or hard evidence of logical thinking. Even the latest models still trip up on very basic tasks. I think they can be very entertaining, sure, but not practical for many applications.
The autoregressive transformer LLMs aren't even the only way to do text generation. There are now diffusion based LLMs, StripedHyena based LLMs, and float matching based LLMs.
There's a wide amount of research into other sorts of architectures.
LLMs are a key piece of understanding that token sequences can trigger actions in the real world. AGI is here. You can trivially spin up a computer using agent to self improve itself to being a competent office worker
Tokens don't need to be text either, you can move to higher level "take_action" semantics where "stream back 1 character to session#117" as every single function call. Training cheap models that can do things in the real world is going to change a huge amount of present capabilities over the next 10 years
Says who? And more importantly, is this the boulder? All I (and many others here) see is that people engage others to sponsor pushing some boulder, screaming promises which aren’t even that consistent with intermediate results that come out. This particular boulder may be on a wrong mountain, and likely is.
It all feels like doubling down on astrology because good telescopes aren’t there yet. I’m pretty sure that when 5 comes out, it will show some amazing benchmarks but shit itself in the third paragraph as usual in a real task. Cause that was constant throughtout gpt evolution, in my experience.
even if it kills us
Full-on sci-fi, in reality it will get stuck around a shell error message and either run out of money to exist or corrupt the system into no connectivity.
The buzzkill when you fire up the latest most powerful model only for it to tell you that peanut is not typically found in peanut butter and jelly sandwiches.
There's no doubt been progress on the way to AGI, but ultimately it's still a search problem, and one that will rely on human ingenuity at least until we solve it. LLMs are such a vast improvement in showing intelligent-like behavior that we've become tantalized by it. So now we're possibly focusing our search in the wrong place for the next innovation on the path to AGI. Otherwise, it's just a lack of compute, and then we just have to wait for the capacity to catch up.
I don't think AI will be what kills us. The paperclip machine is already here - it's capitalism. It doesn't need AI, it has unthinking, powerless people already tied to optimising for bad metrics. Everything else just makes it more efficient at killing us.
I think you're both right and wrong. You're right that capitalism has become a paperclip machine, but capitalism also wants AI so it can cheaply and at scale replace the human components of the machine with something that has more work capacity for fewer demands.
The problem is that the people in power will want to maintain the status quo. So the end of human labor won't naturally result in UBI – or any kind of welfare – to compensate for the loss of income, let alone afford any social mobility. But wealthy people will be able to leverage AGI to defend themselves from any uprising by the plebs.
We're too busy trying to make humans irrelevant, but not asking what exactly we do as a species of 10+ billion individuals do afterwards. There's some excited discussions about a rebirth of culture, but I'm not sure what that means when machines can do anything humans can do but better. Perhaps we just tinker around with our hobbies until we die? I honestly don't think it will play out well for us.
Machines can’t have fun for us. They can’t dance to a beat, they can’t experience altered states of mind. They can’t create a sense of belonging through culture and ritual. Yes we have lost a lot in the last 100 years but there are still pockets of resistance that carry old knowledge that “we the people” will be glad of in the coming century.
The problem is that the "we" who are busy trying to make humans irrelevant seem to be completely unconcerned with the effects on the "we" who will be superfluous afterwards.
It seems to me that given how AI is likely to continuously increase capitalism's efficiency, your argument actually supports the claim you're trying to dispute.
I am working at an AI company that is not OpenAI. We have found ways to modularize training so we can test on narrower sets before training is "completely done". That said, I am sure there are plenty of ways others are innovating to solve the long training time problem.
Perhaps the real issue is that learning takes time and that there may not be a shortcut. I'll grant you that argument's analogue was complete wank when comparing say the horse and cart to a modern car.
However, we are not comparing cars to horses but computers to a human.
I do want "AI" to work. I am not a luddite. The current efforts that I've tried are not very good. On the surface they offer a lot but very quickly the lustre comes off very quickly.
(1) How often do you find yourself arguing with someone about a "fact"? Your fact may be fiction for someone else.
(2) LLMs cannot reason
A next token guesser does not think. I wish you all the best. Rome was not burned down within a day!
I can sit down with you and discuss ideas about what constitutes truth and cobblers (rubbish/false). I have indicated via parenthesis (brackets in en_GB) another way to describe something and you will probably get that but I doubt that your programme will.
This is literally just the scaling laws, "Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets. This provides an efficient way for practitioners and researchers alike to compare pretraining decisions involving optimizers, datasets, and model architectures"
Until you get to a point where the LLM is smart enough to look at real world data streams and prune its own training set out of it. At that point it will self improve itself to AGI.
But if the scaling law holds true, more dollars should at some point translate into AGI, which is priceless. We haven't reached the limits yet of that hypothesis.
a) There is evidence e.g. private data deals that we are starting to hit the limitations of what data is available.
b) There is no evidence that LLMs are the roadmap to AGI.
c) Continued investment hinges on their being a large enough cohort of startups that can leverage LLMs to generate outsized returns. There is no evidence yet this is the case.
"There is no evidence that LLMs are the roadmap to AGI." - There's plenty of evidence. What do you think the last few years have been all about? Hell, GPT-4 would already have qualified as AGI about a decade ago.
>What do you think the last few years have been all about?
Next token language-based predictors with no more intelligence than brute force GIGO which parrot existing human intelligence captured as text/audio and fed in the form of input data.
4o agrees:
"What you are describing is a language model or next-token predictor that operates solely as a computational system without inherent intelligence or understanding. The phrase captures the essence of generative AI models, like GPT, which rely on statistical and probabilistic methods to predict the next piece of text based on patterns in the data they’ve been trained on"
He probably didn't need petabytes of reddit posts and millions of gpu-hours to parrot that though.
I still don't buy the we do the same as LLMs. Of course one could hypothesize the brain language centers may have some similarities, but the differences in resource usage and how those resources are used to train between humans and LLMs are remarkable and may indicate otherwise.
>Everything you said is parroting data you’ve trained on
"Just like" an LLM, yeah sure...
Like how the brain was "just like" a hydraulic system (early industrial era), like a clockwork with gears and differentiation (mechanical engineering), "just like" an electric circuit (Edison's time), "just like" a computer CPU (21st century), and so on...
You have described something but you haven't explained why the description of the thing defines its capability. This is a tautology, or possibly a begging of the question, which takes as true the premise of something (that token based language predictors cannot be intelligent) and then uses that premise to prove an unproven point (that language models cannot achieve intelligence).
You did nothing at all to demonstrate why you cannot produce an intelligent system from a next token language based predictor.
What GPT says about this is completely irrelevant.
>You did nothing at all to demonstrate why you cannot produce an intelligent system from a next token language based predictor
Sorry, but the burden of proof is on your side...
The intelligence is in the corpus the LLM was fed with. Using statistics to pick from it and re-arrange it gives new intelligent results because the information was already produced by intelligent beings.
If somebody gives you an excerpt of a book, it doesn't mean they have the intelligence of the author - even if you have taught them a mechanical statistical method to give back a section matching a query you make.
Kids learn to speak and understand language at 3-4 years old (among tons of other concepts), and can reason by themselves in a few years with less than 1 billionth the input...
>What GPT says about this is completely irrelevant.
On the contrary, it's using its very real intelligence, about to reach singularity any time now, and this is its verdict!
Why would you say it's irrelevant? That would be as if it merely statistically parroted combinations of its training data unconnected to any reasoning (except of that the human creators of the data used to create them) or objective reality...
> If somebody gives you an excerpt of a book, it doesn't mean they have the intelligence of the author
A closely related rant of my own: The fictional character we humans infer from text is not the author-machine generating that text, not even if they happen to share the same name. Assuming that the author-machine is already conscious and choosing to insert itself is begging the question.
Person 1: rockets could be a method of putting things into Earth orbit
Person 2: rockets cannot get things into orbit because they use a chemical reaction which causes an equal and opposite force reaction to produce thrust'
Does person 1 have the burden of proof that rockets can be used to put things in orbit? Sure, but that doesn't make the reasoning used by person 2 valid to explain why person 1 is wrong.
BTW thanks for adding an entire chapter to your comment in edit so it looks like I am ignoring most of it. What I replied to was one sentence that said 'the burden of proof is on you'. Though it really doesn't make much difference because you are doing the same thing but more verbose this time.
None of the things you mentioned preclude intelligence. You are telling us again how it operates but not why that operation is restrictive in producing an intelligent output. There is no law that saws that intelligence requires anything but a large amount of data and computation. If you can show why these things are not sufficient, I am eager to read about it. A logical explanation would be great, step by step please, without making any grand unproven assumptions.
In response to the person below... again, whether or not person 1 is right or wrong does not make person 2's argument valid.
No, GPT-4 would have been classified as it is today: a (good) generator of natural language. While this is a hard classical NLP task, it's a far cry from intelligence.
For an industry that spun off of a research field that basically revolves around recursive descent in one form or another, there's a pretty silly amount of willful ignorance about the basic principles of how learning and progress happens.
The default assumption should be that this is a local maximum, with evidence required to demonstrate that it's not. But the hype artists want us all to take the inevitability of LLMs for granted—"See the slope? Slopes lead up! All we have to do is climb the slope and we'll get to the moon! If you can't see that you're obviously stupid or have your head in the sand!"
Sure they’ve hit the wall with obvious conversations and blog articles that humans produced, but data is a by product of our environment. Surely there’s more. Tons more.
This also isn't true. It'll clearly have a price to run. Even if it's very intelligent, if the price to run it is too high it'll just be a 24/7 intelligent person that few can afford to talk to. No?
Computers will be the size of data centres, they'll be so expensive we'll queue up jobs to run on them days in advance, each taking our turn... history echoes into the future...
Yea, and those statements were true. For a time. If you want to say "AGI will be priceless some unknown time into the future" then i'd be on board lol. But to imply it'll be immediately priceless? As in no cost spent today wouldn't be immediately rewarded once AGI exists? Nonsense.
Maybe if it was _extremely_ intelligent and it's ROI would be all the drugs it would instantly discover or w/e. But lets not imply that General Intelligence requires infinitely knowing.
So at best we're talking about an AI that is likely close to human level intelligence. Which is cool, because we have 7+ billion of those things.
This isn't an argument against it. Just to say that AGI isn't "priceless" in the implementation we'd likely see out of the gate.
"OpenAI’s is called GPT-4, the fourth LLM the company has developed since its 2015 founding." - that sentence doesn't fill me with confidence in the quality of the rest of the article, sadly.
> At best, they say, Orion performs better than OpenAI’s current offerings, but hasn’t advanced enough to justify the enormous cost of keeping the new model running.
If you offer an API you need to dedicate servers to it that keep the model loaded in GPU memory. Unless you don't care about latency at all.
Though I wouldn't be surprised if the bigger reason is the PR cost of releasing with an exciting name but unexciting results. The press would immediately declare the end of the AI growth curve
No, I'm complaining that just because GPT-4 is called GPT-4 doesn't mean it's the fourth LLM from OpenAI.
Off the top of my head: GPT-2, Codex, GPT-3 in three different flavors (babbage, curie, davinci), GPT-3.5.
Suggesting that GPT-4 was "fourth" simply isn't credible.
Just the other day they announced a jump from o1 to o3, skipping o2 purely because it's already the name of a major telecommunications brand in Europe. Deriving anything from the names of OpenAI's products doesn't make sense.
If we're generous the article considers versions that were significant improvements. 4o is hardly better on real-world usage (benchmarks are gamed to death) than the original 4.
The UK is part of Europe. It's technically, geographically, politically, historically, lingustially, tectonically and socially correct. In what ways is it not?
Are Cuba or Haiti part of North America? A lot of British people feel like their civilization is meaningfully distinct from “Europe”, even though they’re part of it in a technical geographical sense.
In general yes, but it depends on if you consider central america as its own continent and if you include them there and how you delineate north/south america. Groupings differ based on your education.
I think the thing that makes the UK different is that there is no other option besides them being a separate thing/continent. Are you suggesting that the UK is it's own continent? Would that be with the faroese and the Greenlanders?
The UK might feel different, but they are not separate. The french feel different from the bulgarians, but that does not mean they are on a separate continent, politically or geographically.
EDIT:
> A lot of British people feel like their civilization is meaningfully distinct
This is, to borrow a word, "balderdash". Looking at the influence vikings, romans and normans have had that is a rubbish argument. Just like other countries in europe the british culture is built on the stones of other cultures, and just like many other countries they subsumed other cultures because of kings or other political dominance.
The point was that any closeby landmass besides europe is either in europe or in north america, and I have a hard time seeing the argument for UK being in North America or America at all.
But I'm guessing we can agree that any major landmass is generally belonging to a continent? Like we all agree that greenland, new zealand, japan, etc generally belong to a continent?
So to what continent do those british people think they belong?
If you asked someone directly “what continent is Britain part of”, they would surely say Europe, even if they would be unlikely to describe themselves as European. Language is funny that way.
Technically…? Does anyone here believe that the EU and Europe is the same thing? Would you find it weird if someone said that a Norwegian company was in Europe?
Parent is suggesting it would be weird for Europeans to call the UK as in Europe which as a European I can tell you is preposterous. That’s the kind of non sense you used to hear from Brexiter. They will have no sympathy from me.
While I’m sure it’s unintentional, that amounts to nitpicking. I can easily find three to include and pass over the rest. Face value turns out to be a decent approximation.
The thing is that I think it could be an optimal way of saying it. Should we not put it into context of making a particular LLM? Why count three versions of three LLMs? They made it hard to choose the one that makes up for not having GPT 1. GPT 3.5 and Codex are both good candidates. And of course calling GPT 4 the third and fifth could be considered as well.
That doesn’t resolve the problem of whether third or fifth is better than fourth. I have yet to be convinced that their wording here shows that they fail to grasp the pace of the development.
The issue isn't the grammar. It is that there are 5 distinct LLMs from OpenAI that you can use right now as well as 4 others that were deprecated in 2024.
The article definitely has issues, but to me what's relevant is where it's published. The smart money and experts without a vested interest have been well aware LLMs are an expensive dead for over a year and have been saying as much (Gary Marcus for instance). That this is starting to enter mainstream consciousness is what's newsworthy.
I've been messing around with base (not instruction tuned) LLMs; they often evade AI detectors and I wouldn't be surprised if they evade this kind of detection too, at least with a high temperature
"Orion’s problems signaled to some at OpenAI that the more-is-more strategy, which had driven much of its earlier success, was running out of steam."
So LLMs finally hit the wall. For a long time, more data, bigger models, and more compute to drive them worked. But that's apparently not enough any more.
Now someone has to have a new idea. There's plenty of money available if someone has one.
The current level of LLM would be far more useful if someone could get a conservative confidence metric out of the internals of the model. This technology desperately needs to output "Don't know" or "Not sure about this, but ..." when appropriate.
The new idea is inference-time scaling, as seen in o1 (and o3 and Qwen's QwQ and DeepSeek's DeepSeek-R1-Lite-Preview and Google's gemini-2.0-flash-thinking-exp).
Is it "eerie"? LeCun has been talking about it for some time, and may also be OpenAI's rumored q-star, mentioned shortly after Noam Brown (diplomacybot) joining OpenAI. You can't hill climb tokens, but you can climb manifolds.
I wasn’t aware of others attempting manifolds for this before - just something I stumbled upon independently. To me the “eerie” part is the thought of an LLM no longer using human language to reason - it’s like something out of a sci fi movie where humans encounter an alien species that thinks in a way that humans cannot even comprehend due to biological limitations.
I am hopeful that progress in mechanistic interpretability will serve as a healthy counterbalance to this approach when it comes to explainability.. though I kinda worry that at a certain point it may be that something resembling a scaling law puts an upper bound on even that.
I imagine he means that when you reason in latent space the final answer is a smooth function of the parameters, which means you can use gradient descent to directly optimize the model to produce a desired final output without knowing the correct reasoning steps to get there.
When you reason in token space (like everyone is doing now) you are executing nonlinear functions when you sample after each token, so you have to use some kind of reinforcement learning algorithm to learn the weights.
What wall? Not a week has gone by in recent years without an LLM breaking new benchmarks. There is little evidence to suggest it will all come to a halt in 2025.
Sure, but "benchmarks" here seems roughly as useful as "benchmarks" for GPUs or CPUs, which don't much translate to what the makers of GPT need, which is 'money making use cases.'
O3 has demonstrated that OpenAI needs 1,000,000% more inference time compute to score 50% higher on benchmarks. If O3-High costs about $350k an hour to operate, that would mean making O4 score 50% higher would cost $3.5B (!!!) an hour. That scaling wall.
I’m convinced they’re getting good at gaming the benchmarks since 4 has deteriorated via ChatGPT, in fact I’ve used 4-0125 and 4-1106 via the API and find them far superior to o1 and o1-mini at coding problems. GPT4 is an amazing tool but the true capabilities are being hidden from the public and/or intentionally neutered.
> I’ve used 4-0125 and 4-1106 via the API and find them far superior to o1 and o1-mini at coding problems
Just chiming in to say you're not alone. This has been my experience as well. The o# line of models just don't do well at coding, regardless of what the benchmarks say.
I used to run a lot of monte carlo simulations where the error is proportional to the inverse square root. There was a huge advantage of running for an hour vs a few minutes, but you hit the diminishing returns depressingly quickly. It would not surprise me at all if llms end up having similar scaling properties.
Yeah, any situation you need O(n^2) runtime to obtain n bits of output (or bits of accuracy, in the Monre Carlo case) is pure pain. At every point, it's still within your means to double the amount of output (by running it 3x longer than you have so far), but it gradually becomes more and more painful, instead of there being a single point where you can call it off.
Even assuming that past rates of inference cost scaling hold up, we would only expect a 2 OoM decrease after about a year or so.
And 1% of 3.5b is still a very large number.
Not really. o3-low compute still stomps the benchmarks and isn't anywhere that expensive and o3-mini seems better than o1 while being cheaper.
Combine that with the fact that LLM inference has reduced orders of magnitudes in cost the last few years and hampering over the inference costs of a new release seems a bit silly.
Not necessarily. And this is the problem with ARC that people seem to forget.
- It's just a suite of visual puzzles. It's not like say GSM8K where proficiency in it gives some indication on Math proficiency in general.
- It's specifically a suite of puzzles that LLMs have shown particular difficulty in.
Basically how much compute it takes to handle a task in this benchmark does not correlate with how much it will take LLMs to compute tasks that people actually want to use LLMs for.
If the benchmark is not representative of normal usage* then the benchmark and the plot being shown are not useful at all from a user/business perspective and the focus on the breakthrough scores of o3-low and o3-high in ARC-AGI would be highly misleading. And also the "representative" point is really moot from the discussion perspective (i.e. saying o3 stomps benchmarks, but the benchmarks aren't representative).
*I don't think that is the case as you can at least make relative conclusions (i.e. o3 vs o1 series, o3-low is 4x to 20x the cost for ~3x the perf). Even if it is pure marketing they expect people to draw conclusions using the perf/cost plot from Arc.
PS: I know there are more benchmarks like SWE-Bench and Frontier Math, but this is the only one showing data about o3-low/high costs without considering the CodeForces plot that includes o3-mini (that one does look interesting, though right now is vaporware) but does not separate between compute scale modes.
If you are talking about ARC benchmark, then o3-low doesn't look that special if you take into account there are plenty of finetuned models with much smaller resources achieved 40-50% results on private set (not semi-private like o3-low).
- I'm not just talking about ARC. On frontier Math, we have 2 scores, one with pass@1 and another with consensus vote with 64 samples. Both scores are much better than previous Sota.
- Also apparently, ARC wasn't a special fine-tune but rather some of the training set in the corpus for pre-training.
>that result is not verifiable, not reproducable, unknown if it was leaked and how it was measured. Its kinda hype science.
It will be verifiable when the model is released. Open ai haven't released any benchmark scores that were shown falsified later so unless you have an actual reason to believe they're outright lying then it's not something to take seriously.
Frontier Math is a private benchmark with its highest tier of difficulty Terrence Tao says:
“These are extremely challenging. I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages…”
Unless you have a reason to believe answers were leaked then again, not interested in baseless speculation.
>its private for outsiders, but it was developed in "collaboration" with OAI, and GPT was tested in the past on it, so they have it in logs somewhere.
They have logs of the questions probably but that's not enough. Frontier Math isn't something that can be fully solved without gathering top experts at multiple disciplines. Even Tao says he only knows who to ask for the most difficult set.
Basically, what you're suggesting at least with this benchmark in particular is far more difficult than you're implying.
>If you think this entire conversation is pointless, then why do you continue?
There's no point arguing about how efficient the models are being (the original point) if you won't even accept the results of the benchmarks. Why i'm continuing ? For now, it's only polite to clarify.
> Frontier Math isn't something that can be fully solved without gathering top experts
Tao's quote above referred on hardest 20% problems, they have 3 levels of difficulty, presumably first level is much easier. Also, as I mentioned OAI collaborated on creating benchmark, so they could have access to all solutions too.
> There's no point arguing
Lol, let me ask again, why you are arguing then? Yes, I have strong reasonable(imo) doubt that those results are valid.
Not really. Throwing a bunch of unfiltered garbage at the pretraining dataset, throwing in RLHF of questionable quality during post-training, and other current hacks - none of that was expected to last forever. There is so much low-hanging fruit that OpenAI left untouched and I'm sure they're still experimenting with the best pre-training and post-training setups.
One thing researchers are seeing is resistance to post-training alignment in larger models, but that's almost the opposite of a wall, they're figuring it out as well.
> Now someone has to have a new idea
OpenAI already has a few, namely the o* series in which they discovered a way to bake Chain of Thought into the model via RL. Now we have reasoning models that destroy benchmarks that they previously couldn't touch.
Anthropic has a post-training technique, RLAIF, which supplants RLHF,and it works amazingly well. Combined with countless other tricks we don't know about in their training pipeline, they've managed to squeeze so much performance out of Sonnet 3.5 for general tasks.
Gemini is showing a lot of promise with their new Flash 2.0 and Flash 2.0-Thinking models. They're the first models to beat Sonnet at many benchmarks since April. The new Gemini Pro (or Ultra? whatever they call it now) is probably coming out in January.
> The current level of LLM would be far more useful if someone could get a conservative confidence metric out of the internals of the model. This technology desperately needs to output "Don't know" or "Not sure about this, but ..." when appropriate.
You would probably enjoy this talk [0], it's by an independent researcher who IIRC is a former employee of Deepmind or some other lab. They're exploring this exact idea. It's actually not hard to tell when a model is "confused" (just look at the probability distribution of likely tokens), the challenge is in steering the model to either get back to the right track or give up and say "you know what, idk"
> Now someone has to have a new idea. There's plenty of money available if someone has one.
I honestly do claim to have some ideas where I see evidence that they might work (and I do attempt to work privately on a prototype if only out of curiosity and to see whether I am right). The bad news: these ideas very likely won't be helpful for these LLM companies because they are not useful for their agenda, and follow a very different approach.
So no money for me. :-(
Let me put it this way:
Have you ever talked to a person whose intelligence is miles above yours? It can easily become very exhausting. Thus an "insanely intelligent" AI would not be of much use for most people - it would think "too different" from such people.
There do exist tasks in commerce for which an insane amount of intelligence would make a huge difference (in the sense of being positive regarding some important KPIs), but these are rare. I can imagine some applications of such (fictional) "super-intelligent" AIs in finance and companies doing some bleeding-edge scientific research - but these are niche applications (though potentially very lucrative ones).
If OpenAI, Anthropic & Co were really attempting to develop some "super-smart" AI, they were working on such very lucrative niche applications where an insane amount of intelligence would make a huge difference, and where you can assume and train the AI operator to have a "Fields-medal level" intelligence.
Anecdotally Claude is just as bad as every other LLM.
Step into more niche areas e.g. I am trying to use it with Scala macros and at least 90% of the time it is giving code that either (a) fails to compile or (b) is just complete gibberish.
And at no point ever has it said it didn't know something.
Yep, get into any sufficiently deep niche (i.e. actually almost any non-trivial app) and the LLM magic fades off.
Yeah sure you can make a pong clone in html/js and that's mainly because there the internet is full of pong clone demos. Ask how to constraint a statsmodels lineal model in some non-standard way? It will gaslight how it is possible and make you loss time in the process.
To output "don't know" a system needs to "know" too. Random token generator can't know. It can guess better and better, maybe it can even guess 99.99% of time, but it can't know, it can't decide or reason (not even o1 can "reason").
What we can reasonably assume from statements made by insiders:
They want a 10x improvement from scaling and a 10x improvement from data and algorithmic changes
The sources of public data are essentially tapped
Algorithmic changes will be an unknown to us until they release, but from published research this remains a steady source of improvement
Scaling seems to stall if data is limited
So with all of that taken together, the logical step is to figure out how to turn compute into better data to train on. Enter strawberry / o1, and now o3
They can throw money, time, and compute at thinking about and then generating better training data. If the belief is that N billion new tokens of high quality training data will unlock the leap in capabilities they’re looking for, then it makes sense to delay the training until that dataset is ready
With o3 now public knowledge, imagine how long it’s been churning out new thinking at expert level across every field. OpenAI’s next moat may be the best synthetic training set ever.
At this point I would guess we get 4.5 with a subset of this - some scale improvement, the algorithmic pickups since 4 was trained, and a cleaned and improved core data set but without risking leakage of the superior dataset
When 5 launches, we get to see what a fully scaled version looks like with training data that outstrips average humans in almost every problem space
Then the next o-model gets to start with that as a base and reason? Its likely to be remarkable
Great improvements and all, but they are still no closer (as of 4o regular) to having a system that can be responsible for work. In math problems, it forgets which variable represents what, in coding questions it invents library fns.
I was watching a YouTube interview with a "trading floor insider". They said they were really being paid for holding risk. The bank has a position in a market, and it's their ass on the line if it tanks.
ChatGPT (as far as I can tell) is no closer to being accountable or responsible for anything it produces. If they don't solve that (and the problem is probably inherent to the architecture), they are, in some sense, polishing a turd.
If an LLM can't be left to do mowing by itself, but a human will have to closely monitor and intervene at every its steps, then it's just a super fast predictive keyboard, no?
Obviously not. I want legislation which imposes liability on OpenAI and similar companies if they actively market their products for use in safety-critical fields and their product doesn’t perform as advertised.
If a system is providing incorrect medical diagnoses, or denying services to protected classes due to biases in the training in the training data, someone should be held accountable.
They would want to, if they thought they could, because doing so would unblock a ton of valuable use cases. A tax preparation or financial advisor AI would do huge numbers for any company able to promise that its advice can be trusted.
"With o3 now public knowledge, imagine how long it’s been churning out new thinking at expert level across every field."
I highly doubt that. o3 is many orders of magnitude more expensive than paying subject matter experts to create new data. It just doesn't make sense to pay six figures in compute to get o3 to make data a human could make for a few hundred dollars.
Yes, I think they had to push this reveal forward because their investors were getting antsy with the lack of visible progress to justify continuing rising valuations. There is no other reason a confident company making continuous rapid progress would feel the need to reveal a product that 99% of companies worldwide couldn't use at the time of the reveal.
That being said, if OpenAI is burning cash at lightspeed and doesn't have to publicly reveal the revenue they receive from certain government entities, it wouldn't come as a surprise if they let the government play with it early on in exchange for some much needed cash to set on fire.
EDIT: The fact that multiple sites seem to be publishing GPT-5 stories similar to this one leads one to conclude that the o3 benchmark story was meant to counter the negativity from this and other similar articles that are just coming out.
Seems to me o3 prices would be what the consumer pays, not what OpenAI pays. That would mean o3 could be more efficient in-house than paying subject-matter experts.
For every consumer there will be a period where they need both the SME and the o3 model for initial calibration and eventual handoff for actually getting those efficiencies in whichever processes they want to automate.
In other words if you are diligent enough, you should at least validate your o3 solution with an actual expert for some time. You wouldn't just blindly trust OpenAI your business critical processes, would you? I would expect at least 3 month - 6 months for large corps and even more considering change management, re-upskilling, etc.
With all those considerations I really don't see the value prop at those prices and in those situations right now. Maybe if costs decrease ~1-3 orders of magnitude more for o3-low, depending on the the processes being automated.
Unless the quality of the human data are extraordinary, it seems according to the TFA that it's not that easy:
> The process is painfully slow. GPT-4 was trained on an estimated 13 trillion tokens. A thousand people writing 5,000 words a day would take months to produce a billion tokens.
And if the human-generated data was so qualitatively good that it is smaller by three order of magnitudes, than I can assume it would be at least as expensive as o3.
I don't think oai has any moat at all. If you look around, QwQ from Alibaba is already pushing o1-preview performances. I think oai is only ahead by 3~6 months at most.
If their AGI dreams would come true it might be more than enough to have 3 months head start. They probably won't, but it's interesting to ponder what the next few hours, days, weeks would be for someone that would wield AGI.
Like let's say you have a few datacenters of compute at your disposal and the ability to instantiate millions of AGI agents - what do you have them do?
I wonder if the USA already has a secret program for this under national defense. But it is interesting that once you do control an actual AGI you'd want to speed-run a bunch of things. In opposition to that, how do you detect an adversary already has / is using it and what to do in that case.
How many important problems are there where a 3 month head start on the data side is enough to win permanently and retain your advantage in the long run?
I'm struggling to think of a scenario where "I have AGI in January and everyone else has it in April" is life-changing. It's a win, for sure, and it's an advantage, but success in business requires sustainable growth and manageable costs.
If (random example) the bargain OpenAI strikes is "we spend every cent of our available capital to get AGI 3 months before the other guys do" they've now tapped all the resources they would need to leverage AGI and turn it into profitable, scalable businesses, while the other guys can take it slow and arrive with full pockets. I don't think their leadership is stupid enough to burn all their resources chasing AGI but it does seem like operating and training costs are an ongoing problem for them.
History is littered with first-movers who came up with something first and then failed to execute on it, only for someone else to follow up and actually turn the idea into a success. I don't see any reason to assume that the "first AGI" is going to be the only successful AGI on the market, or even a success at all. Even if you've developed an AGI that can change the world you need to keep it running so it can do that.
Consider it this way: Sam Altman & his ilk have been talking up how dangerous OpenAI's technology is. Are risk-averse businessmen and politicians going to be lining up to put their livelihood or even their lives in the hands of "dangerous technology"? Or are they going to wait 3-6 months and adopt the "safe" AGI from somebody else instead?
Well that's the thought exercise. Is there something you can do with almost unlimited "brains" of roughly human capability but much faster, within a few days / weeks / months. Lets say you can instantiate 1 million agents, for 3 months, and each of them is roughly 100x faster than a human, that means you have the equivalent of 100 million human-brain-hours to dump into whatever you want, as long as your plans don't require building too many real world things that actually require moving atoms around, I think you could do some interesting things. You could potentially dump a few million hours into "better than AGI AI" to start off for example, then go to other things. If they are good enough you might be able to find enough zero-days to disable any adversary through software, among other interesting things.
Where does "almost unlimited" come into the picture though? I see people talking like AGI will be unlimited when it will be limited by available compute resources, and like I suggested, being 'first' might come at the cost of the war chest you'd need to access those resources.
What does it take to instantiate 1 million agents? Who has that kind of money and hardware? Would they still have it if they burn everything in the tank to be first?
> Where does "almost unlimited" come into the picture though
>> Like let's say you have a few datacenters of compute at your disposal and the ability to instantiate millions of AGI agents - what do you have them do?
> has that kind of money and hardware?
Any hyperscaler plus most geopolitical main players. So the ones who matter.
synthetic data is fine if you can ground the model somehow. that's why the o1/o3's improvements are mostly in reasoning, maths, etc., because you can easily tell if the data is wrong or not.
Everyone's obsessed with new training tokens... It doesn't need to be more knowledgeable, it just needs to practice more. Ask any student: practice is synthetic data.
And who will tell the model whether its practice results are correct or not? Students practice against external evaluators, it’s not a self-contained system.
Overfitting can be caused by a lot of different things. Having an over abundance of one kind of data in a training set is one of those causes.
It’s why many pre-processing steps for image training pipelines will add copies of images at weird rotations, amounts of blur, and different cropping.
> The more concepts the model manages to grok, the more nonlinear its capabilities will be
These kind of hand wavey statements like “practice,” “grok,” and “nonlinear its capabilities will be” are not very constructive as they don’t have solid meaning wrt language models.
So earlier when I was referring to compounding bias in synthetic data I was referring to a bias that gets trained on over and over and over again.
These kind of hand wavey statements like “practice,” “grok,” and “nonlinear its capabilities will be” are not very constructive as they don’t have solid meaning wrt language models.
So, here's my hypothesis, as someone who is adjacent ML but haven't trained DNNs directly:
We don't understand how they work, because we didn't build them. They built themselves.
At face value this can be seen as an almost spiritual position, but I am not a religious person and I don't think there's any magic involved. Unlike traditional models, the behavior of DNNs is based on random changes that failed up. We can reason about their structure, but only loosely about their functionality. When they get better at drawing, it isn't because we taught them to draw. When they get better at reasoning, it isn't because the engineers were better philosophers. Given this, there will not be a direct correlation between inputs and capabilities, but some arrangements do work better than others.
If this is the case, high order capabilities should continue to increase with training cycles, as long as they are performed in ways that don't interfere with what has been successfully learned. People lamented the loss of capability that GPT 4 suffered as they increased safety. I think Anthropic has avoided this by choosing a less damaging way to tune a well performing model.
> We don't understand how they work, because we didn't build them. They built themselves.
We do understand how they work, we did build them.
The mathematical foundation of these models are sound. The statistics behind them are well understood.
What we don’t exactly know is which parameters correspond to what results as it’s different across models.
We work backwards to see which parts of the network seem to relate to what outcomes.
> When they get better at drawing, it isn't because we taught them to draw. When they get better at reasoning, it isn't because the engineers were better philosophers.
Isn’t this the exact opposite of reality?
They get better at drawing because we improve their datasets, topologies, and their training methods and in doing so, teach them to draw.
They get better at reasoning because the engineers and data scientists building training sets do get better at philosophy.
They study what reasoning is and apply those learnings to the datasets and training methods.
> We do understand how they work, we did build them. The mathematical foundation of these models are sound. The statistics behind them are well understood.
We don't understand how they work in the sense that we can't extract the algorithms they're using to accomplish the interesting/valuable "intellectual" labor they're doing. i.e. we cannot take GPT-4 and write human-legible code that faithfully represents the "heavy lifting" GPT-4 does when it writes code (or pick any other task you might ask it to do).
That inability makes it difficult to reliably predict when they'll fail, how to improve them in specific ways, etc.
The only way in which we "understand" them is that we understand the training process which created them (and even that's limited to reproducible open-source models), which is about as accurate as saying that we "understand" human cognition because we know about evolution. In reality, we understand very little about human cognition, certainly not enough to reliably reproduce it in silico or intervene on it without a bunch of very expensive (and failure-prone) trial-and-error.
> We don't understand how they work in the sense that we can't extract the algorithms they're using to accomplish the interesting/valuable "intellectual" labor they're doing. i.e. we cannot take GPT-4 and write human-legible code that faithfully represents the "heavy lifting" GPT-4 does when it writes code (or pick any other task you might ask it to do).
I think English is being a little clumsy here. At least I’m finding it hard to express what we do and don’t know.
We know why these models work. We know precisely how, physically, they come to their conclusions (it’s just processor instructions as with all software)
We don’t know precisely how to describe what they do in a formalized general way.
That is still very different from say an organic brain, where we barely even know how it works, physically.
My opinions:
I don’t think they are doing much mental “labor.” My intuition likens them to search.
They seem to excel at retrieving information encoded in their weights through training and in the context.
They are not good at generalizing.
They also, obviously, are able to accurately predict tokens such that the resulting text is very readable.
Larger models have a larger pool of information and that information is in a higher resolution, so to speak, since the larger better preforming models have more parameters.
I think much of this talk of “consciousness” or “AGI” is very much a product of human imagination, personification bias, and marketing.
>We know why these models work. We know precisely how, physically, they come to their conclusions (it’s just processor instructions as with all software)
I don't know why you would classify this as knowing much of anything. Processor instructions ? Really?
If the average user is given unfettered access to the entire source code of his/her favorite app, does he suddenly understand it ? That seems like a ridiculous assertion.
In reality, it's even worse. We can't pinpoint what weights, how and in what ways and instances are contributing exactly to basic things like whether a word should be preceded by 'the' or 'a' and it only gets more intractable as models get bigger and bigger.
Sure, you could probably say we understand these NNs better than brains but it's not by much at all.
> If the average user is given unfettered access to the entire source code of his/her favorite app, does he suddenly understand it ? That seems like a ridiculous assertion.
And one that I didn’t make.
I don’t think when we say “we understand” we’re talking about your average Joe.
I mean “we” as in all of human knowledge.
> We can't pinpoint what weights, how and in what ways and instances are contributing exactly to basic things like whether a word should be preceded by 'the' or 'a' and it only gets more intractable as models get bigger and bigger.
There is research coming out on this subject. I read a paper recently about how llama’s weights seemed to be grouped by concept like “president” or “actors.”
But just the fact that we know that information encoded in weights affects outcomes and we know the underlying mechanisms involved in the creation of those weights and the execution of the model shows that we know much more about how they work than an organic brain.
The whole organic brain thing is kind of a tangent anyway.
My point is that it’s not correct to say that we don’t know how these systems work. We do. It’s not voodoo.
We just don’t have a high level understanding of the form in which information is encoded in the weights of any given model.
> With o3 now public knowledge, imagine how long it’s been churning out new thinking at expert level across every field. OpenAI’s next moat may be the best synthetic training set ever.
Even taking OpenAI and the benchmark authors at their word they said that it is consuming at least tens of dollars per task to hit peak performance, how much would it cost to have it produce a meaningfully large training set?
There is no public API for o3 yet, those are the numbers they revealed in the ARC-AGI announcement. Even if they were public API prices we can't assume they're making a profit on those for as long as they're billions in the red overall every year, its entirely possible that the public API prices are less than what OpenAI is actually paying.
The value of synthetic data relies on having non-zero signal about which generated data is "better" or "worse". In a sense, this what reinforcement learning is about. Ie, generate some data, have that data scored by some evaluator, and then feed the data back into the model with higher weight on the better stuff and lower weight on the worse stuff.
The basic loop is: (i) generate synthetic data, (ii) rate synthetic data, (iii) update model to put more probability on better data and less probability on worse data, then go back to (i).
But who rates the synthetic data? If it is humans, I can understand that this is another way to get human knowledge into it, but if it's rated by AI, isn't it just a convoluted way of copying the rating AI's knowledge?
Many things are more easily scored than produced. Like it's trivial to tell whether a poem rhymes, but writing one is a comparatively slow and difficult task. So hopefully since scoring is easier/more-discerning than generating, the idea is you can generate stuff, classify it as good or bad, and then retrain on the good stuff. It's kindof an article of faith for a lot of AI companies/professionals as well, since it prevents you from having to face a data wall, and is analogous to a human student practicing and learning in an appealing way.
As far as I know it doesn't work very well so far. It is prone to overfitting, where it ranks highly some trivial detail of the output eg "if a summary starts with a byline of the author its a sign of quality" and then starts looping on itself over and over, increasing the frequency and size of bylines until it's totally crommed off to infinity and just repeating a short phrase endlessly. Humans have good baselines and common sense that these ML systems lack, if you've ever seen one of those "deep dream" images it's the same kind of idea. The "most possible dog" image can be looks almost nothing like a dog in the same way that the "most possible poem" may look nothing like a poem.
> This technique, the "Self-Taught Reasoner" (STaR), relies on a simple loop: generate rationales to answer many questions, prompted with a few rationale examples; if the generated answers are wrong, try again to generate a rationale given the correct answer; fine-tune on all the rationales that ultimately yielded correct answers; repeat. We show that STaR significantly improves performance on multiple datasets compared to a model fine-tuned to directly predict final answers
But there are a few others. In general good data is good data. We're definitely learning more about how to produce good synthetic version.
One issue with that is that the model may learn to smuggle data. You as a human think that the plain reading of the words is what is doing the reasoning, but (part of) the processing is done by the exact comma placement and synonym choice etc.
Data smuggling is a known phenomenon in similar tasks.
There is an enormous "iceberg" of untapped non-public data locked behind paywalls or licensing agreements. The next frontier will be spending money and human effort to get access to that data, then transform it into something useful for training.
Meanwhile, the biggest opportunity lies not in whatever next thing OpenAI releases, but the rest of the enormous software industry actually integrating this technology and realizing the value it can deliver.
Counterpoint: o1-Pro is insanely good -- subjectively, it's as far above GPT4 as GPT4 was above 3. It's almost too good. Use it properly for an extended period of time, and one begins to worry about the future of one's children and the utility of their schooling.
o3, by all accounts, is better still.
Seems to me that things are progressing quickly enough.
Not sure what you are using it for, but it is terrible for me for coding; claude beats it always and hands down. o1 just thinks forever to come up with stuff it already tried the previous time.
People say that's just prompting without pointing to real million line+ repositories or realistic apps to show how that can be improved. So I say they are making todo and hello world apps and yes, there it works really well. Claude still beats it, every.. single.. time..
And yes, I use the Pro of all and yes, I do assume coding is done for most of people. Become a plumber or electrician or carpenter.
That so weird, it’s seems like everybody here prefers Claude.
I’ve been using Claude and openai in copilot and I find even 4o seems to understand the problem better. O1 definitely seems to get it right more for me.
I try to sprinkle 'for us/me' everywhere as much as I can; we work on LoB/ERP apps mostly. These are small frontends to massive multi million line backends. We carved a niche by providing the frontends on these backends live at the client office by a business consultant of ours: they simply solve UX issues for the client on top of large ERP by using our tool and prompting. Everything looks modern, fresh and nice; unlike basically all the competitors in this space. It's fast and no frontend people are needed for it; backend is another system we built which takes a lot longer of course as they are complex business rules. Both claude and o1 turn up something that looks similar but only the claude version will work and be, after less prompting, correct. I don't have shares in either and I want open source to win; we have all open (more open) solutions doing all the same queries and we evaluate all but claude just wins. We did manage even big wins with openai davinci in 2022 (or so; before chatgpt), but this is a massive boost allowing us to upgrade most people to business consultant and just have them build with clients real time and have the tech guys including me add manually tests and proofs (where needed) to know if we are actually fine. Works so much better than the slog with clients before; people are so bad at explaining at what they need, it was slowly driving me insane after doing it for 30+ years.
They're both okay for coding, though for my use cases (which are niche and involve quite a lot of mathematics and formal logic) o1/o1-Pro is better. It seems to have a better native grasp of mathematical concepts, and it can even answer very difficult questions from vague inputs, e.g.: https://chatgpt.com/share/676020cb-8574-8005-8b83-4bed5b13e1...
Different languages maybe? I find Sonnet v2 to be lacking in Rust knowledge compared to 4o 11-20, but excelling at Python and JS/TS. O1's strong side seems to be complex or quirky puzzle-like coding problems that can be answered in a short manner, it's meh at everything else, especially considering the price. Which is understandable given its purpose and training, but I have no use for it as that's exactly the sort of problem I wouldn't trust an LLM to solve.
Sonnet v2 in particular seems to be a bit broken with its reasoning (?) feature. The one where it detects it might be hallucinating (what's even the condition?) and reviews the reply, reflecting on it. It can make it stop halfway into the reply and decide it wrote enough, or invent some ridiculous excuse to output a worse answer. Annoying, although it doesn't trigger too often.
We do the same (all requests go to o1, sonnet and gemini and we store the results for later to compare) automatically for our research: Claude always wins. Even with specific prompting on both platforms. Especially frontend it seems o1 really is terrible.
Exactly. The previous version of o1 did actually worse in the coding benchmarks, so I would expect it to be worse in real life scenarios.
The new version released a few days ago on the other hand is better in the benchmarks, so it would seem strange that someone used it and is saying that it’s worse than Claude.
Every time I try Gemini, it's really subpar. I found that qwen2.5-coder-32b-instruct can be better.
Also, for me 50% 50% for Sonnet and o1, but although I'm not 100% sure about it, I think o1 is better with longer and more complicated (C++) code and debugging. At least from my brief testing. Also, OpenAI models seem to be more verbose - sometimes it's better - where I'd like additional explanation on chosen fields in a SQL schema, sometimes it's too much.
EDIT: Just asked both o1 and Sonnet 3.5 the same QML coding question, and Sonnet 3.5 succeeded, o1 failed.
Very anecdotal but I’ve found that for things that are well spec’d out with a good prompt Sonnet 3.5 is far better. For problems where I might have introduced a subtle logical error o1 seems to catch it extremely well. So better reasoning might be occurring but reasoning is only a small part of what we would consider intelligence.
Wins? What does this mean? Do you have any results? I see the claims that Claude is better for coding a lot but using it and using Gemini 2.0 flash and o1 and it sure doesn't seem like it.
I keep reading this on HN so I believe it has to be true in some ways, but I don't really feel like there is any difference in my limited use (programming questions or explaining some concepts).
If anything I feel like it's all been worse compared to the first release of ChatGPT, but I might be wearing rose colored glasses.
It’s the same for me. I genuinely don’t understand how I can be having such a completely different experience from the people who rave about ChatGPT. Every time I’ve tried it’s been useless.
How can some people think it’s amazing and has completely changed how they work, while for me it makes mistakes that a static analyser would catch? It’s not like I’m doing anything remarkable, for the past couple of months I’ve been doing fairly standard web dev and it can’t even fix basic problems with HTML. It will suggest things that just don’t work at all and my IDE catches, it invents APIs for packages.
One guy I work with uses it extensively and what it produces is essentially black boxes. If I find a problem with something “he” (or rather ChatGPT) has produced it takes him ages to commune with the machine spirit again to figure out how to fix it, and then he still doesn’t understand it.
I can’t help but see this as a time-bomb, how much completely inscrutable shite are these tools producing? In five years are we going to end up with a bunch of “senior engineers” who don’t actually understand what they’re doing?
Before people cry “o tempora o mores” at me and make parallels with the introduction of high-level languages, at least in order to write in a high-level language you need some basic understanding of the logic that is being executed.
> How can some people think it’s amazing and has completely changed how they work, while for me it makes mistakes that should a static analyser would catch?
There are a lot of code monkeys working on boilerplate code, these people used to rely on stack overflow and now that chatgpt is here it's a huge improvement for them
If you work on anything remotely complex or which hasn't been solved 10 times on stack overflow chatgpt isn't remotely as useful
I work on very complex problems. Some of my solutions have small, standard substeps that now I can reliably outsource to ChatGPT. Here are a few just from last week:
- write cvxpy code to find the chromatic number of a graph, and an optimal coloring, given its adjecency matrix.
- given an adjecency matrix write numpy code that enumerates all triangle-free vertex subsets.
- please port this old code from tensorflow to pytorch: ...
- in pytorch, i'd like to code a tensor network defining a 3-tensor of shape (d, d, d). my tensor consists of first projecting all three of its d-dimensional inputs to a k-dimensional vector, typically k=d/10, and then applying a (k, k, k) 3-tensor to contract these to a single number.
To be honest, these don’t sound like hard problems. These sound like they have very specific answers that I might find in the more specialized stackoverflow sections. These are also the kind of questions (not in this domain) that I’ve found yield the best results from LLMs.
In comparison asking an LLM a more project specific question “this code has a race condition where is it” while including some code usually is a crapshoot and really depends if you were lucky enough to give it the right context anyway.
Sure, these are standard problems, I’ve said so myself. My point is that my productivity is multiplied by ChatGPT, even if it can only solve standard problems. This is because, although I work on highly non-standard problems (see https://arxiv.org/abs/2311.10069 for an example), I can break them down into smaller, standard components, which ChatGPT can solve in seconds. I never ask ChatGPT "where's the race condition" kind of questions.
I think the difference comes down to interacting with it like IDE autocomplete vs. interacting with it like a colleague.
It sounds like you're doing the former -- and yeah, it can make mistakes that autocomplete wouldn't or generate code that's wrong or overly complex.
On the other hand, I've found that if you treat it more like a colleague, it works wonderfully. Ask it to do something, then read the code and ask follow-up questions. If you see something that's wrong or just seems off, tell it, and ask it to fix it. If you don't understand something, ask for an explanation. I've found that this process generates great code that I often understand better than if I had written it from scratch, and in a fraction of the time.
It also sounds like you're asking it to do basic tasks that you already know how to do. I find that it's most useful in tackling things that I don't know how to do. It'll already have read all of the documentation and know the right way to call whatever APIs, etc, and -- this is key -- you can have a conversation with it to clear up anything that's confusing.
This takes a big shift in mindset if you've been using IDEs all your life and have expectations of LLMs being a fancy autocomplete. And you really have to unlearn a lot of stuff to get the most out of them.
I'm in the same boat as the person you're responding to. I really don't understand how to get anything helpful out of ChatGPT, or more than anything basic out of Claude.
> I've found that if you treat it more like a colleague, it works wonderfully.
This is what I've been trying to do. I don't use LLM code completion tools. I'll ask anything from how to do something "basicish" with html & css, and it'll always output something that doesn't work as expected. Question it and I'll get into a loop of the same response code, regardless of how I explain that it isn't correct.
On the other end of the scale, I'll ask about an architectural or design decision. I'll often get a response that is in the realm of what I'd expect. When drilling down and asking specifics however, the responses really start to fall apart. I inevitably end up in the loop of asking if an alternative is [more performant/best practice/the language idiomatic way] and getting the "Sorry, you're correct" response. The longer I stay in that loop, the more it contradicts itself, and the less cohesive the answers get.
I _wish_ I could get the results from LLMs that so many people seem to. It just doesn't happen for me.
first time I tried it, I asked it to find bugs in a piece of very well tested C code.
It introduced an off-by-one error by miscounting the number of arguments in an sprintf call, breaking the program. And then proceeded to fail to find that bug that it introduced.
> How can some people think it’s amazing and has completely changed how they work, while for me it makes mistakes that should a static analyser would catch? It’s not like I’m doing anything remarkable, for the past couple of months I’ve been doing fairly standard web dev and it can’t even fix basic problems with HTML.
Part of this is, I think, anchoring and expectation management: you hear people say it's amazing and wonderful, and then you see it fall over and you're naturally disappointed.
My formative years started off with Commodore 64 basic going "?SYNTAX ERROR" from most typos plus a lot of "I don't know what that means" from the text adventures, then Metrowerks' C compiler telling me there were errors on every line *after but not including* the one where I forgot the semicolon, then surprises in VisualBasic and Java where I was getting integer division rather than floats, then the fantastic oddity where accidentally leaning on the option key on a mac keyboard while pressing minus turns the minus into an n-dash which looked completely identical to a minus on the Xcode default font at the time and thus produced a very confusing compiler error…
So my expectations have always been low for machine generated output. And it has wildly exceeded those low expectations.
But the expectation management goes both ways, especially when the comparison is "normal humans" rather than "best practices". I've seen things you wouldn't believe...
Entire files copy-pasted line for line, "TODO: deduplicate" and all,
20 minute app starts passed off as "optimized solutions."
FAQs filled with nothing but Bob Ross quotes,
a zen garden of "happy little accidents."
I watched iOS developers use UI tests
as a complete replacement for storyboards,
bi-weekly commits, each a sprawling novel of despair,
where every change log was a tragic odyssey.
Google Spreadsheets masquerading as bug trackers,
Swift juniors not knowing their ! from their ?,
All those hacks and horrors… lost in time,
Time to deploy.
(All true, and all pre-dating ChatGPT).
> It will suggest things that just don’t work at all and my IDE catches, it invents APIs for packages.
Aye. I've even had that with models forgetting the APIs they themselves have created, just outside the context window.
To me, these are tools. They're fantastic tools, but they're not something you can blindly fire-and-forget…
…fortunately for me, because my passive income is not quite high enough to cover mortgage payments, and I'm looking for work.
> In five years are we going to end up with a bunch of “senior engineers” who don’t actually understand what they’re doing?
Yes, if we're lucky.
If we're not, the models keep getting better and we don't have any "senior engineers" at all.
The ones who use it extensively are the same that used to hit up stackoverflow as the first port of call for every trivial problem that came their way. They're not really engineers, they just want to get stuff done.
Same, on every release from openai, anthropic I keep reading how the new model is so much better (insert hyperbole here) than the previous one yet when using it I feel like they are mostly the same as last year.
One use-case: They help with learning things quickly by having a chat and asking questions. And they never get tired or emotional. Tutoring 24/7.
They also generate small code or scripts, as well as automate small things, when you're not sure how, but you know there's a way. You need to ensure you have a way to verify the results.
They do language tasks like grammar-fixing, perfect translation, etc.
They're 100 times easier and faster than search engines, if you limit your uses to that.
They can't help you learn what they don't know themselves.
I'm trying to use them to read historical handwritten documents in old Norwegian (Danish, pretty much). Not only do they not handle the German-style handwriting, but what they spit out looks like the sort of thing GPT-2 would spit out if you asked it to write Norwegian (only slightly better than Swedish Muppet Swedish Chef's Swedish). It seems the experimental tuning has made it worse at the task I most desperately want to use it for.
And when you think about it, how could it not overfit in some sense, when trained on its own output? No new information is coming in, so it pretty much has to get worse at something to get better at all the benchmarks.
Hah, no. They're good, but they definitely make stuff up when the context gets too long. Always check their output, just the same as you already note they need for small code and scripts.
If you've ever used any enterprise software for long enough, you know the exact same song and dance.
They release version Grand Banana. Purported to be approximately 30% faster with brand new features like Algorithmic Triple Layering and Enhanced Compulsory Alignment. You open the app. Everything is slower, things are harder to find and it breaks in new, fun ways. Your organization pays a couple hundred more per person for these benefits. Their stock soars, people celebrate the release and your management says they can't wait to see the improvement in workflows now that they've been able to lay off a quarter of your team.
Has there been improvements in LLMs over time? Somewhat, most of it concentrated at the beginning (because they siphoned up a bunch of data in a dubious manner). Now it's just part of their sales cycle, to keep pumping up numbers while no one sees any meaningful improvement.
I had a 30 min argument with o1-pro where it was convinced it had solved the halting problem. Tried to gaslight me into thinking I just didn’t understand the subtlety of the argument. But it’s susceptible to appeal to authority and when I started quoting snippets of textbooks and mathoverflow it finally relented and claimed there had been a “misunderstanding”. It really does argue like a human though now...
I had a similar experience with regular o1 about integral that was divergent. It was adamant that it wasn't and would respond to any attempt at persuasion with variants of "its a standard integral" with a "subtle cancellation". When I asked for any source for this standard integral it produced references to support its argument that existed but didn't actually contain the integral. When I told it the references didn't have the result and backpedalled (gaslighting!) to "I never told you they were in there". When I pointed out that in fact it did it insisted this was just a "misunderstanding". It only relented when I told it Mathematica agreed the integral was divergent. It still insisted it never said that the books it pointed to contained this (false, non-sensical) result.
This was new behaviour for me to see in an LLM. Usually the problem is these things would just fold when you pushed back. I don't know which is better, but being this confidently wrong (and "lying" when confronted with it) is troubling.
The troubling part is that the references themselves existed -- one was an obscure Russian text that is difficult to find (but is exactly where you'd expect to find this kind of result, if it existed).
I want AI to help me in the physical world: folding my laundry, cooking and farming healthy food, cleaning toilets. Training data is not lying around on the internet for free, but it's also not impossible. How much data do you need? A dozen warehouses full of robots folding and unfolding laundry 24/7 for a few months?
I think it would be many decades before I'd trust a robot like that around small children or pets. Robots with that kind of movement capability, as well as the ability it pick up and move things around, will be heavy enough that a small mistake could easily kill a small child or pet.
That's a solved problem for small devices. And we effectively have "robots" like that all over the place. Sliding doors in shops/trains/elevators have been around for ages and they include sensors for resistance. Unless there's 1. extreme cost cutting, or 2. bug in the hardware, devices like that wouldn't kill children these days.
Even for adults, a robot that would likely have to be close to as massive as a human being, in order to do laundry and the like, would spook me out, moving freely through my place.
That's the point being made. It's transformed robotics research, yes, but it both remains to see whether it will have a truly transformative effect on the field as experienced by people outside academia (I think this is quite probable) and more pointedly when.
I think it's impossible to spend a lot of time with these models without believing robotics is fundamentally about to transform. Even the most sophisticated versions of robotic logic pre-LLM/VLM feel utterly trivial compared to what even rudimentary applications of these large models can accomplish.
I think this is an opinion borne out of weariness with constant promises that amazing robots are right around the corner (as they have been for 20 odd years now). For anyone who is close to the front line, I think the resounding consensus is clear - this time is different, unbelievably different, and capability development is going to accelerate dramatically.
Laundry folding is an instructive example. Machines have been capable of home-scale laundry folding for over a decade, with two companies Foldimate and Laundroid building functional prototypes. The challenge is making it cost-competitive in a world where most people don't even purchase a $10 folding board.
I would guess that most cooking and cleaning tasks are in basically the same space. You don't need fine motor control to clean a toilet bowl, but you've gotta figure out how to get people to buy the well-proven premisting technology before you'll be able to sell them a toilet-cleaning robot.
Counterexample: Everyone uses dishwashers. Yet I don’t think we’ll have a robot doing the dishes human-style, or even just filling up and clearing out a dishwasher, within the next decade or two, regardless of price.
Part of the tradeoff there is efficiency. I like my dishwasher because it's as good at getting things clean as I am but it does it using less water and less soap, and at scale, it takes less time too. It's just a great use case for machine automation because you can do clever stuff w/a dishwasher that's hard to replicate outside of that closed environment.
I struggle to imagine a scenario where a 1-2 person household would get the same benefits from something like a laundry-folding robot. I hate folding my laundry and I still can't imagine buying one since I simply don't do laundry that often. If I really wanted to spend less time doing laundry, I could spend the cost of that laundrybot on a larger collection of clothing to wear, for that matter.
Robot vacuums are a good comparison point since vacuuming is something you (ideally) do frequently that is time and labor intensive. I do own one of those, and if it got better at dealing with obstacles thanks to "AI" I would definitely like that.
I think it would have to be a general-purpose robot, and doing the laundry would just be one of many things it can do, similar to how running a particular program is only one of many things a computer can do. More than that, I believe it would actually require a general-purpose robot to handle all contingencies that can arise in doing laundry.
As someone who does laundry about twice a week, it would certainly be nice. But it’s a pie in the sky at this time even just on the technological side.
There's plenty of machines which are expensive, bulky, single purpose and yet commercially successful. The average American household has a kitchen range, refrigerator, dishwasher, laundry machine, dryer, television, furnace, and air conditioner. Automatic coffee machines and automatic vacuums are less universal but still have household penetration in the millions. I really think the household tasks with no widely available automation are simply the ones that nobody cares enough about doing to pay for automation.
A robot servant that does literally 100% of chores would be a game changer, and I expect we'll get there at some point, but it will probably have to be a one-shot from a consumer perspective. A clever research idea to reach 25% or 50% coverage still isn't going to lead to a commercially viable product.
For a company that sees itself as the undisputed leader and that wants to raise $7 trillion to build fabs, they deserve some of the heaviest levels of scrutiny in the world.
If OpenAI's investment prospectus relies on them reaching AGI before the tech becomes commoditized, everyone is going to look for that weakness.
I was on an airplane and there was high-speed Internet on the airplane. That's the newest thing that I know exists. And I'm sitting on the plane and they go, open up your laptop, you can go on the Internet.
And it's fast, and I'm watching YouTube clips. It's amazing. I'm on an airplane! And then it breaks down. And they apologize, the Internet's not working. And the guy next to me goes, 'This is bullshit.' I mean, how quickly does the world owe him something that he knew existed only 10 seconds ago?"
Soon, all the middle class jobs will be converted to profits for the capital/data center owners, so they have to spend while they can before the economy crashes due to lack of spending.
Not invariably. Some of those people are the ones who want to draw 7 red lines all perpendicular, some with green ink, some with transparent and one that looks like a kitten.
No, people who say "it's bullshit" and then do something to fix the bullshit are the ones that push technology forward. Most people who say "it's bullshit" instantly when something isn't perfect for exactly what they want right now are just whingers and will never contribute anything except unconstructive criticism.
There's someone with this comment in every thread. Meanwhile, no one answers this because they are getting value. Please take the time to learn, it will give you value.
I’m a consultant. Having looked at several enterprises, there’s a lot of work being done to make a lot of things that don’t really work.
The bigger the ambition, the harder they’re failing. Some well designed isolated use cases are ok. Mostly things about listening and summarizing text to aid humans.
I have yet to see a successful application that is generating good content. IMO replacing the first draft of content creation and having experts review and fix it is, like, the stupidest strategy you can do. The people you replace are the people at the bottom of the pyramid who are supposed do this work to upskill and become domain experts so they can later review stuff. If they’re no longer needed, you’re going to one day lose your reviewer, and with it, the ability to assess your generated drafts. It’s a foot gun.
I mean, no, not generally. but the success rate of other tools is much higher.
A lot of companies are trying to build these general purpose bots that just magically know everything about the company and have these but knowledge bases, but they just don’t work.
I'm someone who generally was a "doubter", but I've dramatically softened my stance on this topic.
Two things:
I was casually watching Andreas Kling's streams on Ladybird development (where he was developing a JIT compiler for JS) and was blown away at the accuracy of completions (and the frequency of those completions)
Prior to this, I'd only ever copypasta'd code from ChatGPT output on occasion.
I started adopting the IDE/Editor extensions and prototyping small projects.
There's now small tools and utilities I've written that I'd not have written otherwise, or would have taken twice the time invested had I'd not used these tools.
With that said, they'd be of no use without oversight, but as a productivity enhancement, the benefits are enormous.
For my mental health I’ve stopped replying to comments where it’s clear the author has no intention of having a discussion and instead wants their share their opinion and have it reinforced by others.
No, we don’t have AGI or anything close to it. Yes, AI has come a long way in the past decade and many people find it useful in their day-to-day lives.
It’s difficult to know where AI will be in 10 years, but the current rate of improvement is staggering.
> Meanwhile, no one answers this because they are getting value.
You're literally doing the same thing you're accusing of. Every HN thread is full of AI boosters claiming AI to be the future with no backing evidence.
Riddle me this. If all these people are "getting value", why are all these companies losing horrendous amounts of money? Why has nobody figured out how to be profitable?
> Please take the time to learn, it will give you value.
Yeah, yeah, just prompt engineer harder. That'll make the stochastic parrot useful. Anyone who has criticism just does so because they're dumb and you're smart. Same as it always was. Everyone opposed to the metaverse just didn't get it bro. You didn't get NFTs bro. You didn't get blockchain bro.
None of these previous bubbles had money in it (beyond scamming idiots), if AI wants to prove it's not another empty tech bubble, pay up. Show me the money. Should be easy, if it's automating so many expensive man-hours of labour. People would be lining up to pay OpenAI.
Think of all the search engines alltheweb, yahoo, astalavista,... where sooo much money got poored in, and finally there was just one winner taking it all. That's the race openai is trying to win now. The competition is fierce and we can just play with all kinds of models for free and we do nothing but complaining.
> Riddle me this. If all these people are "getting value", why are all these companies losing horrendous amounts of money? Why has nobody figured out how to be profitable?
While I agree that LLMs are not currently working great for most envisioned use cases; this premise here is not a good argument. Large LLM providers are not trying to be profitable at the moment. They’re trying to grow and that’s pretty sensible.
Uber was the poster child of this, and for all its mockery, Uber is now an unqualified profitable company.
I'm not sure I would call incinerating 11b dollars a year to the point where you need to do one of the biggest raises ever and it doesn't even buy you a year of runway sensible.
> Why has nobody figured out how to be profitable?
From what I've seen claimed about OpenAI finances, this is easy: It's a Red Queen's race — "it takes all the running you can do, to keep in the same place".
If their financial position was as simple as "we run this API, we charge X, the running cost is Y", then they're already at X > Y.
But if that was all OpenAI were actually doing, they'd have stopped developing new versions or making the existing models more efficient some time back, while the rest of the industry kept improving their models and lowering their prices, and they'd be irrelevant.
> People would be lining up to pay OpenAI.
They are.
Not that this is either sufficient or necessary to actually guarantees anything about real value. For lack of sufficiency: people collectively paid a lot for cryptocurrencies and NFTs, too (and before then and outside tech, homeopathic tinctures and sub-prime mortgages); For lack of necessity: there's plenty of free-to-download models.
I get a huge benefit even just from the free chat models. I could afford to pay for better models, but why bother when free is so good? Every time a new model comes out, the old paid option becomes the new free option.
• Build toys that would otherwise require me to learn new APIs (I can read python, but it's not my day job)
• Learn new things like OpenSCAD
• To improve my German
• Learn about the world by allowing me to take photos of things in this world that I don't understand and ask them a question about the content, e.g. why random trees have bands or rectangles of white paint on them
• Help me shopping, by taking a photo of the supermarket that I happen to be in at the time and ask them where I should look for some item I can't find
• Help with meal prep, by allowing me to get a recipe based on what food and constraints I've got at hand rather than the traditional method of "if you want x, buy y ingredients"
Even if they're just an offline version of Wikipedia or Google, they're already a more useful interface for the same actual content.
That was puzzles me now. Everyone with a semblance of expertise in engineering knows that if you start with a tool and try to find a problem it could solve you are doing it wrong. The right way is the opposite - you start with a problem, and find the best tool to solve it, and if it's the new shiny tool - so be it, but most of the time it's not.
Except the whole tech world starting with the CEOs seems to do it the "wrong" way with LLMs. People and whole companies are encouraged to find what these things might be actually useful for.
GPT-5 is not behind schedule. GPT-5 is called GPT-4o and it has been already released half a year ago. It was not revolutionary enough to be called 5, and prophet saint Altman was probably afraid to release new gen not exponentially improving, so it was rebranded in the last moment. It's speculation of course, but it is kinda obvious speculation.
This is the first I have heard of this in particular. Do you know of any article or source for more on the efforts to train GPT 5 and the decision to call it GPT 4o?
I think my biggest pet peeve is when someone shares an insight which is unmistakably based on intuition, inference, critical thinking, etc (all mental faculties we are allowed to use to come to conclusions in the face of information asymmetry btw)
...and then gets hit deadpan with the good old "Source?", like it's some sort of gotcha.
I think people have started to confuse "making logical conclusions without perfect info" with "misinformation"
-
Before certain people start acting like this is advocating for misinformation (which would be an incredible irony...) it's not.
I'm saying if you disagree with what someone supposits, just state so directly. Don't wrap it in a disingenous query for a source.
It's reasonable to ask for sources when an opinion is phrased as a fact, as GGP did. I don't see how you got that it was _unmistakably_ an opinion from that comment.
There is no way to deduce by intuition alone that GPT-5 == GPT-4o. So either that person has some information the rest of us aren't privy to, or it's an opinion phrased as a fact. In either case, it deserves clarification.
On a second read I see that the comment notes that it is intended as speculation, but still it seems rather confident in its own accuracy and I am not even sure it's wrong, but just looking for something that warrants the confidence.
I wrote my comment that way, based on my personal memories of the news cycle between gpt-4 and gpt-4o, and the claims raised by OpenAI about gpt-4o. The hype before 4o release was overwhelming, people have expected the same step up as between 3 and 4, and there were constant "leaks" from supposed insiders that gpt-5 is just at the horizon and will come out soon. And then they release 4o, which was a big standalone release, not some fine tuning like turbo or whatever else they made before.
Looking at the benchmarks it was also very expected in my opinion. Sure, the absolute results are/were sky high, but results relative to the previous gen were not exponential now, they were comparatively smaller than between 2 and 3, or 3 and 4. So I'm guessing that they have invested and worked for 2023-2024 on a brand new model, and branded it according to the model results.
That was clearly phrased like a fact, which may or may not be correct. If it had been phrased like an opinion we wouldn't be having this conversation...
The problem is once you believe their fact is wrong, just say "I think you're wrong <insert rest of comment>". Innocently asking for a source as if you're still on the fence is just performative and leads to these conversations where both sides just end up talking past each other:
A source for one underpinning of the incorrect fact comes up, then "well but that only proves X part of it, can you prove Y" and so on.
tl;dr I just find the quality of discourse is much higher when people are direct.
> I just find the quality of discourse is much higher when people are direct.
Well this certainly is a lot of work to make a mountain out of a mole hill, and I'm not sure it increases the quality of discussion either.
In any case, I think saying bold shit followed up with "it's speculation, but it's OBVIOUS speculation" is worth asking for some evidence. Obvious speculation implies it's sourced from something other than personal gut feeling.
To echo a sibling comment:
> Every time someone says their speculation is "obvious" it rings every possible alarm bell for someone who has completely lost grasp of the ability to distinguish between facts and speculation.
I think it's okay to make logical conclusions but you must base them in evidence, not just suppositions. Intuition is a good start to begin generating hypothesis, but it doesn't render conclusions. I interpreted the GP asking for sources as "can you give me some evidence that would help me reach the same conclusions you've reached". I think that's much preferable to just accepting random things people say at face value.
Even with evidence a logical conclusion can still a supposition (aka an uncertain belief), and often is in the face of the kind of information asymmetry inherent to any outsider commenting on a private company's internal roadmap... but I digress.
My point is simply that is we can skip the passive aggressiveness and just say "can you give me some more evidence that would help me reach the same conclusions you've reached".
Otherwise you're not actually asking for a source, you're just saying "I disagree" in a very roundabout way.
It doesn't even look like 4o is scaled up parameter wise from 4 and was released closer in time than either 3 or 4 were from their predecessors at a time where the scaling required for these next gen iterations has only gotten more difficult.
Critical thinking ? Lol it's just blind speculation.
If you disagree with their reasoning then you explain that.
You don't do this passive aggressive "source???" thing.
It's a bit like starting a Slack conversation with "Hi?": we all know you have a secondary objective, but now you're inserting an extra turn of phrase into the mix
Not everyone keeps up with LLM development enough to know how far apart the release dates for these models are, how much scaling (roughly) has been done on each iteration and a decent ballpark for how much open ai might try to scale up a next gen model.
To me, OP's speculation reads as obvious nonsense but that might not be the case for everybody. Asking for sources or such to what is entirely speculation is perfectly valid and personally, that comment does not ring as passive aggressive to me but maybe it's just me.
Just because someone doesn't know enough to refute the reasoning doesn't mean they must take whatever they read at face value.
If we're making this about the innocent bystanders now, that's all the more reason to be direct and say "I disagree." rather than indirectly expressing negative feelings (aka being passive aggressive) and asking for a source.
If anything just breezily asking for a source would imply to people who don't know better that this is a rather even keeled take and just needs some more evidence on top. "I disagree and here's why" nips that in the bud directly.
How is "I disagree" any more direct than "I've not heard anything like this. any source that would point at that?" Moreover who's to say this person even disagrees? Personally i don't always ask for them because of a disagreement.
I think the hanging point seems to be that you found the comment passive aggressive but i genuinely didn't.
My sister got taken in by drone conspiracy theories, because for her it was just "obvious" that nobody would ever mistake a plane for a drone.
Meanwhile, aeronautics experts whose job it is to know about this have created an entire lexicon for the various perceptual illusions we experience relating to flight and airborne objects, precisely because it involves conditions where our intuitions fail. Many of them have to do with inability to orient depth, distance, or motion for lights at night.
Every time someone says their speculation is "obvious" it rings every possible alarm bell for someone who has completely lost grasp of the ability to distinguish between facts and speculation.
The road to misinformation is paved with overconfident declarations of the form: "it's so obvious, who needs sources!"
Everyone's comparing o1 and claude, but neither really work well enough to justify paying for them in my experience for coding. What I really want is a mode where they ask clarifying questions, ideally many of them, before spitting out an answer. This would greatly improve utility of producing something with more value than an auto-complete.
Just tell it to do that and it will. Whenever I ask an AI for something and I'm pretty sure it doesn't have all the context I literally just say "ask me clarifying questions until you have enough information to do a great job on this."
And this chain of prompts cumulated with the improved CoT reasoner would accrue a lot more enhanced results. More in line with what the coming agentic era promises.
Yes. You can only do so much with the information you get in. The ability to ask good questions, not just of itself in internal monologue style, but actually of the user, would fundamentally make it better since it can get more information in.
As it is now, it has a bad habit of, if it can't answer the question you asked, instead answering a similar-looking question which it thinks you may have meant. That is of course a great strategy for benchmarks, where you don't earn any points for saying you don't know. But it's extremely frustrating for real users, who didn't read their question from a test suite.
I know multiple people that carefully prompt to get that done. The model outputs in direct token order, and can't turn around, so you need to make sure that's strictly followed. The system can and will come up with post-hoc "reasoning".
Just today I got Claude to convert a company’s PDF protocol specification into an actual working python implementation of that protocol. It would have been uncreative drudge work for a human, but I would have absolutely paid a week of junior dev time for it. Instead I wrote it alongside AI and it took me barely more than an hour.
The best part is, I’ve never written any (substantial) python code before.
I have to agree. It's still a bit hit or miss, but the hits are a huge time and money saver especially in refactoring. And unlike what most of the rather demeaning comments in those HN threads state, I am not some 'grunt' doing 'boilerplate work'. I mostly do geometry/math stuff, and the AIs really do know what they're talking about there sometimes. I don't have many peers I can talk to most of the time, and Claude is really helping me gather my thoughts.
That being said, I definitely believe it's only useful for isolated problems. Even with Copilot, I feel like the AIs just lack a bigger context of the projects.
Another thing that helped me was designing an initial prompt that really works for me. I think most people just expect to throw in their issue and get a tailored solution, but that's just not how it works in my experience.
It would seem you don't care too much about verifying its output or about its correctness. If you did, it wouldn't take you just an hour. I guess you'll let correctness be someone else's problem.
I don't know the OP here, but in my experience a junior dev at an average company would likely not do much more than the AI would. These aren't your grandfather's engineers, after all.
The results of this article are going to be fascinating. Realistically, WSJ has a far wider audience than the tech echo chamber, and the general public is only aware of GPT, not o1/o3.
Outsiders will likely read this article and think, “AI is running out of steam”, because GPT-5 is behind.
Those closer to this know of the huge advancements o3 just made yesterday, and will have a complete opposite conclusion.
It will be interesting to see people’s take away from this. I think WSJ missed the mark here with the headline and the takeaway their audience will get from the article.
Maybe you were dog-piled because OpenAI will ship a successor to GPT-4o someday, whatever it's called.
In any case, the "behind schedule" rumors are themselves based on other rumors. GPT-2→GPT-3 took 5 quarters, GPT-3→GPT-4 took 11 quarters, so obviously GPT-5 (or its equivalent) will be released in Q4'2025.
The lack of tech literacy in this article is a bit concerning:
>Some researchers take this so seriously they won’t work on planes, coffee shops or anyplace where someone could peer over their shoulder and catch a glimpse of their work.
I'm almost certain that originally this was meant to be a reference to public wifi networks, as planes and coffee shops are often the frequently cited prototypical examples. They made it literally into a matter of someone looking over their shoulder, which loses so much in translation it's almost how you would write this as a joke to illustrate someone missing the point.
>OpenAI and its brash chief executive, Sam Altman
This also strikes me as nonsense. It's the first I've ever heard of someone describing Sam Altman as brash. The only way I can see them getting there is (1) tech executives are often brash (2) Altman is a tech executive (3) let's just go ahead and call him brash.
Nevertheless if this history of GPT5 and/or o3 training is accurate, it strikes me as significant news, but perhaps a missed opportunity to say more about the pertinent dynamics that explain why the training isn't working and/or to talk in interestingly specific ways about strategies for training, synthetic data, or other such things.
This entire industry is something I feel like I understand 2% of and every time I make progress to get to 10% (3 months later) some massive change happens and all the terminology changes.
What I find odd is that o1 doesn't support attaching text documents to chats the way 4o does. For a model that specializes in reasoning, reading long documents seems like a natural feature to have.
If Sama ever reads this, I have no idea why no users seem to focus on this, but it would be really good to prioritise being able to select which model you can use with the custom myGPTs. I know this maybe hard or not possible without recreating them , but I still dont think it's possible.
I dont think most customers realise how much better the models work with custom GPTs.
They hyped them like crazy and haven't discussed them once since then. I agree that the inability to change the model is pretty absurd when the whole point was to "supercharge" specific tasks.
There was even talk of some sort of profit sharing with creators which clearly never happened. I just think the premise is too confusing for many and can still be served by using a custom system prompt via the API.
"When using custom instructions or files, only GPT-4o is available". Straight out of the ChatGPT web interface when you try to select which model you want to use.
03 is actually orthogonal to AGI and ASI in a cartesian sense. My SAS startup led multiple qualified teams where our RAG implementations on synthetic data originated positive inference in line with the literature (1).
(1) Sparks of AGI paper
I’m not smart enough or interesting enough to be hired by OpenAI to expertly solve problems and explain how to the AI. However, I like to think there isn’t enough money in the world for me to sell out my colleagues like that.
In my intuition it makes sense that there is going to be some significant friction in LLM development going forward. We're talking about models that will cost upwards of $1bn to train. Save for a technological breakthrough, GPT-6/7 will probably have to wait for hardware to catch up.
I think the main bottleneck right now is training data - they've basically exhausted all public sources of data, so they have to either pay humans to generate new data from scratch or pay for the reasoning models to generate (less useful) synthetic training data. The next bottleneck is hardware, and the least important bottleneck is money.
Considering how evasive they've been, it might also be YouTube.
> When pressed on what data OpenAI used to train Sora, Murati didn’t get too specific and seemed to dodge the question. “I’m not going to go into the details of the data that was used, but it was publicly available or licensed data,” she says. Murati also says she isn’t sure whether it used videos from YouTube, Facebook, and Instagram. She only confirmed to the Journal that Sora uses content from Shutterstock, with which OpenAI has a partnership.
Train for what? For making videos? Train from people’s comments? There’s a lot of garbage on AI slop on youtube, how would this be sifted out? I think there’s more value here on HN in terms of training, but even that, to what avail?
From what I read openai is having trouble bc not enough data.
If u think about it, any videos on YouTube of real world data contribute to its understanding of physics at minimum. From what I gather they do pre training on tons of unstructured content first and that contributes to overall smartness.
YouTube is such a great multimodal dataset—videos, auto-generated captions, and real engagement data all in one place. That’s a strong starting point for training, even before you filter for quality. Microsoft’s Phi-series models already show how focusing on smaller, high-quality datasets, like textbooks, can produce great results. You could totally imagine doing the same thing with YouTube by filtering for high-quality educational videos.
Down the line, I think models will start using video generation as part of how they “think.” Picture a version of GPT that works frame by frame—ask it to solve a geometry problem, and it generates a sequence of images to visualize the solution before responding. YouTube’s massive library of visual content could make something like that possible.
Did you read the article? All it basically says is that OpenAI faced struggles this past year -- specifically with GPT-5 aka Orion. And now they have o3, and other labs have made huge strides. So, yes, show me AI progress is slowing down!
How about just an updated gpt 4o with all newer data? It would go a long way. Currently it doesn't know anything since Oct 2023 (without having to do a web search).