GPT-5 is behind schedule

neonate · 2024-12-22T13:54:11 1734875651

t_serpico · 2024-12-22T23:16:58 1734909418

One fundamental challenge to me is that if each training run because more and more expensive, the time it takes it to learn what works/doesn't work widens. Half a billion dollars for training a model is already nuts, but if it takes 100 iterations to perfect it, you've cumulatively spent 50 billion dollars... Smaller models may actually be where rapid innovation continues simply because of tighter feedback loops. O3 may be an example of this.

dkobia · 2024-12-22T23:37:27 1734910647

AGI is the Sisyphean task of our age. We’ll push this boulder up the mountain because we have to, even if it kills us.

missedthecue · 2024-12-23T00:17:26 1734913046

Do we know LLMs are the path to AGI? If they're not, we'll just end up with some neat but eye wateringly expensive LLMs.

foolfoolz · 2024-12-23T00:33:35 1734914015

AGI will arrive like self driving cars. it’s not that you will wake up one day and we have it. cars gained auto-braking, parallel parking, cruise control assist. and over a long time you get to something like waymo, which still is location dependent. i think AGI will take decades but sooner will be some special cases that are effectively the same

missedthecue · 2024-12-23T01:48:01 1734918481

But maybe thses LLMs are like building bigger and bigger engines. It's not getting you closer to the self driving car.

danpalmer · 2024-12-23T01:05:40 1734915940

Interesting idea. The concept of The Singularity would seem to go against this, but I do feel that seems unlikely and that a gradual transition is more likely.

However, is that AGI, or is it just ubiquitous AI? I’d agree that, like self driving cars, we’re going to experience a decade or so transition into AI being everywhere. But is it AGI when we get there? I think it’ll be many different systems each providing an aspect of AGI that together could be argued to be AGI, but in reality it’ll be more like the internet, just a bunch of non-AGI models talking to each other to achieve things with human input.

I don’t think it’s truly AGI until there’s one thinking entity able to perform at or above human level in everything.

wongarsu · 2024-12-23T01:33:48 1734917628

The idea of the singularity presumes that running the AGI is either free or trivially cheap compared to what it can do, so we are fine expanding compute to let the AGI improve itself. That may eventually be true, but it's unlikely to be true for the first generation of AGI.

The first AGI will be a research project that's completely uneconomical to run for actual tasks because humans will just be orders of magnitude cheaper. Over time humans will improve it and make it cheaper, until we reach some tipping point where letting the AGI improve itself is more cost effective than paying humans to do it

marcus_holmes · 2024-12-23T01:25:13 1734917113

The Singularity is caused by AI being able to design better AI. There's probably some AI startup trying to work on this at the moment, but I don't think any of the big boys are working on how to get an LLM to design a better LLM.

I still like the analogy of this being a really smart lawn mower, and we're expecting it to suddenly be able to do the laundry because it gets so smart at mowing the lawn.

I think LLMs are going to get smarter over the next few generations, but each generation will be less of a leap than the previous one, while the cost gets exponentially higher. In a few generations it just won't make economic sense to train a new generation.

Meanwhile, the economic impact of LLMs in business and government will cause massive shifts - yet more income shifting from labour to capital - and we will be too busy dealing with that as a society to be able to work on AGI properly.

EGreg · 2024-12-23T01:35:16 1734917716

I think this whole “AGI” thing is so badly defined that we may as well say we already have it. It already passes the Turing test and does well on tons of subjects.

What we can start to build now is agents and integrations. Building blocks like panel of experts agents gaming things out, exploring space in a Monte Carlo Tree Search way, and remembering what works.

Robots are only constrained by mechanical servos now. When they can do something, they’ll be able to do everything. It will happen gradually then all at once. Because all the tasks (cooking, running errands) are trivial for LLMs. Only moving the limbs and navigating the terrain safely is hard. That’s the only thing left before robots do all the jobs!

resters · 2024-12-23T01:30:30 1734917430

It's not contradictory. It can happen over a decade and still be a dramatically sloped S curve with tremendous change happening in a relatively short time.

dartos · 2024-12-23T02:49:15 1734922155

I don’t think that’s true for AGI.

AGI is the holy grail of technology. A technology so advanced that not only does it subsume all other technology, but it is able to improve itself.

Truly general intelligence like that will either exist or not. And the instant it becomes public, the world will have changed overnight (maybe the span of a year)

Note: I don’t think statistical models like these will get us there.

worik · 2024-12-23T02:51:44 1734922304

If that is what AGI looks like.

There may well be an upper limit on cognition (we are not really sure what cognition is - even as we do it) and it may be that human minds are close to it.

coffeemug · 2024-12-23T02:57:20 1734922640

Very unlikely, for the reason that human minds evolved under extremely tight energy constraints. AI has no such limitation.

jazzyjackson · 2024-12-23T02:35:11 1734921311

And sometimes you lose the ultrasonic sensors and can't parallel park like last year's model

teleforce · 2024-12-23T01:34:35 1734917675

> AGI will arrive like self driving cars

The statement is promising as the earth will dissapear sometimes in the future. Actually the earth will dissapear has more bearing than that.

NBJack · 2024-12-23T01:06:50 1734916010

No. But it won't stop the industry from trying.

LLMs have no real sense of truth or hard evidence of logical thinking. Even the latest models still trip up on very basic tasks. I think they can be very entertaining, sure, but not practical for many applications.

apsec112 · 2024-12-23T01:08:22 1734916102

What do you think, if we saw it, would constitute hard evidence of logical thinking or a sense of truth?

EGreg · 2024-12-23T01:31:35 1734917495

We have it, it’s called Cyc

But it is far behind the breadth of LLMs

arthurcolle · 2024-12-23T01:29:53 1734917393

Sounds like they need further instruction

LarsDu88 · 2024-12-23T00:44:03 1734914643

The autoregressive transformer LLMs aren't even the only way to do text generation. There are now diffusion based LLMs, StripedHyena based LLMs, and float matching based LLMs.

There's a wide amount of research into other sorts of architectures.

arthurcolle · 2024-12-23T01:17:39 1734916659

LLMs are a key piece of understanding that token sequences can trigger actions in the real world. AGI is here. You can trivially spin up a computer using agent to self improve itself to being a competent office worker

jazzyjackson · 2024-12-23T02:38:24 1734921504

If agents can self improve why hasn't gpt4 improved itself into gpt5 yet

arthurcolle · 2024-12-23T01:29:10 1734917350

Tokens don't need to be text either, you can move to higher level "take_action" semantics where "stream back 1 character to session#117" as every single function call. Training cheap models that can do things in the real world is going to change a huge amount of present capabilities over the next 10 years

icpmacdo · 2024-12-23T01:57:11 1734919031

can you share learning resources on this topic

arthurcolle · 2024-12-23T02:56:58 1734922618

No but if you want to join the Distributed Systems Corporation, you should email arthur@distributed.systems

andrepd · 2024-12-23T00:24:31 1734913471

I would put no money on the latter.

wruza · 2024-12-23T00:33:14 1734913994

Says who? And more importantly, is this the boulder? All I (and many others here) see is that people engage others to sponsor pushing some boulder, screaming promises which aren’t even that consistent with intermediate results that come out. This particular boulder may be on a wrong mountain, and likely is.

It all feels like doubling down on astrology because good telescopes aren’t there yet. I’m pretty sure that when 5 comes out, it will show some amazing benchmarks but shit itself in the third paragraph as usual in a real task. Cause that was constant throughtout gpt evolution, in my experience.

even if it kills us

Full-on sci-fi, in reality it will get stuck around a shell error message and either run out of money to exist or corrupt the system into no connectivity.

Workaccount2 · 2024-12-23T02:07:42 1734919662

The buzzkill when you fire up the latest most powerful model only for it to tell you that peanut is not typically found in peanut butter and jelly sandwiches.

h0l0cube · 2024-12-22T23:57:56 1734911876

There's no doubt been progress on the way to AGI, but ultimately it's still a search problem, and one that will rely on human ingenuity at least until we solve it. LLMs are such a vast improvement in showing intelligent-like behavior that we've become tantalized by it. So now we're possibly focusing our search in the wrong place for the next innovation on the path to AGI. Otherwise, it's just a lack of compute, and then we just have to wait for the capacity to catch up.

ulfw · 2024-12-23T01:40:53 1734918053

Why? Nobody asked us if we want this. Nobody has a plan what to do with humanity when there is AGI

madeofpalk · 2024-12-23T02:14:19 1734920059

What has AGI got to do with this?

mrbungie · 2024-12-23T02:42:51 1734921771

Part of the ideas pushed into the narrative by Marketing departments / consultants / hyperscalers to movilize growth in the AI ecosystem.

idiotsecant · 2024-12-22T23:58:03 1734911883

And when we get it there, it kills us.

anothernewdude · 2024-12-23T00:06:33 1734912393

I don't think AI will be what kills us. The paperclip machine is already here - it's capitalism. It doesn't need AI, it has unthinking, powerless people already tied to optimising for bad metrics. Everything else just makes it more efficient at killing us.

jprete · 2024-12-23T00:19:36 1734913176

I think you're both right and wrong. You're right that capitalism has become a paperclip machine, but capitalism also wants AI so it can cheaply and at scale replace the human components of the machine with something that has more work capacity for fewer demands.

h0l0cube · 2024-12-23T00:39:22 1734914362

The problem is that the people in power will want to maintain the status quo. So the end of human labor won't naturally result in UBI – or any kind of welfare – to compensate for the loss of income, let alone afford any social mobility. But wealthy people will be able to leverage AGI to defend themselves from any uprising by the plebs.

We're too busy trying to make humans irrelevant, but not asking what exactly we do as a species of 10+ billion individuals do afterwards. There's some excited discussions about a rebirth of culture, but I'm not sure what that means when machines can do anything humans can do but better. Perhaps we just tinker around with our hobbies until we die? I honestly don't think it will play out well for us.

wsintra2022 · 2024-12-23T02:15:24 1734920124

Machines can’t have fun for us. They can’t dance to a beat, they can’t experience altered states of mind. They can’t create a sense of belonging through culture and ritual. Yes we have lost a lot in the last 100 years but there are still pockets of resistance that carry old knowledge that “we the people” will be glad of in the coming century.

jprete · 2024-12-23T02:11:04 1734919864

The problem is that the "we" who are busy trying to make humans irrelevant seem to be completely unconcerned with the effects on the "we" who will be superfluous afterwards.

falcor84 · 2024-12-23T00:11:54 1734912714

It seems to me that given how AI is likely to continuously increase capitalism's efficiency, your argument actually supports the claim you're trying to dispute.

dgfitz · 2024-12-23T00:27:57 1734913677

I wonder how Russian and North Korean citizens would feel about a capitalist, representative democracy?

anticorporate · 2024-12-23T01:59:52 1734919192

I think they'd have thing or two to say about living under the rule of wealthy elites. We'd do well to listen to them.

thrwthsnw · 2024-12-23T00:17:41 1734913061

The thing that is killing us is the same thing that is killing capitalism

bloodyplonker22 · 2024-12-23T01:26:01 1734917161

I am working at an AI company that is not OpenAI. We have found ways to modularize training so we can test on narrower sets before training is "completely done". That said, I am sure there are plenty of ways others are innovating to solve the long training time problem.

gerdesj · 2024-12-23T01:58:39 1734919119

Perhaps the real issue is that learning takes time and that there may not be a shortcut. I'll grant you that argument's analogue was complete wank when comparing say the horse and cart to a modern car.

However, we are not comparing cars to horses but computers to a human.

I do want "AI" to work. I am not a luddite. The current efforts that I've tried are not very good. On the surface they offer a lot but very quickly the lustre comes off very quickly.

(1) How often do you find yourself arguing with someone about a "fact"? Your fact may be fiction for someone else.

(2) LLMs cannot reason

A next token guesser does not think. I wish you all the best. Rome was not burned down within a day!

I can sit down with you and discuss ideas about what constitutes truth and cobblers (rubbish/false). I have indicated via parenthesis (brackets in en_GB) another way to describe something and you will probably get that but I doubt that your programme will.

icpmacdo · 2024-12-23T02:01:40 1734919300

This is literally just the scaling laws, "Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets. This provides an efficient way for practitioners and researchers alike to compare pretraining decisions involving optimizers, datasets, and model architectures"

https://arxiv.org/html/2410.11840v1#:~:text=Scaling%20laws%2....

fny · 2024-12-22T23:50:33 1734911433

O3 is not a smaller model. It's an iterative GPT of sorts with the magic dust of reinforcement learning.

falcor84 · 2024-12-23T00:12:48 1734912768

I'm pretty sure that the parent implied that o3 is smaller in comparison to gpt5

dyauspitr · 2024-12-23T00:35:41 1734914141

Until you get to a point where the LLM is smart enough to look at real world data streams and prune its own training set out of it. At that point it will self improve itself to AGI.

ramesh31 · 2024-12-22T23:28:14 1734910094

But if the scaling law holds true, more dollars should at some point translate into AGI, which is priceless. We haven't reached the limits yet of that hypothesis.

threeseed · 2024-12-22T23:38:19 1734910699

a) There is evidence e.g. private data deals that we are starting to hit the limitations of what data is available.

b) There is no evidence that LLMs are the roadmap to AGI.

c) Continued investment hinges on their being a large enough cohort of startups that can leverage LLMs to generate outsized returns. There is no evidence yet this is the case.

ComplexSystems · 2024-12-22T23:54:03 1734911643

"There is no evidence that LLMs are the roadmap to AGI." - There's plenty of evidence. What do you think the last few years have been all about? Hell, GPT-4 would already have qualified as AGI about a decade ago.

coldtea · 2024-12-23T00:13:05 1734912785

>What do you think the last few years have been all about?

Next token language-based predictors with no more intelligence than brute force GIGO which parrot existing human intelligence captured as text/audio and fed in the form of input data.

4o agrees:

"What you are describing is a language model or next-token predictor that operates solely as a computational system without inherent intelligence or understanding. The phrase captures the essence of generative AI models, like GPT, which rely on statistical and probabilistic methods to predict the next piece of text based on patterns in the data they’ve been trained on"

thrwthsnw · 2024-12-23T00:20:58 1734913258

Everything you said is parroting data you’ve trained on, two thirds of it is actual copy paste

mrbungie · 2024-12-23T02:48:45 1734922125

He probably didn't need petabytes of reddit posts and millions of gpu-hours to parrot that though.

I still don't buy the we do the same as LLMs. Of course one could hypothesize the brain language centers may have some similarities, but the differences in resource usage and how those resources are used to train between humans and LLMs are remarkable and may indicate otherwise.

coldtea · 2024-12-23T00:30:00 1734913800

>Everything you said is parroting data you’ve trained on

"Just like" an LLM, yeah sure...

Like how the brain was "just like" a hydraulic system (early industrial era), like a clockwork with gears and differentiation (mechanical engineering), "just like" an electric circuit (Edison's time), "just like" a computer CPU (21st century), and so on...

You're just assuming what you should prove

resters · 2024-12-23T01:31:45 1734917505

This comment isn't false but it's very naive.

zmgsabst · 2024-12-23T02:22:26 1734920546

o1 points out this is mostly about “if submarines swim”.

https://chatgpt.com/share/6768c920-4454-8000-bf73-0f86e92996...

Eisenstein · 2024-12-23T00:38:29 1734914309

You have described something but you haven't explained why the description of the thing defines its capability. This is a tautology, or possibly a begging of the question, which takes as true the premise of something (that token based language predictors cannot be intelligent) and then uses that premise to prove an unproven point (that language models cannot achieve intelligence).

You did nothing at all to demonstrate why you cannot produce an intelligent system from a next token language based predictor.

What GPT says about this is completely irrelevant.

coldtea · 2024-12-23T00:45:06 1734914706

>You did nothing at all to demonstrate why you cannot produce an intelligent system from a next token language based predictor

Sorry, but the burden of proof is on your side...

The intelligence is in the corpus the LLM was fed with. Using statistics to pick from it and re-arrange it gives new intelligent results because the information was already produced by intelligent beings.

If somebody gives you an excerpt of a book, it doesn't mean they have the intelligence of the author - even if you have taught them a mechanical statistical method to give back a section matching a query you make.

Kids learn to speak and understand language at 3-4 years old (among tons of other concepts), and can reason by themselves in a few years with less than 1 billionth the input...

>What GPT says about this is completely irrelevant.

On the contrary, it's using its very real intelligence, about to reach singularity any time now, and this is its verdict!

Why would you say it's irrelevant? That would be as if it merely statistically parroted combinations of its training data unconnected to any reasoning (except of that the human creators of the data used to create them) or objective reality...

Terr_ · 2024-12-23T02:49:17 1734922157

> If somebody gives you an excerpt of a book, it doesn't mean they have the intelligence of the author

A closely related rant of my own: The fictional character we humans infer from text is not the author-machine generating that text, not even if they happen to share the same name. Assuming that the author-machine is already conscious and choosing to insert itself is begging the question.

Eisenstein · 2024-12-23T01:00:51 1734915651

Let's pretend it is 1940

Person 1: rockets could be a method of putting things into Earth orbit

Person 2: rockets cannot get things into orbit because they use a chemical reaction which causes an equal and opposite force reaction to produce thrust'

Does person 1 have the burden of proof that rockets can be used to put things in orbit? Sure, but that doesn't make the reasoning used by person 2 valid to explain why person 1 is wrong.

BTW thanks for adding an entire chapter to your comment in edit so it looks like I am ignoring most of it. What I replied to was one sentence that said 'the burden of proof is on you'. Though it really doesn't make much difference because you are doing the same thing but more verbose this time.

None of the things you mentioned preclude intelligence. You are telling us again how it operates but not why that operation is restrictive in producing an intelligent output. There is no law that saws that intelligence requires anything but a large amount of data and computation. If you can show why these things are not sufficient, I am eager to read about it. A logical explanation would be great, step by step please, without making any grand unproven assumptions.

In response to the person below... again, whether or not person 1 is right or wrong does not make person 2's argument valid.

ViewTrick1002 · 2024-12-23T01:14:19 1734916459

The delta-V for orbit is a precisely defined point. How you get there is not.

What is the defined point for reaching AGI?

gwervc · 2024-12-22T23:59:31 1734911971

No, GPT-4 would have been classified as it is today: a (good) generator of natural language. While this is a hard classical NLP task, it's a far cry from intelligence.

falcor84 · 2024-12-23T00:16:37 1734912997

GPT-4 is a good generator of natural language in the same sense that Google is a good generator of ip packets.

n144q · 2024-12-23T00:20:02 1734913202

> GPT-4 would already have qualified as AGI about a decade ago.

Did you just make that up?

OtomotO · 2024-12-23T00:59:27 1734915567

Probably asked an "AI"

idiotsecant · 2024-12-23T00:00:06 1734912006

Have you ever heard of a local maxima? You don't get an attack helicopter by breeding stronger and stronger falcons.

lolinder · 2024-12-23T00:29:16 1734913756

For an industry that spun off of a research field that basically revolves around recursive descent in one form or another, there's a pretty silly amount of willful ignorance about the basic principles of how learning and progress happens.

The default assumption should be that this is a local maximum, with evidence required to demonstrate that it's not. But the hype artists want us all to take the inevitability of LLMs for granted—"See the slope? Slopes lead up! All we have to do is climb the slope and we'll get to the moon! If you can't see that you're obviously stupid or have your head in the sand!"

zmgsabst · 2024-12-23T02:47:53 1734922073

You’re implicitly assuming only a global maximum will lead to useful AI.

There might be many local maxima that cross the useful AI or even AGI threshold.

OtomotO · 2024-12-23T00:57:58 1734915478

The last four years?

ELIZA 2.0

thrwthsnw · 2024-12-23T00:18:59 1734913139

Private data is 90% garbage too

aantix · 2024-12-23T00:14:10 1734912850

Have we really hit the wall?

Do they use GPS based data?

Feels like there’s data all around us.

Sure they’ve hit the wall with obvious conversations and blog articles that humans produced, but data is a by product of our environment. Surely there’s more. Tons more.

threeseed · 2024-12-23T00:18:14 1734913094

We also could just measure the background noise of the universe and produce unlimited data.

But just like GPS data it isn't suited for LLMs given that you know it has no relevance what so ever to language.

aantix · 2024-12-23T00:29:28 1734913768

You’re thinking of language in the strictest of sense.

GPS data as it relates to location names, people, cultures, path finding.

unshavedyak · 2024-12-23T00:26:17 1734913577

> which is priceless

This also isn't true. It'll clearly have a price to run. Even if it's very intelligent, if the price to run it is too high it'll just be a 24/7 intelligent person that few can afford to talk to. No?

pbhjpbhj · 2024-12-23T00:36:39 1734914199

Computers will be the size of data centres, they'll be so expensive we'll queue up jobs to run on them days in advance, each taking our turn... history echoes into the future...

unshavedyak · 2024-12-23T00:46:52 1734914812

Yea, and those statements were true. For a time. If you want to say "AGI will be priceless some unknown time into the future" then i'd be on board lol. But to imply it'll be immediately priceless? As in no cost spent today wouldn't be immediately rewarded once AGI exists? Nonsense.

Maybe if it was _extremely_ intelligent and it's ROI would be all the drugs it would instantly discover or w/e. But lets not imply that General Intelligence requires infinitely knowing.

So at best we're talking about an AI that is likely close to human level intelligence. Which is cool, because we have 7+ billion of those things.

This isn't an argument against it. Just to say that AGI isn't "priceless" in the implementation we'd likely see out of the gate.

simonw · 2024-12-22T21:55:40 1734904540

"OpenAI’s is called GPT-4, the fourth LLM the company has developed since its 2015 founding." - that sentence doesn't fill me with confidence in the quality of the rest of the article, sadly.

maxrmk · 2024-12-23T00:19:26 1734913166

I was wondering about this one too...

> At best, they say, Orion performs better than OpenAI’s current offerings, but hasn’t advanced enough to justify the enormous cost of keeping the new model running.

wdym "keep it running"?

overgard · 2024-12-23T00:45:41 1734914741

Well, those server farms don't pay for themselves.

maxrmk · 2024-12-23T01:29:00 1734917340

sure, but once it's trained there isn't a running maintenance cost

wongarsu · 2024-12-23T01:37:49 1734917869

If you offer an API you need to dedicate servers to it that keep the model loaded in GPU memory. Unless you don't care about latency at all.

Though I wouldn't be surprised if the bigger reason is the PR cost of releasing with an exciting name but unexciting results. The press would immediately declare the end of the AI growth curve

wavemode · 2024-12-23T01:59:19 1734919159

Of course running inference costs money. You think GPUs are free?

bhouston · 2024-12-23T01:37:11 1734917831

Well if it takes a ton of memory/compute for inference because of its size, it may be cost prohibitive to run compared to the ROI it generates?

jacobsimon · 2024-12-22T22:25:38 1734906338

There’s nothing grammatically offensive about this. It’s like saying, “Cars come in all colors. Mine is red.”

simonw · 2024-12-22T22:29:07 1734906547

No, I'm complaining that just because GPT-4 is called GPT-4 doesn't mean it's the fourth LLM from OpenAI.

Off the top of my head: GPT-2, Codex, GPT-3 in three different flavors (babbage, curie, davinci), GPT-3.5.

Suggesting that GPT-4 was "fourth" simply isn't credible.

Just the other day they announced a jump from o1 to o3, skipping o2 purely because it's already the name of a major telecommunications brand in Europe. Deriving anything from the names of OpenAI's products doesn't make sense.

bluelightning2k · 2024-12-23T01:12:44 1734916364

There's at least 4 major releases just in GPT4.

GPT4, GPT4T, Gpt4o-Mini, GPT4o,

maeil · 2024-12-23T02:40:20 1734921620

If we're generous the article considers versions that were significant improvements. 4o is hardly better on real-world usage (benchmarks are gamed to death) than the original 4.

xanderlewis · 2024-12-22T23:15:53 1734909353

It’s somehow funny to hear a British company being described as ‘in Europe’, but I suppose you’re technically correct…

SahAssar · 2024-12-23T00:38:34 1734914314

The UK is part of Europe. It's technically, geographically, politically, historically, lingustially, tectonically and socially correct. In what ways is it not?

umanwizard · 2024-12-23T00:50:37 1734915037

Are Cuba or Haiti part of North America? A lot of British people feel like their civilization is meaningfully distinct from “Europe”, even though they’re part of it in a technical geographical sense.

SahAssar · 2024-12-23T01:13:28 1734916408

> Are Cuba or Haiti part of North America

In general yes, but it depends on if you consider central america as its own continent and if you include them there and how you delineate north/south america. Groupings differ based on your education.

I think the thing that makes the UK different is that there is no other option besides them being a separate thing/continent. Are you suggesting that the UK is it's own continent? Would that be with the faroese and the Greenlanders?

The UK might feel different, but they are not separate. The french feel different from the bulgarians, but that does not mean they are on a separate continent, politically or geographically.

EDIT:

> A lot of British people feel like their civilization is meaningfully distinct

This is, to borrow a word, "balderdash". Looking at the influence vikings, romans and normans have had that is a rubbish argument. Just like other countries in europe the british culture is built on the stones of other cultures, and just like many other countries they subsumed other cultures because of kings or other political dominance.

reshlo · 2024-12-23T01:49:18 1734918558

> Would that be with the Faroese and the Greenlanders?

Greenland is in North America.

SahAssar · 2024-12-23T02:18:09 1734920289

The point was that any closeby landmass besides europe is either in europe or in north america, and I have a hard time seeing the argument for UK being in North America or America at all.

France would have a better argument for it having territory in both north (https://en.wikipedia.org/wiki/Saint_Pierre_and_Miquelon and others) and south (https://en.wikipedia.org/wiki/French_Guiana and others) america.

umanwizard · 2024-12-23T01:27:43 1734917263

Continents are not objective reality, they are semi-arbitrary groupings vaguely correlated with geography, culture, etc.

If British people don’t feel like they’re part of “the Continent”, there’s little objective reason to say they are.

SahAssar · 2024-12-23T01:37:41 1734917861

But I'm guessing we can agree that any major landmass is generally belonging to a continent? Like we all agree that greenland, new zealand, japan, etc generally belong to a continent?

So to what continent do those british people think they belong?

wsintra2022 · 2024-12-23T02:28:21 1734920901

British people don’t think anything, there are British individuals who may think but collectively the “British” do not have a thought.

umanwizard · 2024-12-23T01:40:19 1734918019

If you asked someone directly “what continent is Britain part of”, they would surely say Europe, even if they would be unlikely to describe themselves as European. Language is funny that way.

SahAssar · 2024-12-23T01:55:49 1734918949

So you agree (and think that most people would) that the UK is part of Europe?

umanwizard · 2024-12-23T02:45:13 1734921913

I would agree that in some, but not all, of the contexts where the word “Europe” is used, it includes the UK.

plufz · 2024-12-22T23:45:16 1734911116

Technically…? Does anyone here believe that the EU and Europe is the same thing? Would you find it weird if someone said that a Norwegian company was in Europe?

xanderlewis · 2024-12-22T23:53:37 1734911617

Many people certainly seem to! And it annoys me. I wasn’t talking about the EU, though.

I was just commenting on the fact that in the UK, ‘Europe’ generally means ‘continental Europe’.

> Would you find it weird if someone said that a Norwegian company was in Europe?

I’d find it weird if a European did. But from Americans it’s to be expected.

maeil · 2024-12-23T02:42:39 1734921759

> Would you find it weird if someone said that a Norwegian company was in Europe?

> I’d find it weird if a European did. But from Americans it’s to be expected.

Absolutely nothing weird about it, I'd find it very weird if they wouldn't. I'm from Europe and my social circle has people from all over Europe.

It's really just the UK which has this weird usage of Europe.

RandomThoughts3 · 2024-12-22T23:59:51 1734911991

> I’d find it weird if a European did.

I have bad news. The UK is definitely in Europe both geographically and even more so historically and culturally. Norway is too by the way.

If you are offended by people referring to the UK as in Europe, my suggestion is both an history course and starting therapy.

Philpax · 2024-12-23T01:47:50 1734918470

I'd suggest you level up your reading comprehension before suggesting the parent poster was in any way offended or in need of therapy.

RandomThoughts3 · 2024-12-23T02:27:55 1734920875

Parent is suggesting it would be weird for Europeans to call the UK as in Europe which as a European I can tell you is preposterous. That’s the kind of non sense you used to hear from Brexiter. They will have no sympathy from me.

dgfitz · 2024-12-23T00:31:34 1734913894

No no no you missed it, clearly Americans are just stupid.

boomskats · 2024-12-23T01:18:42 1734916722

> I was just commenting on the fact that in the UK, ‘Europe’ generally means ‘continental Europe’.

It really depends on who you're speaking to.

umanwizard · 2024-12-23T01:41:39 1734918099

And on the context

thrwthsnw · 2024-12-23T00:24:47 1734913487

Which Americans, North or South?

User23 · 2024-12-23T00:23:28 1734913408

If Norway isn’t in Europe where is it? Asia?

maeil · 2024-12-23T02:45:26 1734921926

The only people who find this funny are the British themselves, the other 99% of the world thinks nothing strange of it.

simonw · 2024-12-22T23:26:01 1734909961

https://en.wikipedia.org/wiki/O2_(brand) - "O2 (typeset as O2) is a global brand name owned by the Spanish telecommunications company Telefónica"

xanderlewis · 2024-12-22T23:54:26 1734911666

It’s a British brand, even if it’s now owned by someone else. It even says so on the page you link to.

blinding-streak · 2024-12-23T01:07:33 1734916053

Europe != EU

benatkin · 2024-12-22T22:40:29 1734907229

While I’m sure it’s unintentional, that amounts to nitpicking. I can easily find three to include and pass over the rest. Face value turns out to be a decent approximation.

simonw · 2024-12-22T23:29:55 1734910195

If this was a random blog post I wouldn't nitpick, but this is the Wall Street Journal.

benatkin · 2024-12-22T23:33:43 1734910423

The thing is that I think it could be an optimal way of saying it. Should we not put it into context of making a particular LLM? Why count three versions of three LLMs? They made it hard to choose the one that makes up for not having GPT 1. GPT 3.5 and Codex are both good candidates. And of course calling GPT 4 the third and fifth could be considered as well.

simonw · 2024-12-22T23:40:46 1734910846

"OpenAI's fourth family of LLMs" or "fourth generation of LLMs" would work for me.

benatkin · 2024-12-22T23:46:42 1734911202

That doesn’t resolve the problem of whether third or fifth is better than fourth. I have yet to be convinced that their wording here shows that they fail to grasp the pace of the development.

vasco · 2024-12-22T22:41:37 1734907297

Imagine coming up with a naming scheme for the versioning of your product just for it to fail on the second time you want to use it.

zapnuk · 2024-12-22T23:11:25 1734909085

Should have used chatGPT to ask for a name or at least check it

lelandfe · 2024-12-22T22:31:04 1734906664

It’s more like saying “the Audi Quattro, the company’s fourth car…”

benatkin · 2024-12-22T22:43:30 1734907410

Because there’s an Audi Tre e Mezzo?

dghlsakjg · 2024-12-22T22:32:33 1734906753

The issue isn't the grammar. It is that there are 5 distinct LLMs from OpenAI that you can use right now as well as 4 others that were deprecated in 2024.

overgard · 2024-12-23T00:40:31 1734914431

The article definitely has issues, but to me what's relevant is where it's published. The smart money and experts without a vested interest have been well aware LLMs are an expensive dead for over a year and have been saying as much (Gary Marcus for instance). That this is starting to enter mainstream consciousness is what's newsworthy.

icpmacdo · 2024-12-23T02:03:47 1734919427

Gary Marcus is continuously lambasted and not taken seriously

404mm · 2024-12-22T22:01:38 1734904898

Quite funny that an article about AI was not fed to AI to proof read it.

viraptor · 2024-12-22T22:14:57 1734905697

Editing mistakes that AI wouldn't make is the new "proof of human input".

KTibow · 2024-12-22T23:03:02 1734908582

I've been messing around with base (not instruction tuned) LLMs; they often evade AI detectors and I wouldn't be surprised if they evade this kind of detection too, at least with a high temperature

staunton · 2024-12-22T23:27:10 1734910030

> with a high temperature

More like: with the right prompting

ToucanLoucan · 2024-12-22T22:11:12 1734905472

Bold of you to assume AI didn't write it, too.

dheera · 2024-12-22T23:14:37 1734909277

Articles these days are probably written by ChatGPT

MichaelDickens · 2024-12-22T23:26:48 1734910008

I doubt it, if you ask ChatGPT whether GPT-4 is OpenAI's fourth LLM, it gives the correct answer. That's the sort of thing GPT-2 might have said.

kaycebasques · 2024-12-23T02:18:11 1734920291

> And the results of the project, dubbed Arrakis, indicated that creating GPT-5 wouldn’t go as smoothly as hoped.

Quite the hubris to name the project after the desert planet of Dune, where multiple royal houses met their ruin.

Insanity · 2024-12-23T02:21:41 1734920501

And the other theme in Dune is how artificial intelligence essentially fubar’d civilization. (The Butlerian Jihad)

kaycebasques · 2024-12-23T02:38:05 1734921485

Ah yes, Thou shalt not make a machine in the likeness of a human mind.

Animats · 2024-12-22T22:09:55 1734905395

"Orion’s problems signaled to some at OpenAI that the more-is-more strategy, which had driven much of its earlier success, was running out of steam."

So LLMs finally hit the wall. For a long time, more data, bigger models, and more compute to drive them worked. But that's apparently not enough any more.

Now someone has to have a new idea. There's plenty of money available if someone has one.

The current level of LLM would be far more useful if someone could get a conservative confidence metric out of the internals of the model. This technology desperately needs to output "Don't know" or "Not sure about this, but ..." when appropriate.

simonw · 2024-12-22T22:24:10 1734906250

The new idea is inference-time scaling, as seen in o1 (and o3 and Qwen's QwQ and DeepSeek's DeepSeek-R1-Lite-Preview and Google's gemini-2.0-flash-thinking-exp).

I suggest reading these two pieces about that:

- https://www.aisnakeoil.com/p/is-ai-progress-slowing-down - best explanation I've seen of inference scaling anywhere

- https://arcprize.org/blog/oai-o3-pub-breakthrough - François Chollet's deep dive into o3

I've been tracking it on this tag on my blog: https://simonwillison.net/tags/inference-scaling/

exhaze · 2024-12-22T22:41:09 1734907269

I think the wildest thing is actually Meta’s latest paper where they show a method for LLMs reasoning not in English, but in latent space

https://arxiv.org/pdf/2412.06769

I’ve done research myself adjacent to this (mapping parts of a latent space onto a manifold), but this is a bit eerie, even to me.

ynniv · 2024-12-22T22:51:41 1734907901

Is it "eerie"? LeCun has been talking about it for some time, and may also be OpenAI's rumored q-star, mentioned shortly after Noam Brown (diplomacybot) joining OpenAI. You can't hill climb tokens, but you can climb manifolds.

exhaze · 2024-12-23T00:28:16 1734913696

I wasn’t aware of others attempting manifolds for this before - just something I stumbled upon independently. To me the “eerie” part is the thought of an LLM no longer using human language to reason - it’s like something out of a sci fi movie where humans encounter an alien species that thinks in a way that humans cannot even comprehend due to biological limitations.

I am hopeful that progress in mechanistic interpretability will serve as a healthy counterbalance to this approach when it comes to explainability.. though I kinda worry that at a certain point it may be that something resembling a scaling law puts an upper bound on even that.

Y_Y · 2024-12-22T23:29:14 1734910154

> You can't hill climb tokens, but you can climb manifolds.

Could you explain this a bit please?

ynniv · 2024-12-22T23:46:43 1734911203

Links to Yan:

Title: "Objective Driven AI: Towards Machines that can Learn, Reason, and Plan"

Lytle Lecture Page: https://ece.uw.edu/news-events/lytle-lecture-series/

Slides: https://drive.google.com/file/d/1e6EtQPQMCreP3pwi5E9kKRsVs2N...

Video: https://youtu.be/d_bdU3LsLzE?si=UeLf0MhMzjXcSCAb

sebzim4500 · 2024-12-22T23:39:04 1734910744

I imagine he means that when you reason in latent space the final answer is a smooth function of the parameters, which means you can use gradient descent to directly optimize the model to produce a desired final output without knowing the correct reasoning steps to get there.

When you reason in token space (like everyone is doing now) you are executing nonlinear functions when you sample after each token, so you have to use some kind of reinforcement learning algorithm to learn the weights.

danielmarkbruce · 2024-12-22T23:26:41 1734910001

It's just concept space. The entire LLM works in this space once the embedding layer is done. It's not really that novel at all.

mountainriver · 2024-12-23T01:02:53 1734915773

There are lots of papers that do this

asadalt · 2024-12-22T22:49:03 1734907743

kinda how we do it. language is just an io interface(but also neural obv) on top of our reasoning engine.

briga · 2024-12-22T22:22:49 1734906169

What wall? Not a week has gone by in recent years without an LLM breaking new benchmarks. There is little evidence to suggest it will all come to a halt in 2025.

jrm4 · 2024-12-22T22:25:51 1734906351

Sure, but "benchmarks" here seems roughly as useful as "benchmarks" for GPUs or CPUs, which don't much translate to what the makers of GPT need, which is 'money making use cases.'

peepeepoopoo98 · 2024-12-22T22:28:42 1734906522

O3 has demonstrated that OpenAI needs 1,000,000% more inference time compute to score 50% higher on benchmarks. If O3-High costs about $350k an hour to operate, that would mean making O4 score 50% higher would cost $3.5B (!!!) an hour. That scaling wall.

oceanplexian · 2024-12-22T22:43:50 1734907430

I’m convinced they’re getting good at gaming the benchmarks since 4 has deteriorated via ChatGPT, in fact I’ve used 4-0125 and 4-1106 via the API and find them far superior to o1 and o1-mini at coding problems. GPT4 is an amazing tool but the true capabilities are being hidden from the public and/or intentionally neutered.

CSMastermind · 2024-12-22T23:02:18 1734908538

> I’ve used 4-0125 and 4-1106 via the API and find them far superior to o1 and o1-mini at coding problems

Just chiming in to say you're not alone. This has been my experience as well. The o# line of models just don't do well at coding, regardless of what the benchmarks say.

norir · 2024-12-22T22:37:50 1734907070

I used to run a lot of monte carlo simulations where the error is proportional to the inverse square root. There was a huge advantage of running for an hour vs a few minutes, but you hit the diminishing returns depressingly quickly. It would not surprise me at all if llms end up having similar scaling properties.

LegionMammal978 · 2024-12-23T02:12:11 1734919931

Yeah, any situation you need O(n^2) runtime to obtain n bits of output (or bits of accuracy, in the Monre Carlo case) is pure pain. At every point, it's still within your means to double the amount of output (by running it 3x longer than you have so far), but it gradually becomes more and more painful, instead of there being a single point where you can call it off.

riku_iki · 2024-12-22T23:12:11 1734909131

And I suspect o3 is something like monte carlo: generates tons of CoTs, with most of them are junk, but some hit the answer.

exhaze · 2024-12-23T00:31:34 1734913894

Sounds plausible given I’ve recently observed a ton of research papers in the space that in some way or another incorporate MCTS

Kuinox · 2024-12-22T22:31:17 1734906677

Wait a few month and they will have a distilled model with the same performance and 1% of the run cost.

achierius · 2024-12-22T22:34:41 1734906881

Even assuming that past rates of inference cost scaling hold up, we would only expect a 2 OoM decrease after about a year or so. And 1% of 3.5b is still a very large number.

peepeepoopoo98 · 2024-12-22T22:34:05 1734906845

100X efficiency improvement (doubtful) still means that costs grow 200X faster than benchmark performance.

og_kalu · 2024-12-22T22:44:38 1734907478

Not really. o3-low compute still stomps the benchmarks and isn't anywhere that expensive and o3-mini seems better than o1 while being cheaper.

Combine that with the fact that LLM inference has reduced orders of magnitudes in cost the last few years and hampering over the inference costs of a new release seems a bit silly.

mrbungie · 2024-12-22T23:40:03 1734910803

It is still not economical: in Arc at least 20 usd for task vs ~3 usd for a human (avg mturker) for the same perf.

og_kalu · 2024-12-23T00:44:53 1734914693

Not necessarily. And this is the problem with ARC that people seem to forget.

- It's just a suite of visual puzzles. It's not like say GSM8K where proficiency in it gives some indication on Math proficiency in general.

- It's specifically a suite of puzzles that LLMs have shown particular difficulty in.

Basically how much compute it takes to handle a task in this benchmark does not correlate with how much it will take LLMs to compute tasks that people actually want to use LLMs for.

mrbungie · 2024-12-23T02:08:09 1734919689

If the benchmark is not representative of normal usage* then the benchmark and the plot being shown are not useful at all from a user/business perspective and the focus on the breakthrough scores of o3-low and o3-high in ARC-AGI would be highly misleading. And also the "representative" point is really moot from the discussion perspective (i.e. saying o3 stomps benchmarks, but the benchmarks aren't representative).

*I don't think that is the case as you can at least make relative conclusions (i.e. o3 vs o1 series, o3-low is 4x to 20x the cost for ~3x the perf). Even if it is pure marketing they expect people to draw conclusions using the perf/cost plot from Arc.

PS: I know there are more benchmarks like SWE-Bench and Frontier Math, but this is the only one showing data about o3-low/high costs without considering the CodeForces plot that includes o3-mini (that one does look interesting, though right now is vaporware) but does not separate between compute scale modes.

riku_iki · 2024-12-22T23:14:02 1734909242

If you are talking about ARC benchmark, then o3-low doesn't look that special if you take into account there are plenty of finetuned models with much smaller resources achieved 40-50% results on private set (not semi-private like o3-low).

og_kalu · 2024-12-22T23:24:01 1734909841

- I'm not just talking about ARC. On frontier Math, we have 2 scores, one with pass@1 and another with consensus vote with 64 samples. Both scores are much better than previous Sota.

- Also apparently, ARC wasn't a special fine-tune but rather some of the training set in the corpus for pre-training.

riku_iki · 2024-12-22T23:30:49 1734910249

> On frontier Math

that result is not verifiable, not reproducable, unknown if it was leaked and how it was measured. Its kinda hype science.

> ARC wasn't a special fine-tune but rather some of the training set in the corpus for pre-training.

post says: Note on "tuned": OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details.

So, I guess we don't know.

og_kalu · 2024-12-23T00:00:40 1734912040

>that result is not verifiable, not reproducable, unknown if it was leaked and how it was measured. Its kinda hype science.

It will be verifiable when the model is released. Open ai haven't released any benchmark scores that were shown falsified later so unless you have an actual reason to believe they're outright lying then it's not something to take seriously.

Frontier Math is a private benchmark with its highest tier of difficulty Terrence Tao says:

“These are extremely challenging. I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages…”

Unless you have a reason to believe answers were leaked then again, not interested in baseless speculation.

riku_iki · 2024-12-23T00:14:02 1734912842

> Open ai haven't released any benchmark scores

there are multiple research results demonstrating that various benchmarks are heavily leaked to GPT training data.

Is it intentionally or not, we can't figure out, but they have very strong incentive to cheat to get more investments.

> Unless you have a reason to believe answers were leaked then again, not interested in baseless speculation.

this is scientific methodology when results have to be reproduced or confirmed before believed.

og_kalu · 2024-12-23T00:26:24 1734913584

Again, Frontier Math is private. Benchmarks leaked to GPT-4 are all public datasets on the internet. Frontier Math literally cannot leak that way.

If you don't want to take the benchmarks at face value then good for you but this entire conversation is pointless.

riku_iki · 2024-12-23T00:31:02 1734913862

> Again, Frontier Math is private.

its private for outsiders, but it was developed in "collaboration" with OAI, and GPT was tested in the past on it, so they have it in logs somewhere.

> If you don't want to take the benchmarks at face value then good for you but this entire conversation is pointless.

If you think this entire conversation is pointless, then why do you continue?

og_kalu · 2024-12-23T00:53:11 1734915191

>its private for outsiders, but it was developed in "collaboration" with OAI, and GPT was tested in the past on it, so they have it in logs somewhere.

They have logs of the questions probably but that's not enough. Frontier Math isn't something that can be fully solved without gathering top experts at multiple disciplines. Even Tao says he only knows who to ask for the most difficult set.

Basically, what you're suggesting at least with this benchmark in particular is far more difficult than you're implying.

>If you think this entire conversation is pointless, then why do you continue?

There's no point arguing about how efficient the models are being (the original point) if you won't even accept the results of the benchmarks. Why i'm continuing ? For now, it's only polite to clarify.

riku_iki · 2024-12-23T02:12:35 1734919955

> Frontier Math isn't something that can be fully solved without gathering top experts

Tao's quote above referred on hardest 20% problems, they have 3 levels of difficulty, presumably first level is much easier. Also, as I mentioned OAI collaborated on creating benchmark, so they could have access to all solutions too.

> There's no point arguing

Lol, let me ask again, why you are arguing then? Yes, I have strong reasonable(imo) doubt that those results are valid.

mnk47 · 2024-12-22T23:37:17 1734910637

> So LLMs finally hit the wall

Not really. Throwing a bunch of unfiltered garbage at the pretraining dataset, throwing in RLHF of questionable quality during post-training, and other current hacks - none of that was expected to last forever. There is so much low-hanging fruit that OpenAI left untouched and I'm sure they're still experimenting with the best pre-training and post-training setups.

One thing researchers are seeing is resistance to post-training alignment in larger models, but that's almost the opposite of a wall, they're figuring it out as well.

> Now someone has to have a new idea

OpenAI already has a few, namely the o* series in which they discovered a way to bake Chain of Thought into the model via RL. Now we have reasoning models that destroy benchmarks that they previously couldn't touch.

Anthropic has a post-training technique, RLAIF, which supplants RLHF,and it works amazingly well. Combined with countless other tricks we don't know about in their training pipeline, they've managed to squeeze so much performance out of Sonnet 3.5 for general tasks.

Gemini is showing a lot of promise with their new Flash 2.0 and Flash 2.0-Thinking models. They're the first models to beat Sonnet at many benchmarks since April. The new Gemini Pro (or Ultra? whatever they call it now) is probably coming out in January.

> The current level of LLM would be far more useful if someone could get a conservative confidence metric out of the internals of the model. This technology desperately needs to output "Don't know" or "Not sure about this, but ..." when appropriate.

You would probably enjoy this talk [0], it's by an independent researcher who IIRC is a former employee of Deepmind or some other lab. They're exploring this exact idea. It's actually not hard to tell when a model is "confused" (just look at the probability distribution of likely tokens), the challenge is in steering the model to either get back to the right track or give up and say "you know what, idk"

[0] https://www.youtube.com/watch?v=4toIHSsZs1c

whoisthemachine · 2024-12-22T22:22:37 1734906157

Unfortunately, the best they can do is "This is my confidence on what someone would say given the prior context".

aleph_minus_one · 2024-12-22T23:14:15 1734909255

> Now someone has to have a new idea. There's plenty of money available if someone has one.

I honestly do claim to have some ideas where I see evidence that they might work (and I do attempt to work privately on a prototype if only out of curiosity and to see whether I am right). The bad news: these ideas very likely won't be helpful for these LLM companies because they are not useful for their agenda, and follow a very different approach.

So no money for me. :-(

Let me put it this way:

Have you ever talked to a person whose intelligence is miles above yours? It can easily become very exhausting. Thus an "insanely intelligent" AI would not be of much use for most people - it would think "too different" from such people.

There do exist tasks in commerce for which an insane amount of intelligence would make a huge difference (in the sense of being positive regarding some important KPIs), but these are rare. I can imagine some applications of such (fictional) "super-intelligent" AIs in finance and companies doing some bleeding-edge scientific research - but these are niche applications (though potentially very lucrative ones).

If OpenAI, Anthropic & Co were really attempting to develop some "super-smart" AI, they were working on such very lucrative niche applications where an insane amount of intelligence would make a huge difference, and where you can assume and train the AI operator to have a "Fields-medal level" intelligence.

synapsomorphy · 2024-12-22T22:21:44 1734906104

The new idea is already here and it's reasoning / chain of thought.

Anecdotally Claude is pretty good at knowing the bounds of its knowledge.

threeseed · 2024-12-22T23:07:41 1734908861

Anecdotally Claude is just as bad as every other LLM.

Step into more niche areas e.g. I am trying to use it with Scala macros and at least 90% of the time it is giving code that either (a) fails to compile or (b) is just complete gibberish.

And at no point ever has it said it didn't know something.

mrbungie · 2024-12-22T23:43:56 1734911036

Yep, get into any sufficiently deep niche (i.e. actually almost any non-trivial app) and the LLM magic fades off.

Yeah sure you can make a pong clone in html/js and that's mainly because there the internet is full of pong clone demos. Ask how to constraint a statsmodels lineal model in some non-standard way? It will gaslight how it is possible and make you loss time in the process.

knapcio · 2024-12-22T22:41:14 1734907274

I’m wondering whether O3 can be used to explore its own improvement or optimization ideas, or if it hasn’t reached that point yet.

thrwthsnw · 2024-12-23T00:30:49 1734913849

Seriously? All they do is produce a “confidence metric”

Yizahi · 2024-12-22T22:49:42 1734907782

To output "don't know" a system needs to "know" too. Random token generator can't know. It can guess better and better, maybe it can even guess 99.99% of time, but it can't know, it can't decide or reason (not even o1 can "reason").

ericskiff · 2024-12-22T13:47:20 1734875240

What we can reasonably assume from statements made by insiders:

They want a 10x improvement from scaling and a 10x improvement from data and algorithmic changes

The sources of public data are essentially tapped

Algorithmic changes will be an unknown to us until they release, but from published research this remains a steady source of improvement

Scaling seems to stall if data is limited

So with all of that taken together, the logical step is to figure out how to turn compute into better data to train on. Enter strawberry / o1, and now o3

They can throw money, time, and compute at thinking about and then generating better training data. If the belief is that N billion new tokens of high quality training data will unlock the leap in capabilities they’re looking for, then it makes sense to delay the training until that dataset is ready

With o3 now public knowledge, imagine how long it’s been churning out new thinking at expert level across every field. OpenAI’s next moat may be the best synthetic training set ever.

At this point I would guess we get 4.5 with a subset of this - some scale improvement, the algorithmic pickups since 4 was trained, and a cleaned and improved core data set but without risking leakage of the superior dataset

When 5 launches, we get to see what a fully scaled version looks like with training data that outstrips average humans in almost every problem space

Then the next o-model gets to start with that as a base and reason? Its likely to be remarkable

sdwr · 2024-12-22T21:52:49 1734904369

Great improvements and all, but they are still no closer (as of 4o regular) to having a system that can be responsible for work. In math problems, it forgets which variable represents what, in coding questions it invents library fns.

I was watching a YouTube interview with a "trading floor insider". They said they were really being paid for holding risk. The bank has a position in a market, and it's their ass on the line if it tanks.

ChatGPT (as far as I can tell) is no closer to being accountable or responsible for anything it produces. If they don't solve that (and the problem is probably inherent to the architecture), they are, in some sense, polishing a turd.

tucnak · 2024-12-22T22:24:13 1734906253

> ChatGPT (as far as I can tell) is no closer to being accountable or responsible for anything it produces.

What does it even mean? How do you imagine that? You want OpenAI to take on liability for the kicks of it?

numpad0 · 2024-12-22T22:35:23 1734906923

If an LLM can't be left to do mowing by itself, but a human will have to closely monitor and intervene at every its steps, then it's just a super fast predictive keyboard, no?

sdwr · 2024-12-23T01:14:56 1734916496

Personal responsibility, not legal liability. In the way a child can be responsible for a pet.

Chatgpt was trained on benchmarks and user opinions - "throwing **** at the wall to see what sticks".

Responsibility means penalties for making mistakes, and, more importantly, having an awareness of those penalties (that informs its decision-making).

dmkolobov · 2024-12-22T22:48:21 1734907701

Obviously not. I want legislation which imposes liability on OpenAI and similar companies if they actively market their products for use in safety-critical fields and their product doesn’t perform as advertised.

If a system is providing incorrect medical diagnoses, or denying services to protected classes due to biases in the training in the training data, someone should be held accountable.

SpicyLemonZest · 2024-12-22T22:35:28 1734906928

They would want to, if they thought they could, because doing so would unblock a ton of valuable use cases. A tax preparation or financial advisor AI would do huge numbers for any company able to promise that its advice can be trusted.

Stevvo · 2024-12-22T13:58:17 1734875897

"With o3 now public knowledge, imagine how long it’s been churning out new thinking at expert level across every field."

I highly doubt that. o3 is many orders of magnitude more expensive than paying subject matter experts to create new data. It just doesn't make sense to pay six figures in compute to get o3 to make data a human could make for a few hundred dollars.

bookaway · 2024-12-22T17:36:12 1734888972

Yes, I think they had to push this reveal forward because their investors were getting antsy with the lack of visible progress to justify continuing rising valuations. There is no other reason a confident company making continuous rapid progress would feel the need to reveal a product that 99% of companies worldwide couldn't use at the time of the reveal.

That being said, if OpenAI is burning cash at lightspeed and doesn't have to publicly reveal the revenue they receive from certain government entities, it wouldn't come as a surprise if they let the government play with it early on in exchange for some much needed cash to set on fire.

EDIT: The fact that multiple sites seem to be publishing GPT-5 stories similar to this one leads one to conclude that the o3 benchmark story was meant to counter the negativity from this and other similar articles that are just coming out.

mrshadowgoose · 2024-12-22T21:48:54 1734904134

Can SMEs deliver that data in a meaningful amount of time? Training data now is worth significantly more than data a year from now.

tshadley · 2024-12-22T20:58:37 1734901117

Seems to me o3 prices would be what the consumer pays, not what OpenAI pays. That would mean o3 could be more efficient in-house than paying subject-matter experts.

mrbungie · 2024-12-22T23:51:56 1734911516

For every consumer there will be a period where they need both the SME and the o3 model for initial calibration and eventual handoff for actually getting those efficiencies in whichever processes they want to automate.

In other words if you are diligent enough, you should at least validate your o3 solution with an actual expert for some time. You wouldn't just blindly trust OpenAI your business critical processes, would you? I would expect at least 3 month - 6 months for large corps and even more considering change management, re-upskilling, etc.

With all those considerations I really don't see the value prop at those prices and in those situations right now. Maybe if costs decrease ~1-3 orders of magnitude more for o3-low, depending on the the processes being automated.

lalalali · 2024-12-22T21:46:10 1734903970

What is open ai margin on that product?

rtsil · 2024-12-22T23:50:43 1734911443

Unless the quality of the human data are extraordinary, it seems according to the TFA that it's not that easy:

> The process is painfully slow. GPT-4 was trained on an estimated 13 trillion tokens. A thousand people writing 5,000 words a day would take months to produce a billion tokens.

And if the human-generated data was so qualitatively good that it is smaller by three order of magnitudes, than I can assume it would be at least as expensive as o3.

dartos · 2024-12-22T14:19:30 1734877170

That’s an interesting idea. What if OpenAI funded medical research initiatives in exchange for exclusive training rights on the research.

onlyrealcuzzo · 2024-12-22T14:25:16 1734877516

It would be orders of magnitude cheaper to outsource to humans.

dartos · 2024-12-22T14:27:59 1734877679

Not as sexy to investors though