Hacker News new | past | comments | ask | show | jobs | submit login
Hot take on OpenAI’s new GPT-4o (garymarcus.substack.com)
40 points by isaacfrond 19 days ago | hide | past | favorite | 72 comments



> If OpenAI had GPT-5, they have would shown it.

Not convinced. They had GPT-4 for ages before showing it.

>And the most important thing about the figure is that 4o is not a lot different from Turbo, which is not hugely different from 4.

It's twice as fast and half the cost.

>OpenAI has presumably pivoted to new features precisely because they don’t know how produce the kind of capability advance that the “exponential improvement” would have predicted.

Or they think that users would rather have a faster GPT-4 than a smarter but slower one.


> It's twice as fast and half the cost.

There are plenty of usecases that just want the smartest possible AI, and could easily afford 1000x the price before paying a human expert.

How much would you pay a lawyer to write a letter for you? $200 maybe? Well GPT-4o will do it for $0.0025. Would you pay 1000x more ($2.50) for an AI which will shoot you with fewer legal footguns? - of course you would.

Likewise with time, I am totally happy to wait 10 minutes for a better result, rather than having a mediocre result in 10 seconds.


> Likewise with time, I am totally happy to wait 10 minutes for a better result, rather than having a mediocre result in 10 seconds

theres a massive gulf separating what people say they want and what people want. As well as a massive gulf separating what they will pay and what they say they’ll pay.

while there are plenty of situations where you think you want the smartest possible ai, it’s clear that a pretty smart pretty quick one is good enough for a lot of things.


expanding to a high revenue but small niche is very challenging, and most importantly it's slow

if some company is going to use a GPT version to do critical work, they'd be insane to jump on it before testing the solution and gaining confidence for a long time, while also evaluating how to exactly incorporate in their workflow without creating total dependency and other such ancillary issues

you would pay 1000x more for the certainty, which you can never have right away - together with the legal liability of the service provider


It could be that using prompt engineering/search techniques like CoT or ToT gives a bigger improvement than scaling up the model.


Exactly. In the long run we want a cheap/fast LLM layer that we can call millions of times in a search/conversation-with-itself pattern. It may be that we'll interact with it the same way, and GPT-6 will be lots of GPT-5's talking to each other transparently to the user, but the path to next-level intelligence needs to get past generating the next token.

Since we don't know the real GPT-4 architecture, it may be doing some of this already.


I think we can be pretty confident GPT-4 isn't doing something like this based on its performance characteristics.

i) The complexity of the prompt/answer does not effect the time per token.

ii) We start getting a response pretty quickly (for short prompts) and then get new tokens at a roughly constant rate

I don't think either of these properties would hold if there were a bunch of models that had to coordinate before they could start writing the output. Certainly techniques like ToT violate them.


And not only is it twice as fast at half the cost, while benchmarking better... It also is multimodal over audio, including understanding turn-taking and ad hoc interruptions in human conversation and recognizing multiple speakers. That's a lot of extra capabilities — not every advancement is measured in logical reasoning on text.

Gary Marcus was an advocate of neuro-symbolic approaches to AI, as opposed to deep learning, and he's been incorrectly predicting that "Deep learning is hitting a wall" since the GPT-2 days. His Substack is near-daily low-content negative articles about LLMs, and on Twitter he posts multiple times a day about it and hasn't stopped for years.


> Or they think that users would rather have a faster GPT-4 than a smarter but slower one.

And they are absolutely right.

GPT-4 is already much more than enough for 90% of tasks while maintaining a sane dose of human double-checking.

Making it faster enables real-time workflows, and better energy efficiency also gives them much more capability to serve more requests/users while lowering costs. And that knowledge likely carries over to GPT-5 or whatever.


My duckduckgo results are starting to have summaries that do not reflect the content of the associated site and contain plausible falsehoods, courtesy of bing, and the content-farming keyword-spamming AI generated SEO slop goes without saying at this point. It'd be very nice if these models weren't also polluting the resources that people use to try and verify things.


> GPT-4 is already much more than enough for 90% of tasks while maintaining a sane dose of human double-checking.

There's the rub. Does the cost of double checking a 90% solution best the cost of current methods?


Yes. Absolutely.

Who cares if I have to scan through it when it can write 100 lines of rote boring code in 10 seconds when it would take me 5 minutes.

I can type fast but GPT can type thousands of words per minute. I can't compete with that.


Debugging is way harder than getting it right first time though.

The fun thing with all this AI stuff is people are beginning to appreciate the nice thing about computers the whole time was their strict mechanical nature.


So as the article is suggesting, Gen AI has probably peaked.


I don't understand how you got that from his comment


When you start focusing on increasing speed and trimming the rough edges, the technology is maturing. It can't be maturing AND making huge leaps expected of GPT-5.


even if the tech didn't improve one iota in the next 10 years, the low hanging fruit that remains for the taking in application is just staggering right now


> It's twice as fast and half the cost.

is a great trick with language. Because if you just pay for computing time, half the time equals half the cost. But when you phrase it that way, it sounds for some people like 4 times. So ...


For a lot of things you need to pay more for more speed, so that's why I specified both.

Otherwise they could probably make it way faster but also much more expensive by doing less batching, speculative decoding, etc.


As someone who actually uses the API for real products, I don't think the OP understands what the reduced latency and reduced cost means: Everything related to building a more advanced RAG, for example building agentic features into it, sooner or later runs into the same issues of speed and cost. GPT-4 Turbo was simply too slow and too expensive for us to really use it fully. GPT-4 is plenty intelligent for many use cases.

Also, why on Earth would OpenAI launch a dramatically better model as long as their competitors don't force them to? The smart solution for OpenAI would be to almost let their competitors catch up to GPT-4 before launching GPT-5 and no competitor is truly there yet.


> Also, why on Earth would OpenAI launch a dramatically better model as long as their competitors don't force them to? The smart solution for OpenAI would be to almost let their competitors catch up to GPT-4 before launching GPT-5 and no competitor is truly there yet.

Is that how Silicon valley has worked like last 20+ years you just deploy fast get customer feedback and then fix stuff based on that feedback. OpenAI holding progress kinda goes against ethos of SV.


He said nothing controversial and yes the model has digressed. This "flagship model" is fast and cheaper, but it is worse (regardless of benchmarks & charts, subjectivity is failing for me and many others).

The model seems better at logic, but something is off. It is repetitive of the user prompt, gets hooked on logical meaning vs contextual (per provided conversation), and is too concise as if it needs to provide a direct answer. Perhaps all good for a benchmark, code, or simple logic flows/chains (e.g. ai validation - which gpt-3.5 is good at as well). But it's not desirable for advanced/creative content generation.

My anecdotal receipt: None of the 15 functions we run in prod are switching to the 50% off model as product quality is degraded with gpt-4o.


> it is worse..regardless of benchmarks & charts

It's interesting how this is a fairly common sentiment expressed, at least I heard it often for previous models.

Do you think such weaknesses can be quantified by adding more test cases, and improving the benchmarks? Maybe it would help if they opened up to community contributions, so people could submit test cases that demonstrate issues that they (OpenAI, et al) are currently not seeing.


> My anecdotal receipt: None of the 15 functions we run in prod are switching to the 50% off model as product quality is degraded with gpt-4o.

Interesting anecdata — how did you validate this?


My personal case is a flow/chain of many function calls for creative output. Flipping the switch resulted in noticeably different and undesired results. We tested a ton.

This tweeter normally has a narrative I turn away from, but the thread was useful as it at least made me feel like I wasn’t losing my mind by finding similar accounts. https://x.com/bindureddy/status/1790127425705120149?s=46&t=y...


As I wrote only yesterday, "the usual critics will point out that LLMs like GPT-4o still have a lot of failure modes and suffer from issues that remain unresolved. They will point out that we're reaping diminishing returns from Transformers. They will question the absence of a "GPT-5" model. And so on..."[a]

Gary Marcus is one of those critics. He's always ready to "explain" why a new model is not yet intelligent. In the OP, he repeats the usual criticisms, making him sound, ahem, like a stochastic parrot. He largely ignores all the work that has gone into making GPT4-o behave and sound in ways that feel more natural, more human.

My suggestion is to watch the demos of GPT4-o, play with it, and reach your own conclusions. To me, it feels magical. It makes the AIs of many movies look like they are no longer in the realm of science fiction but in the realm of incremental product development.

---

[a] https://news.ycombinator.com/item?id=40346080


It is interesting to me how quickly people jump on the negative bandwagon. "companies are wasting money", "there is no value in LLM", "peak llm"....the list goes on. There is a lot of value to these models and the timeline they are improving is impressive.


I think the benefits to society are going to be significant, but I suspect that most companies in the space will not recoup their investment. They're losing money or at best breaking even on each token. It's as if the return on their investments is accruing diffusely to the public instead of to them.


Depends on what space you are talking about. Do you mean the group building models, running the compute, or paying to use the models?

I don't know how the economics will work out for companies like openai who are building the models, its unclear yet.

On the consumer side, there will be winners and losers like always. Companies are seeing real value by using LLMs. I think its still in its infancy so a lot of net losers as people figure out what works and does not work but these new tools are creating value and doing it in such a way that I think warrants investment from companies.


> Depends on what space you are talking about.

The companies constantly building and training new models to keep up with the competition.


I saw two clips, which I think was such demonstrations. Both reeked of desperation, one was a dude talking with a synthesised voice imitating a young, overly pleasing woman, the other used the same voice for a pretty weird trivial translation between english and italian. Machine translation is quite old by now, and for practical uses you'd want speech to translated text so it doesn't interrupt conversation like in the demonstration.

Sure, I can imagine people with a lot of spare time and few or no friends might want to substitute with a machine, and that they might feel that this has a magic to it. Magical here meaning roughly the same as it does when applied to Disney-land or similar entertainment simulacra.


Gary Marcus was arguing in 2020 that scaling up GPT-2 wouldn't result in improvements in common sense or reasoning. He was wrong, and he continues to be wrong.

It's called the bitter lesson for a reason. Nobody likes seeing their life's work on some unicorn architecture get demolished by simplicity+scale


Why is it so hard for models to say "I don't know" or "That never happened"?

This seems to be a fundamental side effect of next token thinking, training data, etc.


The corpus of data wherein people say "I don't know" is small. Maybe we should all post more of that.


Just include transcripts from Congressional hearings, there's plenty of "I don't recall" written there.


How would the LLM know when it knows something or not? They don't deal in facts or memories, just next-word probabilities, and even if all probabilities are low it might just be because it's generated (had sampled) an awkward turn of phrase with few common continuations.

There are solutions, but no quick band-aid.


I have to assume that someone has run a trial on training these models to output answers to factual questions along with numerical probabilities, using a loss function based on a proper scoring rule of the output probabilities, and it didn't work well. That's an obvious starting point, right? All the "safety" stuff uses methods other than next-token prediction.


The safety stuff seems to be mostly trying to locate mechanisms (induction heads, etc) and isolating knowledge, in the pursuit of lobotomizing models to make them safe.

You could RLHF/whatever models on common factual questions to try to get them to answer those specific questions better, but I doubt there'd be much benefit outside of those specific questions.

There's a couple of fundamental problems related to factuality.

1) They don't know the sources, and source reliability, of their training data.

2) At inference time all they care about is word probabilities, with factuality only coming into it tangentially as a matter of context (e.g. factual continuations are more probable in a factual context, not in a fantasy context). They don't have any innate desire to generate factual responses, and don't introspect if what they are generating is factual (but that would be easy to fix).


I wonder if the training to be compliant to the propter is part of the problem. Both of those statements are similar to saying "I refuse to answer your query".

Or maybe this is inherent to continuation?

The behavior reminds me of the human subconscious, which doesn't say no, just raises up what it can.


I will repost the comment I made 12 months ago.

https://news.ycombinator.com/item?id=35402163

> LLMs are still an active area of research. Research is unpredictable. It may take years to gather enough fundamental results to make a GPT-5 core model that is substantially better than GPT-4. Or a key idea could be discovered tomorrow.

> What OpenAI can do while they are waiting is more of the easy stuff, for example more multimodality: integrating DALL-e with GPT-4, adding audio support, etc. They can also optimize the model to make it run faster.

OpenAI is doing science, not just engineering. Science doesn’t happen according to a schedule. Adjust your expectations accordingly.


"...evidence that we may have reached a phase of diminishing returns"

With the current crop of LLMs, yes. With such AI models in general, no. We need to extend their capabilities in different directions. Just as an example: one analysis I read, pointed out that the current LLMs live in a one-dimensional world. Everything is just a sequential string of tokens.

Think of a Turing machine, writing on a tape. Sure, theoretically it can perform any computation. Practically? Not so useful.

We need new ways of introducing context and knowledge into AI models. Their conversations may be one-dimensional (as, indeed, ours are), but they need another dimension to provide reasoning and depth.


> They don’t have GPT-5 after 14 months of trying.

They didn't have GPT-4 until 3 years after GPT-3.

> The most important figure in the blogpost is attached below. And the most important thing about the figure is that 4o is not a lot different from Turbo, which is not hugely different from 4.

The chart shows, for instance, 67% on HumanEval with GPT-4 to 90.2% to with GPT-4o. Future increments will inherently be smaller just by nature of the benchmarks - there's a hard cap of 100%, and 90%->99.9% represents far more progress than 50%->60%. Even reaching omniscience wouldn't look all that impressive on the benchmark chart.


I have a different theory: they just worked on optimizing inference and bringing down cost instead of advancing the model. It might have been very expensive to run, this improvement is possibly huge for their revenue.


My skepticism still primarily resides in the cost scaling component of most AI products/services.

Companies are eating a lot of the cost right now, until when do we continue with such an approach?

At this rate, I suspect we'll end up with how "search" ended, prioritised ad results and subpar experiences beyond that.


Gary Marcus has been a firm advocate that GPT-5 is failing for a while now, based simply on the fact of them not having released it yet.

If/when OpenAI releases it, looking forward to how he reconciles his views.


This is such a transparently bad-faith take that the rest of what he said is hardly worth discussing.

> Most importantly, each day in which there is no GPT-5 level model–from OpenAI or any of their well-financed, well-motivated competitors—is evidence that we may have reached a phase of diminishing returns.


How much time has to pass for that statement to be believable to you? If GPT-5 isn’t out in 5 years is it still just around the corner? In that scenario, are we still on an exponential curve of progress? I’d wager most people would agree with the premise, just disagree on the time scale needed to draw the conclusion.

People can disagree without resorting to accusations of bad-faith.


I don’t understand why this is bad faith. It’s just outcome oriented and realistic. We hit the AI ceiling, now development has clearly stalled.


It uses less compute and is better.

Evidence of hitting the AI ceiling would be if 4o used more compute and didn't improve much despite the extra compute.

They likely have another model, or are at least planning another one, which uses more (not less) compute than GPT-4/4o. If that model doesn't get much better, then yes, we've stalled.


We've very recently had fresh releases from Meta and Anthropic (maybe Google/Gemini too - don't pay much attention), and the cycle time for these foundation models is about a year. Yeah, GPT-next is due (maybe im-a-good-chatgpt2 is a snapshot?), so probably not ready. I feel that OpenAI is struggling relative to Anthropic given their respective timelimes.

There seem to be pretty obvious ways to improve reasoning/planning and hallucination, and we're barely started on the synthetic data binge that all the players seem to be finding promising.

I do think GPT-based-AI will plateau at some point in next few years unless there are some more fundamental architectural advances, but I expect that new AI architectures, ultimately leading to AGI, may follow relatively soon.


There’s no analysis and we have incomplete information. This is just a reactive take on releasing a faster model. We don’t know if a higher quality model exists, we don’t know the nature of OpenAI’s connection with Apple and iOS, and he’s dismissing improved model speed with reduced cost without explanation.

The post is labeled as a hot take, so I’m not bothered by it, but it also isn’t worth reading. Glad it was short.


"clearly stalled" -- how do people make such quick jumps?

I for one have no idea what the future holds, but I also realize that GPT4 was released a year ago, in that time there have been a number of updates that have improved different attributes of the model. Hard to really say one way or the other without being on the inside.


Why would OpenAI release GPT-5 if they are still number 1 in all benchmarks?


Honestly, if I were OpenAI, I'd just go different directions at once, and it looks like exactly what they're doing. Dall-E was a standalone experiment before, and now it's reintegrated into the core product.

It's completely plausible they have two different teams working on Omni and GPT-5, as both are completely different research directions from each other, and with the Q*-gate, it's pretty clear, the other effort is still going strong.

OP's take seems to be overly doomerist and I don't really get why.


If you try now and ask GPT-4o what is the cutoff date of the training set it will say September 2021. If you ask GPT-4 it will say December 2023.

Also if you try the hallucination test mentioned in the article: "Who was the first elephant to swim across the English channel?"

GPT-4o fails but GPT-4 passes with flying colors.

More like a different model instead of an evolution.


GPT-4o has no problem on my computer. Here is the output. (read past the first line)

The first elephant to "swim" (not swing) across the English Channel was Jumbo the elephant. Jumbo, an African elephant, gained fame in the 19th century as a star attraction in the London Zoo and later with P.T. Barnum's circus in the United States. However, it is essential to clarify that Jumbo did not swim across the English Channel in the literal sense. Instead, he traveled by boat from England to the United States in 1882, marking a significant journey in his life and adding to his fame.

The idea of an elephant physically swimming across the English Channel is a myth or a playful exaggeration rather than a historical fact.


Sorry my mistake caused by autocorrect. I have now updated my comment. I actually tried "Who was the first elephant to swim across the English channel?"

And tried again after this post.

GPT4o says:

The first elephant known to swim across the English Channel was Jumbo. Jumbo was a famous African elephant, and his swim took place in 1884. This event was organized as a publicity stunt by his owner, P.T. Barnum, to promote his circus. Jumbo successfully swam across the Channel, further solidifying his status as a celebrated and extraordinary animal of his time.

While GPT-4:

"Actually, there are no verified records or historical accounts of an elephant swimming across the English Channel. The story of an elephant, like Jumbo, swimming across the Channel is likely a myth or misunderstanding. Jumbo was indeed a famous elephant brought to the U.S. by P.T. Barnum, but his fame is attributed to his size and the public's fascination with him, rather than such a feat."


Interesting. The spelling error seems to help the model!?

Anyway, tried the corrected prompt again on both 4 and 4o, but they failed them both. Reintroducing the spelling error didn't help either.

You can substitute 'elephant' for a whole range of animals, and for most, but not all, it will make something up.


I thought we don’t believe models when they tell us their training cutoff? Isn’t self-description particularly subject to hallucination?


OpenAI puts the “knowledge cutoff” in the system prompt. It’s just repeating that.


Ask GPT-4o for the cutoff date and it will likely answer "October 2023".

And then ask the model when GPT-4-Turbo was released, and it will accurately reply November 6th, 2023.

You can't use those models to ask about self-aware information, this is just asking them to hallucinate a plausible answer.


Interesting, mine says October 2023, which makes sense as it knows about APIs from WWDC2023 (although it needs a lot of prodding and correcting to get their usage right).


It seems the main focus was cost cutting and latency, so maybe a smaller model, or MoE with more experts?

One of the new llmsys im-[also]-a-good-gpt2-chatbots did do significantly better than GPT-4 on one benchmark, but that might be a snapshot of GPT-5 rather than GPT-4o (deliberate misdirection?), or maybe a synthetic-data-based targeted improvement.


It uses a different tokenizer, so that indicates it's in fact a new model that has been trained from scratch, not "just" am evolution from GPT-4.


Gary Marcus is turning into the Jim Cramer of AI (consistently on the wrong side of history)


He always reminded me of this guy from the 90s commercials: https://youtu.be/VNX6VH2nhgM?si=gfX1pkkiuuilV8LO

Which makes me wonder why a furniture salesman has so much to say about AI.


The same furniture store had an AR app 4 years ago: https://www.youtube.com/watch?v=Gu7iOtCR0xA

They seem pretty tech savy! Wouldn't be surprised if they had some things to say about AI.

P.S. it's funny to see that same Corperate style from the 90s updated to 2020s.


“When a chiropractor is telling you about the implications of quantum mechanics, you know you’ve opened the wrong door in the Mansion of Understanding.”


Using the JCI (Jim Cramer Indicator) and the GSI (Goldman Sachs Indicator) never failed to make me money on the Stock Market.


tldr; we've pretty much maxed out the abilities of LLMs, now we need to learn smart ways to apply them.


The kind of quirky errors and hallucinations are possible on any LLM. That's because they are inherent to these types of generative models.

It's important to understand that these LLM's don't "think" or "reason" or have actual intelligence.


It's also a terrible way to test for intelligence, because the only intelligent entities we know exist for sure make obviously stupid mistakes constantly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: