Hacker News new | past | comments | ask | show | jobs | submit login

They are admitting[1] that the new model is the gpt2-chatbot that we have seen before[2]. As many highlighted there, the model is not an improvement like GPT3->GPT4. I tested a bunch of programming stuff and it was not that much better.

It's interesting that OpenAI is highlighting the Elo score instead of showing results for many many benchmarks that all models are stuck at 50-70% success.

[1] https://twitter.com/LiamFedus/status/1790064963966370209

[2] https://news.ycombinator.com/item?id=40199715




I think the live demo that happened on the livestream is best to get a feel for this model[0].

I don't really care whether it's stronger than gpt-4-turbo or not. The direct real-time video and audio capabilities are absolutely magical and stunning. The responses in voice mode are now instantaneous, you can interrupt the model, you can talk to it while showing it a video, and it understands (and uses) intonation and emotion.

Really, just watch the live demo. I linked directly to where it starts.

Importantly, this makes the interaction a lot more "human-like".

[0]: https://youtu.be/DQacCB9tDaw?t=557


The demo is impressive but personally, as a commercial user, for my practical use cases, the only thing I care about is how smart it is, how accurate are its answers and how vast is its knowledge. These haven’t changed much since GPT-4, yet they should, as IMHO it is still borderline in its abilities to be really that useful


But that's not the point of this update


I know, and I know my comment is dismissive of the incredible work shown here, as we’re shown sci-fi level tech. But I feel I have this kettle, that boils water in 10min, and it really should boil it in 1, but instead is now voice operated.

I hope the next version delivers on being smarter, as this update instead of making me excited, makes me feel they’ve reached a plateau on the improvement of the core value and are distracting us with fluff instead


Everything is amazing & Nobody is happy: https://www.youtube.com/watch?v=PdFB7q89_3U


gpt4 isn't quite "amazing" in terms of commercial use. Gpt4 is often good, and also often mediocre or bad. Its not going to change the world, it needs to get better.


Near real-time voice feedback isn't amazing? Has the bar risen this high?

I already know an application for this, and AFAIK it's being explored in the SaaS space: guided learning experiences and tutoring for individuals.

My kids, for instance, love to hammer Alexa with random questions. They would spend a huge amount of time using a better interface, esp. with quick feedback, that provided even deeper insight and responses to them.

Taking this and tuning it to specific audiences would make it a great tool for learning.


"My kids, for instance, love to hammer Alexa with random questions. They would spend a huge amount of time using a better interface, esp. with quick feedback, that provided even deeper insight and responses to them."

Great, using GPT-4 the kids will be getting a lot of hallucinated facts returned to them. There are good use cases for tranformer currently but they're not at the "impact company earnings or country GDP" stage currently, which is the promise that the whole industry has raised/spent 100+B dollars on. Facebook alone is spending 40B on AI. I believe in the AI future, but the only thing that matters for now is that the models improve.


I always double-check even the most obscure facts returned by GPT-4 and have yet to see a hallucination (as opposed to Claude Opus that sometimes made up historical facts). I doubt stuff interesting to kids would be so out of the data distribution to return a fake answer.

Compared to YouTube and Google SEO trash, or Google Home / Alexa (which do search + wiki retrieval), at the moment GPT-4 and Claude are unironically safer for kids: no algorithmic manipulation, no ads, no affiliated trash blogs, and so on. Bonus is that it can explain on the level of complexity the child will understand for their age



My kids get erroneous responses from Alexa. This happens all the time. The built-in web search doesn't provide correct answers, or is confusing outright. That's when they come to me or their Mom and we provide a better answer.

I still see this as a cool application. Anything that provides easier access to knowledge and improved learning is a boon.

I'd rather worry about the potential economic impact than worry about possible hallucinations from fun questions like "how big is the sun?" or "what is the best videogame in the world?", etc.

There's a ton you can do here, IMO.

Take a look at mathacademy.com, for instance. Now slap a voice interface on it, provide an ability for kids/participants to ask questions back and forth, etc. Boom: you've got a math tutor that guides you based on your current ability.

What if we could get to the same style of learning for languages? For instance, I'd love to work on Spanish. It'd be far more accessible if I could launch a web browser and chat through my mic in short spurts, rather than crack open Anki and go through flash cards, or wait on a Discord server for others to participate in immersive conversation.

Tons of cool applications here, all learning-focused.


People should be more worried about how much this will be exploited by scammers. This thing is miles ahead of the crap fraudsters use to scam MeeMaw out of her life savings.


It's an impressive demo, it's not (yet) an impressive product.

It seems like the people who are ohhing and ahhing at the former and the people who are frustrated that this kind of this is unbelivably impractical to productize will be doomed to talk past one another forever. The text generation models, image generation models, speech-to-text and text-to-speech have reached impressive product stages. Multi-model hasn't got there because no one is really sure what to actually do with the thing outside of make cool demos.


Multi modal isn't there because "this is an image of a green plant" is viable in a demo, but its not commercially viable. "This is an image of a monstera deliciosa" is commercially viable, but not yet demoable. The models need to improve to be usable.


Sure, but "not enough, I want moar" is a trivial demand. So trivial that it goes unsaid.


It's equivalent to "nothing to see here" which is exactly the TLDR I was looking for.


Watch the last few minutes of that linked video, Mira strongly hints that there’s another update coming for paid users and seems to make clear that GPT4o is moreso for free tier users (even though it is obviously a huge improvement in many features for everyone).


There is room for more than one use case and large language model type.

I predict there will be a zoo (more precisely tree, as in "family tree") of models and derived models for particular application purposes, and there will be continued development of enhanced "universal"/foundational models as well. Some will focus on minimizing memory, others on minimizing pre-training or fine-tuning energy consumption, some need high accuracy, others hard realtime speed, yet others multimodality like GPT4.o, some multilinguality, and so on.

Previous language models that encoded dictionaries for spellcheckers etc. never got standardized (for instance, compare aspell dictionaries to the ones from LibreOffice to the language model inside CMU PocketSphinx) so that you could use them across applications or operating systems. As these models are becoming more common, it would be interesting to see this aspect improve this time around.

https://www.rev.com/blog/resources/the-5-best-open-source-sp...


I disagree, transfer learning and generalization are hugely powerful and specialized models won't be as good because their limited scope limits their ability to generalize and transfer knowledge from one domain to another.

I think people who emphasis specialized models are operating under a false assumption that by focusing the model it'll be able to go deeper in that domain. However, the opposite seems to be true.

Granted, specialized models like AlphaFold are superior in their domain but I think that'll be less true as models become more capable at general learning.


They say it's twice as fast/cheap, which might matter for your use case.


It's twice as fast/cheap relative to GPT-4-turbo, which is still expensive compared to GPT-3.5-turbo and Claude Haiku.

https://openai.com/api/pricing/


For commercial use at scale, of course cost matters.

For the average Joe programmer like me, GPT4 is already "dirt cheap". My typical monthly bill is $0-3 using it as much as I like.

The one time it was high was when I had it take 90+ hours of Youtube video transcripts, and had it summarize each video according to the format I wanted. It produced about 250 pages of output.

That month I paid $12-13. Well worth it, given the quality of the output. And now it'll be less than $7.

For the average Joe, it's not expensive. Fast food is.


but better afaik


But may not be better enough to warrant the cost difference. LLM cost econonmics are complicated.


I’d much rather have it be slower, more expensive, but smarter


Depends what you want it for. I'm still holding out for a decent enough open model, Llama 3 is tantalisingly close, but inference speed and cost are serious bottlenecks for any corpus-based use case.


I think, that might come with the next GPT version.

OpenAI seems to build in cycles. First they focus on capabilities, then they work on driving the price down (occasionally at some quality degradation)


Then the current offering should suffice, right?


I understand your point, and agree that it is "borderline" in its abilities — though I would instead phrase it as "it feels like a junior developer or an industrial placement student, and assume it is of a similar level in all other skills", as this makes it clearer when it is or isn't a good choice, and it also manages expectations away from both extremes I frequently encounter (that it's either Cmdr Data already, or that's it's a no good terrible thing only promoted by the people who were previously selling Bitcoin as a solution to all the economics).

That said, given the price tag, when AI becomes genuinely expert then I'm probably not going to have a job and neither will anyone else (modulo how much electrical power those humanoid robots need, as the global electricity supply is currently only 250 W/capita).

In the meantime, making it a properly real-time conversational partner… wow. Also, that's kinda what you need for real-time translation, because: «be this, that different languages the word order totally alter and important words at entirely different places in the sentence put», and real-time "translation" (even when done by a human) therefore requires having a good idea what the speaker was going to say before they get there, and being able to back-track when (as is inevitable) the anticipated topic was actually something completely different and so the "translation" wasn't.


I guess I feel like I’ll get to keep my job a while longer and this is strangely disappointing…

A real time translator would be a killer app indeed, and it seems not so far away, but note how you have to prompt the interaction with ‘Hey ChatGPT’; it does not interject on its own. It is also unclear if it is able to understand if multiple people are speaking and who’s who. I guess we’ll see soon enough :)


> It is also unclear if it is able to understand if multiple people are speaking and who’s who. I guess we’ll see soon enough :)

Indeed; I would be pleasantly surprised if it can both notice and separate multiple speakers, but only a bit surprised.


One thing I've noticed, is the more context and more precise the context I give it the "smarter" it is. There are limits to it of course. But, I cannot help but think that's where next barrier will be brought down. An agent or multiple of that tag along with everything I do throughout the day to have the full context. That way, I'll get smarter and more to the point help as well as not spending much time explaining the context.. but, that will open a dark can that I'm not sure people will want to open - having an AI track everything you do all the time (even if only in certain contexts like business hours / env).


There are definitely multiple dimensions these things are getting better in. The popular focus has been on the big expensive training runs but inference , context size, algorithms, etc are all getting better fast


I have a few LLM benchmarks that were extracted from real products.

GPT-4o got slightly better overall. Ability to reason improved more than the rest.


Its faster, smarter and cheaper over the API. Better than a kick in the teeth.


Absolutely agree.

This model isn't about basemark chasing or being a better code generator; it's entirely explicitly focused on pushing prior results into the frame of multi-modal interaction.

It's still a WIP, most of the videos show awkwardness where its capacity to understand the "flow" of human speech is still vestigial. It doesn't understand how humans pause and give one another space for such pauses yet.

But it has some indeed magic ability to share a deictic frame of reference.

I have been waiting for this specific advance, because it is going to significantly quiet the "stochastic parrot" line of wilfully-myopic criticism.

It is very hard to make blustery claims about "glorified Markov token generation" when using language in a way that requires both a shared world model and an understanding of interlocutor intent, focus, etc.

This is edging closer to the moment when it becomes very hard to argue that system does not have some form of self-model and a world model within which self, other, and other objects and environments exist with inferred and explicit relationships.

This is just the beginning. It will be very interesting to see how strong its current abilities are in this domain; it's one thing to have object classification—another thing entirely to infer "scripts plans goals..." and things like intent, and, deixis. E.g. how well does it now understand "us" and "them" and "this" vs "that"?

Exciting times. Scary times. Yee hawwwww.


What part of this makes you think GPT-4 suddenly developed a world model? I find this comment ridiculous and bizarre. Do you seriously think snappy response time + fake emotions is an indicator of intelligence? It seems like you are just getting excited and throwing out a bunch of words without even pretending to explain yourself:

> using language in a way that requires both a shared world model

Where? What example of GPT-4o requires a shared world model? The customer support example?

The reason GPT-4 does not have any meaningful world model (in the sense that rats have meaningful world models) is that it freely believes contradictory facts without being confused, freely confabulates without having brain damage, and it has no real understanding of quantity or causality. Nothing in GPT-4o fixes that, and gpt2-chatbot certainly had the same problems with hallucinations and failing the same pigeon-level math problems that all other GPTs fail.


One of the most interesting things about the advent of LLMs is people bringing out all sorts of "reasons" GPT doesn't have true 'insert property' but all those reasons freely occur in humans as well

>that it freely believes contradictory facts without being confused,

Humans do this. You do this. I guess you don't have a meaningful world model.

>freely confabulates without having brain damage

Humans do this

>and it has no real understanding of quantity or causality.

Well this one is just wrong.


So many even here on HN have a near-religious belief that intelligence is unique to humans and animals, and somehow a fundamental phenomenon that cannot ever be created using other materials.


It reminds me of the geocentric mindset.


It's a defensive response to an emerging threat to stability and current social tiers.


>>and it has no real understanding of quantity or causality.

>Well this one is just wrong.

Is it?

--

Me: how many characters are in: https://google.com

ChatGPT: The URL "https://google.com" has 12 characters, including the letters, dots, and slashes.

--

What is it counting there? 12 is wrong no matter how you dice that up.

Part of the reason is it has no concept of the actual string. That URL breaks into four different tokens in 3.5 and 4: "http", "://", "google" and ".com".

Its not able to figure out the total length, or even the length of its parts and add them together.

I ask it to double check, it tells me 13 and then 14. I tell it the answer and suddenly its able...

---

Me: I think its 18

ChatGPT: Let's recount together:

"https://" has 8 characters. "google" has 6 characters. ".com" has 4 characters. Adding these up gives a total of 8 + 6 + 4 = 18 characters. You're correct! My apologies for the oversight earlier.

---

Count me out.


LLMs process text, but only after it was converted to a stream of tokens. As a result, LLMs are not very good at answering questions about letters in the text. That information was lost during the tokenization.

Humans process photons, but only after converting them into nerve impulses via photoreceptor cells in the retina, which are sensitive to wavelengths ranges described as "red", "green" or "blue".

As a result, humans are not very good at distinguishing different spectra that happen to result in the same nerve impulses. That information was lost by the conversion from photons to nerve impulses. Sensors like the AS7341 that have more than 3 color channels are much better at this task.


Yet I can learn there is a distinction between different spectra that happen to result in the same nerve impulses. I know if I have a certain impulse, that I can't rely on it being a certain photon. I know to use tools, like the AS7341, to augment my answer. I know to answer "I don't know" to those types of questions.

I am a strong proponent of LLM's, but I just don't agree with the personification and trust we put into its responses.

Everyone in this thread is defending that ChatGPT can't count for _reasons_ and how its okay, but... how can we trust this? Is this the sane world we live in?

"The AGI can't count letters in a sentence, but any day not he singularity will happen, the AI will escape and take over the world."

I do like to use it for opinion related questions. I have a specific taste in movies and TV shows and by just listing what I like and going back and forth about my reasons for liking or not liking it's suggestions, I've been able to find a lot of gems I would have never heard of before.


That URL breaks into four different tokens in 3.5 and 4: "http", "://", "google" and ".com".

Except that "http" should be "https". Silly humans, claiming to be intelligent when they can't even tokenize strings correctly.


A wee typo.


How much of your own sense of quantity is visual, do you think? How much of your ability to count the lengths of words depends on your ability to sound them out and spell?

I suspect we might find that adding in the multimodal visual and audio aspects to the model gives these models a much better basis for mental arithmetic and counting.


I'd counter by pasting a picture of an emoji here, but HN doesn't allow that, as a means to show the confusion that can be caused by characters versus symbols.

Most LLMs can just pass the string to an tool to count it to bypass it's built in limitations.


It seems you're already aware LLMs receive tokens not words.

Does a blind man not understand quantity because you asked him how many apples are in front of him and he failed ?


I do, but I think it shows it's limitations.

I don't think that test determines his understanding of quantity at all, he has other senses like touch to determine the correct answer. He doesn't make up a number and then give justification.

GPT was presented with everything it needed to answer the question.


Nobody said GPT was perfect. Everything has limitations.

>he has other senses like touch to determine the correct answer

And? In my hypothetical, you're not allowing him to use touch.

>I don't think that test determines his understanding of quantity at all

Obviously

>GPT was presented with everything it needed to answer the question.

No, it was not.


How was it not? It's a text interface. It was given text.

The deaf example now is like asking GPT "What am I pointing at?"


Please try to actually understand what og_kalu is saying instead of being obtuse about something any grade-schooler intuitively grasps.

Imagine a legally blind person, they can barely see anything; just general shapes flowing into one another. In front of them is a table onto which you place a number of objects. The objects are close together and small enough such that they merge into one blurred shape for our test person.

Now when you ask the person how many objects are on the table, they won't be able to tell you! But why would that be? After all, all the information is available to them! The photons emitted from the objects hit the retina of the person, the person has a visual interface and they were given all the visual information they need!

Information lies within differentiation, and if the granularity you require is higher than the granularity of your interface, then it won't matter whether or not the information is technically present; you won't be able to access it.


I think we agree. ChatGPT can't count, as the granularity that requires is higher than the granularity ChatGPT provides.

Also the blind person wouldn't confidently answer. A simple "the objects blur together" would be a good answer. I had ChatGPT telling me 5 different answers back to back above.


No, think about it. The granularity of the interface (the tokenizer) is the problem, the actual model could count just fine.

If the legally blind person never had had good vision or corrective instruments, had never been told that their vision is compromised and had no other avenue (like touch) to disambiguate and learn, then they would tell you the same thing ChatGPT told you. "The objects blur together" implies that there is already an understanding of the objects being separate present.

You can even see this in yourself. If you did not get an education in physics and were asked to describe of how many things a steel cube is made up, you wouldn't answer that you can't tell. You would just say one, because you don't even know that atoms are a thing.


I agree, but I don't think that changes anything, right?

ChatGPT can't count, the problem is the tokenizer.

I do find it funny we're trying to chat with an AI that is "equivalent to a legally blind person with no correction"

> You would just say one, because you don't even know that atoms are a thing.

My point also. I wouldnt start guessing "10" and then "11" and then "12" when asked to double check only to capitulate when told the correct answer.


You consistently refuse to take the necessary reasoning steps yourself. If your next reply also requires me to lead you every single millimeter to the conclusion you should have reached on your own, then I won't reply again.

First of all, it obviously changes everything. A shortsighted person requires prescription glasses, someone that is fundamentally unable to count is incurable from our perspective. LLMs could do all of these things if we either solve tokenization or simply adapt the tokenizer to relevant tasks. This is already being done for program code, it's just that aside from gotcha arguments, nobody really cares about letter counting that much.

Secondly, the analogy was meant to convey that the intelligence of a system is not at all related to the problems at its interface. No one would say that legally blind people are less insightful or intelligent, they just require you to transform input into representations accounting for their interface problems.

Thirdly, as I thought was obvious, the tokenizer is not a uniform blur. For example, a word like "count" could be tokenized as "c|ount" or " coun|t" (note the space) or ". count" depending on the surrounding context. Each of these versions will have tokens of different lengths, and associated different letter counts. If you've been told that the cube had 10, 11 or 12 trillion constituent parts by various people depending on the random circumstances you've talked to them in, then you would absolutely start guessing through the common answers you've been given.


I do agree I've been obtuse, apologies. I think I was just being too literal or something, as I do agree with you.

Apologies from me as well. I've been unnecessarily aggressive in my comments. Seeing very uninformed but smug takes on AI here over the last year has made me very wary of interactions like this, but you've been very calm in your replies and I should have been so as well.

Its first answer of 12 is correct, there are 12 _unique_ characters in https://google.com.


The unique characters are:

h t p s : / g o l e . c m

There are 13 unique characters.


OK neither GPT-4o nor myself is great at counting apparently


I agree. The interesting lesson I take from the seemingly strong capabilities of LLMs is not how smart they are but how dumb we are. I don't think LLMs are anywhere near as smart as humans yet, but it feels each new advance is bringing the finish line closer rather than the other way round.


Moravec's paradox states that, for AI, the hard stuff is easiest and the easy stuff is hardest. But there's no easy or hard; there's only what the network was trained to do.

The stuff that comes easy to us, like navigating 3D space, was trained by billions of years of evolution. The hard stuff, like language and calculus, is new stuff we've only recently become capable of, seemingly by evolutionary accident, and aren't very naturally good at. We need rigorous academic training at it that's rarely very successful (there's only so many people with the random brain creases to be a von Neumann or Einstein), so we're impressed by it.


If someone found a way to put an actual human brain into SW, but no one knew it was a real human brain -- I'm certain most of HN would claim it wasn't AGI. "Kind of sucks at math", "Knows weird facts about Tik Tok celebrities, but nothing about world events", "Makes lots of grammar mistakes", "scores poorly on most standardized tests, except for one area that he seems to well", and "not very creative".


What is a human brain without the rest of it's body? Humans aren't brains. Our nervous systems aren't just the brain either.


It's meant to explore a point. Unless your point is that AGI can only exist with a human body too.


It's an open question as to whether AGI needs a (robot) body. It's also a big question whether the human brain can function in a meaningful capacity kept alive without a body.


i don't think making the same mistakes as a human counts as a feature. I see that a lot when people point out a flaw with an llm, the response is always "well a human would make the same mistake!". That's not much of an excuse, computers exist because they do the things humans can't do very well like following long repetitive lists of instructions. Further, upthread, there's discussion about adding emotions to an llm. An emotional computer that makes mistakes sometimes is pretty worthless as a "computer".


It's not about counting as a feature. It's the blatant logical fallacy. If a trait isn't a reason humans don't have a certain property then it's not a reason for machines either. Can't eat your cake and have it.

>That's not much of an excuse, computers exist because they do the things humans can't do very well like following long repetitive lists of instructions.

Computers exist because they are useful, nothing more and nothing less. If they were useful in a completely different way, they would still exist and be used.


It's objectively true that LLMs do not have bodies. To the extent general intelligence relies on being emobodied (allowing you to manipulate the world and learn from that), it's a legitimate thing to point out.


>But it has some indeed magic ability to share a deictic frame of reference.

They really Put That There!

https://www.youtube.com/watch?v=RyBEUyEtxQo

Oh, shit.


In my view, this was in response to the machine being colourblind haha


I expect the really solid use case here will be voice interfaces to applications that don't suck. Something I am still surprised at is that vendors like Apple have yet to allow me to train the voice to text model so that it only responds to me and not someone else.

So local modelling (completely offline but per speaker aware and responsive), with a really flexible application API. Sort of the GTK or QT equivalent for voice interactions. Also custom naming, so instead of "Hey Siri" or "Hey Google" I could say, "Hey idiot" :-)

Definitely some interesting tech here.


I assume (because they don't address it or look at all phased) the audio cutting in and out is just an artefact of the stream?


Haven’t tried it but from work I’ve done on voice interaction this happens a lot when you have a big audience making noise. The interruption feature will likely have difficulty in noisy environments.


Yeah that was actually my first thought (though no professional experience with it/on that side) - it's just that the commenter I replied to was so hyped about it and how fluid & natural it was and I thought that made it really jarr.


Interesting that they decided to keep the horrible ChatGPT tone ("wow you're doing a live demo right now?!"). It comes across just so much worse in voice. I don't need my "AI" speaking to me like I'm a toddler.


It is cringe overenthusiastic, but a proper instructions/system prompt will fix that mostly


You can tell it not to talk like this using custom prompts.


One of the linked demos is it being sarcastic, so maybe you can make it remember to be a little more edgy.


tell it to speak to you differently

with a GPT you can modify the system prompt


It still refuses to go outside the deeply sanitise tone that "alignment" enforces on you.


it should be possible to imitate any voice you want like your actual parents soon enough


That won't be Black Mirror levels of creepy /s


Did you miss the part where they simply asked it to change its manner of speaking and the amount of emotion it used?


Call me overly paranoid/skeptical, but I'm not convinced that this isn't a human reading (and embellishing) a script. The "AI" responses in the script may well have actually been generated by their LLM, providing a defense against it being fully fake, but I'm just not buying some of these "AI" voices.

We'll have to see when end users actually get access to the voice features "in the coming weeks".


It's weird that the "airplane mode" seems to be ON on the phone during the entire presentation.


This was on purpose - they connected it to the internet via a USB-C cable it appears, for consistent internet instead of having it switch WiFi

Probably some kinks there they are working out


> Probably some kinks there they are working out

Or just a good idea for a live demo on a congested network/environment with a lot of media present, at least one live video stream (the one we're watching the recording of), etc.

At least that's how I understood it, not that they had a problem with it (consistently or under regular conditions, or specific to their app).


That's very common practice for live demos. To avoid situations like this:

https://www.youtube.com/watch?v=6lqfRx61BUg


And eliminate the change of some prankster affecting the demo by attacking the wifi.


They mention at the beginning of the video that they are using hardwired internet for reliability reasons.


You would want to make sure that it is always going over WiFi for the demo and doesn't start using the cellular network for a random reason.


You can turn off mobile data. They probably just wanted wired internet.


This is going straight into 'Her' territory


Hectic!

Thanks for this.


"not that much better" is extremely impressive, because it's a much smaller and much faster model. Don't worry, GPT-5 is coming and it will be better.


Chalmers: "GPT-5? A vastly-improved model that somehow reduces the compute overhead while providing better answers with the same hardware architecture? At this time of year? In this kind of market?"

Skinner: "Yes."

Chalmers: "May I see it?"

Skinner: "No."


It has only been a little over one year since GPT-4 was announced, and it was at the time the largest and most expensive model ever trained. It might still be.

Perhaps it's worth taking a beat and looking at the incredible progress in that year, and acknowledge that whatever's next is probably "still cooking".

Even Meta is still baking their 400B parameter model.


As Altman said (paraphrasing): GPT-4 is the _worst_ model you will ever have to deal with in your life (or something to that effect).


I found this statement by Sam quite amusing. It transmits exactly zero information (it's a given that models will improve over time), yet it sounds profound and ambitious.


I got the same vibe from him on the All In podcast. For every question, he would answer with a vaguely profound statement, talking in circles without really saying anything. On multiple occasions he would answer like 'In some ways yes, in some ways no...' and then just change the subject.


Yep. I'm not quite sure what he's up to. He takes all these interviews and basically says nothing. What's his objective?

My guess is he wants OpenAI to become a household name, and so he optimizes for exposure.


and boy did the stockholders like that one.


What stockholders. They’re investors at this point. I wish I could get in on it.


They're rollercoaster riders, being told lusterous stories by gold-panners while the shovel salesman counts his money and leaves.


There are no shovels or shovel sellers. It’s heavily accredited investors with millions of dollars buying in. It’s way above our pay grade, our pleb sayings don’t apply.


I think you could pretty easily call Nvidia a shovel-seller in this context.


You’re right.


Why should I believe anything he says?


I will believe it when I see it. People like to point at the first part of a logistic curve and go "behold! an exponential".


Ah yes my favorite was the early covid numbers, some of the "smartest" people in the SF techie scene were daily on Facebook thought-leadering about how 40% of people were about to die in the likely case.


Let's be honest, everyone was speculating. Nobody knew what the future would bring, not even you.


The difference is some people were talking a whole lot confidently, and some weren’t.


Legit love progress


GPT-3 was released in 2020 and GPT-4 in 2023. Now we all expect 5 sooner than that but you're acting like we've been waiting years lol.


The increased expectations are a direct result of LLM proponents continually hyping exponential capabilities increase.


So if not exponential, what would you call adding voice and image recognition, function calling, greatly increased token generation speed, reduced cost, massive context window increases and then shortly after combining all of that in a truly multi modal model that is even faster and cheaper while adding emotional range and singing in… checks notes …14 months?! Not to mention creating and improving an API, mobile apps, a marketplace and now a desktop app. OpenAI ships and they are doing so in a way that makes a lot of business sense (continue to deliver while reducing cost). Even if they didn’t have another flagship model in their back pocket I’d be happy with this rate of improvement but they are obviously about to launch another one given the teasers Mira keeps dropping.


All of that is awesome, and makes for a better product. But it’s also primarily an engineering effort. What matters here is an increase in intelligence. And we’re not seeing that aside from very minor capability increases.

We’ll see if they have another flagship model ready to launch. I seriously doubt it. I suspect that this was supposed to be called GPT-5, or at the very least GPT-4.5, but they can’t meet expectations so they can’t use those names.


Isn’t one of the reasons for the Omni model that text based learning has a limit of source material. If it’s just as good at audio that opens a whole another set of data - and a interesting UX for users


I believe you’re right. You can easily transcribe audio but the quality of the text data is subpar to say the least. People are very messy when they speak and rely on the interlocutor to fill in the gaps. Training a model to understand all of the nuances of spoken dialogue opens that source of data up. What they demoed today is a model that to some degree understands tone, emotion and surprisingly a bit of humour. It’s hard to get much of that in text so it makes sense that audio is the key to it. Visual understanding of video is also promising especially for cause and effect and subsequently reasoning.


The time for the research, training, testing and deploying of a new model at frontier scales doesn't change depending on how hyped the technology is. I just think the comment i was replying to lacks perspective.


Pay attention to the signal, ignore the noise.


People who buy into hype deserve to be disappointed. Or burned, as the case may be.


Incidentally, this dialogue works equally well, if not better, with David Chalmers versus B.F. Skinner, as with the Simpsons characters.


Agnes (voice): "SEYMOUR, THE HOUSE IS ON FIRE!"

Skinner (looking up): No, mother, it's just the Nvidia GPUs.


Agnes (voice): "SEYMOUR, THE HOUSE IS ON FIRE!"

Skinner (looking up): "No, mother, it's just the Nvidia GPUs."


"Seymour, the house is on fire!"

"No, mother, that's just the H100s."


Obviously given enough time there will always be better models coming.

But I am not convinced it will be another GPT-4 moment. Seems like big focus on tacking together multi-modal clever tricks vs straight better intelligence AI.

Hope they prove me wrong!


The problem with "better intelligence" is that OpenAI is running out of human training data to pillage. Training AI on the output of AI smooths over the data distribution, so all the AIs wind up producing same-y output. So OpenAI stopped scraping text back in 2021 or so - because that's when the open web turned into an ocean of AI piss. I've heard rumors that they've started harvesting closed captions out of YouTube videos to try and make up the shortfall of data, but that seems like a way to stave off the inevitable[0].

Multimodal is another way to stave off the inevitable, because these AI companies already are training multiple models on different piles of information. If you have to train a text model and an image model, why split your training data in half when you could train a combined model on a combined dataset?

[0] For starters, most YouTube videos aren't manually captioned, so you're feeding GPT the output of Google's autocaptioning model, so it's going to start learning artifacts of what that model can't process.


>harvesting closed captions out of YouTube videos

I'd bet a lot of YouTubers are using LLMs to write and/or edit content. So we pass that through a human presentation. Then introduce some errors in the form of transcription. Turn feed the output in as part of a training corpus ... we plateaued real quick.

It seems like it's hard to get past a level of human intelligence at which there's a large enough corpus of training data or trainers?

Anyone know of any papers on breaking this limit to push machine learning models to super-human intelligence levels?


If a model is average human intelligence in pretty much everything, is that super-human or not? Simply put, we as individuals aren't average at everything, we have what we're good at and a great many things we're not. We average out by looking at broad population trends. That's why most of us in the modern age spend a lot of time on specialization for whatever we work in. Which brings the likely next place for data. A Manna (the story) like data collection program where companies hoover up everything they can on their above average employees till we're to the point most models are well above the human average in most categories.


>[0] For starters, most YouTube videos aren't manually captioned, so you're feeding GPT the output of Google's autocaptioning model, so it's going to start learning artifacts of what that model can't process.

Whisper models are better than anything google has. In fact the higher quality whisper models are better than humans when it comes to transcribing text with punctuation.


Why do you think they’re using Google auto-captioning?

I would expect they’re using their own t2s which is still a model but way better quality and potentially customizable to better suit their needs


At some point, algorithms for reasoning and long-term planning will be figured out. Data won’t be the holy grail forever, and neither will asymptotically approaching human performance in all domains.


I don't think a bigger model would make sense for OpenAI: it's much more important for them that they keep driving inference coat down, because there's no viable business model if they don't.

Improving the instruction tuning, the RLHF step, increase the training size, work on multilingual capabilities, etc. make sense as a way to improve quality, but I think increasing model size doesn't. Being able to advertize a big breakthrough may make sense in terms of marketing, but I don't believe it's going to happen for two reasons:

- you don't release intermediate steps when you want to be able to advertise big gains, because it raises the baseline and reduce the effectiveness of your ”big gains” in terms of marketing.

- I don't think they would benefit in an arm race with Meta, trying to keeping a significant edge. Meta is likely to be able to catch-up eventually on performance, but they are not so much of a threat in terms of business. Focusing on keeping a performance edge instead of making their business viable would be a strategic blunder.


What is OpenAI business model if their models are second-best? Why would people pay them and not Meta/Google/Microsoft - who can afford to sell at very low margins, since they have existing very profitable businesses that keeps them afloat.


That's the question OpenAI needs to find an answer to if they want to end up viable.

They have the brand recognition (for ChatGPT) and that's a good start, but that's not enough. Providing a best in class user experience (which seems to be their focus now, with multimodality), a way to lock down their customers in some kind of walled garden, building some kind of network effect (what they tried with their marketplace for community-built “GPTs” last fall but I'm not sure it's working), something else?

At the end of the day they have no technological moat, so they'll need to build a business one, or perish.

For most tasks, pretty much every models from their competitors is more than good enough already, and it's only going to get worse as everyone improves. Being marginally better on 2% of tasks isn't going to be enough.


I know it is super crazy, but maybe they could become a non-profit and dedicate themselves to producing open source AI in an effort to democratize it and make it safe (as in, not walled behind a giant for-profit corp that will inevitably enshittify it).

I don't know why they didn't think about doing that earlier, could have been a game changer, but there is still an opportunity to pivot.


And how can one be so sure of that?

Seems to me that performance is converging and we might not see a significant jump until we have another breakthrough.


> Seems to me that performance is converging

It doesn't seem that way to me. But even if it did, video generation also seemed kind of stagnant before Sora.

In general, I think The Bitter Lesson is the biggest factor at play here, and compute power is not stagnating.


Computer power is not stagnating, but the availability of training data is. It's not like there's a second stackoverflow or reddit to scrape.


No: soon the wide wild world itself becomes training data. And for much more than just an LLM. LLM plus reinforcement learning—this is were the capacity of our in silico children will engender much parental anxiety.


This may create a market for surveillance camera data and phone calls.

"This conversation may be recorded and used for training purposes" now takes on a new meaning.

Can car makers sell info from everything that happens in their cars?


Well, this is a massively horrifying possibility.


Agree.

However, I think the most cost-effective way to train for real world is to train in a simulated physical world first. I would assume that Boston Dynamics does exactly that, and I would expect integrated vision-action-language models to first be trained that way too.


That's how everyone in robotics is doing these days.

You take a bunch of mo-cap data and simulate it with your robot body. Then as much testing as you can with the robot and feed the behavior back in to the model for fine tuning.

Unitree gives an example of the simulation versus what the robot can do in their latest video

https://www.youtube.com/watch?v=GzX1qOIO1bE


I don't think training data is the limiting factor for current models.


It is a limiting factor, due to diminishing returns. A model trained on double the data, will be 10% better, if that!

When it comes to multi-modality, then training data is not limited, because of many different combinations of language, images, video, sound etc. Microsoft did some research on that, teaching spacial recognition to an LLM using synthetic images, with good results. [1]

When someone states that there are not enough training data, they usually mean code, mathematics, physics, logical reasoning etc. In the open internet right now, there are is not enough code to make a model 10x better, 100x better and so on.

Synthetic data will be produced of course, scarcity of data is the least worrying scarcity of all.

Edit: citation added,

[1] VoT by MS https://medium.com/@multiplatform.ai/microsoft-researchers-p...


> A model trained on double the data, will be 10% better, if that!

If the other attributes of the model do not improve, sure.


Soon these models are cheap enough to learn in the real world. Reduced costs allows for usage at massive scale.

Releasing models to users that where users can record video is more data. Users conversing with AI is also additional data.

Another example is models that code– And then debug the code and learn from that.

This will be anywhere, and these models will learn from anything we do/publish online/discuss. Scary.

Pretty soon– OpenAI will have access to


It isn’t clear that we are running out of training data, and it is becoming increasingly clear that AI-generated training data actually works.

For the skeptical, consider that humans can be trained on material created by less intelligent humans.


> humans can be trained on material created by less intelligent humans.

For the skeptics, "AI models" are not intelligent at all so this analogy makes no sense.

You can teach lots of impressive tricks to dogs, but there is no amount of training that will teach them basic algebra.


> video generation also seemed kind of stagnant before Sora

I take the opposite view. I don't think video generation was stagnating at all, and was in fact probably the area of generative AI that was seeing the biggest active strides. I'm highly optimistic about the future trajectory of image and video models.

By contrast, text generation has not improved significantly, in my opinion, for more than a year now, and even the improvement we saw back then was relatively marginal compared to GPT-3.5 (that is, for most day-to-day use cases we didn't really go from "this model can't do this task" to "this model can now do this task". It was more just "this model does these pre-existing tasks, in somewhat more detail".)

If OpenAI really is secretly cooking up some huge reasoning improvements for their text models, I'll eat my hat. But for now I'm skeptical.


> By contrast, text generation has not improved significantly, in my opinion, for more than a year now

With less than $800 worth of hardware including everything but the monitor, you can run an open weight model more powerful than GPT 3.5 locally, at around 6 - 7T/s[0]. I would say that is a huge improvement.

[0] https://www.reddit.com/r/LocalLLaMA/comments/1cmmob0/p40_bui...


Yeah. There are lots of things we can do with existing capabilities, but in terms of progressing beyond them all of the frontier models seem like they're a hair's breadth from each other. That is not what one would predict if LLMs had a much higher ceiling than we are currently at.

I'll reserve judgment until we see GPT5, but if it becomes just a matter of who best can monetize existing capabilities, OAI isn't the best positioned.


Exactly. People like to point at the start of a logistic curve and go "behold! an exponential"


The use of AI in the research of AI accelerates everything.


I'm not sure of this. The jury is still out on most ai tools. Even if it is true, it may be in a kind of strange reverse way: people innovating by asking what ai can't do and directing their attention there.


There is an increasing amount of evidence that using AI to train other AI is a viable path forward. E.g. using LLMs to generate training data or tune RL policies


I bet this will also cause model regressions.


I really hope GPT5 is good. GPT4 sucks at programming.


It's excellent at programming if you actually know the problem you're trying to solve and the technology. You need to guide it with actual knowledge you have. Also, you have to adapt your communication style to get good results. Once you 'crack the pattern' you'll have a massive productivity boost


In my experience 3.5 was better at programming than 4, and I don't know why.


It's better than at least 50% of the developers I know.


A developer that just pastes in code from gpt-4 without checking what it wrote is a horror scenario, I don't think half of the developers you know are really that bad.


What kind of people are you working with?


It's not better than any of the developers I work with.

Trying to talk it into writing anything other than toy code is an exercise in banging my head against the wall.


Look to a specialized model instead of a general purpose one


Any suggestions? Thanks

I have tried Phind and anything beyond mega junior tier questions it suffers as well and gives bad answers.


You have to think of the LLMs as more of a better search engine than something that can actually write code for you. I use phind for writing obscure regexes, or shell syntax, but I always verify the answer. I've been very pleased with the results. I think anyone disappointed with it is setting the bar too high and won't be fully satisfied until LLMs can effectively replace a Sr dev (which, let's be real, is only going to happen once we reach AGI)


Yea, I use them daily and that’s my issue as well. You have to learn what to ask or you spend more time debugging their junk than being productive, at least for me. Devv.ai is my recent try, and so far it’s been good but library changes quickly cause it to lose accuracy. It is not able to understand what library version you’re on and what it is referencing, which wastes a lot of time.

I like LLMs for general design work, but I’ve found accuracy to be atrocious in this area.


> library changes quickly cause it to lose accuracy

yup, this is why an LLM only solution will not work. You need to provide extra context crafted from the language or library resources (docs, code, help, chat)

This is the same thing humans do. We go to the project resources to help know what code to write


Fwiw that's what Devv.ai claims to do (in my summation from the Devv.ai announcement, at least). Regardless of how true the claims of Devv.ai are, their library versioning support seems very poor. At least for the one library i tested it on (Rust's Bevy).


kapa.ai is another SaaS focused on per-project LLMs

As a developer, you would want something like this, which has access to all the languages / libraries you actually use


It will be a system, not a single model, and will depend on what programming task you want to perform

probably need routers, RAG, and reranking

I think there is a role for LLM + deterministic code gen as well (https://github.com/hofstadter-io/hof/blob/_dev/flow/chat/pro...)


Interesting. I was hoping for something with a UI like chat gpt or phind.

Something that I can just use as easily as copilot. Unfortunately every single one sucks.

Or maybe that's just how programming is - its easy at the surface/ice berg level and below is just massive amounts of complexity. Then again, I'm not doing menial stuff so maybe I'm just expecting too much.


I think a more IDE native experience is better than a chat UI

I don't want to have to copy & paste between applications, just let me highlight some sections and then run some LLM operation on it

i.e. a VS Code extension with keyboard shortcuts


I think this comment is easily misread as implying that this GPT4o model is based on some old GPT2 chatbot - that’s very much not what you meant to say, though.

This model has been being tested under a code name of ‘gpt2-chatbot’ but it is very much a new GPT4+-level model, with new multimodal capabilities - but apparently some impressive work around inference speed.

Highlighting so people don’t get the impression this is just OpenAI slapping a new label on something a generation out of date.


I agree. I tried a few programming problems that, let's say, seem to be out of the distribution of their training data and which GPT4 failed to solve before. The model couldn't find a similar pattern and failed to solve them again. What's interesting is that one of these problems were solved by Opus, which seems to indicate that the majority of progress in the last months should be attributed to the quality/source of the training data.


useless anecdata but I find the new model very frustrating, often completely ignoring what I say in follow up queries. it's giving me serious Siri vibes

(text input in web version)

maybe it's programmed to completely ignore swearing but how could I not swear after it gave me repeatedly info about you.com when I try to address it in second person


> As many highlighted there, the model is not an improvement like GPT3->GPT4.

The improvements they seem to be hyping are in multimodality and speed (also price – half that of GPT-4 Turbo – though that’s their choice and could be promotional, but I expect it’s at least in part, like speed, a consequence of greater efficiency), not so much producing better output for the same pure-text inputs.


the model scores 60 points higher in lmsys than the best gpt 4 turbo model from april, that's still a pretty significant jump in text capability


I tested a few use cases in the chat, and it's not particularly more intelligent but they seem to have solved laziness. I had to categorize my expenses to do some budgeting for the family, and in gpt 4 I had to go ten in ten, confirm the suggested category, download the file, took two days as I was constantly hitting the limit. gpt4o did most of the grunth work, then commincated anomalies in bulk, asked for suggestion for these, and provided a downloadable link in two answers, calling the code interpreter mulitple times, and working toward the goal on it's own.

and the prompt wasn't a monstrosity, and it wasn't even that good, it was just one line "I need help to categorize these expenses" and off it went. hope it won't get enshittified like turbo, because this finally feels as great as 3.5 was for goal seeking.


Heh - I'm using ChatGPT for the same thing! Works 10X better than Rocket Money, which was supposed to be an improvement on Mint but meh.


They are admitting that is the im-also-a-good-gpt2-chatbot. There was 3.... Don't ask me why.

The "gpt2-chatbot" was the worst of the three.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: