Hacker News new | past | comments | ask | show | jobs | submit login

The argument about AGI from LLMs is not based on the current state of LLMs, but on the rate of progress over the last 5+ years or so. It wasn't very long ago that almost nobody outside of a few niche circles seriously thought LLMs could do what they do right now.

That said, my personal hypothesis is that AGI will emerge from video generation models rather than text generation models. A model that takes an arbitrary real-time video input feed and must predict the next, say, 60 seconds of video would have to have a deep understanding of the universe, humanity, language, culture, physics, humor, laughter, problem solving, etc. This pushes the fidelity of both input and output far beyond anything that can be expressed in text, but also creates extraordinarily high computational barriers.




> The argument about AGI from LLMs is not based on the current state of LLMs, but on the rate of progress over the last 5+ years or so.

And what I'm saying is that I find that argument to be incredibly weak. I've seen it time and time again, and honestly at this point just feels like a "humans should be a hundred feet tall based on on their rate of change in their early years" argument.

While I've also been amazed at the past progress in LLMs, I don't see any reason to expect that rate will continue in the future. What I do see the more and more I use the SOTA models is fundamental limitations in what LLMs are capable of.


Expecting the rate of progress to drop off so abruptly after realistically just a few years of serious work on the problem seems like the more unreasonable and grander prediction to me than expecting it to continue at its current pace for even just 5 more years.


The problem is that the rate of progress over the past 5/10/15 years has not been linear at all, and it's been pretty easy to point out specific inflection points that have allowed that progress to occur.

I.e. the real breakthrough that allowed such rapid progress was transformers in 2017. Since that time, the vast majority of the progress has simply been to throw more data at the problem, and to make the models bigger (and to emphasize, transformers really made that scale possible in the first place). I don't mean to denigrate this approach - if anything, OpenAI deserves tons of praise for really making that bet that spending hundreds of millions on model training would give discontinuous results.

However, there are loads of reasons to believe that "more scale" is going to give diminishing returns, and a lot of very smart people in the field have been making this argument (at least quietly). Even more specifically, there are good reasons to believe that more scale is not going to go anywhere close to solving the types of problems that have become evident in LLMs since when they have had massive scale.

So the big thing I'm questioning is that I see a sizable subset of both AI researchers (and more importantly VC types) believing that, essentially, more scale will lead to AGI. I think the smart money believes that there is something fundamentally different about how humans approach intelligence (and this difference leads to important capabilities that aren't possible from LLMs).


Could it be argued that transformers are only possible because of Moore's law and the amount of processing power that could do these computations in a reasonable time? How complex is the transformer network really, every lay explanation I've seen basically says it is about a kind of parallelized access to the input string. Which sounds like a hardware problem, because the algorithmic advances still need to run on reasonable hardware.


Transformers in 2017 as the basis, but then the quantization-emergence link as a grad student project using spare time on ridiculously large A100 clusters in 2021/2022 is what finally brought about this present moment.

I feel it is fair to say that neither of these were natural extrapolations from prior successful models directly. There is no indication we are anywhere near another nonlinearity, if we even knew how to look for that.

Blind faith in extrapolation is a finance regime, not an engineering regime. Engineers encounter nonlinearities regularly. Financiers are used to compound interest.


I don’t see why it’s unreasonable. Training a model that is an order of magnitude bigger requires (at least) an order of magnitude more data, an order of magnitude more time, hardware, energy, and money.

Getting an order of magnitude more data isn’t easy anymore. From GPT2 to 3 we (only) had to scale up to the internet. Now? You can look at other sources like video and audio, but those are inherently more expensive. So your data acquisition costs aren’t linear anymore, they’re something like 50x or 100x. Your quality will also dip because most speech (for example) isn’t high-quality prose, it contains lots of fillers, rambling, and transcription inaccuracies.

And this still doesn’t fix fundamental long-tail issues. If you have a concept that the model needs to see 10x to understand, you might think scaling your data 10x will fix it. But your data might not contain that concept 10x if it’s rare. It might contain 9 other one-time things. So your model won’t learn it.


10 years of progress is a flash in the pan of human progress. The first deep learning models that worked appeared in 2012. That was like yesterday. You are completely underestimating the rate of change we are witnessing. Compute scaling is not at all similar to biological scaling.


Happy to review this in 5 years


If its true that predicting the next word can be turned into predict the next pixel. And that you could run a zillion hours of video feed into that, I agree. It seems that the basic algorithm is there. Video is much less information dense than text, but if the scale of compute can reach the 10s of billions of dollars, or more, you have to expect that AGI is achievable. I think we will see it in our lifetimes. Its probably 5 years away


I feel like that's already been demonstrated with the first-generation video generation models we're seeing. Early research already shows video generation models can become world simulators. There frankly just isn't enough compute yet to train models large enough to do this for all general phenomena and then make it available to general users. It's also unclear if we have enough training data.

Video is not necessarily less information dense than text, because when considered in its entirety it contains text and language generation as special cases. Video generation includes predicting continuations of complex verbal human conversations as well as continuations of videos of text exchanges, someone flipping through notes or a book, someone taking a university exam through their perspective, etc.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: