Hacker News new | past | comments | ask | show | jobs | submit login

As a ML Vision researcher, I find these scaling hypothesis claims quite ridiculous. I understand that the NLP world has made large strides by adding more attention layers, but I'm not an NLP person and I suspect there's more than just more layers. We won't even talk about the human brain and just address that "scaling is sufficient" hypothesis.

With vision, pointing to Parti and DALLE as scaling is quite dumb. They perform similarly but are DRASTICALLY different in size. Parti has configurations with 350M, 750M, 3B, and 20B parameters. DALLE has 3.5. Imagen uses T5-XXL which alone has 11B parameters, just in the text part.

Not only this, there are major architecture changes. If scaling was all you needed then all these networks would still be using CNNs. But we shifted to transformers. THEN we have shifted to diffusion based models. Not to mention that Parti, DALLE, and Parti have different architectures. It isn't just about scale. Architecture matters here.

And to address concerns, diffusion (invented decades ago) didn't work because we just scaled it up. It worked because of engineering. It was largely ignored previously because no one got it to work better than GANs. I think this lesson should really stand out. That we need to consider the advantages and disadvantages of different architectures and learn how to make ALL of them work effectively. In that manner we can combine them to work in ideal ways. Even Le Cun is coming to this point of view despite previously being on the scaling side.

But maybe you NLP folks disagree. But the experience in vision is far more rich than just scaling.




I agree - I think scaling laws and scaling hypothesis are quite distinct personally. Scaling hypothesis is 'just go bigger with what we have and we'll get AGI', vs scaling laws are 'for these tasks and these models types, these are the empirical trends in performance we see'. I think scaling laws are still really valuable for vision research, but as you say we should not just abandon thinking about things beyond scaling even if we observe good scaling trends.


Yeah I agree with this position. It is also what I see within my own research. But also in my own research I see the vast importance of architecture search. This may not be what the public sees, but I think it is well known to the research community or anyone with hands on experience with these types of models.


this is well articulated. another key point: dall-e 2 uses 70% fewer parameters than dalle-e 1 while offering far higher quality.

from wikipedia (https://en.wikipedia.org/wiki/DALL-E):

DALL-E's model is a multimodal implementation of GPT-3 with 12 billion parameters which "swaps text for pixels", trained on text-image pairs from the Internet. DALL-E 2 uses 3.5 billion parameters, a smaller number than its predecessor.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: