Hacker News new | past | comments | ask | show | jobs | submit login

The scaling laws of large language models are very specific to language models and the way they're trained. The important thing that LLMs demonstrate is that transformers are capable of this kind of scale (where other approaches have not)

In the RL space, a sufficiently complex, stochastic environment is effectively a data generator.




I'm not sure I agree that there is a distinction between scaling laws and a model being "capable of scale." RNNs are Turing complete, so from that perspective they should in theory they should be sufficient for AGI. But of course they are not because their scaling with regards to network depth and the length of sequences is abysmal. LLMs do scale with depth and sequence length, if their scaling laws with regard to dataset size prevent us from training them adequately, then we are stuck nonetheless.

I haven't heard of any groups who are studying data constrained learning in the context of LLMs, but that will probably change as models get bigger. And at that point, architectures with better scaling laws may be right around the corner, or they may not. That's the pain of trying to project these things into the future.


The scaling laws for LLMs depend heavily on the quality of data. For example, if you add an additional 100gb of data but it only contains the same repeating word, that will hurt the model. If you add 100gb of completely random words, that will also hurt the model. Between these two extremes (low and high entropy), human language has a certain amount natural entropy that helps the model gauge the true co-occurrence frequency of the words in the sentence. The scaling laws for LLMs aren't just a reflection of the model but the conditional entropy of human-generated sentences.

RL is such a different field that you can't apply these scaling laws directly. eg. agents playing tictactoe and checkers would stop scaling at a very low ceiling.


One possible risk I see is that with the amount of model generated text out there it will at some point inevitably result in feeding the output of one model into another unless the source of the text is meticulously traced. (My assumption is that that would hurt the model that you are trying to train as well.)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: