GPT-3 was trained on half a trillion words (common crawl, webtext, two book corpuses, and wikipedia, IIRC). At about 100 words per minute, that's almost ten thousand years of continuous speech. By my estimate it's probably a few thousand times what people actually hear in a lifetime. We don't experience nearly the volume of language that it did.
Why then, the continued obsession with building single-media models?
Is focusing on the Turing test and language proficiency bringing us further away from the goals of legitimate intelligence?
I would argue "yes", which was my original comment. At no point in us trying to replicate what an adult sounds like have we actually demonstrated anything remotely like the IQ of a small child. And there's this big gap where it's implied by some that this process goes 1) sound like an adult -> 2) think like an adult, which seems to be missing the boat imo. (There's logically this intermediate step where we have this adult-sounding monster AI child.)
If we could constrain the vocabulary to that a child might be exposed to, the correlative trickery of these models would be more obvious. The (exceptionally good) quality of these curve fits wouldn't trick us with vocabulary and syntax that looks like something we'd say. The dumb things would sound dumb, and the smart things would sound smart. And maybe, probably even, that would require us fusing in all sorts of other experiential models to make that happen.
> Why then, the continued obsession with building single-media models?
I think it's literally just working with available data. With some back of the envelope math, GPT-3's training corpus is thousands of lifetimes of language heard. All else equal, I'm sure the ML community would almost unanimously agree that thousands of lifetimes of other data with many modes of interaction and different media would be better. It would take forever to do and would cost insane amounts of money. But some kinds of labels are relatively cheap, and some data don't need labels at all, like this internet text corpus. I think that explains the obsession with single-media models. There's a lot more work to do and this is, believe it or not, still the low hanging fruit.
> thousands of lifetimes of other data with many modes of interaction and different media would be better.
But why not just 1 lifetime of different kinds of data? Heck, why not an environment of 3 years of multi-media data that a child would experience? That wouldn't cost insane amounts of money (or probably anything even close to what we've spent on deep learning as a species).
A corpus limited to the experiences of a single agent would create a very compelling case for intelligence if at the end of that training there was something that sounded and acted smart. It couldn't "jump the gun" as it were, by a lookup of some very intelligent statement that was made somewhere else. It would imply the agent was creatively generating new models as opposed to finding pre-existing ones. It'd even be generous to plain-ol'-AI as well as deep learning, because it would allow both causal models to explain learned explicit knowledge (symbolic), or interesting tacit behavior (empirical ML).
> But why not just 1 lifetime of different kinds of data? Heck, why not an environment of 3 years of multi-media data that a child would experience? That wouldn't cost insane amounts of money (or probably anything even close to what we've spent on deep learning as a species).
How would you imagine creating such an environment in a way that allows you to train models quickly?