
GPT-3 and Scaling Trends - luu
https://nostalgebraist.tumblr.com/post/619672884731904000/gpt-3-and-scaling-trends
======
mordymoop
This blogger implies that GPT-3 could be within an order of magnitude of
reaching a threshold of data efficiency. The implication is that a GPT model
with >1 trillion parameters would begin to see a reduction in data efficiency.
I have read all the GPT related papers and I’m frankly not sure what
nostalgebraist thinks this would mean, practically speaking. All complex
problem domains begin to see a dropoff in data efficiency once the “easy”
structure is successfully learned. The lesson here might simply be that GPT
type models are close (within an order of magnitude, so “close” is subjective)
to being able to learn all the “obvious” regularities in massive language
datasets, leaving the increasingly subtle regularities that might require very
specific, very difficult-to-reach-through-local-minima-ravines, or very “big”
learned abstractions to discover. Since GPT-3 can already do some rather
incredible things in the zero-shot case, to say nothing of the few-shot case,
this fails to make me feel suddenly dissatisfied with the performance large
transformer models.

------
sillysaurusx
Does anyone know of research into chatbot memory? I found this one:
[https://deepai.org/publication/a-proposal-for-intelligent-
ag...](https://deepai.org/publication/a-proposal-for-intelligent-agents-with-
episodic-memory)

GPT-2 was a solution to a certain kind of problem: is it possible to throw out
the idea of representing language while still having good performance on a
variety of language tasks? But there doesn't seem to be anything like that for
"memory," which seems separate and distinct from "tasks."

In concrete terms, I'm interested in models that have memory in the sense of
"When you inference from the model, you leave a lasting impact on the model."
Inferencing from the model should cause a change in the model parameters. Yet
most models currently seem to assume that model parameters can be frozen
without losing something essential to the end goal.

We're very task-oriented. But none of these chat bots can remember my name, or
anything about me, and it's always bugged me. GPT-2 (and now GPT-3) punts the
problem to a sufficiently clever programmer: just figure out how to encode all
the "memory" into the context window, and then out pops the results you want.
But that feels rather like arguing "Just come up with a technique that works,
and it will work." Perhaps it's true, but not too helpful.

If you hear the same name a few times, you'll remember it a long time, and
start associating it with someone's face. It seems like language models could
do something similar. I don't know precisely what; maybe someone here does.

You could designate part of the model as long-term memory, short-term, etc.
Inferencing from the model could cause larger effects in the short-term area
than the long-term area (equivalent to a higher learning rate).

~~~
nmfisher
Are you aware of the Neural Turing Machine
([https://arxiv.org/abs/1410.5401](https://arxiv.org/abs/1410.5401)) or
Differentiable Neural Computer ([https://deepmind.com/blog/differentiable-
neural-computers/](https://deepmind.com/blog/differentiable-neural-
computers/))?

They're not exactly what you describe, which I suppose is a truly online model
that knows how/when to update its own parameters.

But those two models do incorporate a similar concept of external memory,
whereby the controller is trained via BP to read/write to a tensor
(essentially a form of soft-addressable memory available at inference time).

As far as I recall, these were never applied beyond toy problems, and it seems
this line of research hasn't been very active (at least since the
Transformer's "memorize all the things" approach started performing
exceptionally well on all the benchmarks). I haven't read the paper you linked
just yet - it may well be relevant.

~~~
albertzeyer
For updating its own weights, there is the idea of fast weights
([https://arxiv.org/abs/1610.06258](https://arxiv.org/abs/1610.06258)).

This idea of updating its own weights is not new. Schmidhuber has done some
work on this ([http://people.idsia.ch/~juergen/deep-learning-miraculous-
yea...](http://people.idsia.ch/~juergen/deep-learning-miraculous-
year-1990-1991.html)). The main idea is also that the model can modify even
itself (so not having two separate nets) (see Schmidhuber, "Steps towards
`self-referential' learning").

Online learning / continual learning is yet another (orthogonal) topic. This
is the setting where new (training) data becomes available all the time, and
the model should use that data, i.e. all input (from inference) is used to
update and train the model further. This can be done by standard
backpropagation. The problem is usually to overcome catastrophic forgetting in
this case. See for example: [https://deepmind.com/blog/article/enabling-
continual-learnin...](https://deepmind.com/blog/article/enabling-continual-
learning-in-neural-networks)

------
ladberg
The author mentions that the shape of the curve from 117M parameters to 175B
is interesting, but doesn't show it. Does anyone have the graph?

~~~
Tarean
Presumably it's one of these graphs?
[https://i.imgur.com/r7xJfi1.jpg](https://i.imgur.com/r7xJfi1.jpg)
[https://lh6.googleusercontent.com/VHmmdKYio39kz017ECfPwCxGPT...](https://lh6.googleusercontent.com/VHmmdKYio39kz017ECfPwCxGPT_l7FYl7KDvlXM-
ruC5ttiAYJXz5hsfPiWkrxkVKCU5fL-24XRXtw7gEe9KwJXqWW1bLzukITyL1Xz-
YFWg0BCP5-RrBXOAdDK2paV7HpT1F4AM)

There are a lot of graphs in the paper, though.

------
perl4ever
I read some stuff about GPT-3, and what I noticed was that although it was
doing some amazing things, there were a few tests that it did very badly on. I
think like 30%-ish?

It would be interesting to read an article that focused on that, for contrast,
and to give insight into what is still lacking.

------
alpineidyll3
Mine goes to 11.

------
riskable
I hate to be the critic but this article reads like it was generated by a
similar algorithm that "wrote" those pretend scientific research papers that
were accepted by predatory publishers...

[http://news.mit.edu/2015/how-three-mit-students-fooled-
scien...](http://news.mit.edu/2015/how-three-mit-students-fooled-scientific-
journals-0414)

There's a whole lot of acronyms and basically zero context. The language is so
generic you could swap out many of the terms and acronyms with random ones and
it would still _seem_ like it makes sense.

When I read the title my first thought was, "Wait: When did GUID Partition
Table format _version 2_ come out? They're already taking about version 3‽"

~~~
drusepth
I don't think the audience for this article is meant to include people that
aren't familiar with GPT-2/GPT-3.

