
How GPT3 Works – Visualizations and Animations - dsr12
https://jalammar.github.io/how-gpt3-works-visualizations-animations/
======
AcerbicZero
I enjoy a good visualization, but at best they're high level graphical
powerpoints, and in this case I found the animations more distracting than
useful.

Also, if you're going to do a 30k foot view of a technical topic, you might
want to tell people what GPT3 is somewhere in there.

~~~
m3at
I agree that in this case the animated parts of the graphics were not needed,
it's an easy pitfall to be distracted by the beautiful aspects of
visualisations when crafting them.

I feel the need to defend the author though, it's hard to make research
accessible while still distilling valuable insight. I think his post on
transformer networks [1] did a good job for example, and you'll appreciate the
lack of animations.

[1] [https://jalammar.github.io/illustrated-
transformer/](https://jalammar.github.io/illustrated-transformer/)

~~~
ypcx
Yes this seems like an early work in progress, compared to Jay's previous
Transformer articles.

In addition to your link, I've found a really good Transformer explanation
here (backed by a Github repo w/ lively Issues talk):
[http://www.peterbloem.nl/blog/transformers](http://www.peterbloem.nl/blog/transformers)

Additionally, there's a paper on visualizing self-attention:
[https://arxiv.org/pdf/1904.02679.pdf](https://arxiv.org/pdf/1904.02679.pdf)

~~~
m3at
That's a good complement, thank you for the links

------
whywhywhywhy
Please don't use terms like "magic" when trying to explain things to people.
They never point out where the "magic" part lines up to any of their other
explanation.

~~~
Quarrelsome
a lot of maths is basically just "number magic". Apply the formula get the
desired output.

~~~
rhizome
That is absolutely not "magic!"

~~~
Quarrelsome
depends how good you are at reading maths proofs. Its like how hardware is
magical to some of us because we're too spooked to grab a soldering iron.

------
sirpunch
I love the amount of efforts Jay puts in his posts to develop intuitions. And,
I wonder if there are some open source projects out there to help make simple
animations for researchers who like to blog.

~~~
iwintermute
there's [https://github.com/3b1b/manim](https://github.com/3b1b/manim) created
and used by [https://www.3blue1brown.com/](https://www.3blue1brown.com/)

------
torusenthusiast
I'm curious, what can I, as a full-stack developer, do to prepare for things
like GPT-X eventually making a lot of the work I do obsolete in the next 10-20
years?

Seeing all these demonstrations is starting to make me a little bit nervous
and I feel it is time for a long term plan.

~~~
hpen
I'm actually looking forward to more code generation tools. Things like wiring
up a button aren't stimulating and I wouldn't mind that level of programming
becoming automated.

~~~
freehunter
That’s what I loved about Visual Basic. You could just draw your user
interface and specify actions and then just fill in the one or two lines of
code that need to run when that button is pressed.

I’m surprised React doesn’t have something like that. At least not that I’m
aware of. Is there a GUI interface builder for React?

~~~
pkage
There are a handful of projects out there, such as BuilderX[0] and React
Studio[1].

[0] [https://builderx.io/](https://builderx.io/) [1]
[https://reactstudio.com/](https://reactstudio.com/)

------
blueboo
What am I missing? How is any of his visualizations GPT-3 specific and not,
say, a deep learning LSTM from years ago?

~~~
not2b
AFAIK, the only thing new about GPT-3 is its massive size, the architecture is
completely conventional, so the same as those you've seen from a few years
ago.

------
aeternum
The visualizations seem to show non-recurrent networks whereas my
understanding is that one of the important differences between GPT1 and GPT2 &
3 is the use of recurrent networks.

This allows the output to loop backwards, providing a rudimentary form of
memory / context beyond just the input vector.

~~~
sdrg822
While models such as XLNet incorporate recurrence, GPT-{2,3} is mostly just a
plain decoder-only transformer model.[1]

[1][https://arxiv.org/abs/2005.14165](https://arxiv.org/abs/2005.14165)
[2][https://d4mucfpksywv.cloudfront.net/better-language-
models/l...](https://d4mucfpksywv.cloudfront.net/better-language-
models/language-models.pdf)

------
srg0
Just curious. What languages (human languages) were used in the training data
set of GPT3? Is it trained only on English texts and grammar, or is it
transcending language barriers?

~~~
supermatt
The vast majority (>93%) is English (by document):
[https://github.com/openai/gpt-3/blob/master/dataset_statisti...](https://github.com/openai/gpt-3/blob/master/dataset_statistics/languages_by_document_count.csv)

------
2bitencryption
My (very limited) understanding of AI models is the input "shape" has to be
well defined.

I.e. a vision network expects 1 input per pixel (or more for encoding color)
and so it's up to you to "format" your given image into what the model
expects.

But what about GPT-3, which takes in "free text?" The animations in the post
show 2048 input nodes, does this mean it can only take in a maximum of 2048
tokens? Or will it somehow scale beyond that?

~~~
unixhero
Is GPT-3 even a computer vision AI model?

~~~
minimaxir
No, but there ain't no rule about flattening pixels and using it as input
data.
[https://news.ycombinator.com/item?id=23554944](https://news.ycombinator.com/item?id=23554944)

------
pkaye
So can something like this answer a query like "give me the names of countries
with population exceeding a million" How would it go about doing that?

~~~
hervature
The answer is yes and no. First, it will produce an output for any input. What
you really mean is answer a query correctly.

It goes about doing that the same way it works in general which is memorizing
sequences that are similar and outputting the corresponding sequence that
follows. For example, if the training data has something like "These are the
countries that have over a million people: <countries>" I would not be
surprised if it returned <countries> for your query. However, if your query
was "less than a million" I would be very surprised if it would return the
other countries.

~~~
minimaxir
Just tested the "less than a million" prompt: it works, kinda.

[https://gist.github.com/minimaxir/14c4de89ed9ecc7a170e7e1ca0...](https://gist.github.com/minimaxir/14c4de89ed9ecc7a170e7e1ca0094263)

~~~
mrfusion
Why did it output so many answers? Also it missed Iceland.

How are you able to use it? Can I try it out?

~~~
minimaxir
Each answer is an independent generation from the model.

I have access to the API.

~~~
mrfusion
Is it hard to get access?

------
Tempest1981
Can someone explain this - what is meant by "complete"?

> Training is the process of exposing the model to lots of text. It has been
> done once and complete.

~~~
Plough_Jogger
I think this is a typo. Perhaps the author means to say that the training
process has been completed?

~~~
visarga
Yes, from now on we can only control it by prompting. But OpenAI announced
plans for further fine-tuning on client data.

------
iandanforth
This fun but IMO too simplified. For example it's really important to know
that GPT-3 does _not_ see "words" it sees byte pair encodings. Which are for
the most part smaller than words but larger than individual characters. This
has immediate implications for what GPT-3 can and cannot do. It can reverse a
sentence (This cat is cute -> cute is cat this) but it cannot reliably reverse
a word (allegorical -> lacirogella).

[https://www.gwern.net/GPT-3#bpes](https://www.gwern.net/GPT-3#bpes)

~~~
nmfisher
Interesting to consider whether this limitation of BPE points to a more
fundamental issue with the model. Does GPT-3 "fail" when BPE is replaced with
the conventional English alphabet as input symbols (for various definitions of
"fail")?

If so, wouldn't this be evidence that the model is using its mind-blowingly
large latent space to memorize surface patterns that bear no real relationship
to the underlying language (as most people suspect)?

I suppose this comes back to my question about Transformer models in general -
the use of a very large attention window of BPE tokens.

When I finish reading a paragraph, I can probably use my own words to explain
it. But there's no chance I could even try to recreate the sentences using the
exact words I just read. So I doubt our brains are keeping some running stack
of the last XXXX words, or even some smaller distributed representation
thereof.

It's more plausible that we're using some kind of natural hierarchical
compression/comprehension mechanism that operating on the
character/word/sentence/paragraph level.

It certainly _feels_ like GPT-3 is using a huge parameter space to bypass this
mechanism and simply learn a "reconstitutable" representation.

Either way, I'd be really interested to see how it handles character-level
input symbols.

~~~
nullc
> Does GPT-3 "fail" when BPE is replaced with the conventional English
> alphabet as input symbols (for various definitions of "fail")?

The attention mechanism is quadratic cost in the number of input symbols.
Restricting it to a tiny alphabet would radically blow up the model cost, so
it's difficult to make an apples to apples comparison.

> When I finish reading a paragraph, I can probably use my own words to
> explain it. But there's no chance I could even try to recreate the sentences
> using the exact words I just read.

Sure you could, you could look up and copy it which is an ability GPT-3 also
needs to model if its to successfully learn from the internet where people do
that all the time. :)

> the use of a very large attention window of BPE

You're also able to remember chunks from not long before. You just don't
remember all of them. I'm sure people working on transformers would _prefer_
to not have it remember everything for a window (and instead spend those
resource costs elsewhere), but it's necessary that the attention mechanism be
differentiable for training, and that excludes obvious constructions. (E.g.
you can't just bolt a nearest-key->value database on the side and simply
expect it to learn to use it).

~~~
nmfisher
> The attention mechanism is quadratic cost in the number of input symbols.
> Restricting it to a tiny alphabet would radically blow up the model cost, so
> it's difficult to make an apples to apples comparison.

That's ultimately my point. If an alphabet-based model can't achieve nearly
the same results as a BPE-based model (even if appropriately scaled up to
accommodate the expanded cost), doesn't that suggest that Transformers really
are just a neat memorization hack?

> Sure you could, you could look up and copy it which is an ability GPT-3 also
> needs to model if its to successfully learn from the internet where people
> do that all the time. :)

That's right - but then we're just talking about memorization and
regurgitation. Sure, it's impressive when done on a large scale, but is it
really a research direction worth throwing millions of dollars at?

> I'm sure people working on transformers would _prefer_ to not have it
> remember everything for a window (and instead spend those resource costs
> elsewhere), but it's necessary that the attention mechanism be
> differentiable for training, and that excludes obvious constructions.

Of course, but all of my whinging about Transformers is a roundabout way of
saying "I'm not convinced that the One True AI will unquestionably use some
variant of differentiation/backpropagation".

~~~
nullc
> nmfisher 1 day ago [–]

> The attention mechanism is quadratic cost in the number of input symbols.
> Restricting it to a tiny alphabet would radically blow up the model cost, so
> it's difficult to make an apples to apples comparison.

That's ultimately my point. If an alphabet-based model can't achieve nearly
the same results as a BPE-based model (even if appropriately scaled up to
accommodate the expanded cost), doesn't that suggest that Transformers really
are just a neat memorization hack?

BPE's aren't even words for the most part. Are all native Chinese authors non-
conscious memorization hacks? :)

------
arensc
Hey how did you do those animations?

~~~
jalammar
Apple Keynote

------
mrfusion
Could I give gpt-3 a legal contract or a terms of service and then ask it
questions about it?

~~~
GreenHeuristics
Yes, and it will give you answers, it will give you an answer that is a blend
of answers to similar questions asked before.

~~~
mrfusion
But say I give it a new piece of text it has never seen before. Can it answer
questions about that or it won’t really care about what’s in that text?

~~~
not2b
It will give an answer, whatever is formed by extending the input you give it.
The answer will be based on the text you provide, so in that sense it "cares"
about it. Whether the answer is any good is another matter. But maybe it will
find something based on its training data that relates.

------
rafaelturk
Looks like a big rainbow table generated by AI. Thanks Jay for creating this
amazing chart.

------
quasarj
It has such a confusing name.. I always think this is something interesting,
and then it's just more ML bunk.

------
thatwasunusual
I find it funny that the two first comments has to do with how they did the
_animations_.

~~~
arensc
yeah, mainly because its generally getting more difficult to distill ideas and
keep people interested, highly applicable to communications in engineering

------
faldore
"OpenAI" isn't "Open" therefore it's worthless. Next.

