Is GPT-2's architecture any different?

stellaathena · on March 22, 2021

Not hugely, but yes. I tend to think of GPT as a style of architecture with consistent themes and major features, but varying minor features and implementation details. Off the top of my head, I believe the most important difference is that GPT-3 alternates global and local attention while GPT-2 is all global attention.

The two published GPT-Neo models follow GPT-3's lead but the repo lets the user pick whether to use global or local attention layers.