* Attention is all you need transformer created a non recurrent architecture for NMT (https://arxiv.org/abs/1706.03762)
* OpenAI GPT modified the original transformer by changing architectutre (one net instead of encoder/decoder pair), and using different hyperparameters which seems to work the best (https://s3-us-west-2.amazonaws.com/openai-assets/research-co...)
* BERT used GPT's architecture but trained in a different way. Instead of training a language model, they forced the model predict holes in a text and predicting whether two sentences go one after another. (https://arxiv.org/abs/1810.04805)
* OpenAI GPT2 achieved a new state of the art in language models (https://d4mucfpksywv.cloudfront.net/better-language-models/l...)
* The paper in the top post found out that if we fine tune several models in the same way as in BERT, we get improvement in each of the fine tuned models.
* OpenAI GPT adapted idea of fine-tuning of language model for specific NLP task, which has been introduced in ELMo model.
* BERT created bigger model (16 layers in GPT vs 24 layers in BERT), proving that larger Transformer models increase performance
Idea of transfer learning of deep representations for NLP tasks was before, but nobody was able to achieve it before ELMo.
If we are pedantic we can include the whole word2vec stuff. It's a shallow transfer learning.
Maybe Microsoft doesn't feel like making a big splash about their research?
The architecture is a derivation of PyTorch BERT , with an MTL loss function on top.
>Our implementation of MT-DNN is based on the PyTorch implementation of BERT4. We used Adamax (Kingma and Ba, 2014) as our optimizer with a learning rate of 5e-5 and a batch size of 32. The maximum number of epochs was set to 5. A linear learning rate decay schedule with warm-up over 0.1 was used, unless stated otherwise. Fol- lowing (Liu et al., 2018a), we set the number of steps to 5 with a dropout rate of 0.1. To avoid the exploding gradient problem, we clipped the gradi- ent norm within 1. All the texts were tokenized using wordpieces, and were chopped to spans no longer than 512 tokens.
You won't be able to train BERT in 3 epochs.
>We train with batch size of 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch) for 1,000,000 steps, which is approximately 40 epochs over the 3.3 billion word corpus.