* 2017 - the Attention Is All You Need paper proposes transformers, suggests language translation as their primary application
* 2019 - GPT2 is presented to the public and demonstrates transformer-based LLMs and their emergent capabilities
* 2020 - GPT3 is released and shows that throwing more compute at LLMs yields significantly better LLMs
* 2022 - ChatGPT is openly released to the public, showcasing the versatility of an LLM-based chatbot
In my experience transformers have been all the rage in the researcher/enthusiast scene since 2019. The technology has just gradually matured enough to become viable for consumer use, which is why you see the industry rushing to adopt it. ChatGPT was the watershed moment for the tech because suddenly anyone in the world could sign up for free, open a chat dialogue and start getting legible LLM output without needing to understand the tech or prompt engineering.
The technology has been a while coming .. language models have long been a research area within machine learning, with recurrent models such as RNNs and LSTMs being an earlier approach since they allow the model to process a (language) sequence of arbitrary length.
Problems/limitations of recurrent models led to other approaches being tried using "attention" as way to let earlier parts of a sequence impact future prediction, culminating in the 2017 "Attention is all you need" paper which introduced the "Transformer" architecture that all these current LLMs are based on.
From there it was a matter of scale - scaling up the model and amount of data the models were trained on. Nobody knew how well this "Transformer" architecture could perform at scale, but early signs were promising enough to keep pushing to see how much better they could get. OpenAI in particular have been very aggressive in pushing this scaling up with their GPT-N (N=1/2/3..) models. They themselves expressed some surprise at the capabilities of GPT-2, leading to the much larger GPT-3 that is the basis of ChatGPT.
Both OpenAI and others had been leery of publically releasing these very capable LLMs for fear of ways they might be misused, but finally OpenAI released GPT-3 (with a bit of human feedback polish) in the guise of the chat bot ChatGPT, which was the first time the public had seen what the tech was capable of.
The sudden impact of ChatGPT belies the incremental improvements that brought us to this point, but seems to have been largely because the public had never seen/experienced the steps that got us here, partly because of the highly accessible packaging of the tech as a web-based chat bot, and perhaps partly because it was released without much explanation from OpenAI as to what it was/how it works - they seem quite happy for the public to do what they've done and anthromorphise it as being an AI assistant.
Transformers aren't really a wonderful architecture in the sense of a great fit between the architecture and what we know about the task. (For comparison, I think convolutional networks are.)
What makes Transformers great is:
1. Can handle long sequences without large increase in number of parameters to be trained.
2. Parallelize better than previous sequence models, ie LSTM. If we could train LSTMs of the size and with the same training data size as current Transformers, they'd probably be just as good.
Transformers were used on text to train models without needing labeled data. People realized that simply scaling the data and models meant better performance. when they scaled even further, emergent intelligence started appearing and the models were dominating every known task. Now everyone wants an LLM
Transfomers are currently state of art i.e. the problems that transformers are currently solving cant be solved better by any other known technique/algorithm. Thats the main reason why they are so popular at the moment. LLMs are popular right now because a recent transformer based neural network has proven to be fun/useful i.e. gpt-3/chatgpt. A lot more useful than previous language models at least.