There was this research: https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreti...
Turns out, GPT-like architectures appear to use the same representation throughout all the layers. So you can use the final head layer as a lens to see what words the network is thinking of as it ... "thinks". That's a bit contrary to what I imagined a GPT-like architecture was doing. I would have assumed that its embedding of ideas changed throughout the network, only reaching a sensible embedding near the end. At least from the perspective of the head layer, that doesn't appear to be the case.
Facebook's research demonstrates that the MLP layer can be dropped in favor of more attention over learned knowledge vectors: https://ai.facebook.com/blog/making-transformer-networks-sim...
> Reading new Transformer papers makes me feel that training these models requires something akin to black magic when determining the best learning rate schedule, warmup strategy and decay settings.
I found GPT-*'s schedule straightforward. In fact, most training schedules these days are "boring". Adam with warmup and either a linear or cosine decay. CNNs have been doing that for awhile, and the hyperparameters now are fairly robust. If you get within an order of magnitude you'll land within a few percent of optimal accuracy.
The OpenAI paper Scaling Laws for Neural Language Models does a good job of exploring the hyperparameters of GPT-like networks. It's a fascinating read.
That paper suggests another thing about Transformers that we don't understand. Beyond some minimums, the layer count, embedding size, and number of attention heads _isn't important_. The most important factor in the performance of a model is simply the number of parameters.
That's quite unusual, as the classic intuition is that adding more layers to a model improves performance. Yet for GPT-like architectures that isn't the case. You can get the same performance gains by just increasing the embedding size. 4 heads? 12 heads? Doesn't matter. Weird.
> Initially introduced for machine translation, Transformers have gradually replaced RNNs in mainstream NLP. The architecture takes a fresh approach to representation learning: Doing away with recurrence entirely
The more I study Transformers, the more I suspect that their success has more to do with our utter failure to train RNNs. In theory, RNNs have infinite attention. In practice, the only tool we have for optimizing models is backprop, and so to train an RNN we have to use BPTT. This defacto creates a learning horizon. There is absolutely no training signal that tells an RNN to remember something beyond the BPTT horizon. So why would it?
Our training loop for RNNs involves giving them, for example, 1024 tokens, running them for 1024 iterations, and computing loss on 1024 predictions.
A Transformer's training is nearly identical. They get 1024 tokens of context, make 1024 predictions, and compute loss.
The difference? Let's consider a scenario where it's the last prediction, and that prediction should be a copy of the first word in the sequence. Very simple, right? Yet for an RNN to learn how to do that, it needs to backprop the loss through 1024 iterations of its model. Even for a simple 1 layer RNN that makes it look like a 1024 layer model. Vanishing gradients turned up to 11!!
For a Transformer this is dead simple. Its attention mechanism allows the last column of the Transformer to directly use knowledge from the entire sequence. Not only can it see the first token in the sequence, it can see all the computations it previously performed on the sequence, all at once. All with easy backprop. No vanishing gradients here.
So is it really any surprise that Transformers have dominated RNNs? Given the exact same compute, memory, parameters, etc Transformers make much more efficient use of the resources. (During training)
To put it another way, a Transformer is exactly like an RNN. It just uses attention to access history rather than recurrence. And we know that backprop and recurrence are incompatible. So Transformers win.
Of course, that's a huge problem. We have a temporary win. A HUGE win. But Transformers haven't solved what RNNs were meant to solve. By the end of a book, human brains can remember things from the beginning of the book. RNNs trained with BPTT cannot. Transformers cannot. Even Transformers with linear attention mechanisms cannot. There's another leap here left to do.
We need some kind of long term memory mechanism. A big Transfomer model can probably approximate the active parts of a human brain. The image-gpt model and papers using Transformers in place of CNNs for classification tasks have shown that Transformers are a generic substrate. But the big missing puzzle piece is long term memory. Give a Transformer some method to query a bank of memory and I think we're going to see that next big leap. Plus, with access to a memory bank, a Transformer can free up all the resources its using today to behave _like_ long term memory and instead use them for more "thinking".
Yet we have absolutely no mechanisms available to us today to build such a thing. To teach any kind of model to act as long term memory requires some way to show to an example from its distant past, see how well it remembers it, and then ... backprop that. But we can't backprop it, because we can't backprop across an entire book, let alone WikiText or WebText2.
We're really back at square one. We need some way to make the theory of RNNs a reality. Transforms haven't bought us that. They've only bought us a short-term cheat in performance.
EDIT: Just so it's clear, none of my comment is meant as a criticism of the linked article. I actually thought the OP was great and sources a lot of research in trying to understand the mechanisms of Transformers. Really I just springboarded off the article to dump some of my own musings.
[This is a post to Hacker news and I'm making a point to explain a gimmick for giving GPT3 self generated longer term 'memory'.]
Most obvious forms of memory have a problem that they aren't differentiable so you can't train with them in place. This idea works around the issue because english text contains things like running commentary a times, and so a model trained on it already has some idea of how to use it.
[This is a post to Hacker news and I'm making a point to explain a gimmick for giving GPT3 self generated longer term 'memory' and the limitations of other approaches.]
I've had some success at getting this to help generate better text. I wonder though if it would be effective to generate a new training corpus this way. E.g. get GPT3 to generate annotations for arbitrary input text by using some summerization prompt, then use that to go augment the entire training corpus with the summaries injected inline like virtual-thought bubbles with beginning and ending symbols that don't occur in the training material. Then the network is retrained on this augmented data and then can generate its own prompts.
Bonus: the operator could be given access to the otherwise normally hidden "internal monolog" text, to increase control over the output or understand more about the state of the model.
You can't differentiate across the different executions, due to sampling-- but perhaps you don't need to... it doesn't do any gradient descent to perform one shot learning.
I am guessing that this must not work at scale because it's an obvious enough idea and a similar approach for database access (e.g. have it generate keywords from the text, then inject tokens encoding some text search results for those keywords the stream, and skip over them in training and just keep them as context; thus training a model that can use a search to improve its results) must have been tried but I've never heard anyone report it working.
Memory is just association. When Foo is at input, Memory must bring up Bar, Baz, etc., which are in association with Foo, as separate input. It's better if association kind (before, after, inside, together, opposite, same, etc.) will be stored and retrieved by Memory too. Not a hard task to do by today standards.
However, Long Term Memory is orthogonal to AI training. It's kind of "self-attention" mechanism, because LTM need to watch _training process_, and then note what, when and how put input into LTM, and how to associate it with other things, which are already in LTM. In short, LTM requires meta training, to watch a lot of training sessions, to understand that. It will be hard to define proper loss function for LTM, so it may be better to implement LTM as simple non-AI algorithm first. IMHO, for LTM, rate of training convergence can be used as loss function for meta-training of LTM itself.
BTW, LTM also need a way to translate between input encoding, or single input encoding must be used for all trainings.
Also, when bringing up associations (memories) for Foo, LTM can also bring up associations for Bar, Baz, etc. For example, LTM can bring up 10 direct (tier 1) associations for Foo, then 3 tier 2 main associations for Bar, Baz, etc., then 1 tier 3 association for tier 2 associations, and so on, up to e.g. 7 tiers. Beware, it can lead to "inner monologue" of machine. :-)
Actually, the way I see it, Transformer is a direct descendent of memory-based architectures (NTM, MemNet, stack-based RNNs...) that is both expressive and easy to train.
But you are right, they are trained on next word prediction so there’s no long term memory. I imagine people are working on transformers with a memory bank. But RNNs seem to be the brute force solution here... what I am guessing is that you need to maintain some kind of index to decide where to backprop. If it hasn’t been discovered yet, I bet it will be some kind of bloom filter.
Because they generalize. Char-RNN learn to balance parenthesis separated by a longer distance than the BPTT window because they've learned that counting parenthesis is useful for prediction based on parenthetical statements shorter than the BPTT window.