It's interesting watching these attempts to understand Transformers. Are they Gr...

nullc · on Sept 13, 2020

With GPT3 you can give it a bit more long term memory by priming it with text that has "self commentary" written and repeated through each paragraph.

[This is a post to Hacker news and I'm making a point to explain a gimmick for giving GPT3 self generated longer term 'memory'.]

Most obvious forms of memory have a problem that they aren't differentiable so you can't train with them in place. This idea works around the issue because english text contains things like running commentary a times, and so a model trained on it already has some idea of how to use it.

[This is a post to Hacker news and I'm making a point to explain a gimmick for giving GPT3 self generated longer term 'memory' and the limitations of other approaches.]

I've had some success at getting this to help generate better text. I wonder though if it would be effective to generate a new training corpus this way. E.g. get GPT3 to generate annotations for arbitrary input text by using some summerization prompt, then use that to go augment the entire training corpus with the summaries injected inline like virtual-thought bubbles with beginning and ending symbols that don't occur in the training material. Then the network is retrained on this augmented data and then can generate its own prompts.

Bonus: the operator could be given access to the otherwise normally hidden "internal monolog" text, to increase control over the output or understand more about the state of the model.

You can't differentiate across the different executions, due to sampling-- but perhaps you don't need to... it doesn't do any gradient descent to perform one shot learning.

I am guessing that this must not work at scale because it's an obvious enough idea and a similar approach for database access (e.g. have it generate keywords from the text, then inject tokens encoding some text search results for those keywords the stream, and skip over them in training and just keep them as context; thus training a model that can use a search to improve its results) must have been tried but I've never heard anyone report it working.

drran · on Sept 13, 2020

Memory is easy. It was proposes 20 years ago, but nobody bothered to translate the paper to English, because 20 years ago AI was a toy.

Memory is just association. When Foo is at input, Memory must bring up Bar, Baz, etc., which are in association with Foo, as separate input. It's better if association kind (before, after, inside, together, opposite, same, etc.) will be stored and retrieved by Memory too. Not a hard task to do by today standards.

However, Long Term Memory is orthogonal to AI training. It's kind of "self-attention" mechanism, because LTM need to watch _training process_, and then note what, when and how put input into LTM, and how to associate it with other things, which are already in LTM. In short, LTM requires meta training, to watch a lot of training sessions, to understand that. It will be hard to define proper loss function for LTM, so it may be better to implement LTM as simple non-AI algorithm first. IMHO, for LTM, rate of training convergence can be used as loss function for meta-training of LTM itself.

BTW, LTM also need a way to translate between input encoding, or single input encoding must be used for all trainings.

PS.

Also, when bringing up associations (memories) for Foo, LTM can also bring up associations for Bar, Baz, etc. For example, LTM can bring up 10 direct (tier 1) associations for Foo, then 3 tier 2 main associations for Bar, Baz, etc., then 1 tier 3 association for tier 2 associations, and so on, up to e.g. 7 tiers. Beware, it can lead to "inner monologue" of machine. :-)

YetAnotherNick · on Sept 12, 2020

Actually we do have few mechanisms for long term memory like Neural Turing Machine, which has explicit memory cells which neural network could read and write. I think the only thing that is holding back NTM is that it is computationally not efficient like fixed sized context transformer.

Muller20 · on Sept 13, 2020

What's holding back NTM is that they are hard to train, even worse than RNNs. They are not much less efficient than a Transformer. Instead, Transformer has all the advantages of the NTM but it is much easier to train.

Actually, the way I see it, Transformer is a direct descendent of memory-based architectures (NTM, MemNet, stack-based RNNs...) that is both expressive and easy to train.

sjg007 · on Sept 13, 2020

When I realized that what transformers do is transform input into output which is also input I was amazed but it makes sense. It’s exactly like a markov chain. Think of a snake eating itself. What’s important is that the output is basically a Probability distribution. Now you can post process that output to get a finite value but you really want to put it back in and turn the wheel again.

But you are right, they are trained on next word prediction so there’s no long term memory. I imagine people are working on transformers with a memory bank. But RNNs seem to be the brute force solution here... what I am guessing is that you need to maintain some kind of index to decide where to backprop. If it hasn’t been discovered yet, I bet it will be some kind of bloom filter.

hrbigelow · on Sept 13, 2020

One idea in the Hopfield Networks is All You Need paper, was that the softmax-based attention mechanism is equivalent to a Hopfield energy update, and in which the attention keys are the Hopfield "memories". But, the keys are produced as a transformation of the input, so it seems to me, the Transformer does not actually store keys as "memories" the way a Hopfield network stores memories (as energy minima). Is this correct, or am I missing something about the paper?

murbard2 · on Sept 13, 2020

> There is absolutely no training signal that tells an RNN to remember something beyond the BPTT horizon. So why would it?

Because they generalize. Char-RNN learn to balance parenthesis separated by a longer distance than the BPTT window because they've learned that counting parenthesis is useful for prediction based on parenthetical statements shorter than the BPTT window.

Erlich_Bachman · on Sept 13, 2020

Very well thought-out! We need more of this. Doesn't mean that all of your ideas are correct or don't make unnecessary assumptions, etc. But it shows a very clear thinking path that is easy to understand. You should write and publish more if you can!

nmca · on Sept 12, 2020

You might be interested in RTRL (real time recurrent learning).