GPT-2 was a solution to a certain kind of problem: is it possible to throw out the idea of representing language while still having good performance on a variety of language tasks? But there doesn't seem to be anything like that for "memory," which seems separate and distinct from "tasks."
In concrete terms, I'm interested in models that have memory in the sense of "When you inference from the model, you leave a lasting impact on the model." Inferencing from the model should cause a change in the model parameters. Yet most models currently seem to assume that model parameters can be frozen without losing something essential to the end goal.
We're very task-oriented. But none of these chat bots can remember my name, or anything about me, and it's always bugged me. GPT-2 (and now GPT-3) punts the problem to a sufficiently clever programmer: just figure out how to encode all the "memory" into the context window, and then out pops the results you want. But that feels rather like arguing "Just come up with a technique that works, and it will work." Perhaps it's true, but not too helpful.
If you hear the same name a few times, you'll remember it a long time, and start associating it with someone's face. It seems like language models could do something similar. I don't know precisely what; maybe someone here does.
You could designate part of the model as long-term memory, short-term, etc. Inferencing from the model could cause larger effects in the short-term area than the long-term area (equivalent to a higher learning rate).
They're not exactly what you describe, which I suppose is a truly online model that knows how/when to update its own parameters.
But those two models do incorporate a similar concept of external memory, whereby the controller is trained via BP to read/write to a tensor (essentially a form of soft-addressable memory available at inference time).
As far as I recall, these were never applied beyond toy problems, and it seems this line of research hasn't been very active (at least since the Transformer's "memorize all the things" approach started performing exceptionally well on all the benchmarks). I haven't read the paper you linked just yet - it may well be relevant.
This idea of updating its own weights is not new. Schmidhuber has done some work on this (http://people.idsia.ch/~juergen/deep-learning-miraculous-yea...). The main idea is also that the model can modify even itself (so not having two separate nets) (see Schmidhuber, "Steps towards `self-referential' learning").
Online learning / continual learning is yet another (orthogonal) topic. This is the setting where new (training) data becomes available all the time, and the model should use that data, i.e. all input (from inference) is used to update and train the model further. This can be done by standard backpropagation. The problem is usually to overcome catastrophic forgetting in this case. See for example: https://deepmind.com/blog/article/enabling-continual-learnin...
I'd imagine if the toolchain problem was solved we'd see a lot more research in this direction.
 Commonsense Knowledge Aware Conversation Generation with Graph Attention https://www.ijcai.org/Proceedings/2018/0643.pdf
 A Knowledge-Grounded Neural Conversation Model https://arxiv.org/pdf/1702.01932.pdf
 Conversing by Reading:Contentful Neural Conversation with On-demand Machine Reading https://arxiv.org/pdf/1906.02738.pdf
The main problem with recurrent models is its hard to train them with backprop. For example the GPT-3 can handle sequences up to ~2000 tokens? I'm not sure what the largest sequence LSTMs could be trained on but it was probably less.
> I'm sure somebody somewhere is working to make a transformer with recurrency. The neural turing machine mentioned in another comment is such an example but it seems to have been abandoned.
Yeah, there's a bunch of Transformer variants which either use recurrency, compression for long-range, or efficient attention approximation for windows so large as to obviate recurrency. The NTM hasn't been shown useless so much as alternatives like Transformers proven to be way easier to implement & scale up to get similar performance, but it pops up occasionally; a particularly surprising recent appearance was Nvidia's GameGAN which uses a NTM-like memory module for learning to model Pac-Man: https://nv-tlabs.github.io/gameGAN/
In your specific chatbot use case the obvious external memory to present to the bot is the full chat history. Alternatively you can manually extract features from the full chat history and present those instead.
Language models are just always trying to predict the next character.
Language models are kind of the jack of all trades. They are generic and given enough parameters, enough data, enough compute they will learn to solve all tasks simultaneously. But they are not enticed to solve the task you are interested in, they are just trying to learn how to best predict the next character with the finite capacity they have, and will only learn memory insofar it helps them achieve their task.
If you are interested in a specific task you can inject your specific desires either into the structure of the model, or in the structure of the dataset.
There are various line of thoughts when you want to achieve a specific task.
-The GPT approach is to make the model bigger and bigger and solve task as one or zero shot learning. By not touching at the structure of the model anymore but by training it on tasks encoded as structured text.
-You can find some literature about question answering tasks.
You can have a transformer "encode" a big document and answer the question by extracting just the relevant information.
-This mean for example if you want for your chatbot to remember your name or any information you gave him before, one quick way is to give it your whole chat history (it's probably not so big). It's akin making the context window big. You can use tricks like LSH tranformers, or some form of hierarchical memory to not be too much memory constrained by the attention.
-You can also encode "manually" : Train a separate neural network to answer the question you are interested about from the chat history. Like "what is his name ?", "how old is he ?",... And save it as a context information that will be presented as input knowledge for your specific chatbot training task.
-You can also update the weights continuously like it is done in reinforcement learning. Fixed-sized neural networks need to be presented the information multiple times before being able to ingest it. So you'll need to use some kind of replay memory. It's also not a great idea to update all billions of parameters every time you want to retain an information so you will probably need to use sparse operations such that only a few parameters are updated at a time. If you look hard enough you'll probably notice that most traditional database operations can be encoded as sparse neural network operations. If you look even harder you'll probably notice that a lot of information retrieval algorithms are just gradient descent on some form of sparsely encoded neural network operations.
-You can also use GAN for text so that the loss function optimized is closer to the task at hand.
There are a lot of graphs in the paper, though.
The GPT-3 paper (https://arxiv.org/pdf/2005.14165.pdf) has a very large number of graphs showing parameter count on the horizontal axis and some kind of prediction quality metric on the vertical axis. Most of the interesting ones are in Appendix H.
"The point isn’t the performance at 175B, but the shape of the curve as it passes from 117M to 175B" was referring to a general point about how to interpret any/all of those graphs, not a particular one of them.
It would be interesting to read an article that focused on that, for contrast, and to give insight into what is still lacking.
There's a whole lot of acronyms and basically zero context. The language is so generic you could swap out many of the terms and acronyms with random ones and it would still seem like it makes sense.
When I read the title my first thought was, "Wait: When did GUID Partition Table format version 2 come out? They're already taking about version 3‽"