The authors have devised a clever dual-masking-plus-caching mechanism to induce an attention-based model to learn to predict tokens from all possible permutations of the factorization order of all other tokens in the same input sequence. In expectation, the model learns to gather information from all positions on both sides of each token in order to predict the token.
For example, if the input sequence has four tokens, ["The", "cat", "is", "furry"], in one training step the model will try to predict "is" after seeing "The", then "cat", then "furry". In another training step, the model might see "furry" first, then "The", then "cat". Note that the original sequence order is always retained, e.g., the model always knows that "furry" is the fourth token.
The masking-and-caching algorithm that accomplishes this does not seem trivial to me.
The improvements to SOTA performance in a range of tasks are significant -- see tables 2, 3, 4, 5, and 6 in the paper.
The big value of BERT is that it is publicly available, and can be modified and improved.
Given that we still don't fundamentally understand the properties of deep nets anywhere near as well as we might like to thing, it gives me the gnawing feeling that we're missing something.
This isn’t people banging GPUs together in desperation because nothing works. This is what happens when everything works better and better the more money you throw at it with no apparent end.
To boil it down a little perhaps, the point I'm making is that the incremental gain from new ideas coming out seems to be slowly decreasing, in my view - which is perfectly natural! It obviously makes sense that that would happen.
Where that thought leads me though, is the idea that we may be slowly moving towards the exploitation half of the explore/exploit choice. I'm not saying greenfield ML research is gone by any means, only that I think a lot of stuff presented as such may in fact not be because in many cases it doesn't really fundamentally re-think the problem being addressed.
> it is obvious that XLNet always learns more dependency pairs given the same target and contains “denser” effective training signals
BERT is only masking 15% of the tokens, so isn't the amount of dependency pairs like 18% higher at most?
That's a small but significant difference, allowing them to improve performance by a small but significant amount.