
XLNet: Generalized Autoregressive Pretraining for Language Understanding - asparagui
https://arxiv.org/abs/1906.08237
======
cs702
This is NOT "just throwing more compute" at the problem.

The authors have devised a clever dual-masking-plus-caching mechanism to
induce an attention-based model to learn to predict tokens from _all possible
permutations_ of the factorization order of all other tokens in the same input
sequence. In expectation, the model learns to gather information from all
positions on both sides of each token in order to predict the token.

For example, if the input sequence has four tokens, ["The", "cat", "is",
"furry"], in one training step the model will try to predict "is" after seeing
"The", then "cat", then "furry". In another training step, the model might see
"furry" first, then "The", then "cat". Note that the original sequence order
is always retained, e.g., the model always knows that "furry" is the fourth
token.

The masking-and-caching algorithm that accomplishes this does not seem trivial
to me.

The improvements to SOTA performance in a range of tasks are significant --
see tables 2, 3, 4, 5, and 6 in the paper.

~~~
riku_iki
They used more computer power than BERT and much more data for training. Also
they still under-perform comparing to the best BERT-base model in SQuaD 2.0
benchmark.

The big value of BERT is that it is publicly available, and can be modified
and improved.

~~~
roseway4
An English pretrained model is available. Also, the SQuAD2.0 EM numbers are
higher than the BERT results published in the BERT paper. Subsequent work has
improved on these and no doubt XLNet will see similar activity.

------
s_Hogg
It's nice to see people managing to push BERT further and get SOTA on stuff,
but I feel like a fair amount of this sort of thing is really just throwing
more and more compute at a problem.

Given that we still don't fundamentally understand the properties of deep nets
anywhere near as well as we might like to thing, it gives me the gnawing
feeling that we're missing something.

~~~
goodside
This concern is vague, frequently raised, and not especially useful. What are
you advocating, specifically? That the amount of compute being thrown at
problems is excessive relative to the economic benefit that Google (or
whoever) derives from training models as long as they do? This is an empirical
question, and it’s well understood how much compute is useful. Are you saying
that advances in M.L. theory are coming too slowly because we focus too much
on hardware? I spend maybe a third of my day job reading journal articles, and
the rate of ideas coming out is blistering.

This isn’t people banging GPUs together in desperation because nothing works.
This is what happens when everything works better and better the more money
you throw at it with no apparent end.

~~~
s_Hogg
I accept all of what you're saying and also spend a lot of my day reading ML
papers :)

To boil it down a little perhaps, the point I'm making is that the incremental
gain from new ideas coming out seems to be slowly decreasing, in my view -
which is perfectly natural! It obviously makes sense that that would happen.

Where that thought leads me though, is the idea that we may be slowly moving
towards the exploitation half of the explore/exploit choice. I'm not saying
greenfield ML research is gone by any means, only that I think a lot of stuff
presented as such may in fact not be because in many cases it doesn't really
fundamentally re-think the problem being addressed.

------
jamesblonde
What i am most impressed about this paper is that it is primarily from CMU
with 5 authors and 1 from Google (with the legend Quoc from Google and Ruslan
Salakhutdinov is also excellent). Even though the CMU team may have had the
original ideas, I guess the golden age of ML adage still holds despite their
contributions related to dual-masking-plus-caching - "it's a golden age so
long as you have access to massive amounts of compute and storage".

------
jeremysalwen
I don't the point they are trying to make about BERT not learning dependencies
between masked words. Isn't the mask randomly chosen each time, so it has a
chance to learn with all possible words unmasked?

~~~
psb217
Most of the statements they make regarding orderless autoregression, including
statements about the "independence assumption" made by BERT, are misleading at
best.

~~~
jeremysalwen
Thank you, it seemed that way to me, but I thought maybe there was something I
wasn't understanding.

