Hacker News new | past | comments | ask | show | jobs | submit login
XLNet: Generalized Autoregressive Pretraining for Language Understanding (arxiv.org)
79 points by asparagui 30 days ago | hide | past | web | favorite | 15 comments

This is NOT "just throwing more compute" at the problem.

The authors have devised a clever dual-masking-plus-caching mechanism to induce an attention-based model to learn to predict tokens from all possible permutations of the factorization order of all other tokens in the same input sequence. In expectation, the model learns to gather information from all positions on both sides of each token in order to predict the token.

For example, if the input sequence has four tokens, ["The", "cat", "is", "furry"], in one training step the model will try to predict "is" after seeing "The", then "cat", then "furry". In another training step, the model might see "furry" first, then "The", then "cat". Note that the original sequence order is always retained, e.g., the model always knows that "furry" is the fourth token.

The masking-and-caching algorithm that accomplishes this does not seem trivial to me.

The improvements to SOTA performance in a range of tasks are significant -- see tables 2, 3, 4, 5, and 6 in the paper.

They used more computer power than BERT and much more data for training. Also they still under-perform comparing to the best BERT-base model in SQuaD 2.0 benchmark.

The big value of BERT is that it is publicly available, and can be modified and improved.

An English pretrained model is available. Also, the SQuAD2.0 EM numbers are higher than the BERT results published in the BERT paper. Subsequent work has improved on these and no doubt XLNet will see similar activity.

It's nice to see people managing to push BERT further and get SOTA on stuff, but I feel like a fair amount of this sort of thing is really just throwing more and more compute at a problem.

Given that we still don't fundamentally understand the properties of deep nets anywhere near as well as we might like to thing, it gives me the gnawing feeling that we're missing something.

This concern is vague, frequently raised, and not especially useful. What are you advocating, specifically? That the amount of compute being thrown at problems is excessive relative to the economic benefit that Google (or whoever) derives from training models as long as they do? This is an empirical question, and it’s well understood how much compute is useful. Are you saying that advances in M.L. theory are coming too slowly because we focus too much on hardware? I spend maybe a third of my day job reading journal articles, and the rate of ideas coming out is blistering.

This isn’t people banging GPUs together in desperation because nothing works. This is what happens when everything works better and better the more money you throw at it with no apparent end.

I accept all of what you're saying and also spend a lot of my day reading ML papers :)

To boil it down a little perhaps, the point I'm making is that the incremental gain from new ideas coming out seems to be slowly decreasing, in my view - which is perfectly natural! It obviously makes sense that that would happen.

Where that thought leads me though, is the idea that we may be slowly moving towards the exploitation half of the explore/exploit choice. I'm not saying greenfield ML research is gone by any means, only that I think a lot of stuff presented as such may in fact not be because in many cases it doesn't really fundamentally re-think the problem being addressed.

Deep NN surely gives people a lot of spaces to come up or combine ideas from other domain. The problem is people take what NN could do for granted without really look into the details. As long as you could invent some fancy architecture and interpret your loss, no one cares the black box part. The higher than usual portion of system that lacks of theoretical foundation is indeed worrying.

What i am most impressed about this paper is that it is primarily from CMU with 5 authors and 1 from Google (with the legend Quoc from Google and Ruslan Salakhutdinov is also excellent). Even though the CMU team may have had the original ideas, I guess the golden age of ML adage still holds despite their contributions related to dual-masking-plus-caching - "it's a golden age so long as you have access to massive amounts of compute and storage".

I don't the point they are trying to make about BERT not learning dependencies between masked words. Isn't the mask randomly chosen each time, so it has a chance to learn with all possible words unmasked?

Most of the statements they make regarding orderless autoregression, including statements about the "independence assumption" made by BERT, are misleading at best.

Thank you, it seemed that way to me, but I thought maybe there was something I wasn't understanding.

See the section "Comparison with BERT" which has a nice worked example that is reasonably understandable without necessarily fully understanding the rest of the paper.

That section you are saying was reasonably understandable was the part that confused me.

For example:

> it is obvious that XLNet always learns more dependency pairs given the same target and contains “denser” effective training signals

BERT is only masking 15% of the tokens, so isn't the amount of dependency pairs like 18% higher at most?

> BERT is only masking 15% of the tokens, so isn't the amount of dependency pairs like 18% higher at most?

That's a small but significant difference, allowing them to improve performance by a small but significant amount.

I think the argument behind is that XLNet retains all context regardless. Ultimately, it is the performance that speaks.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact