
MT-DNN Achieves Human Performance in General Language Understanding Benchmark - apsec112
https://blogs.msdn.microsoft.com/stevengu/2019/06/20/microsoft-achieves-human-performance-estimate-on-glue-benchmark/
======
SmooL
It seems since submitting that they are no longer the leader on the GLUE
leaderboard -
[https://gluebenchmark.com/leaderboard/](https://gluebenchmark.com/leaderboard/)

~~~
yorwba
In fact the article was published on June 20, while the XLNet submission that
dethroned them was already on June 19. I guess their publishing pipeline
doesn't allow last-minute amendments.

~~~
dweekly
The relentlessness of the pace of ML remains breathtaking.

Microsoft: We beat human performance and lapped everyone else!

XLNet: Hold my beer.

Code: [https://github.com/zihangdai/xlnet](https://github.com/zihangdai/xlnet)

Description: [https://towardsdatascience.com/what-is-xlnet-and-why-it-
outp...](https://towardsdatascience.com/what-is-xlnet-and-why-it-outperforms-
bert-8d8fce710335)

Paper:
[https://arxiv.org/pdf/1906.08237.pdf](https://arxiv.org/pdf/1906.08237.pdf)

~~~
m0zg
Notably, this result would be very hard for Microsoft to achieve, or indeed
even reproduce fully in-house, because it requires more memory than GPUs have.
GitHub mentions that TPUs are pretty much table stakes to train this.

~~~
riku_iki
> because it requires more memory than GPUs have

could you elaborate? TPU V3 unit has 16GB of memory, and old V100 also has
16GB of memory. Plus TPU has extra memory consumption for mandatory tensor
padding, which GPU doesn't have.

~~~
m0zg
The XLNet GitHub elaborates pretty well, no need to duplicate it here:
[https://github.com/zihangdai/xlnet](https://github.com/zihangdai/xlnet)

~~~
riku_iki
There is no elaboration of your statement there. It says it is hard to
reproduce results on single 16GB GPU, but nor they used single TPU in their
result. They explicitly state that large number of GPU required: "Therefore, a
large number (ranging from 32 to 128, equal to batch_size) of GPUs are
required to reproduce many results in the paper.", and they used around 200
TPUs themself according to their publication.

~~~
m0zg
Yes, it boils down to using either a single machine with multiple TPUs (and
therefore doing things "the easy way" and relatively quickly) or having to use
128 GPUs (up to 8 per machine) and working with a single sample per GPU,
really, really slowly. Given that a single model often requires dozens, if not
hundreds of training runs to figure out hyperparameters that achieve a SOTA
result, this means that with TPU you can do this, and with GPU you aren't even
going to bother training something this big because it will take forever, and
someone with a TPU will figure out something better by the time you're done.
Which is what happened in this case, it looks like.

------
daenz
One of their test sentences was interesting:

>For example, the task provides the sentence: “The city councilmen refused the
demonstrators a permit because they [feared/advocated] violence.” If the word
“feared” is selected, then “they” refers to the city council. If “advocated”
is selected, then “they” presumably refers to the demonstrators.

~~~
aisofteng
This sort of sentence is exactly the point of the article. See:
[https://en.m.wikipedia.org/wiki/Winograd_Schema_Challenge](https://en.m.wikipedia.org/wiki/Winograd_Schema_Challenge)

------
p1esk
Seems like this is largely due to improved Winograd Schema results. I wonder
if those questions made it into the training set in some form.

------
TheIronYuppie
Disclosure: I work at Azure on Machine Learning

Hi all! Please let me know if you have any questions - happy to direct them to
the right people! Thanks!

(Email in my profile)

