
ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations - agluszak
https://ai.googleblog.com/2019/12/albert-lite-bert-for-self-supervised.html
======
aglionby
Paper here:
[https://arxiv.org/abs/1909.11942](https://arxiv.org/abs/1909.11942)

Really impressed at the speed with which Hugging Face ported this to their
transformers library -- Google released the model and source code Oct 21 [1]
and it was available in the library just 8 days later [2].

[1] [https://github.com/google-research/google-
research/commit/b5...](https://github.com/google-research/google-
research/commit/b5b95447720276feb3f6c7d1782207ab3f037d3e)

[2]
[https://github.com/huggingface/transformers/commit/c0c208833...](https://github.com/huggingface/transformers/commit/c0c2088333e2e8ce2b24d0c7f4bf071dcccbd7ea)

------
danieldk
This is nice work. Summarized: by untying word piece embedding sizes from the
hidden layer size + sharing parameters between layers, the number of
parameters in the model are drastically reduced. They use this headroom to
make the hidden layer sizes larger and given that layers share parameters,
they can also use more layers without increasing model size.

Another interesting finding is that this speeds up training due to the smaller
number of parameters.

However, my worry is that those of use that do not readily have access to TPUs
will get even slower models when using CPUs for prediction due to the
additional and wider hidden layers. (Of course, one could use ALBERT base,
which still have 12 layers and a hidden layer size of 768, at a small loss.)
Did anyone measure the CPU prediction performance of ALBERT models compared to
BERT?

Edit: I guess one solution would be to use a pretrained ALBERT and finetune to
get the initial model and then use model distillation to get a smaller, faster
model.

~~~
nl
_Summarized: by untying word piece embedding sizes from the hidden layer size
+ sharing parameters between layers, the number of parameters in the model are
drastically reduced. They use this headroom to make the hidden layer sizes
larger and given that layers share parameters, they can also use more layers
without increasing model size._

This summary is wrong.

It's true that they use parameter sharing to reduce the number of parameters.

But for any given "class" of model (base/large/xlarge) ALBERT has the _same_
size hidden layer and the _same_ number of layers.

If you try to compare by model size (measured by number of parameters) then
ALBERT xxlarge (235M parameters) has _less_ layers than BERT large (334M
parameters) - 12 vs 24 - a larger hidden layer (4096 vs 1024) and smaller
embedding size (128 vs 1024).

~~~
nl
_they can also use more layers without increasing model size._

Additionally, in section 4.9 they compare more layers and find "The difference
between 12-layer and 24-layerALBERT-xxlarge configurations in terms of
downstream accuracy is negligible, with the Avg score being the same. We
conclude that, when sharing all cross-layer parameters (ALBERT-style), there
is no need for models deeper than a 12-layer configuration"

------
nl
_RACE test accuracy of 89.4. The latter appears to be a particularly strong
improvement,a jump of +17.4% absolute points over BERT (Devlin et al., 2019),
+7.6% over XLNet (Yang et al.,2019), +6.2% over RoBERTa (Liu et al., 2019),
and 5.3% over DCMI+ (Zhang et al., 2019), an ensemble of multiple models
specifically designed for reading comprehension tasks. Our single model
achieves an accuracy of86.5%, which is still2.4%better than the state-of-the-
art ensemble model_

This is an amazing result.

RACE has a performance ceiling of 94.5% set by inaccuracies in the data.
Mechanical Turk performance is 73.3%.

It's a hard, reading comprehension test where you can't just extract spans
from the text and match against answers. Section 2 of
[https://www.aclweb.org/anthology/D17-1082.pdf](https://www.aclweb.org/anthology/D17-1082.pdf)
has a sample.

------
iceIX
Some interesting discussion between the authors and the ICLR 2020 reviewers
here:
[https://openreview.net/forum?id=H1eA7AEtvS](https://openreview.net/forum?id=H1eA7AEtvS)

------
misterman0
The progress academics continuously make in NLP take us closer and closer to a
local maxima that, when we reach it will be the mark of the longest and
coldest AI winter ever experienced by man, because of how far we are from a
global maxima and, progress made by academically untrained researchers will,
in the end, be what melts the snow, because of how "out-of-the-current-AI-box"
they are in their theories about language and intelligence in general.

~~~
heyitsguay
We hear that opinion all the time. As someone working in neural net-based
computer vision I'd basically agree that the current approaches are tending
towards a non-AGI local maximum, but I'd note that as compared to the 80s,
this is an economically productive local maximum, which will likely help fuel
new developments more efficiently than in previous waves. The next
breakthrough may be made by someone academically untrained, but you can bet
they'll have learned a whole lot of math, computer science, data science, and
maybe neuroscience first.

~~~
spinningslate
Agreed. I'm nowhere near expert enough to opine on how far state-of-the-art is
from some global maxima.

I'd contend that, for the most part, it doesn't matter. It's a bit like the
whole ML vs AGI debate ("but ML is just curve fitting, it's not _real_
intelligence"). The more pertinent question for human society is the impact it
has - positive or negative. ML, with all its real or perceived weaknesses, is
having a significant impact on the economy specifically and society generally.

It'll be little consolation for white collar workers who lose their jobs that
the bot replacing them isn't "properly intelligent". Equally, few people using
Siri to control their room temperature or satnav will care that the underlying
"intelligence" isn't as clever as we like to think we are.

Maybe current approaches will prove to have a cliff-edge limitation like
previous AI approaches did. That will be interesting from a scientific
progress perspective. But even in its current state, contemporary ML has
plently scope to bring about massive changes in society (and already is). We
should be careful not to miss that in criticising current limitations.

~~~
rumticular
This perspective makes sense pragmatically, but in philosophical terms it’s a
little absurd.

Going back to Turing, the argument was for true, human creativity. The claim
was that there is no theoretical reason a machine cannot write a compelling
_sonnet_.

After spending the better part of a century on _that_ problem, we have made
essentially zero progress. We still believe that there is no theoretical
reason a machine cannot write a compelling sonnet. We still have zero models
for how that could actually work.

If you are a non-technical person who has been reading popular reporting about
ML, you might well have been given the impression that something like GPT2
reflects progress on the sonnet problem. Some very technical people seem to
believe this too? Which seems like an issue, because there’s just no evidence
for it.

 _Maybe_ a larger/deeper/more recurrent ML approach will magically solve the
problem in the next twenty years.

And _maybe_ the first machine built in the 20th century that could work out
symbolic logic faster than all of the human computers in the world would have
magically solved it.

There was no systematic model for the problem, so there was no reason to
conclude one way or another, just as there isn’t any today.

ML is a powerful metaprogramming technique, probably the most productive one
developed yet. And that matters.

It’s just still not at all what we’ve been proposing to the public for a
hundred years. To the best of our understanding, it’s not even meaningfully
closer. And that matters too, even if Siri still works fine.

~~~
p1esk
Re sonnet problem: we can use GPT-2 to generate 10k sonnets, then choose the
best one (say by popular vote, or expert opinion, etc), it's quite likely to
be "compelling" or at least on par with an average published sonnet. Do you
agree? If yes, then with some further deep learning research, more training
data, and bigger models, we will probably be able to eventually shrink the
output space to 1k, 100, and eventually maybe just 10 sonnets to choose from,
to get similar quality. Would this be considered "progress for that problem"
in your opinion?

