ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations

aglionby · on Jan 8, 2020

Paper here: https://arxiv.org/abs/1909.11942

Really impressed at the speed with which Hugging Face ported this to their transformers library -- Google released the model and source code Oct 21 [1] and it was available in the library just 8 days later [2].

[1] https://github.com/google-research/google-research/commit/b5...

[2] https://github.com/huggingface/transformers/commit/c0c208833...

danieldk · on Jan 8, 2020

This is nice work. Summarized: by untying word piece embedding sizes from the hidden layer size + sharing parameters between layers, the number of parameters in the model are drastically reduced. They use this headroom to make the hidden layer sizes larger and given that layers share parameters, they can also use more layers without increasing model size.

Another interesting finding is that this speeds up training due to the smaller number of parameters.

However, my worry is that those of use that do not readily have access to TPUs will get even slower models when using CPUs for prediction due to the additional and wider hidden layers. (Of course, one could use ALBERT base, which still have 12 layers and a hidden layer size of 768, at a small loss.) Did anyone measure the CPU prediction performance of ALBERT models compared to BERT?

Edit: I guess one solution would be to use a pretrained ALBERT and finetune to get the initial model and then use model distillation to get a smaller, faster model.

samcodes · on Jan 8, 2020

> I guess one solution would be to use a pretrained ALBERT and finetune to get the initial model and then use model distillation to get a smaller, faster model.

IME this is the way to go; take an ensemble of big, accurate models, then distill them down to the smallest model you can get away with. There are really good tricks in this paper

https://www.aclweb.org/anthology/D19-5632/

nl · on Jan 9, 2020

Summarized: by untying word piece embedding sizes from the hidden layer size + sharing parameters between layers, the number of parameters in the model are drastically reduced. They use this headroom to make the hidden layer sizes larger and given that layers share parameters, they can also use more layers without increasing model size.

This summary is wrong.

It's true that they use parameter sharing to reduce the number of parameters.

But for any given "class" of model (base/large/xlarge) ALBERT has the same size hidden layer and the same number of layers.

If you try to compare by model size (measured by number of parameters) then ALBERT xxlarge (235M parameters) has less layers than BERT large (334M parameters) - 12 vs 24 - a larger hidden layer (4096 vs 1024) and smaller embedding size (128 vs 1024).

nl · on Jan 9, 2020

they can also use more layers without increasing model size.

Additionally, in section 4.9 they compare more layers and find "The difference between 12-layer and 24-layerALBERT-xxlarge configurations in terms of downstream accuracy is negligible, with the Avg score being the same. We conclude that, when sharing all cross-layer parameters (ALBERT-style), there is no need for models deeper than a 12-layer configuration"

danieldk · on Jan 9, 2020

The summary is only wrong if you want to compare the categories. I am not sure why the category names are important, except for identifying individual models.

Nimitz14 · on Jan 8, 2020

Yeah, trading off embedding dim vs hidden layer dim is a common trick that effectively means trading size on disk (and in memory) for inference speed. It's obvious the model will need to be more powerful if the embedding carry less information. Still cool they got the size down by 90%.

nl · on Jan 8, 2020

I've been using ALBERT (the HuggingFace port) for a few weeks. It works fine on GPUs, and it isn't noticeably slower for inference than other large models on CPUs.

It's worth noting that TPUs are available for free on Google Colab.

danieldk · on Jan 9, 2020

It's worth noting that TPUs are available for free on Google Colab.

Yes, and you can also get a research grant, which gives you several TPUs for a month. But that does not mean that you can easily deploy TPUs in your own infrastructure, unless you use Google Cloud and suck up the costs (which may not be possible in academia).

ZeroCool2u · on Jan 8, 2020

I'm sorry, but why would you run a model like ALBERT on a CPU in the first place?

It's common knowledge not to bother running a deep model like ALBERT, BERT, or XLNet, etc without at least a GPU. Training and inference with models of this size on CPU's is typically considered to be intractable.

Obviously, this paper is from Google, so they have free access to TPU's, which is arguably optimal for an arbitrary TensorFlow model in terms of performance, but if you don't have access to TPU's, the clear choice is to use GPU's instead. It won't be quite as fast, but it is sufficient and GPU's are relatively cheap and available.

nl · on Jan 8, 2020

I'm sorry, but why would you run a model like ALBERT on a CPU in the first place?

It's pretty common to run inference on CPUs. There are lots of operational and cost reasons why this makes sense in at least some cases.

peterjussi · on Jan 9, 2020

I would even say: in every case, except where a GPU or TPU is necessary to achieve a certain speed. Unless there are very specific reasons for it, GPU/TPU is just unnecessarily cost inefficient.

danieldk · on Jan 9, 2020

It's common knowledge not to bother running a deep model like ALBERT, BERT, or XLNet, etc without at least a GPU. Training and inference with models of this size on CPU's is typically considered to be intractable.

I am not sure if this is common knowledge, and if it is, it is wrong. With read-ahead and sorting batches by length, we can easily reach 100 sentences per second on modern CPUs (8 threads) with BERT base. We use BERT in a multi-task setup, typically annotating 5 layers at the same time. This is many times faster than old HPSG parsers (which typically had exponential time complexity) and as fast as other neural methods used in a pipelined setup.

zuzun · on Jan 8, 2020

Is there a speed-up? In their paper in table 3, once you compare each ALBERT model with the smaller BERT model, you're looking at similar accuracies and longer training times.

nl · on Jan 9, 2020

They are comparing the speed to execute training to 125K steps, not speed to a given accuracy.

In section 4.8 they compare accuracy at the same amount of training time for the biggest of each model and show that ALBERT is substantially better.

halflings · on Jan 9, 2020

> Edit: I guess one solution would be to use a pretrained ALBERT and finetune to get the initial model and then use model distillation to get a smaller, faster model.

Huggingface did just that: https://medium.com/huggingface/distilbert-8cf3380435b5

mapleshamrock · on Jan 8, 2020

On CPU, assuming inference is compute bound rather than bandwidth bound, the compute time will scale quadratically with the size of the FC layers (which account for almost all compute time in these networks). So if the hidden size was 768 in BERT-Base, and 4096 in ALBERT, inference will approximately be 28.4x slower... yikes.

nl · on Jan 9, 2020

RACE test accuracy of 89.4. The latter appears to be a particularly strong improvement,a jump of +17.4% absolute points over BERT (Devlin et al., 2019), +7.6% over XLNet (Yang et al.,2019), +6.2% over RoBERTa (Liu et al., 2019), and 5.3% over DCMI+ (Zhang et al., 2019), an ensemble of multiple models specifically designed for reading comprehension tasks. Our single model achieves an accuracy of86.5%, which is still2.4%better than the state-of-the-art ensemble model

This is an amazing result.

RACE has a performance ceiling of 94.5% set by inaccuracies in the data. Mechanical Turk performance is 73.3%.

It's a hard, reading comprehension test where you can't just extract spans from the text and match against answers. Section 2 of https://www.aclweb.org/anthology/D17-1082.pdf has a sample.

iceIX · on Jan 8, 2020

Some interesting discussion between the authors and the ICLR 2020 reviewers here: https://openreview.net/forum?id=H1eA7AEtvS

misterman0 · on Jan 8, 2020

The progress academics continuously make in NLP take us closer and closer to a local maxima that, when we reach it will be the mark of the longest and coldest AI winter ever experienced by man, because of how far we are from a global maxima and, progress made by academically untrained researchers will, in the end, be what melts the snow, because of how "out-of-the-current-AI-box" they are in their theories about language and intelligence in general.

heyitsguay · on Jan 8, 2020

We hear that opinion all the time. As someone working in neural net-based computer vision I'd basically agree that the current approaches are tending towards a non-AGI local maximum, but I'd note that as compared to the 80s, this is an economically productive local maximum, which will likely help fuel new developments more efficiently than in previous waves. The next breakthrough may be made by someone academically untrained, but you can bet they'll have learned a whole lot of math, computer science, data science, and maybe neuroscience first.

spinningslate · on Jan 8, 2020

Agreed. I'm nowhere near expert enough to opine on how far state-of-the-art is from some global maxima.

I'd contend that, for the most part, it doesn't matter. It's a bit like the whole ML vs AGI debate ("but ML is just curve fitting, it's not real intelligence"). The more pertinent question for human society is the impact it has - positive or negative. ML, with all its real or perceived weaknesses, is having a significant impact on the economy specifically and society generally.

It'll be little consolation for white collar workers who lose their jobs that the bot replacing them isn't "properly intelligent". Equally, few people using Siri to control their room temperature or satnav will care that the underlying "intelligence" isn't as clever as we like to think we are.

Maybe current approaches will prove to have a cliff-edge limitation like previous AI approaches did. That will be interesting from a scientific progress perspective. But even in its current state, contemporary ML has plently scope to bring about massive changes in society (and already is). We should be careful not to miss that in criticising current limitations.

bjornsing · on Jan 8, 2020

> as clever as we like to think we are.

Word. I think we’re actually at a level now that we’ll soon start questioning how intelligent people really are, and how much of human intelligence is just an uncanny ability to hide incompetence/lack of deeper comprehension.

(Of course we’re a hell of a long way from A.I. with deep comprehension, and may remain so for hundreds of years. It’s impossible to predict that kind of quantum leap IMHO.)

the8472 · on Jan 8, 2020

The questions are already being asked.

> Humans Who Are Not Concentrating Are Not General Intelligences

https://srconstantin.wordpress.com/2019/02/25/humans-who-are...

rumticular · on Jan 8, 2020

This perspective makes sense pragmatically, but in philosophical terms it’s a little absurd.

Going back to Turing, the argument was for true, human creativity. The claim was that there is no theoretical reason a machine cannot write a compelling sonnet.

After spending the better part of a century on that problem, we have made essentially zero progress. We still believe that there is no theoretical reason a machine cannot write a compelling sonnet. We still have zero models for how that could actually work.

If you are a non-technical person who has been reading popular reporting about ML, you might well have been given the impression that something like GPT2 reflects progress on the sonnet problem. Some very technical people seem to believe this too? Which seems like an issue, because there’s just no evidence for it.

Maybe a larger/deeper/more recurrent ML approach will magically solve the problem in the next twenty years.

And maybe the first machine built in the 20th century that could work out symbolic logic faster than all of the human computers in the world would have magically solved it.

There was no systematic model for the problem, so there was no reason to conclude one way or another, just as there isn’t any today.

ML is a powerful metaprogramming technique, probably the most productive one developed yet. And that matters.

It’s just still not at all what we’ve been proposing to the public for a hundred years. To the best of our understanding, it’s not even meaningfully closer. And that matters too, even if Siri still works fine.

p1esk · on Jan 8, 2020

Re sonnet problem: we can use GPT-2 to generate 10k sonnets, then choose the best one (say by popular vote, or expert opinion, etc), it's quite likely to be "compelling" or at least on par with an average published sonnet. Do you agree? If yes, then with some further deep learning research, more training data, and bigger models, we will probably be able to eventually shrink the output space to 1k, 100, and eventually maybe just 10 sonnets to choose from, to get similar quality. Would this be considered "progress for that problem" in your opinion?

visarga · on Jan 8, 2020

> After spending the better part of a century on that problem, we have made essentially zero progress.

I dunno, I heard music composed with neural nets that is above what the average human could achieve [1]. Not on par with the greatest composers, but over average human level.

In the same line of thought, I have seen models do symbolic math better than automatic solvers, generate paintings better than average humans could paint, even translate better than average second language learners.

I would rate current level in AI at 50% of human intelligence on average, and most of that was accomplished in the recent years.

[1] Google Music Transformer - https://magenta.tensorflow.org/music-transformer

nl · on Jan 8, 2020

Going back to Turing, the argument was for true, human creativity.

That's not true. The Turing test is that one can't tell the difference between a human and a machine intelligence by communicating with it. That's it.

The claim was that there is no theoretical reason a machine cannot write a compelling sonnet.

And that's absolutely not true. I can't write a compelling sonnet.

If you are a non-technical person who has been reading popular reporting about ML, you might well have been given the impression that something like GPT2 reflects progress on the sonnet problem. Some very technical people seem to believe this too? Which seems like an issue, because there’s just no evidence for it.

I work in the field of NLP and I believe it does reflect progress, and I think there is evidence for it.

  The gods are they who came to earth
  And set the seas ablaze with gold.
  There is a breeze upon the sea,
  A sea of summer in its folds,
  A salt, enchanted breeze that mocks
  The scents of life, from far away
  Comes slumbrous, sad, and quaint, and quaint.
  The mother of the gods, that day,
  With mortal feet and sweet voice speaks,
  And smiles, and speaks to men:  "My Sweet,
  I shall not weary of thy pain."

GPT2 small generated poetry.

  For the youth, who, long ago,
  Came up the long and winding way
  Beneath my father's roof, in sorrow—
  Sorrow that I would not bless
  With his very tears. Oh,
  My son the sorrowing,
  Sorrow's child. God keep thy head,
  Where it is dim with age,
  Gentle in her death!

GPT2 1.5B large generated poetry.

Both samples from https://www.gwern.net/GPT-2

mrec · on Jan 8, 2020

> The next breakthrough may be made by someone academically untrained, but you can bet they'll have learned a whole lot of math, computer science, data science, and maybe neuroscience first.

I found this sentence particularly intriguing given that John Carmack recently announced that he was switching his main focus to AI.

The_rationalist · on Jan 8, 2020

Where can we follow his work?

dna_polymerase · on Jan 8, 2020

> [..] this is an economically productive local maximum, which will likely help fuel new developments more efficiently than in previous waves.

This is exactly it. Every previously seen AI Winter had in common that funding was cut back. However, Google or other companies in the realm could approach a point where further investment wouldn't make sense to them. Until then there won't be a Winter, maybe Autumn, because smaller players disappear.

fergal_reid · on Jan 8, 2020

The progress that has been made so far is already good enough to deliver tons of real business value. The tech is way ahead of application, as the tech has jumped forward so much in the last 5 years, and progress continues to be rapid.

I've direct evidence of that from my day job (building NLP chat bots for Intercom).

That business value will increase as we NLP progresses, even if we're moving towards a local optimum.

Even if we do get stuck, real products and real revenue powered by NLP will help fund research on successive generations.

Of course theres tons of hype about AI. But theres also a big virtuous cycle which just wasn't present in the setup which created previous AI winters.

screye · on Jan 8, 2020

People thought that with LSTMs, and then we got transformers. People thought that with CNNs, and then we got Res-nets.

Progress is always that way. It plateaus, then suddenly jumps and then plateaus again.

If you complaint is about the general move away from statistics and deep learning becoming the norm, then there are a pretty decent number of labs who are working on coming up with whatever the next deep learning is. There is probabilistic programming and there models are some models with newer biologically inspired computation structures.

Even inside ML and deep learning, people are trying to come up with ways to better leverage unsupervised learning and building large common sense representations of the world.

There is certainly an oversupply of applied deeplearning practitioners, but there are other approaches being explored in the AI/ML community too.

teshier-A · on Jan 8, 2020

Like the local maxima that the GLUE benchmark was for a few weeks (months?) before SuperGLUE got released ? This field is moving so fast, it's probably wiser to hold off over the top ominous predictions for a little while.

octbash · on Jan 8, 2020

This sounds like some sort of perverse inverse of "This is good for bitcoin."

Making progress of NLP benchmarks? Must be a sign that we're moving even closer to an even longer and more bitter AI winter.

tanilama · on Jan 8, 2020

It is not a summer/winter choice only if you chose to think this way. Such construction is superficial and drama oriented while reflecting very little what reality actually presents.

The current A.I. B.O.O.M is due to end or it is ending already, but this only means now we are equipped with really powerful approximators previous generation of researchers would not even dream of, that we left us with a really tantalizing question:

What is the right question to ask?

We have undoubtedly proved machine are superior to fit, now we need to make them curious.

317070 · on Jan 8, 2020

Yeah, some people even started giving it a name [0]: Schmidhuber's (local) optimum. It is a bit tongue-in-cheek, but the idea is that as long as Schmidhuber says he did it before, we are probably in the same basin of attraction as we were in the nineties.

The open question is whether AGI is the same as Schmidhuber's optimum, or even lies within Schmidhuber's basin.

[0] Cambridge style debate on the topic at Neurips 2019.

tjansen · on Jan 8, 2020

But why does it have to be the longest AI winter? I would agree that current NLP approaches do not get us any closer to NLU. They won't hurt either though. They may even help to motivate people. I started working on NLU because the current state of voice assistants is so frustrating...

The_rationalist · on Jan 8, 2020

But why does it have to be the longest AI winter? Because we have explored both paradigms (symbolic, subsymbolic and hybrid AI) The research has explored both existing paradigms and no other paradigm exist. Curve fitting (subsymbolic) is inherently limited. Maybe we need to reinvent symbolic AI, but almost nobody is working on it, and I'm not aware of any promising research paths/ideas for symbolic AI.

tjansen · on Jan 8, 2020

My feeling is that symbolic AI hasn't really been explored that much with modern tools and modern computing capacity. It just needs that one breakthrough to show that for certain areas (IMO NLU) it is far superior. There's no way of predicting when this will happen, but given the current interest in AI I don't think it will take that long, even if currently there are far more people working on subsymbolic AI.

visarga · on Jan 8, 2020

Isn't physics in the same situation? The theory is useful for countless applications but is ultimately flawed, and researchers are aware of that, of course. But we don't hear it every time there is a new application of physics. Why should the ultimate high standard be invoked so often in discussion only for ML? Other fields like psychology or economics are probably in an even worse position with their theories vs the reality.

ML is an empirical science, or a craft if you want, with useful applications. It's not the ultimate theory of intelligence.

Iv · on Jan 8, 2020

Personally, seeing that connexionitsts and symbolists start talking with each other gives me hope that there won't be another AI winter before the AI singularity.

2sk21 · on Jan 8, 2020

Agreed. I should mention that reading Gary Marcus' new book really drives home your point.

slumdev · on Jan 8, 2020

> progress made by academically untrained researchers will, in the end, be what melts the snow, because of how "out-of-the-current-AI-box" they are in their theories

This is an unnecessarily uncharitable view of academia.

"Outside the box thinking" is frequently just ignorance and Dunning-Kruger.

laretluval · on Jan 8, 2020

Current academic NLP would have been considered quite out-of-the-current-box 10 years ago. Most academic progress is driven by young graduate students who think similarly to you.