A Primer in BERTology: What We Know About How Bert Works

mobilio · on Nov 10, 2020

" BERT is a method of pre-training language representations, meaning that we train a general-purpose "language understanding" model on a large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering). BERT outperforms previous methods because it is the first unsupervised, deeply bidirectional system for pre-training NLP."

https://github.com/google-research/bert

niea_11 · on Nov 10, 2020

Can anyone please explain (in layman terms if it's possible) how did the researchers come up with the method in the first place if the process how the method finds the answers is not understood?

alquemist · on Nov 10, 2020

Intelligent trial and error.

1. Transformers are an extension of the attention mechanism, a well known LSTM extension that worked well for machine translation (2014, https://arxiv.org/abs/1409.0473). A transformer model is essentially building a multi headed attention module, analogue to CNNs, then stacking several layers of them, analogue to CNNs / stacked LSTMs. (1998, http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf)

2. Transfomers use residual blocks, which were introduced by the ResNet CNN arhitecture (2015, https://arxiv.org/abs/1512.03385). At the time, ResNet was topping the ImageNet benchmarks. This technique helps preventing the vanishing gradient problem during training.

3. Transformers use normalization extensively. Layer normalization and attention normalization. This helps keeping internal vectors in the neighborhood of 1 and prevents vanishing gradient training collapse.

4. Correct initialization of the network vectors also helps preventing the vanishing gradients problem.

5. Unsupervised pretraining was one of the first tricks to make deep networks work (2009, https://www.jmlr.org/papers/volume11/erhan10a/erhan10a.pdf)

6. Pretraining was used extensively in the vision community for transfer learning, i.e. reusing the weights of a network trained on ImageNet and replace the top layers / loss function to tackle a different problem.

7. Finally, language modelling, that is predicting the next word in a sentence, was a well known technique to make machine translation work better. Researchers were looking for better language modelling techniques using large corpora (2013, https://arxiv.org/abs/1312.3005, https://www.kaggle.com/c/billion-word-imputation)

PaulHoule · on Nov 10, 2020

Cave men discovered fire before humans discovered phlogiston, which they discovered before they discovered there was no such thing as phlogiston but rather oxygen instead.

BERT was motivated by the discovery that neural nets will try to learn functions you show them and some ideas of how to route information so it forms bottlenecks that force the network to learn general representations, but there was not a mature theory behind it and judging by that review paper there still isn't.

That whole paper reads to me like an account of wandering in the dark. If I read a paper that long about liquid rocket fuels I'd learn that 99% of the things I might want to use as a rocket fuel won't work and I really have a choice of Hydrogen, Methane or RP-8 and Oxygen.

If you were out to "build a better BERT", even a slightly better BERT, that paper doesn't give clear guidelines about what you should do.

It's got all the trappings of a field which is preparadigmatic but could be mistaken for paradigmatic because of the sheer volume of researchers, conferences, papers, etc.

JacobiX · on Nov 10, 2020

Throughout human history, you can find many discoveries that were made before understanding why they work in the first place: we know exactly how the neural networks work, but we don’t know why they are so effective, maybe because of the lack of a theoretical understanding of deep learning and complex neural networks. For this particular case, BERT is the result of incremental enhancements of existing architectures and training procedures. Fundamentally, BERT is a sequence prediction algorithm, and historically, the sequence prediction models were based on complex recurrent or convolutional neural networks. Experiences showed that the best performing models were those having an attention mechanism (the concept of directing the focus on some words or sentences). Some researchers proposed a new simple network architecture, based solely on attention mechanisms, without complex recurrence or convolutions! and they showed that some of these models achieve state-of-the-art performances while being more parallelizable and requiring significantly less time to train.

taneq · on Nov 10, 2020

Because some significant proportion of ML/NN research involves throwing mud at a wall and seeing what sticks.

marcinzm · on Nov 10, 2020

How can you write a computer program if you don't understand how every library you're using works down to the smallest detail? Humans are fairly good at working with building blocks that they don't fully understand.

niea_11 · on Nov 10, 2020

I agree with you on the fact that humans can cope with lack of detailed knowledge to do things. But I think your programming analogy doesn't apply to this situation.

In your analogy, there is me the programmer and the library's programmer. If the library has a consistent behavior, I need only to know how to use it, I don't need to know how it works internally. But the library's programmer needs to know how it works. He may not need to know how the building blocks, he's using in his library, work. but he does need to know how they interact to be able to deliver new features and bugfixes. (I think) He can't add features and correct bugs by trial and error (at least not all the time :)) .

So In your analogy, I think the researchers that came up with the BERT model are the library's programmer and not the programmer.

niea_11 · on Nov 10, 2020

Thank you all for your replies!

taneq · on Nov 10, 2020

For anyone who, like me, isn't a BERTologist, BERT is a neural network architecture.

jeffrallen · on Nov 10, 2020

He's also Ernie's best friend.

kleiba · on Nov 10, 2020

Let's not forget Elmo!

nullsense · on Nov 10, 2020

Oh hey Ernie! Oh hey Bert!

godelmachine · on Nov 10, 2020

Ernie who?

cinntaile · on Nov 10, 2020

I can't hear you Bert, I've got a banana in my ear.

hallqv · on Nov 10, 2020

Any new information in the paper since the first version came out in Mars? Otherwise a 6 month old meta-study seems kind of dated given rate of progress in NLP atm.