Hacker News new | past | comments | ask | show | jobs | submit login
A Primer in BERTology: What We Know About How Bert Works (arxiv.org)
81 points by whym 18 days ago | hide | past | favorite | 16 comments

" BERT is a method of pre-training language representations, meaning that we train a general-purpose "language understanding" model on a large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering). BERT outperforms previous methods because it is the first unsupervised, deeply bidirectional system for pre-training NLP."


Can anyone please explain (in layman terms if it's possible) how did the researchers come up with the method in the first place if the process how the method finds the answers is not understood?

Intelligent trial and error.

1. Transformers are an extension of the attention mechanism, a well known LSTM extension that worked well for machine translation (2014, https://arxiv.org/abs/1409.0473). A transformer model is essentially building a multi headed attention module, analogue to CNNs, then stacking several layers of them, analogue to CNNs / stacked LSTMs. (1998, http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf)

2. Transfomers use residual blocks, which were introduced by the ResNet CNN arhitecture (2015, https://arxiv.org/abs/1512.03385). At the time, ResNet was topping the ImageNet benchmarks. This technique helps preventing the vanishing gradient problem during training.

3. Transformers use normalization extensively. Layer normalization and attention normalization. This helps keeping internal vectors in the neighborhood of 1 and prevents vanishing gradient training collapse.

4. Correct initialization of the network vectors also helps preventing the vanishing gradients problem.

5. Unsupervised pretraining was one of the first tricks to make deep networks work (2009, https://www.jmlr.org/papers/volume11/erhan10a/erhan10a.pdf)

6. Pretraining was used extensively in the vision community for transfer learning, i.e. reusing the weights of a network trained on ImageNet and replace the top layers / loss function to tackle a different problem.

7. Finally, language modelling, that is predicting the next word in a sentence, was a well known technique to make machine translation work better. Researchers were looking for better language modelling techniques using large corpora (2013, https://arxiv.org/abs/1312.3005, https://www.kaggle.com/c/billion-word-imputation)

Cave men discovered fire before humans discovered phlogiston, which they discovered before they discovered there was no such thing as phlogiston but rather oxygen instead.

BERT was motivated by the discovery that neural nets will try to learn functions you show them and some ideas of how to route information so it forms bottlenecks that force the network to learn general representations, but there was not a mature theory behind it and judging by that review paper there still isn't.

That whole paper reads to me like an account of wandering in the dark. If I read a paper that long about liquid rocket fuels I'd learn that 99% of the things I might want to use as a rocket fuel won't work and I really have a choice of Hydrogen, Methane or RP-8 and Oxygen.

If you were out to "build a better BERT", even a slightly better BERT, that paper doesn't give clear guidelines about what you should do.

It's got all the trappings of a field which is preparadigmatic but could be mistaken for paradigmatic because of the sheer volume of researchers, conferences, papers, etc.

Throughout human history, you can find many discoveries that were made before understanding why they work in the first place: we know exactly how the neural networks work, but we don’t know why they are so effective, maybe because of the lack of a theoretical understanding of deep learning and complex neural networks. For this particular case, BERT is the result of incremental enhancements of existing architectures and training procedures. Fundamentally, BERT is a sequence prediction algorithm, and historically, the sequence prediction models were based on complex recurrent or convolutional neural networks. Experiences showed that the best performing models were those having an attention mechanism (the concept of directing the focus on some words or sentences). Some researchers proposed a new simple network architecture, based solely on attention mechanisms, without complex recurrence or convolutions! and they showed that some of these models achieve state-of-the-art performances while being more parallelizable and requiring significantly less time to train.

Because some significant proportion of ML/NN research involves throwing mud at a wall and seeing what sticks.

How can you write a computer program if you don't understand how every library you're using works down to the smallest detail? Humans are fairly good at working with building blocks that they don't fully understand.

I agree with you on the fact that humans can cope with lack of detailed knowledge to do things. But I think your programming analogy doesn't apply to this situation.

In your analogy, there is me the programmer and the library's programmer. If the library has a consistent behavior, I need only to know how to use it, I don't need to know how it works internally. But the library's programmer needs to know how it works. He may not need to know how the building blocks, he's using in his library, work. but he does need to know how they interact to be able to deliver new features and bugfixes. (I think) He can't add features and correct bugs by trial and error (at least not all the time :)) .

So In your analogy, I think the researchers that came up with the BERT model are the library's programmer and not the programmer.

Thank you all for your replies!

For anyone who, like me, isn't a BERTologist, BERT is a neural network architecture.

He's also Ernie's best friend.

Let's not forget Elmo!

Oh hey Ernie! Oh hey Bert!

Ernie who?

I can't hear you Bert, I've got a banana in my ear.

Any new information in the paper since the first version came out in Mars? Otherwise a 6 month old meta-study seems kind of dated given rate of progress in NLP atm.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact