Hacker News new | past | comments | ask | show | jobs | submit login
The State of Transfer Learning in NLP (ruder.io)
122 points by amrrs 61 days ago | hide | past | web | favorite | 11 comments

What I have seen is that one of the main issues in NLP is actually finding annotated text. Especially for some more specialized tasks annotating can be a very costly process. It is not easy to annotate an e.g an argument versus labelling a Bus or a person. I believe that unsupervised pre-training can help a lot with this issue. It is just not feasible to find such big labelled corpora to train a model from the beginning for all the NLP tasks.

Another issue that I've seen is that there is a huge inertia in academia in fields where (deep)NLP is really needed (e.g law or education) most of the academics in these fields just cannot follow the developments and a lot of the quantitative folks who can, they seem stuck in the SVM + heckton of features approach. I am glad that the trend is towards tools like BERT that can act as the initial go-to for NLP development even if its not exactly Transfer Learning as done by CV people

Academic incentives are awkward. Computer scientists and other quants find it difficult to build careers in this stuff as it's often either too derivative to be cutting-edge ML ('you applied last year's tools to an education problem we don't care about, yay?') or too reductive to be useful in the target discipline ('you've got an automated system to tell us how newspaper articles are framed? Lovely, but we've been studying this with grad students reading them for 30 years and already have a much more nuanced view than your experimental system can give us').

That, and it's difficult to sell results to people who don't understand your methods, which by definition is practically everybody when making the first new applications in a field.

I'm a social scientist who can code, and even stuff like SVMs can be a difficult sale outside of the community of methodologists.

You're also spot on about annotated training data. The framing example above is my current problem, and there is one (1) existing annotated dataset, annotated with a version of frames which is useless for my substantive interest. Imagenet this is not.

Creating annotations is the main issue. It is what will separate those who commercialize NLP vs those who talk about it.

As a practioner I don't think BERT and friends are that great. They get you from 88% accuracy in some of those tasks to 93% accuracy. They will get better over time but they are approaching an asymptote -- that article points out several major limitations of the approach.

Thus "simple classifier" vs "heckton of features" can be effective for that work, particularly because you don't have to train and retrain complex models and use the effort to build up training sets and do feature engineering.

Maybe I don't understand correctly but my point is that BERT and friends just let you skip a lot of feature engineering. Which(the F.Eng) is actually pretty useless once you shift the corpus a bit. That is for me the biggest contribution, I don't care about a couple of percentage points of better results in the end but I do care that I don't need to spend all my time devising new features, which almost always require an expert on that particular topic.

Yeah I agree. Deep learning mostly eliminates feature engineering which is a huge win. However BERT and other transformer models take too damn long to train. I still prefer using FastText for transfer learning because it can be trained on large corpora quickly. Combine that with supervised deep learning, and you often get good enough accuracy in practice.

Sure, but the feature engineer could escape the asymptope that BERT is converging towards and get better accuracy which could make the product*market fit work.

As an NLP person who had to work on some object detection stuff recently for an unrelated matter, it's amazing what a difference transfer learning makes. Whole task becomes brainless.

You could argue that BERT was a first go at it, but until transfer learning doesn't equate to "throw compute at it because we're Google/OpenAI", we're nowhere near to having solved this.

I would like to know: Yes, it take really a lot of compute and time to train e.g BERT. (probably 64 V100) and 3 days of training. BUT once it is trained, and you want to use it on your application. Inference time usually take a few ms, is it far longer with bert? And does a modern smartphone can easily run such inference? I've heard about mobilenets but they sacrifice too much accuracy, so I really hope BERT can be run today on a 7nm snapdragon + it's mini TPU. I can't find such data on the web but this is an elementary question, necessary for complete success of Nlp.

throw compute at it because we're Google/OpenAI Sorry but for training time, neural network deep learning is far from a "smart" paradigm. It is essentially statistical brute force + a few clever math tricks. This is a part of the answer on how to create an artificial general intelligence. But where's is the research for creating a causal reasoning system understunding natural language? It mostly died in the AI winter, and except a few hipsters like me or opencog or cyc, is dead. I wonder how many decades will be needed for firms like Google to realize such an obvious thing (that real intelligence is statistical AND causal).

We have had "good" (maybe not BERT/XLNet'ish levels of quality) results using ULMFit. I.e., on almost all problems we got better results than our previous best approaches (mostly LSTM/CNN and self-attention à la https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-att...).

Thus, we've seen real value out of transfer learning that doesn't require overly much compute power (and actually could even be run on free colab instances, I think).

That said, I agree that the problem is still very far from being "solved". In particular, I have a fear that most recent advances might be tracked back to gigantic models memorizing things (instead of doing something that could at least vagely be seen as sort of understanding text) to slightly improve GLUE scores.

Still, I am highly optimistic about transfer learning for NLP in general

there are some much smaller BERT models available on tf-hub and elsewhere if you wish.

most people also actually distill BERT to reduce the computation cost

Great write up. Sebastian has a deep learning NLP email newsletter that I highly recommend. I read two articles this morning that were mentioned in the newsletter that he sent out last night. Good stuff.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact