As an NLP person who had to work on some object detection stuff recently for an unrelated matter, it's amazing what a difference transfer learning makes. Whole task becomes brainless.
You could argue that BERT was a first go at it, but until transfer learning doesn't equate to "throw compute at it because we're Google/OpenAI", we're nowhere near to having solved this.
I would like to know:
Yes, it take really a lot of compute and time to train e.g BERT.
(probably 64 V100) and 3 days of training.
BUT once it is trained, and you want to use it on your application.
Inference time usually take a few ms, is it far longer with bert?
And does a modern smartphone can easily run such inference?
I've heard about mobilenets but they sacrifice too much accuracy, so I really hope BERT can be run today on a 7nm snapdragon + it's mini TPU.
I can't find such data on the web but this is an elementary question, necessary for complete success of Nlp.
throw compute at it because we're Google/OpenAI
Sorry but for training time, neural network deep learning is far from a "smart" paradigm.
It is essentially statistical brute force + a few clever math tricks.
This is a part of the answer on how to create an artificial general intelligence.
But where's is the research for creating a causal reasoning system understunding natural language?
It mostly died in the AI winter, and except a few hipsters like me or opencog or cyc, is dead.
I wonder how many decades will be needed for firms like Google to realize such an obvious thing (that real intelligence is statistical AND causal).
We have had "good" (maybe not BERT/XLNet'ish levels of quality) results using ULMFit. I.e., on almost all problems we got better results than our previous best approaches (mostly LSTM/CNN and self-attention à la https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-att...).
Thus, we've seen real value out of transfer learning that doesn't require overly much compute power (and actually could even be run on free colab instances, I think).
That said, I agree that the problem is still very far from being "solved". In particular, I have a fear that most recent advances might be tracked back to gigantic models memorizing things (instead of doing something that could at least vagely be seen as sort of understanding text) to slightly improve GLUE scores.
Still, I am highly optimistic about transfer learning for NLP in general
You could argue that BERT was a first go at it, but until transfer learning doesn't equate to "throw compute at it because we're Google/OpenAI", we're nowhere near to having solved this.