We have had "good" (maybe not BERT/XLNet'ish levels of quality) results using ULMFit. I.e., on almost all problems we got better results than our previous best approaches (mostly LSTM/CNN and self-attention à la https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-att...).
Thus, we've seen real value out of transfer learning that doesn't require overly much compute power (and actually could even be run on free colab instances, I think).
That said, I agree that the problem is still very far from being "solved". In particular, I have a fear that most recent advances might be tracked back to gigantic models memorizing things (instead of doing something that could at least vagely be seen as sort of understanding text) to slightly improve GLUE scores.
Still, I am highly optimistic about transfer learning for NLP in general
Thus, we've seen real value out of transfer learning that doesn't require overly much compute power (and actually could even be run on free colab instances, I think).
That said, I agree that the problem is still very far from being "solved". In particular, I have a fear that most recent advances might be tracked back to gigantic models memorizing things (instead of doing something that could at least vagely be seen as sort of understanding text) to slightly improve GLUE scores.
Still, I am highly optimistic about transfer learning for NLP in general