
Improving Language Understanding with Unsupervised Learning - gdb
https://blog.openai.com/language-unsupervised/
======
jph00
This paper is really important: it shows that transfer learning can be applied
to a wide variety of NLP problems with great success. They show state of the
art results on nearly every major class of NLP problem.

The basic approach is the same as our ULMFiT
([http://nlp.fast.ai/classification/2018/05/15/introducting-
ul...](http://nlp.fast.ai/classification/2018/05/15/introducting-ulmfit.html))
model - pre-train a language model (a model that predicts the next word in a
sequence) on a large corpus, and then modify the language model slightly for
whatever task you wish to do (e.g. text classification). Finally, fine-tune
that model using your target corpus (e.g. texts labeled with classes).

This new paper has two significant leaps over ULMFiT:

\- Replace the RNN with a transformer model

\- Apply to many more types of problem.

Note that although the original language model takes them a long time to train
(a month on 8 GPUs), there's almost no reason for anyone else to create their
own model from scratch, except if you need to use this approach on a language
that doesn't have a pre-trained model yet. The transfer learning fine-tuning
doesn't take anywhere close to as long as the language model pre-training, and
you can just use the existing pre-trained weights.

The previous HN discussion on ULMFit may also be of interest:
[https://news.ycombinator.com/item?id=17076222](https://news.ycombinator.com/item?id=17076222)

~~~
andreyk
Should be noted this idea is not that novel though - it's just replacing word
vectors with a pre-trained model. Interesting that it works so well but not
very surprising.

~~~
laichzeit0
It may not be novel, but then why do you we still have commercial APIs
(Microsoft, Google, IBM Watson, etc.) where there is pretty much no way to
"fine tune" them to your domain with a small set of supervised examples? We
all know domain adaptation is a real problem.

Instead you either have to roll-your-own models in-house (which defeats the
whole point of using a ready made cloud solution) or deal with whatever
accuracy you happen to get from those APIs.

IMHO this is an area where you can make some serious competitive headway in
commoditised AI/ML. Do all the heavy lifting of pretraining and give your
customers an API to "fine-tune" with. Who is currently doing this?

~~~
pwaai
Well i hope that [http://js.fo](http://js.fo) will be that as i add more aiml
libraries although currently focused on web area. I hope to expend in other
areas

------
cs702
This is fabulous, great work, with broad applicability.

Train this transformer model on a good amount of text (or grab a pretrained
model), and then, _with minimal fuss and very little tweaking_ , you can
repurpose it to obtain state-of-the-art (or near state-of-the-art) results in
a wide range of tasks, from document classification to textual entailment to
semantic similarity.

This stands in contrast to prior approaches that involve much more tweaking
and/or careful discriminative finetuning of the pretrained model, such as as
Jeremy Howard and Sebastian Ruder's also-impressive ULMFit.[a]

The main downside to this new approach is that pretraining takes a long time.

Anyone working on ML/DL/AI with text should take a look at this, right now.

UPDATE: See Jeremy Howard's comment here:
[https://news.ycombinator.com/item?id=17288320](https://news.ycombinator.com/item?id=17288320)

[a] [https://arxiv.org/abs/1801.06146](https://arxiv.org/abs/1801.06146)

------
laichzeit0
What's really great about these open ai papers is that they release the source
code as well. How many papers I've read where people have tried to reproduce
results and failed and you're left wondering if they screwed up the
implementation somehow or maybe didn't initialise something "just right". This
is a huge amount of time wasted for no reason at all. It really should become
a standard requirement that you at the very least just release your code.

Another excellent example I came across recently (it also happens to be about
unsupervised pretraining and transfer learning)
[https://github.com/bfelbo/DeepMoji](https://github.com/bfelbo/DeepMoji)

It's an absolute joy working with such papers and I suspect one of the best
ways to get people to actually pay attention to your work in an era of Arxiv
Sanity Preserver.

------
nl
Also notable this week is
[https://arxiv.org/abs/1806.02847](https://arxiv.org/abs/1806.02847) from
Google where they got an 11% improvement on Winograd schema performance using
a similar idea.

------
radarsat1
> Our approach requires an expensive pre-training step - 1 month on 8 GPUs.

Wow! I enjoy playing with neural networks but this kind of thing reminds me
that I'm not really doing deep learning...

I have no idea how researchers could have the patience and confidence to wait
that long for a result. In my own (small-data) work, I get frustrated if it
doesn't converge in half an hour.. I constantly end up Ctrl-C'ing and tweaking
things if it doesn't behave as expected, or appear to be continuing to
improve.

~~~
backpropaganda
You pretty much always need at least 2 GPUs, one to keep running jobs for 30
minutes or so and debugging, and the other for longer jobs. It also takes a
lot of patience to only make ONE change at a time. Often, changes you make
which feel intuitive would actually hurt performance, so it's important to
verify that each new change is actually improving performance.

------
mark_l_watson
Very nice. I just requested the ROCStories data so I can experiment with this.

I enjoyed several talks at NACL 2016 that referenced the ROCStories data, but
didn't really have the (personal, not work) compute power to do much. OpenAI's
nice contribution fixes that.

------
boxy310
Fascinating. I've used my own hacked-together unsupervised learning to
facilitate topic labeling, and this seems to address the bottleneck problem
for supervised learning. Will have to dig in deeper though to pick out
specifics.

Also interesting that one of the fundamental problems the authors note is "The
limits and bias of learning about the world through text", which is
essentially a Godelian incompleteness problem. One could say the reverse also
applies to embodied/visual data, and a good argument for studying established
literature in the abstract.

