
NLP's ImageNet Moment: From Shallow to Deep Pre-Training - stablemap
https://thegradient.pub/nlp-imagenet/
======
narrator
I was at a deep learning conference recently. The topic of how AI can improve
healthcare came up. One panelist said that a startup they were working with
wants to help doctors use AI to use NLP to send claims to insurance companies
in a way that won't be rejected. Another panelist said that he was working
with another startup that wants to use AI and NLP to help insurance companies
reject claims.

I think in the future we'll see their AI fighting against our AI in an arms
race similar to the spam wars. The one with the most computing power and
biggest dataset will win and humans will be at their mercy.

~~~
jahabrewer
> One panelist said that a startup they were working with wants to help
> doctors use AI to use NLP to send claims to insurance companies in a way
> that won't be rejected. Another panelist said that he was working with
> another startup that wants to use AI and NLP to help insurance companies
> reject claims.

Sounds like GAN in meatspace.

~~~
vokep
Yep, They compete endlessly, while we enjoy hyper-accurate decisions on these
things, leading to greater efficiency of both.

~~~
bigiain
Yes.

But quite possibly "greater efficiency" according to a fitness function that's
not accurately mapped onto "keeping humans alive"...

I wonder if this'll end up in an equivalent state to the "tank detection
neural net" which learned with 100% accuracy that the researchers/trainers had
always taken pictures of tanks on cloudy days and pictures without tanks on
sunny days? ( [https://www.jefftk.com/p/detecting-
tanks](https://www.jefftk.com/p/detecting-tanks) )

Who'd bet against the doctor/insurer neural net training ending up approving
all procedures where, say, the doctor ends up with a kickback from a drug
company - instead of optimising for maximum human health benefit?

~~~
Rainymood
>But quite possibly "greater efficiency" according to a fitness function
that's not accurately mapped onto "keeping humans alive"...

Since when was this ever the case? Especially in America? The US healthcare
system is NOT built around providing adequate care for everyone, as far as
I've read/heard.

Full disclosure: West-EU citizen here

------
rusbus
For more detail plus working code, lesson 4 of the fast.ai course uses this
technique to obtain (what was at time of writing) a state of the art result on
the imdb dataset:

[http://course.fast.ai/lessons/lesson4.html](http://course.fast.ai/lessons/lesson4.html)

By training a language model on the dataset, then using that model to fine
tune the sentiment classification task, they were able to achieve 94.5%
accuracy

~~~
jph00
Well spotted - this is where I first created the algorithm that became ULMFiT!
I wanted to show an example of transfer learning outside of computer vision
for the course but couldn't find anything compelling. I was pretty sure a
language model would work really well in NLP so tried it out, and was totally
shocked when the very first model best the previous state of the art!

Sebastian (author of this article) saw the lesson, and was kind enough to
complete lots of experiments to test out the approach more carefully, and did
a great job of writing up the results in a paper, which was then accepted by
the ACL.

------
cs702
The title is a little too click-baity for my taste ("has arrived," huh?), but
I think the OP is unto something.

It is now possible to grab a pretrained model and start producing state-of-
the-art NLP results in a wide range of tasks with relatively little effort.

This will likely enable much more tinkering with NLP, all around the world...
which will lead to new SOTA results in a range of tasks.

~~~
zawerf
Do you have links for these pretrained models? The only one I am aware of is
OpenAI's where they fine tuned a Transformer architecture for 1 month on 8
gpus:

[https://blog.openai.com/language-
unsupervised/](https://blog.openai.com/language-unsupervised/)

[https://github.com/openai/finetune-transformer-
lm](https://github.com/openai/finetune-transformer-lm)

~~~
sebastianruder
You can find ELMo here:
[https://github.com/allenai/allennlp/blob/master/tutorials/ho...](https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md)

And ULMFiT here:
[http://nlp.fast.ai/category/classification.html](http://nlp.fast.ai/category/classification.html)

~~~
cs702
For those who don't know, Sebastian Ruder is a coauthor of the ULMFiT paper:
[https://arxiv.org/abs/1801.06146](https://arxiv.org/abs/1801.06146)

------
acganesh
Pretrained models have enabled so much in CV, excited to see similar shifts in
the language world.

A great supplement is Sebastian’s NLP progress repo:
[https://github.com/sebastianruder/NLP-
progress](https://github.com/sebastianruder/NLP-progress)

------
neuromantik8086
Not to be too obtuse, but isn't WordNet (you know, the project that inspired
the creation of ImageNet) "an ImageNet for language"? It seems kind of weird
to bring up ImageNet within the context of NLP and not mention WordNet once.

~~~
sebastianruder
WordNet (as you probably know) is a database that groups English words into a
set of synonyms. If you consider WordNet as a clustering of high-level
classes, then you could argue that ImageNet is the "WordNet for vision",
meaning the clustering of object classes. The article uses a different meaning
of ImageNet, namely ImageNet as pretraining task that can be used to learn
representations that will likely be beneficial for many other tasks in the
problem space. In this sense, you could use WordNet as an "ImageNet for
language" e.g. by learning word representations based on the WordNet
definitions. This is something people have done, but there are a lot more
effective approaches. I hope this helped and was not too convoluted.

~~~
neuromantik8086
Does WordNet know that the word "ImageNet" refers to both a database and a
pretraining task? :)

~~~
wodenokoto
No, it does not know that, or anything else about "ImageNet"

[http://wordnetweb.princeton.edu/perl/webwn?c=8&sub=Change&o2...](http://wordnetweb.princeton.edu/perl/webwn?c=8&sub=Change&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&i=-1&h=&s=imagenet)

------
andreyk
TLDR the standard practice of using 'word vectors' (numeric vector
representation of words) may soon be superceded by just using entire
pretrained neural nets as is standard in CV, and we have both conceptual and
empirical reasons to believe language modeling is how it'll happen.

Helped edit this piece, think it is spot on - exciting times for NLP.

------
JPKab
Definitely excited by this, but wish the article was a bit more detailed.

~~~
jph00
The ULMFiT, ELMO, and OpenAI Transformer papers are all quite readable and
linked from the article. Sebastian and I also wrote an introduction to ULMFiT
here: [http://nlp.fast.ai/classification/2018/05/15/introducting-
ul...](http://nlp.fast.ai/classification/2018/05/15/introducting-ulmfit.html)

~~~
JPKab
Thanks!

------
YeGoblynQueenne
>> In order to predict the most probable next word in a sentence, a model is
required not only to be able to express syntax (the grammatical form of the
predicted word must match its modifier or verb) but also model semantics. Even
more, the most accurate models must incorporate what could be considered world
knowledge or common sense.

So, the first sentence in this passage is a huge assumption. For a model to
predict the next token (word or character) in a string, all it has to do is to
be able to predict the next token in a string. In other words, it needs to
model structure. Modelling semantics is not required.

Indeed, there exist a wide variety of models that can, indeed, predict the
most likely next token in a string. The simplest of those are n-gram models,
that can do this task reasonably well. Maybe what that first sentence above is
trying to say is that to predict the next token with good accuracy, modelling
of semantics is required, but that is still a great, big, huge leap of
reasoning. Again- structure is probably sufficient. A very accurate model
modelling structure, is still only modelling structure.

It's important to consider what we mean when we're talking about modelling
language probaiblistically. When humans generate (or recognise) speech, we
don't do that stochastically, by choosing the most likely utterance from a
distribution. Instead, we -very deterministically- say what we want to say.

Unfortunately, it is impossible to observe "what we want to say" (i.e. our
motivation for emitting an utterance). We are left with observing -and
modelling- only what we actually say. The result is models that can capture
the structure of utterances, but are completely incapable of generating new
language that makes any sense - i.e. gibberish.

It is also worth considering how semantic modelling tasks are evaluated (e.g.
machine translation). Basically, a source string is matched to an arbitrary
target string meant to capture the source string's intended meaning.
"Arbitrary" because there may be an infinite number of strings that carry the
same meaning. So what, exactly, are we measuring when we evaluate a model's
ability to map between to of those infinite strings chosen just because we
like them best?

Language inference and comprehension benchmarks like the ones noted in the
article are particularly egregious in this regard. They are basically
classification tasks, where a mapping must be found between a passage and a
multiple-choice spread of "correct" labels, meant to represent its meaning.
It's very hard to see how a model that does well in this sort of task is
"incorporating world knowledge" let alone "common sense"!

Maybe NLP _will_ have its ImageNet moment- but that will only be in terms of
benchmarks. Don't expect to see machines understanding language and holding
reasonable conversations any time soon.

~~~
DoctorOetker
I fully agree and while you probably word it much better than me, I made a
somewhat similar argument at
[https://news.ycombinator.com/item?id=16961233](https://news.ycombinator.com/item?id=16961233)
if you are interested...

