
State-of-the-art text classification with universal language models - jph00
http://nlp.fast.ai/classification/2018/05/15/introducting-ulmfit.html
======
jph00
The post mentions that there's work ongoing to make this available in
languages other than English. If you'd like to help contribute a language,
there's a discussion thread available with 20 languages under development so
far - we'd love to have more folks join us: [http://forums.fast.ai/t/language-
model-zoo-gorilla/14623](http://forums.fast.ai/t/language-model-zoo-
gorilla/14623)

------
pixelHD
I'm glad we're again concentrating on newer language models.

Curious how it'll perform compared to fasttext when used as encoding network
in larger tasks. I can't help but notice the trend of going back to simpler
models with smarter optimizations and regularization to achieve better
results.

This is a frequent question of mine, which I ask to everyone using RNNs - what
do you think of the idea that CNNs will be able to replace RNNs for sequence
tasks [0]? CNNs are less computationally expensive too, so there's a definite
benefit of switching to them if the performance is on par.

[0]:
[https://twitter.com/lmthang/status/989261575482560513](https://twitter.com/lmthang/status/989261575482560513)

~~~
jph00
fasttext is just an encoding of the first layer of a model (the word
embeddings - or subword embeddings). Full multi-layer pre-trained models are
able to do a lot more. For instance, on IMDb sentiment our method is about
twice as accurate as fasttext.

As to whether CNNs can replace RNNs in general, the jury is still out. Over
the last couple of years there have been some sequence tasks where CNNs are
state of the art, some where RNNs are. Note that with stuff like QRNNs the
assumption that CNNs are less computationally expensive is no longer
necessarily true: [https://github.com/salesforce/pytorch-
qrnn](https://github.com/salesforce/pytorch-qrnn)

I'd be surprised if for tasks that require long-term state (like sentiment
analysis on large docs) whether CNNs will win out in the end, since RNNs are
specifically designed to be stateful - especially with the addition of an
attention layer.

~~~
Radim
_For instance, on IMDb sentiment our method is about twice as accurate as
fasttext._

Seeing as fasttext accuracy is 90%+, does this mean your method achieves 180%?

I'm nitpicking of course, but lately I've seen claims like "20% improvement in
accuracy", where on closer inspection, the authors mean error rate dropped
from 5% to 4%.

Which is not bad of course, but in the grand of scheme of things, 1% absolute
improvement may not be such game-changer, especially if it comes at the cost
of other relevant metrics like model complexity, developer sanity or
performance.

(haven't read your paper yet, just a general sigh/rant)

~~~
PeterisP
This generally _is_ the metric you care about - a difference of one percentage
point can be an improvement of twenty percent, as that means that the total
number of "bad events" that you expect to get when running the system is
decreased by 20%. And it's quite reasonable to assume that here, as in almost
all other domains, "x% improvement" means the percentage difference
(multiplicative), not the percentage point difference (subtractive). For
pretty much every percentage quantity, things like defect ratios, recidivism
rates or financial interest rates, "20% increase" never means an increase of
20 percentage points but an increase by 20 percent of the starting value. If
we're nitpicking, "1% absolute improvement" is an inaccurate statement, the
improvement should be described as 1pp (or 20%), not 1%.

Especially for more well defined problems, going from 98.5% to 99.5% is "just"
1pp absolute improvement but the fact that you have three times less mistakes
can well justify a more complex model that requires ten times more hardware.
The metric that you'd actually care about would often be like "number of hours
required to correct the mistakes" or "number of lost sales due to mistakes",
which all would get modified by the relative percentage change.

~~~
Radim
Yes, that's what I was getting at.

Your note on "more well defined problems" is spot on. Chasing single percent
improvements and SOTA is indeed the name of the game there.

But defining the problem in the first place, figuring out the cost matrix and
solution constraints, is typically the bigger challenge in highly innovative
projects. Once you know what to chase, 80% of the job is done.

Disclosure: building commercial ML systems for the past 11 years, using deep
learning and otherwise. What you call "metric you care about" is often not the
metric you care about. This is why people coming from academia are sometimes
taken by surprise that logistic regression, linear models, or heck, even rule-
based systems (!) are still so popular. Model simplicity, developer sanity and
performance do matter, too.

------
jph00
Jeremy here (co-author of this paper). Let me know if you have any questions!

~~~
Abundnce10
_This method dramatically improves over previous approaches to text
classification, and the code and pre-trained models allow anyone to leverage
this new approach to better solve problems such as: Finding documents relevant
to a legal case; Identifying spam, bots, and offensive comments; Classifying
positive and negative reviews of a product; Grouping articles by political
orientation;_

I'm starting a new project where I'm given many recipes and I need to take in
a free form text of recipe ingredients (e.g. "1/2 cup diced onions", "two
potatoes, cut into 1-inch cubes", etc.) and build a program that identifies
the ingredient (e.g. onion, potato), as well as the quantity (e.g. 0.5 cup,
2.0 units). Could I use something like Fast.ai to tackle this problem?

~~~
ioot17
CRF works quite well, it's actually what I utilize right now to approach
recipe parsing on [https://cookalo.com/](https://cookalo.com/). It's based on
CRFsuite with Python bindings for data training on already labeled recipes. If
you build your own app and want to do some comparison, feel free to run some
benchmarks against it.

~~~
Abundnce10
Very cool! It sounds like you followed a similar approach that the NY Times
used in their recipe parsing approach, correct?

How does your API handle ingredients with multiple options (e.g. "1 1/2 cups
seedless red or green grapes")?

~~~
ioot17
Yes, that's correct, it's similar to the mechanisms NY Times guys were using
and I've been focusing on the datasets to feed the CRF with as it's what
drives the whole thing. This is the output I've got based on your example: [ {
"unit": "cup", "input": "1$1/2 cups seedless red or green grapes", "name":
"red grapes", "qty": "1$1/2", "comment": "seedless or green" } ]

Don't hesitate to try the API out by pasting some examples to the white box on
the site and pressing the "Try it out!" button, it's interactive :)

~~~
Abundnce10
_Don 't hesitate to try the API out by pasting some examples to the white box
on the site and pressing the "Try it out!" button, it's interactive_

Sweet, I didn't realize it was interactive. I'll give it a try!

------
dreaminvm
Link to paper for those that are curious:
[https://arxiv.org/abs/1801.06146](https://arxiv.org/abs/1801.06146)

------
eddotman
Wow - great paper! Very readable / accessible. I'm working on some stuff for
NLP in materials science academic literature, but we haven't tried anything
beyond the usual word embedding -> supervised classifier approach. I'll have
to give this a try!

------
saulberardo
Jeremy, after a quick first reading, I believe the following can be improved
in the paper:

1\. In the introduction, there seem to be contradictory statements about the
status of Inductive Transfer for NLP. It is first stated that it "has had a
large impact in practice", then in the next paragraph it is stated that "it
has been unsuccessful for NLP". How is it possible, having a large impact and
at the same time being unsuccessful?

2\. In the introduction, it is stated that "Research in NLP focused mostly on
transductive transfer". Perhaps this statement were valid back in 2007, but it
seems to me outdated. Recently most transfer learning in NLP are related to
using pre-trained embeddings in a inductive transfer setting.

3\. In the beginning of the "2 Related Work" section, in the excerpt "Features
in deep neural networks in CV have been observed to transition from task-
specific to general from the first to the last layer", I believe the order
"first to the last" should read "last to the first", since the last layers
have the more task-specific features and the first layers have the more
general.

------
jfaucett
Man that video at the bottom is great. I just finished up research on an image
classification task in medical imaging and that tool would have helped out a
lot in debugging and interpreting results - especially when you're working on
image datasets of objects which you aren't very used to (like medical datasets
where different tissue textures are important and only medical experts can
distinguish between them).

------
web64
I'm looking forward to testing this out!

I noticed the "NLP classification page" links are broken. I assume they should
go to:
[http://nlp.fast.ai/category/classification.html](http://nlp.fast.ai/category/classification.html)

~~~
jph00
Oops - forgot the leading '/'. Fixed now. Many thanks for letting us know.

~~~
amarsharma
Also, clicking on "Universal Language Model Fine-tuning for Text
Classification" sends me to [https://arxiv.org/](https://arxiv.org/) not to
the paper.

~~~
jph00
D'oh! Many thanks - fixed that one too.

------
ozcoder56
Hi Jeremy, I noticed you were active on the Kaggle toxic comments challenge
though did not participate. Did you apply this model to that problem and if
so, how were the results?

------
beders
I've yet to see data this is beating a SVM on a 4-character shingle approach
(which also doesn't require tons of data to train).

~~~
jph00
It beats it easily. See the paper and citations for comparison

~~~
gargarplex
Is there an easily accessible API? And is this robust to bad labels -
imperfect training data? I have a huge corpus of labeled descriptions for jobs
and I want to categorize them as 'programming' or 'not programming'. The
accuracy of my manual labeling is like 95%. Can that be used to train a
classifier using this newly published technology?

------
bitL
Fantastic! Thanks for sharing this! Can't wait to get inspiration from the
paper to my work! ;-)

------
ramanan
I understand that you do mention the pre-training / transfer learning approach
clearly, but isn't it disingenuous to claim that you provide better
performance based on (only) 100 labeled examples, when the pre-training
dataset (Wikitext-103) actually contains 103M words?

~~~
jph00
Of course not. The use of pre-training on a large unlabeled corpus and
subsequent fine-tuning _is_ what the paper is about. It is stated repeatedly
in the paper and the post.

It is totally correct and in no way misleading to say we need only 100 labeled
examples. Anyone can get similar results on their own datasets without even
needing to train their own wikitext model, since we've made the pre-trained
model available.

(BTW, I see you work at a company that sells something that claims to
"categorize SKUs to a standard taxonomy using neural networks." This seems
like something you maybe could have mentioned.)

~~~
ramanan
Got it. I was looking for input on how generalizable (the ability of weights
to change/adapt) when the training labeled data is 100x smaller than the
initial pre-training dataset?

Also, I don't understand the need to be so defensive though and the relevance
between my employer and my post?

~~~
JPKab
When you use the word disingenuous, you invited the response you got. Totally
uncalled for to write that.

His response on your employer was likely driven by an assumption that you
viewed this as free, open source competition to your product, and thus the
negative comment.

To the OP:. I've find a lot of NLP, and this is phenomenal work.

