
What Kagglers Are Using for Text Classification - homarp
https://mlwhiz.com/blog/2018/12/17/text_classification/
======
minimaxir
It should be noted that CNNs and LSTMs are an _order of magnitude_ slower than
things like bag-of-words/fasttext unless you're using an expensive GPU, and
the accuracy benefit _if any_ may be marginal in practice.

Kaggle prioritizes chasing a metric, but real-world data science has more
considerations.

~~~
995533
Nobody cares how long it takes to train a model. What matters is prediction
speeds, which are comparable (and NLP less likely to require high frequency,
where a few more milliseconds matters).

Besides that, the accuracy gains are not marginal anymore (BoW can't compete
like it used to, especially with pre-trained models).

~~~
Cybiote
> Nobody cares how long it takes to train a model.

This isn't true. It depends on your priorities and goals. Machine learning
that spends most of its time unable to learn is not real AI. Some of us are
interested in sample and energy efficient learning capable of on-line
incremental updates immune to catastrophic forgetting. Not just because this
is truer to actual learning but because it moves away from being dependent on
a handful of companies to do the actual training.

Anticipating some replies: no, transfer learning or meta-learning methods
don't really avoid this. In the case of transfer learning, you still have that
high coupling between a handful of sources. The down-sides of this is its own
discussion. In addition, there are times where the ability to extract local
relations can be dulled by the dominant wikipedia and common-crawl
representations. Meta-learning gets you fast updates but you still cannot
stray too far away from the domains that were met at training time.

> What matters is prediction speeds

I'm not a fan of bag of words models either but a simple dot product is always
going to be faster than many matrix multiplies and or convolutions. The
implementor should always try these as a base-line and decide if the
performance accuracy trade-off is worth it for them.

~~~
995533
Nobody in business cares if you are doing proper AI or dumb curve fitting.
What matters is the complexity (engineering debt) and performance (accuracy,
robustness).

Online learning, sample - and energy efficiency are unrelated to training
times. Like said: nobody cares if you ran Vowpal Wabbit for 1 hour or 100
hours, as long as you are not constantly babysitting it and calling that paid
work (or have the unusual requirement of daily retraining while using an
online model).

> simple dot product is always going to be faster than many matrix multiplies

If you care about this (because it is profitable), you rewrite in lower-level
language or predict with cloud GPU (which will be at least comparable to
simple dot product, while adding performance)

~~~
dna_polymerase
> Nobody in business cares if you are doing proper AI or dumb curve fitting.

What is proper AI? It's all dumb curve fitting right now.

------
biomodel
Not sure why anyone would use 2D CNNs for processing text when there is no
spatial correlation in the embedding features. Recent work such as
[https://arxiv.org/abs/1803.01271](https://arxiv.org/abs/1803.01271) show that
for most tasks, 1D CNNs outperform recurrent architectures while being faster
to train

~~~
soraki_soladead
This is just a bug in their code. The paper they cite uses 1D convolutions.
Though, I suppose having an unused dimension only really hurts efficiency.

~~~
gnulinux
> Though, I suppose having an unused dimension only really hurts efficiency.

That might not be true as it might increase bias and thus might need a more
careful hyperparameter tuning to avoid overfitting.

------
elyase
We have done extensive testing in the context of chatbot intent classification
and in our particular problem nothing (including CNN, LSTM, fasttext plus
LUIS, Watson and other proprietary classifiers) has been able to beat a simple
linear model trained on char n-gram features.

~~~
briga
I've seen the same things in the models I've built. For basic intent
classification simpler models seem to be more accurate, not to mention they
train faster and require less memory. There seems to be a lot of emphasis on
shiny complex neural network architectures, even when simple models work just
fine.

~~~
FridgeSeal
> There seems to be a lot of emphasis on shiny complex neural network
> architectures, even when simple models work just fine.

It's resume-driven-development for data scientists.

I've never seen an interviewer impressed with the fact that a job was
performed using not-deep learning, but say that you used deep learning
(despite how spurious it might be) and they light up like it's Christmas.

------
wenc
I wonder how FastText (essentially word2vec + word & char n-grams + other
stuff) stacks up against these algorithms.

In my own tests on my own corpuses, CPU-based FastText is faster to train and
produces significantly better results (precision/recall) than the GPU-bound
CNN algorithms that I've tried, but have not compared it against RNN
techniques.

~~~
physicsyogi
I’ve found that some CNNs consistently beat fasttext in terms of model
quality. But I’ve beaten those CNNs and fasttext by doing transfer learning
with ULMFit and fasts I. But if we’re talking training speed, fasttext is
indeed aptly named.

------
platz
Three methods and no idea how I would choose between the three of them aside
randomly trying each one and measuring performance.

(Not for winning kaggle but for an actual problem)

~~~
platz
"The information you'd need to choose is included in there. If you're doing
this professionally, you should strive to have enough of a high-level
understanding of NLP to be able to make these decisions without having a
rubric handed to you on a silver platter. In a nutshell, though: Strive to use
the simplest model that will get the job done. Less elaborate models are
easier to understand and (usually) less prone to things like overfitting, so
they'll be more tractable to work with in a business context. To that end: Use
a convolutional net when you can get away with a small, fixed-size context
window. Use an LSTM when you need long-term memory. Attention can be
expensive, so you use it when you have cause to believe you can gain a lot by
giving selective attention to features, and have both a lot of training data
and a lot of computing resources.

It's also worth considering that you might be best off going with none of
these options. Cool as deep learning is, I've personally never actually been
able to justify using it in a professional setting. Simpler models such as
logistic regression and decision trees have characteristics that are near-
useless for getting you to the top of a Kaggle leaderboard, but can be
indispensable when working on many real-world business problems"

\- anonymous comment reply

~~~
platz
This is the kind of context that is very helpful.

------
SpaceManNabs
Seems like the article is mistaken on one part.

Attention was first coined in this paper (as far as I know):
[https://arxiv.org/pdf/1409.0473v7.pdf](https://arxiv.org/pdf/1409.0473v7.pdf)

The second page of the introduction of the "Hierarchical Attention Networks
for Document Classification" paper mentioned in the article even cites it.

------
rundigen12
I was really hoping to see a summary comparison of the performance(s) of the
different models at the end, e.g. accuracy vs. complexity vs. execution time,
etc.

Here's a summary from the end of each section...

1\. TextCNN: "This kernel scored around 0.661 on the public leaderboard."

2\. BiDirectional RNN: 0.671

3\. Attention Models: 0.682

~~~
ykevinator
Thank you, that's exactly what I was scrolling through the bickering to find.

------
PaulHoule
It would be nice to see how these methods compare to the classical methods
based on word occurrences.

~~~
nerdponx
Kaggle is a pretty serious natural-selection environment for machine learning
algorithms. Basically, if bag-of-words worked better, the contest winners
would still use it.

~~~
PaulHoule
One issue is the kind of problem.

I remember getting 95%-ish accuracy with BoW and the SVM circa 2004 when it
came to questions like "is this paper about astrophysics or organic
chemistry?"

In that case you have a distinct vocabulary for different topics and it is
hard to beat BoW.

Sentiment analysis, on the other hand, is where BoW goes to die since now "not
good" means something very different than "good", and even simple heuristics
like treating "not X" as a term that is different from "X" give limited gain
because negation is expressed with constructions like "i don't believe that is
good" and there is no k-word window that you reliably catch negation in since
there isn't a limit on how complex sentences are.

There is also the question of "is the improvement between method A and method
B worth it?" For instance the Netflix prize was much celebrated because some
brilliant people busted their ass to go from 92% to 95% accuracy on movie
recommendations. In the end the algorithm proved to be too complex for the
value it created. (eg. Who would notice that they got 8 bad recommendations
instead of 5 out of a hundred? An additional half a bad recommendation out of
10?)

The real "Netflix optimization problem" is how to spend as little on acquiring
content as possible while motivating people to keep their subscriptions and
that is something Netflix will keep closer to their chest and not promote a
public competition on. (eg. if it were valuable why would they let competitors
know about it?)

~~~
nerdponx
Good point, I should have written "for machine learning algorithms on problems
that the industry is currently interested in".

------
ScoutOrgo
No mention of ULMFiT?
[http://nlp.fast.ai/classification/2018/05/15/introducting-
ul...](http://nlp.fast.ai/classification/2018/05/15/introducting-ulmfit.html)

------
eggie5
Why not just use a 1d conv over the sequence of embeddings?

------
brian_herman__
This was really interesting and insightful thanks!

------
nixpulvis
I already don't like humans parsing my words half the time, I'm confident I'll
hate most algorithms.

Nothing inherently wrong with these methods, just a lot of possibilities for
misuse in my eyes.

