
Bag of Tricks for Efficient Text Classification - T-A
https://arxiv.org/abs/1607.01759
======
Smerity
For anyone who is interested in efficiently classifying text, I can't
recommend Vowpal Wabbit[1] (VW) enough. It's blazingly fast and has been used
in both production and research. It also has a billion options out of the box
for various different set-ups.

Other researchers have noted[2] that with a set of command line flags that vw
is almost the same as the system described in the paper, specifically, "vw
--ngrams 2 --log_multi [K] --nn 10".

Behind the speed of both methods is use of ngrams^, the feature hashing trick
(think Bloom filter except for features) that has been the basis of VW since
it began, hierarchical softmax (think finding an item in O(log n) using a
balanced binary tree instead of an O(n) array traversal) and using a shallow
instead of deep model.

I am still interested in the more detailed insights the team from Facebook AI
Research may provide but the initial paper is a little light and they're still
in the process of releasing the source code.

^ Illustrating ngrams: "the cat sat on the mat" => "the cat", "cat sat", "sat
on", "on the", "the mat" \- you lose complex positional and ordering
information but for many text classification tasks that's fine.

[1]:
[https://github.com/JohnLangford/vowpal_wabbit/wiki](https://github.com/JohnLangford/vowpal_wabbit/wiki)

[2]:
[https://twitter.com/haldaume3/status/751208719145328640](https://twitter.com/haldaume3/status/751208719145328640)

~~~
tadkar
Thanks for the detailed comment! It's interesting that simple and classical
techniques are so competitive for text, but not for images. What do you think
is different about text that makes simple methods so effective, or
equivalently what is different about images that yields such massive
improvements from using convolutional deep neural nets?

~~~
syllogism
Imagine looking at an image as a "bag of pixels": shuffle them all up, and
look at the resulting image. What do you see? Nothing useful, right?

Now look at a bag-of-words view of a movie review:

    
    
        set(['and', 'predecessor,', 'immersive;', 'script,', 'is', 'an', 'engaging', 'as', 'home', 'still', 'its', 'film', 'identity.', 'puts', 'dazzling', 'issues', 'visually', 'colorful', 'While', 'not', 'spin', 'on', 'of', 'while', 'the', 'predictable,'])
    

It's not certain, but there's definitely a lot more information there.

The predict loop of a linear model works like this (written with sparse
vectors, implemented as dictionaries):

    
    
        def predict(classes, weights, features):
            scores = {clas: 0 for clas in classes}
            for feature in features:
                for clas, weight in weights[feature].items():
                    scores[clas] += weight
            return max(scores, key=lambda clas: scores[clas])
    

This function is the same for Naive Bayes, Maximum Entropy, linear-kernel SVM,
Averaged Perceptron...etc.

All you get to do is attach a weight to each feature, which you'll sum. You
can make the features larger or smaller slices of the input, but that's all
the structure you get.

Note that linear models have massive disadvantages for speech recognition,
too. Linear models don't work very well if you have an analog signal.

~~~
rcpt
I think bag of pixels would be more analogous to a bag of characters. A bag of
words is more like a bag of SIFT features.

~~~
msandford
Sure but n-gram feature extraction is what, five lines of code? It's a trivial
transform compared to SIFT.

If you don't do SIFT manually prior to classification then your NN has to
evolve something "similar" in order to work. Which is why it needs to be deep.

------
lqdc13
I don't understand why they're doing this on the Yelp/IMDB datasets. Here's
the paper they're citing that is doing the same thing with gated neural nets:
[http://www.emnlp2015.org/proceedings/EMNLP/pdf/EMNLP167.pdf](http://www.emnlp2015.org/proceedings/EMNLP/pdf/EMNLP167.pdf)

However that paper also has the polar score which is a 2-class classifier - 1
star + 2 star vs 3 star + 4 star.

Basically, the rating scores are clearly ordinal, but they're completely
disregarding that information as if a 4 is a completely different star review
than a 5 in a yelp restaurant review or a 9/10 is different from 8/10 for IMDB
review.

More useful estimate of success would be how close they got to real rating and
possibly penalize disproportionately when a model gets the rating completely
wrong. Something like mean squared error of the final score vs predicted
score.

I understand that these are just used to benchmark the algorithms against each
other, but why not use something more relevant like a topic classification?
That is a real n-classes problem.

General disregard for domain-specific information is very prevalent in machine
learning papers and it makes real world applications difficult because real
world evaluation metrics are different so classifiers that are marked as
inferior in such evaluation might actually be better.

In both of their evaluations, why not try

    
    
        F(y) = exp(yj) / sum(exp(yi) for i in range(1, n+1))
        p1 = F(y1)
        pj = F(yj) - F(yj-1), for j ≥ 2
    

for softmax? In other words, subtract cum prob of previous category.

Or use kappa function?

At least the naive bag of words comparison classifier should have used ordinal
logistic regression instead of n-class logistic classifier.

------
kwrobel
How a sequence of words is feeded to non-recurrent network?

~~~
_delirium
They represent the sequence as a bag of n-grams, and feed that into the
classifier, rather than feeding the sequence directly. The paper basically
combines variants on a few old techniques (although a few of the variants are
significant and recent), but the interesting result is that they show that put
together in the right way and tweaked a little, they're competitive in
accuracy with state-of-the-art deep neural network models, at least on some
problems, while being much faster to train. Section 2 of the paper, although
pretty brief, is where this info is.

~~~
SomewhatLikely
Specifically the bag of n-grams can be viewed as a very sparse vector with
non-zero entries corresponding to the n-grams in the bag. As a result, n-grams
not seen during training need to be ignored.

