
Abusing hash kernels for wildly unprincipled machine learning - jhoon
http://jeremydhoon.github.com/2013/03/19/abusing-hash-kernels-for-wildly-unprincipled-machine-learning/
======
fiatmoney
Essentially random projection, which has good theoretical justification and
comes in handy quite often, for instance, in SVMs [1]. I'd be concerned about
using something as naive as CRC32 though, ideally they'd be using a
cryptographic-strength hash fn.

There's also an entire area of research around "semantic hashing" and local
embedding, that starts with such a random projection, and tries to improve the
mapping to be better at a certain task while still remaining low-dimensional.

[1] <http://arxiv.org/pdf/1211.6085v2.pdf>

~~~
rm999
I think this is actually quite different from random projection, which creates
linear combinations of several continuous variables. RP is much more similar
to PCA in that respect. OP's method works on a single categorical variable and
essentially shuffles the possible values, then 'folds' them down to an
arbitrary size (the size of your hashing function).

>ideally they'd be using a cryptographic-strength hash fn

Cryptographic adds nothing; there is no danger from someone reversing the hash
(and if there is you may have bigger problems). Any hash function that is
suitably random in its redistribution of the variables should suffice.

~~~
fiatmoney
It's a sparse random matrix, but definitely in the same family of techniques.

I was concerned more about distributional properties of different hash
functions than reversibility. Checksums-as-hashes often work, but it's not
that much work to swap in something a little more robust. If you want a good
output distribution, pick something where that's a goal of the algorithm.

------
rm999
This is pretty similar to the approach that many predictive modelers already
use: compress data down into 'dummy variables', where each variable represents
some attribute. For example, you could convert the variable state into 51
dummy variables, one for california, one for DC, etc. Hashing makes the
programming a little easier and helps avoid throwing out data in the long-tail
of the distribution when the number of possible values is very high. It's
common in NLP exactly because of this.

Hashing comes with costs though. It's much harder to interpret a model with
hashed variables because the variables lose meaning. Also, some information
may get thrown out; as OP mentions integers may have continuous meaning and
categorizing them can really damage the model. If you have a very large
dataset it probably won't matter because most modeling methods asymptotically
converge (i.e. as you have more and more data the model can learn that 333 and
334 are similar just by seeing enough examples that this is the case), but if
you don't you could be throwing out valuable information. In a case like this
I suppose the modeler could manually go in and convert his integers to floats
to 'tell' the algorithm the data is continuous.

~~~
jhoon
Your point regarding lost meaning is quite salient. It can be useful and
enlightening when a learner reports a measure feature importance (such as in
random forest models,
[http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home....](http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#varimp)).

When you say "hashing makes the programming a little easier," I think you hit
the nail on the head. I'm not trying to improve classification accuracy -- my
goal was just to make it as easy as possible to learn on arbitrary structured
data.

~~~
rm999
>my goal was just to make it as easy as possible to learn on arbitrary
structured data

I'd be very careful about throwing arbitrary data at your learner, at least if
you don't understand your data well. Oftentimes the predictors and response
are not properly separated in the same way they will be during real-world
usage (for example, in time); this leads to target leaks, where your model is
effectively cheating by using data it won't have in production.

Target leaks are obvious when the classifier performs suspiciously well on in-
sample test data, but sometimes the repercussions are more subtle but still
very damaging in a production environment.

~~~
mhluongo
Hm, couldn't a hybrid approach deal with this? Eg hash all the data except a
few dimensions you think are vital, and add those to the resulting hashed
array?

~~~
mkmkmmmmm
Or location-aware hashing?

------
mattj
This is very similar to the hashing trick used in vowpal wabbit
(<http://hunch.net/~vw/>). I haven't used this python module, but vw is
absolutely incredible to use.

~~~
jhoon
Great point! VW was actually an inspiration for this blog post. I don't
believe it handles structured data quite the same way as hashkernel, but it
does a great job of accepting both categorical and continuous features. We use
VW models to classify all sorts of badness on my team at Facebook
([http://www.newscientist.com/article/dn21095-inside-
facebooks...](http://www.newscientist.com/article/dn21095-inside-facebooks-
massive-cybersecurity-system.html)).

------
textminer
I love this entry. Two months ago, I wrote a perhaps more-pedantically-
technical entry about the hash trick and its use in building a simple
classifier, even making it play nice with Scipy's sparse linear algebra
(eventually writing my own CSR times CSC multiplication method in Cython):
[http://blog.newsle.com/2013/02/01/text-classification-and-
fe...](http://blog.newsle.com/2013/02/01/text-classification-and-feature-
hashing-sparse-matrix-vector-multiplication-in-cython/)

------
jimdanz
Nice writeup, jhoon!

In case anyone else was searching his/her eyes trying to figure out where the
"likes" field is actually sanitized out from the the training data set, it's
the item.pop("likes") inside the function "is_liked":
[https://github.com/jeremydhoon/hashkernel/blob/master/fblear...](https://github.com/jeremydhoon/hashkernel/blob/master/fblearn.py#L53)

... Maybe I like long names too much, but my vote would be something longer
like "check_if_liked_and_remove_likes" since "is_liked" sounds like a
predicate that won't be modifying its input.

------
datahipster
Meh. Diving a little bit more into the results on how well the hash kernels
algorithm did with UCI Adult Names data set is mildly disappointing. Take a
look at the results ([http://archive.ics.uci.edu/ml/machine-learning-
databases/adu...](http://archive.ics.uci.edu/ml/machine-learning-
databases/adult/adult.names)) and you'll see that hash kernels rank 14 of 17.

To be fair, I would definitely like to see how this algorithm does with other
data sets as well.

All I see this algorithm doing is basically a projection of a highly dimension
feature set onto a random n-d projection via a hashing function. Another
words, it's not clear to me how an optimal classification boundary can be
constructed using this random projection. I feel comfortable with
understanding the performance characteristics of techniques such as SVM or
vector quantization since they both focus on implementing algorithms that
optimally reduce the dimensionality of the feature space. However, random
projects are, in my humble opinion, overrated.

Also, there are two additional sources of parameterization that might make it
difficult to use this tool: the selection of a hash function (and accounting
for the information loss via a random projection) along with the arbitrary
hash kernel array size. That makes it somewhat difficult to train and validate
a model.

However! I really do like the ideas presented here and look forward into
diving into this and other variants. I really appreciate the OP for taking the
time to piece together this cool tool and compare it's performance with other
algorithms.

------
ashish01
Anyone interested in the article should give VW a try. You can get good
baseline performance in a fraction of time compared to other traditional
approaches like liblinear et. al.

<https://github.com/JohnLangford/vowpal_wabbit/wiki>

------
342
You are one of my favorite online authors merely for this post. Do you write
often?

I'm going to explore using this to fill parts of a vector in an otherwise
normal regression algorithm and see how it does against only linear regression
and only hash kernel on my dataset.

------
timothybone
For someone (me) not particularly knowledgeable on the subject. I'm curious to
understand the contrast between this and vowpal wabbit. Could someone offer an
insight into when it would be useful to use this project as opposed to vowpal?

~~~
jhoon
First, please don't use hashkernel in production. It's a toy designed to
showcase how hash kernels can help us (easily) learn on structured data.

VW is faster, smarter, and more featureful. However, it doesn't know how to
learn on structured data.

------
bcoates
I don't know anything about this space, but I have some real-world datasets
and love throwing algorithms I don't understand at them. Right now I'm writing
classifiers by hand, it'd be awesome to outsource my job to some AI. Does
anyone have a good starting point where I can figure out which modern
classification algorithms are relevant or even possible given my dataset and
computer power available?

~~~
rm999
This is a pretty decent flowchart to finding appropriate machine learning
algorithms for your problem: <http://i.imgur.com/iVo4AAw.png>

------
brendano
I don't understand why this is "wildly unprincipled." Doesn't it have all the
guarantees that Weinberger and Shi et al papers describe?

~~~
jhoon
This is a great question. The "unprincipled" part comes in the application of
hashkernel. For example, any integer-valued fields encountered in structured
data are treated as categorical features (whereas floating-point numbers are
treated as continuous features). Of course, it's possible that some of these
integer-valued fields should be treated as continuous features. When using
techniques in ways they were never intended to be used, like treating
continuous features as categorical, many of the assumptions that make
something "principled" no longer apply.

~~~
brendano
eh, it's just binning :)

------
SamuelKillin
Writes one blog post - front page of HN

