

The Curse of Dimensionality in Classification - lucasrp
http://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/

======
languagehacker
Another good dimensionality reduction technique to consider is Latent
Dirichlet Allocation. I use this approach for natural language or other
"bursty" data sets. "Bursty" data sets are characterized by having Zipfian
distribution over features, but certain long-tail features achieving a higher
probability of multiple observations given initial observation in an instance.
For example, "armadillo" is relatively rare, but an article mentioning an
armadillo once has a high chance of mentioning it again.

A cool thing about LDA is that it allows you to express the latent
characteristics of a given document as a point in Euclidean space. This gives
you the ability to use spatial distance metrics such as cosine distance to
express document similarity. I specifically use this for recommending large-
scale UGC communities based on their latent characteristics. Furthermore,
since you've turned your language data into spatial data, you're able to use
spatial classifiers such as SVMs more effectively over natural language data,
which is normally a bit better suited for Bayesian classifiers.

I'm a huge fan of Gensim for its LDA library. It's even capable of distributed
computing using Pyro4. It's relatively trivial to deploy an LDA pipeline for
extremely large datasets using EC2 and the Boto AWS library.

Edit: If you haven't heard of it, scikit-learn is an awesome Python library
for highly performant machine learning using Python's C extensions for
numerical computing (scipy, numpy). It's easy to take the data you get above
and perform learning on it using the classifiers provided.

~~~
amitdeshwar
Can you provide a link to an article discussing how you can treat the latent
characteristics as a point in Euclidean space?

~~~
languagehacker
Here's some basic background:
[http://faculty.washington.edu/jwilker/559/SteyversGriffiths....](http://faculty.washington.edu/jwilker/559/SteyversGriffiths.pdf)

------
vonnik
Neural nets are a great way to reduce dimensionality. In particular, deep
autoencoders. here's an old Hinton paper on it:
[http://www.cs.toronto.edu/~hinton/science.pdf](http://www.cs.toronto.edu/~hinton/science.pdf)

------
alceufc
I really like the idea of the Web site as a whole: explaining concepts from
computer vision in a simple way.

When I was starting my masters course I was interested in learning what the
concept of bag of words in computer vision was all about. Although it is
straightforward technique, there are few examples on the Web explaining how to
implement it (clustering the feature vectors and etc.)

~~~
apu
Although not widely known, the best way to learn about specific computer
vision topics in detail is usually through the websites for tutorials held at
the premiere computer vision conferences. These are CVPR, ICCV, and ECCV. For
example, here are the tutorials held at ICCV 2013:
[http://www.iccv2013.org/tutorials.php](http://www.iccv2013.org/tutorials.php)

If you click through, you'll see that most of them have links to slides, and
some even have video coverage and/or links to software as well. They're also
usually presented by experts in that area.

Finally, note that even if the most recent conferences don't have relevant
tutorials for what you're looking for, you can still often find good material
by going back further in time (e.g., tutorials presented 2 or more years ago).
Although the state of the art may have advanced since then, the fundamentals
usually don't change very frequently.

------
Malarkey73
I'm familiar with this idea - but it's nice to see explained with cute little
puppies and kittens.

------
therobot24
one counter-example: face recognition using 100k features
([http://research.microsoft.com/pubs/192106/HighDimFeature.pdf](http://research.microsoft.com/pubs/192106/HighDimFeature.pdf))

~~~
cortexman
Not really. The article mentions that using linear methods (i.e., LIBLINEAR)
is one way to avoid the curse. LIBLINEAR is specifically designed for
situations in which you have many features and relatively few training
instances. When using a linear classifier it may make sense to simply generate
as many features as you can, and then use, i.e., lasso regression in order to
do feature selection.
[http://www.csie.ntu.edu.tw/~cjlin/liblinear/](http://www.csie.ntu.edu.tw/~cjlin/liblinear/)

~~~
kmike84
But the article used a linear model to demonstrate the curse, and the model
was overfit just with 3 dimensions. There is clearly something missing: for
example, for text data it is not uncommon to have thousands or hundred
thousands of dimensions, and algorithms work fine.

I think the missing piece is regularisation. It doesn't have to do feature
selection and actually reduce the number of dimensions, but you're right that
using L1 for such data is usually a good idea.

~~~
ced
The article had very few data points, that's why it worked with 3 dimensions.
The deciding factor is how N (effective number of data points) compares with p
(effective number of features).

------
nraynaud
is there some kind of test to know if we are past the optimal number of
dimensions? I guess overfitting could be detected by the ratio between volume
and area of the classification boundary.

~~~
RK
You could make a plot like Figure 1. Look for the turning point (do some
calculus if you can, i.e. d(perf)/d(dim) = 0).

~~~
christopheraden
Derivatives require continuity. It would be sufficient to simply look at which
number of dimensions gave you the best cross-validated classification rate.

------
jbert
Can someone explain to me Figure 6 please? How does projecting the 3d space
(and 2d green plane) lead to the regions around the cat's heads?

~~~
ronaldx
I will try:

Imagine flattening the 3D space (Figure 5) down onto the 2D space of the
floor.

Then, the image is trying to illustrate which sections of the 2D space have
been selected as 'cat' by the classifier (green plane).

Visually, you can imagine that the cat heads have each pushed down a chunk of
the green plane (the dog heads have held the green plane up off the ground).

This illustrates that the third dimension was important for separating out the
cats: they certainly aren't separated by a clean line in those two dimensions.

However, it also illustrates that the separation might be slightly contrived:
looked at in 2D, the green plane seems to have plucked out the odd cats
correctly but without logic.

One goal is to show that using a high number of dimensions will guarantee that
you can separate dogs and cats, but that this is just an over-fitted solution
to the data set that you have: it will not continue to work when you apply it
to further data.

~~~
jbert
> Visually, you can imagine that the cat heads have each pushed down a chunk
> of the green plane (the dog heads have held the green plane up off the
> ground).

Thank you, that helps. I was mentally projecting the whole plane down and
could see why it wasn't all-green (or all-not).

