
Curse of dimensionality - boh
http://en.wikipedia.org/wiki/Curse_of_dimensionality
======
dnquark
What finally made the curse of dimensionality "click" for me was recognizing
that the volume of an n-sphere goes to zero as n grows large. In 2D, if I want
to see which of my data points lie close to some reference point, I can draw a
circle around my reference point and count the proportion of the data points
inside. In 3D, I could do the same with a sphere. But in 20D, the volume of my
hypersphere will be miniscule, and it will probably not even contain any
points, so this procedure becomes useless for clustering. Note that I can't
just pick the hypersphere of a larger radius: suppose all my points are
contained inside a unit n-cube. The volume of the unit n-sphere, the largest
one I could inscribe inside the n-cube will _still_ be approximately zero!

This result is pretty cool and somewhat surprising, and there's a neat and
short derivation of it; in fact (shameless plug) I have a youtube video where
I derive it in 3 minutes: <http://www.youtube.com/watch?v=QkMn_5QpsX8> \--
feel free to check it out!

~~~
mokus
Very nice video. For those less able to follow the calculus, another way to
get at least a little intuition about this result is to think about how far
the farthest points in a unit cube are from the origin - in 20 dimensions, the
point (1,1,...1,1) is sqrt(20) ~= 4.47 units from the origin, so the 19-sphere
only "reaches" 1/sqrt(20) ~= 0.22 of the way to the corners of the 20-cube.

------
stiff
Suppose you want to build a system that using some M features of the current
weather (humidity, temperature, pressure etc.) will predict whether it will be
sunny in 3 hours or not - to this end you manually collect a dataset of N
measurements (samples) of those features together with the outcome 3 hours
after taking the measurement. This dataset can be used to find a function that
will map the current features of weather to a real variable from 0.0 to 1.0 -
the likelihood of it being sunny in 3 hours given the samples we collected. Of
course, since we cannot manually collect measurements of all possible
combinations of features, we want to extend the domain of this function beyond
just the combinations of features that occured in our measurements - we want
to "generalize". The values of the function for those "generalized" inputs are
our guesses about how the likelihood function behaves for measurements other
than the ones we have taken. The guess for a given measurement is much more
likely to be correct if we at least have _similar_ measurements in our
collection.

Now, if there are too few features (M is too low), it might not be possible to
do a good prediction. The basic aspect of the "curse of dimensionality" is
that as the number M of features grows (linearly), the number of all possible
combinations of the features grows exponentially and hence the number N of
measurements you need to do to get a good representation of the whole space of
possible feature combinations also grows exponentially. So either you have to
collect much more measurements, or the function that is inferred will have to
"guess" much more often, without having good evidence from the dataset in the
form of similar cases.

What I refer to as "all possible combinations of features", can be thought as
a high-dimensional space and hence it is customary and sometimes useful to use
geometric language in analysing the problem.

------
arjunnarayan
Here's my best attempt at a simple.wikipedia.org article. In accordance with
both simple.wikipedia.org rules and the oft-repeated Einstein/Feynman/Michael
Scott quotes about explaining things to 5-year olds being the best sign of
understanding something yourself, I've kept the vocabulary simple: Feel free
to remix and rework the text to improve it.

"Machine learning" is often the problem of taking a bunch of data and coming
up with a simple summary. For example, you might want a machine to look at
pictures (which are made up of millions of data points called "pixels") and
come up with a single word definition of the subject (say "dog", or "cat", or
"human", or "spoon", or "chair"). Now we will pick an even simpler problem to
use as an example. We have collected data on two objects. We don't know what
the objects are, but we do know that there are two-distinct objects in our
data. We want our machine to _cluster_ (this is a specific important machine
learning problem called _clustering_) the given input and tell us which ones
are object 1 and which ones are object 2. So, for example, we could give it
lots of data on dogs and cats, and we want it to separate the dog-data from
the cat-data.

However sometimes when we have a lot of data, our algorithms stop working
well. This is called the "curse of dimensionality". Consider a more abstract
version of the problem than the one above. You see a group of points in
2-dimensions and you want to draw a single line through it that separates the
points. For example, see this picture
<http://en.wikipedia.org/wiki/File:KMeans-density-data.svg> I will refer to
this picture many times, so keep it open on a tab to the side. The algorithm
(the one used in the picture is called "K-Means" but we can ignore the
specifics. The point here is that _any_ clustering algorithm will face the
same issues, so just focus on the problem itself) has separated the points
into two groups (red and blue) and drawn a line between them. The idea is we
now have detected that the blues are different from the reds in some pattern
of the "features" (which is what we call the 2 axes --- we have 2 "features").

Now we could try and "generalize" this algorithm --- that is, come up with
some modification that works when we have lots of features. So lets consider
three dimensions. Now first you will notice that a line doesn't "split" three
dimensions. You need a "plane". A plane is a generalization of a line into
3-space. Visualize a room. A single pipe doesn't split the room in two. But a
very large piece of stiff cardboard will. Similarly, when we are in higher
dimensions we will talk of "hyperplanes". Notice that a plane sort of "exists"
in a 2-d world of it's own (like the stiff cardboard), even though the
cardboard "lives" in 3-d. This is not a coincidence. We can, in principle, use
a (n-1) dimensional structure to separate cleanly a n-dimensional structure
into two parts. While you cannot visualize this in higher dimensions, the n-1
separation principle is important. [2]

So since we are looking to do just that --- "cluster" our data into two
separate parts separated by a single "hyperplane" --- we will try and draw one
in 3-dimensions. So maybe it is easy. From the picture above, imagine that the
points are in a 3-dimensional room. So we have the two dimensions as in the
picture, but now there is a third one, say that is just completely useless ---
so the points are randomly distributed on that dimension. To visualize,
imagine you laid that figure on the floor in your room and the points were all
at different random heights off the floor, suspended in the room. The
important thing is that if you viewed the room from directly above, (and far
enough away so that you couldn't see the heights at all) you would see the
original 2-d picture in the k-means example picture). But from any other
angle, you would see a bunch of extra noise clouding the clean separation that
is visible in that one specific from-the-ceiling angle.

So already you can see when you have a single useless dimension it makes life
difficult. You have to somehow figure out the right angle to see things so
that it doesn't cloud your judgment and you can "see" the separation. This
intuition is important, because that is what will happen to the algorithm.

Now we will talk about k-means, the specific algorithm used in that example to
do clustering. K-means is a powerful algorithm. How k-means works is it
compares the distance between points and some "centers" that it thinks
"cluster" the data. So it guesses two centers, and then draws a dividing line
between the two (at the mid point). You can see this in the example picture.
There is a blue center and a red center and a line that divides.

K-means then computers a "average" (it uses a little more fancy metric, but
"average" will suffice for our intuition here) of the distances of all the
points on it's side of the line to the center. Basically, it is trying to see
how far all the points are from the center. If the points are all far away,
then it is not really the center. In fact, the center will _minimize_ the
distance from all the points --- that's why we like centers. So it repeats
this a few times until it finds the right center. Importantly, you can also
use this calculation to "point" you in the right direction. I found a nice
picture that shows this step by step
<http://itee.uq.edu.au/~comp4702/lectures/k-means_bis_1.jpg> as it _figures
out_ where the centers should be, at each step redrawing the lines.

Now, the important thing here is that we use the distances from the points to
the center to calculate how "good" the center is. So I will unfortunately have
to introduce a little bit of math here to show that.

When we have two points x and y in n dimensions, we can compute (x_1 - y_1)^2
+ (x_2 - y_2)^2 + (x_3 - y_3)^2 + … + (x_n - y_n)^2 as the distance between
the two points.

Now suppose in our example, we have only 2 dimensions that "matter". So n = 2.
So we have our distance metric as (x_1 - y_1)^2 + (x_2 - y_2)^2 Now let's
assume that x_1 and x_2 tell us important things about the data. Maybe, for
example, x_1 and x_2 are key properties of the objects we want to cluster.
(weight and number of teeth). Suppose we are clustering between dogs and cats.
Clearly weight and number of teeth will tell us important things that help us
separate the dogs from the cats. So we call these "features" _informative_.
They are valuable! Running k-means will give us a good separation.

Now suppose, in addition to that, we have some extra dimensions that are
useless. Say "color", "date of birth", first name of owner", "number of kids
in the house where it lives". Maybe these "features" really are just random
noise. They're just getting in the way. So we will assume that the data is
"randomly distributed with respect to these features".

Now our distance metric is: (x_1 - y_1)^2 + (x_2 - y_2)^2 + (x_3 - y_3)^2 + …
+ (x_n - y_n)^2 Since we assumed for k>2 that x_k is random, we have our
distance as (x_1 - y_1)^2 + (x_2 - y_2)^2 + (random-random)^2 + … + (random-
random)^2

Now we will avoid some math here, but the important point is that the "random
- random" starts to shout out over the first two dimensions as we increase the
number of randomly distributed features.[3]

Intuitively, unless you know to "see" from the exact right angle (like the top
of the room in the example) you will be lost in the sea of points in the
space. But how do you decide which is the right angle? That is itself a hard
problem. And oftentimes, it is _the_ hard problem. Deciding that "first name
of the owner" wasn't as important as "weight" _is_ the hard problem. So
k-means doesn't work as well.

This problem is often called the "curse" of dimensionality, because in a very
high-dimensional space, "all points are equally far".

[1] A proper simple.wikipedia article should use a different word here.
Suggestions?

[2] Side distraction: We can actually, with a little effort visualize a 4-d
space. Say the 4 dimensions are our usual 3 + the extra time-dimension. We can
have a 3-d object separate us into two separate 4-d worlds. So we're waiting
waiting waiting as time passes, and then suddenly, the entire world is made of
concrete. It obliterates everything in the universe. Then, in a flash, just as
soon as it arrived, the concrete disappears. That concrete barrier is the 3-d
hyperplane in the 4-d space.

Now to make it convenient to visualize, we have made our 3-d object "axis-
aligned". But it may not be axis aligned. But the problem will be that the
answer we want won't always be conveniently axis-aligned. Imagine a 2-d
concrete barrier that starts at one end of the universe and "moves" through
the universe as we move time. It crushes everything in its path. You can't get
past it. It will eventually get to you because it is moving perpendicular to
it's surface. Eventually it will destroy everything. So you still have a 3-d
(this barrier exists in 2 spacial dimensions and 1 time dimension) concrete
barrier separating the 4-d space without being axis aligned.

[3] I'm trying to avoid having to say that while the expected value of
(random-random) is 0, the variance is nonzero and positive. And as we have
more random features, the expected variance of the sum of the random features
grows. I.E. Var(N(mu, sigma) - N(mu, sigma))

~~~
pacala
Why not cluster one dimension at a time? K-means will meaningfully cluster for
2 of these dimensions and fail for the other 998. Measure how good the
separation is and stick with the best X dimensions.

~~~
arjunnarayan
Yes that will work. But, there are 1000C2 such combinations (that is, O(2^n)),
and you have to try them all since you don't know which two dimensions are
needed. And you cannot do each of the 100 separately in one-dimension and
later hope to "combine" them together. In fact, you might not even know that
it is 2 of the dimensions that are special. Maybe it is 7. So you have to try
each of the 2^100 combinations separately. The algorithm you propose is not
polynomial in the number of features.

~~~
tveita
I may be misunderstanding you, but the number of ways to pick 2 out of _n_
dimensions is Θ( _n_ ²).

Even if we looked at up to d dimensions, (bruteforceable at Θ( _n_ ^ _d_ )),
that doesn't imply we need to generalize to do all 1000.

The post you replied to actually suggested clustering just one dimension at a
time. I think that's a reasonable solution given the situation you posited,
where you have some dimensions that do cluster well, even individually, and
some who are mostly noise.

I would say that the curse of dimensionality is more fundamental than the
problem you describe. The problem doesn't lie just in separating noise from
signal - in a truly high-dimensional space there may not be a "right angle" to
see it from.

For a simple example, you may take randomly distributed points. Even at
uniform random distribution, it is relatively easy to for instance index two
dimensional points in such a way that you can later locate points that are
close to any particular spot. For 20-dimensional points, this is very hard.

------
colinhevans
The number of parameters to be fit or estimated in a statistical or machine
learning model will always be a function of the dimensionality. For example,
fitting a Gaussian (a Bell Curve) with a fixed variance of 1.0 to an
n-dimensional space requires estimating the mean, which is n parameters. If we
want to estimate an nXn covariance matrix for our n-dimensional Gaussian, we
need to estimate n+n^2 parameters. For linear or logistic regression, we'll
need to estimate n+1 parameters, and so on.

The curse of dimensionality is that if you add dimensions but don't add data,
your functions will overfit the data because you don't have sufficient samples
to estimate your model parameters. The worst-case estimate is that for n
dimensions, you will need on the order of 2^n samples. This comes from the
combinatoric increase of relative "distances" as dimensions increase.

This 2^n bound assumes that your data has a high degree of uncorrelated
variance across all dimensions. In practice, the curse of dimensionality often
isn't a problem. This is because most high-dimensionality data residing in an
n-dimensional space actually doesn't have uniform variance across all n
dimensions, and can be mapped or otherwise transformed into a k-dimensional
subspace where k is much smaller than n with a minimal loss of variance.

Dimensionality reduction approaches include principle component analysis
(PCA), minimum message length methods (MML), various feature selection
approaches, virtually every clustering algorithm. Anything that removes
dimensions while retaining the essential information content will do the
trick.

------
tylerneylon
When it comes to finding approximate neighbors, there is a nice simple trick
to help address the curse of dimensionality.

An easy algorithm to find nearby neighbors is to split space into the
generalization of squares / cubes, aka hypercubes. Then look at all the
surrounding cubes and compile a list of all the neighbors in them. You can do
this with a precomputed hash table mapping cubes to lists of points in that
cube.

This suffers the curse because a cube in n dimensions has 3^n - 1 neighbors
(it is in the middle of a 3x3x..x3 hypercube). Make it a little better with
hash table from every cube corner to a list of points in all surrounding
cubes, a benefit of 3^n down to 2^n lookups -- still exponential.

Simple idea: Split up space using triangles instead of squares (aka simplices
instead of hypercubes). We just went from 2^n lookups down to n+1 (one for
each corner) -- pretty good.

The questions are then: (1) how do you tile n-space with n-dim'l
triangles/pyramids (simplices), and (2) how do we get highly regular simplices
to do that? [Regularity is good because you usually want a sphere-like radius-
based lookup.]

It turns out there's a vertex-transitive simplex that tiles any dimension, and
as a bonus is easy to work with computationally. ("Vertex-transitive" is a
type of regularity where basically every vertex "looks the same" as every
other.) So if you care about point neighbors within a certain approximate
radius, these tricks are extremely helpful.

Reference:
[http://www.siam.org/proceedings/soda/2010/SODA10_094_neylont...](http://www.siam.org/proceedings/soda/2010/SODA10_094_neylont.pdf)

------
ColinWright
More information about this in the discussion of this item:
<http://news.ycombinator.com/item?id=3995615>

There it talks about thinking of high-dimensional spheres as "spikey" to help
avoid the common traps your 3D intuition will lead you into.

------
ntumlin
No article in simple.wikipedia.org, could someone please explain this in a bit
simpler terms?

~~~
i2pi
There are two things going on here. First, as other commenters have noted,
that as you increase the number of dimensions, the search space grows
exponentially. The second, and deeper problem, is that our intuitions about
'volume' fail for higher dimensional problems. As you increase the number of
dimensions, the outer shell of any hypercube holds much more volume than the
inner portion of the cube. This means that if you were to distribute points
with a Gaussian distribution in 2D, most of the mass is near the mean point of
the distribution. As you increase the number of dimensions, more and more mass
is contained within 'outliers'. In high dimensional space, things that seem
unlikely in our usual 3D world become far more probable.

~~~
aktau
I find this (and the other comment referencing the volumes of hypercubes) to
be interesting. Perhaps exactly because it is so unintuitive. Maybe I could
reason it up myself, but what exactly do you see as the "volume" of the inner
portion and the "volume" of the shell? Are we talking about the same units
here? Are there generic formulae?

------
sageikosa
I seem to remember F-tests or something like that when I did multi-dimensional
variant analysis as part of my statistical geography courses back in the day.

Even though you may have data along 7 or 8 dimensions, you may be able to
develop a statistcal model that captures 95% predictive value using only 3 or
4 dimensions (but the 3 or 4 dimensions do not co-relate perectly to any of
the 7 or 8 dimensions you've collected), instead each predictive dimension is
based on part of each of the others.

Or if I summarize: the underlying (statistically useful) dimensionality of a
dataset can be smaller than the number of variables collected as many of the
variables co-relate with each other.

~~~
scottfr
Principle Component Analysis is the main example of this:

<http://en.wikipedia.org/wiki/Principal_component_analysis>

------
kylebrown
Another reference to PRML I presume?

To this page, I'd add a direct link to Mahalanobis Distance. I was just
harping about MD in some HN comments the other day on a Bayes article. There's
no pictoral representation on the wiki page, but it is intuitive and its one
of those simple concepts with foundational implications.

MD rests on the assumption of gaussian/normal population. The math is
straightforward, starting with the covariance matrix. The original (Rubine
1991) gesture recognizer used the MD of some 16 feature vectors between an
input gesture and the collection of trained gestures.

------
anusinha
Another big problem with studying high-dimensional spaces is that you run into
many configurations that simply do not make sense in three
dimensions/correspond to completely nonphysical shapes. This makes it
difficult to study real (3-D) structures using many parameters because results
in higher dimensional spaces do not necessarily correspond to a real shape.

