
Gaussian Distributions Are Soap Bubbles - lebek
http://www.inference.vc/high-dimensional-gaussian-distributions-are-soap-bubble/
======
cs702
Of course. As others here point out, the hypervolume inside an n-dimensional
hypersphere grows as the nth power of a linear increase in radius. In high
dimensions, tiny increases in radius cause hypervolume to grow by more than
100%. The concentration of hypervolume is always highest at the edge.[0]

The theoretical tools (and intuitions) we have today for making sense of the
distribution of data, developed over the past three centuries, _break down_ in
high dimensions. The fact that in high dimensions Gaussian distributions are
not "clouds" but actually "soap bubbles" is a perfect example of this
breakdown. Can you imagine trying to model a cloud of high-dimensional points
lying on or near a lower-dimensional manifold with soap bubbles?

If the data is not only high-dimensional but also non-linearly entangled, we
don't yet have "mental tools" for reasoning about it:

* [https://medium.com/intuitionmachine/why-probability-theory-s...](https://medium.com/intuitionmachine/why-probability-theory-should-be-thrown-under-the-bus-36e5d69a34c9)

* [https://news.ycombinator.com/item?id=15620794](https://news.ycombinator.com/item?id=15620794)

[0] See kgwgk's comment below.

~~~
kgwgk
> Density is always highest at the edge.

More precisely: it is the mass that is “concentrated” at the edge, not the
density. In the Gaussian case the distribution “gets more and more dense in
the middle” regardless of the number of dimensions. However, in high
dimensions the volume in the middle is so low that essentially all the mass is
close to the surface of the hypersphere.

~~~
ks1723
What i found quite surprising in that context, is that the volume of the
n-dimensional ball for any finite fixed radius goes to zero as n goes to
infinity (see, for example, section "high dimensions" of
[https://en.m.wikipedia.org/wiki/Volume_of_an_n-
ball](https://en.m.wikipedia.org/wiki/Volume_of_an_n-ball))

~~~
cs702
There are multiple good (intuitive) explanations for this here:
[https://math.stackexchange.com/questions/67039/why-does-
volu...](https://math.stackexchange.com/questions/67039/why-does-volume-go-to-
zero)

------
woopwoop
I was going to comment that what's going on here doesn't have much to do with
the Gaussian distribution. In high dimensions, almost all of the volume of the
unit ball is concentrated near the unit sphere. In the first comment, Frank
Morgan makes the same remark, pointing out that you get the same effect with
the uniform distribution on the unit cube in high dimensions.

High dimensions are weird.

~~~
paulgb
It's true! I just made a colab notebook to visualize the effect:
[https://colab.research.google.com/notebook#fileId=1znFHwemxa...](https://colab.research.google.com/notebook#fileId=1znFHwemxaNf2x8iAHWonBa_2LsKSGaZD)

(seems to require a Google account, sorry in advance)

~~~
semi-extrinsic
Instead of plotting the cumulative mass distribution, why not just plot the
mass distribution itself? I.e. the derivatives of the curves you've plotted?

------
smallnamespace
Isn't the unsuitability of the high-dimensional Gaussian intimately related to
the fact that for most realistic problem spaces, we actually believe there are
really far fewer than the N >> 1 measured dimensions?

A uniform Gaussian presupposes that the variates are either linearly
orthogonal, or all have the same linear interaction with each other (in the
case of fixed positive correlation).

If your actual problem has dimension 20, but you've measured it with N
dimensions, then that means there are strong interactions between your
measured variates, and moreover the intervariate interactions do not have a
single _fixed_ interaction strength (like a single Gaussian correlation), but
probably vary like a random matrix.

This might be related to the Tracy-Widom[1] distribution somehow. Perhaps the
the distribution you use to replace the Gaussian should really be something
like: first generate a random positive semi-definite matrix as C, then
generate random data based on different random choices of C.

[1]
[https://en.wikipedia.org/wiki/Tracy%E2%80%93Widom_distributi...](https://en.wikipedia.org/wiki/Tracy%E2%80%93Widom_distribution)

------
tgb
I won't dispute the main point of the article but a couple minor errors bug
me. First, he kept referring to a Gaussian distribution as being the _unit_
sphere, when of course the radius depends upon the parameters of the Gaussian
(the standard deviation). If not, then it wouldn't be invariant under which
units you chose. A bizarre mistake to repeat many times throughout the
article.

Less importantly, the last paragraph says that the probability that two
samples are orthogonal is "very high". Being precisely orthogonal is
technically a probability zero event. There author means "very close to
orthogonal."

There was a good discussion about this problem in the context of Monte Carlo
simulations in (1).

(1) [https://arxiv.org/abs/1701.02434](https://arxiv.org/abs/1701.02434)

~~~
conjectures
On that point my teeth were grinding because it assumes an identity covariance
matrix. Ie the bubble needn't even be spherical.

The second is that the squared norm has a chisq distribution. There's no point
simulating it. You can just plot the pdf, and have all kinds of facts about
its mean, var, entropy etc. Also, iirc Shannon had something to say about
this.

However, I do think these facts are worth a reminder.

~~~
jjoonathan
I don't (on the first point). Everyone with the background to understand the
problem under discussion and appreciate the explanation already understands
that Gaussians are parametrized. I challenge you to find a counterexample. The
specifics of non-isotropic parametrizations are even less relevant to the
discussion than scalar parametrization.

On the second point, I agree that the approximation deserves a mention.

------
Bromskloss
Those images [0] inputs that were optimised to maximise a certain
classification response were cool! Instead of going to this peak of the
response function, is there a way to explore the shell where the actual images
reside? Would such images look, to our eyes, more like real input than the
optimised input? I suspect they won't, but I still would like to see what is
between the dogs!

[0] [http://www.inference.vc/content/images/2017/11/Screen-
Shot-2...](http://www.inference.vc/content/images/2017/11/Screen-
Shot-2017-11-09-at-2.12.44-PM.png)

~~~
lebek
This paper is sort of about that:
[https://arxiv.org/abs/1710.11381](https://arxiv.org/abs/1710.11381)

They use a gamma distribution which has more probability density near the
origin, which causes samples around the origin and interpolations to be more
like real input.

------
snippyhollow
Compulsory "Spikey Spheres" notebook
[http://nbviewer.jupyter.org/urls/gist.githubusercontent.com/...](http://nbviewer.jupyter.org/urls/gist.githubusercontent.com/syhw/9025964/raw/441645b476a2a997f27f5993e4da2988febe1ef3/SpikeySpheres)

~~~
cs702
Yes. t-SNE is probably the only algorithm that _might_ produce "sensible" 2D
mappings.

BTW, matplotlib has a nicer facility than add_subplot() for making grid plots:

    
    
      fig, axes = plot.subplots(nrows=figdims, ncols=figdims)
      for dim, ax1 in zip(range(2, MAX_DIM), axes.flatten()[:(MAXDIM-2)]):
      .
      .
      .

------
amluto
In information theory, there's a reflected concept of a "typical set", which
is the set of sequences of samples from a distribution whose probability (or
probability density) is very close to the expected probability. If you draw a
sequence of samples, you are overwhelmingly likely to get a typical outcome as
opposed to, say, anything resembling the most likely outcome.

As a concrete example, if you have a coin that gets heads 99% of the time and
you flip it 1M times, you are overwhelmingly likely to get around 10k tails,
even though the individual sequences with many fewer tails are each far
likelier than the typical sequences.

------
Scene_Cast2
So I have a feeling that he's looking at the wrong histogram. If you plot the
distributions of the vector magnitudes, you'll get a spike around whatever
large number, and a very sharp falloff to the right and left.

However, it's not a "bubble" in the intuitive sense. He's looking at the
magnitude distribution of dots over the entire space, and implicitly using the
Cartesian coordinate system (discarded angle, looking at just magnitude).

If you look at the distribution of dots per volume (or R^N hyper-volume
rather), then you'll still have the highest concentration in the center, with
no "bubble".

------
strainer
Maybe it goes without saying but what I found distinctive about gaussian
distribution in multiple dimensions is that it seems to be the only
distribution which produces a smooth radial pattern when plotted co-linearly
(yet not radially). All other distributions which I tested exhibit a bias
through the main axis when just a number of (variateA,variateB) pairs are
plotted. Gaussian seems to be the only one , fundamentally, which shows no
sign of the orientation of the axis it is plotted along.

Comes in handy for plotting a radially smooth 'star cluster' without doing
polar coordinates and trig. Just plot a load of
(x=a_guass,y=another_gaus,z=another_gaus) and you have a radially smooth
object. I dont think any other distribution can do that, it seems to me there
is something mathematically profound about it which Im sure some
mathemagicians have a proper grasp of.

The 'co-linear' distortions of other distributions can be seen here in some
plots in the test page for my random distribution lib:

[http://strainer.github.io/Fdrandom.js/](http://strainer.github.io/Fdrandom.js/)

------
bglazer
I was recently reading the section on importance sampling in David MacKay's
"Information Theory, Learning, and Inference Algorithms". Page 373-376 in the
linked pdf
([http://www.inference.org.uk/itprnn/book.pdf](http://www.inference.org.uk/itprnn/book.pdf))

He shows that importance sampling will likely fail in high dimensions
precisely because samples from a high dimensional Gaussian can be _very
different_ than those from a uniform distribution on the unit sphere.

Consider the ratio between a sample at the same point from a 1000D Gaussian
and a 1000D uniform distribution over a sphere. If you sample enough times,
then the median ratio and the largest ratio will be different by a factor of
10^19. Basically, most samples from the Gaussian will be fairly similar to the
uniform. A few will be wildly different.

Perhaps I'm misunderstanding both the post and MacKay's book. I'd be happy to
be corrected.

~~~
kgwgk
If you sample from a 1000D Gaussian, most of the points will be "close" to the
hypersphere of radius sqrt(1000). The distance to the center for 99.5% of the
points is between 31.6-2 and 31.6+2. 99.997% will be between 31.6-3 and
31.6+3, 99.999997% will be between 31.6-4 and 31.6+4, etc.

This is what he means when he says "practically indistinguishable from uniform
distributions on the [unit] sphere." As tgb remarked in another comment, the
"unit" bit is incorrect.

------
srs70187
This is an interesting take and kudos to the author for relaying a helpful way
to think about high dimensional distributions.

I really like and often come back to this talk by Michael Betancourt where the
theme is quite similar:
[https://youtu.be/pHsuIaPbNbY](https://youtu.be/pHsuIaPbNbY)

------
andrewflnr
This reminds me of the story summed up in this quote:

    
    
      There was no such thing as an average pilot. If you’ve designed a cockpit to fit
      the average pilot, you’ve actually designed it to fit no one.
    

Good enough source here:
[http://wmbriggs.com/post/18291/](http://wmbriggs.com/post/18291/)

Humans form a very high-dimensional space. I'm not sure what to make of the
point about orthogonality in that regard.

------
brianjoseff
Having trouble understanding a lot of the specifics of this--though broader
concepts grok-able. Before I go blindly googling around to get up to speed--

Any recommended foundational texts to begin with?

Recommended learning trajectory to get to where this is understandable?

------
m3kw9
Can we derive some type of optimization algorithms from using this?

