
Choosing 3M points on a 300-dimensional sphere - CarolineW
https://mikesmathpage.wordpress.com/2016/07/23/choosing-3-million-points-on-a-300-dimensional-sphere/
======
grizzly2panam
I think this is a pretty correct approximation. An even better way is to
assume vectors are uniformly distributed on the sphere. The dot product has
expectation 0 and the variance is 1/300 (cf
[http://stats.stackexchange.com/questions/85916/distribution-...](http://stats.stackexchange.com/questions/85916/distribution-
of-a-scalar-product-of-two-random-unit-vectors-in-mathbbrd)).

Chebyshev's inequality shows that the probability that the dot product be
greater than 0.3 in absolute value is less than 4%.

------
murbard2
If you want to generate vectors uniformly on a n-dimensional unit sphere,
generate them as i.i.d vectors of n standard normal variates and renormalize.

    
    
        x = np.random.randn(300)
        x /= np.sqrt(np.sum(x*x))
    

To see why consider that the joint pdf depends solely on the vector's distance
from the origin.

~~~
tintor
That is not going to be uniform.

~~~
keithlfrost
uh... yes, yes it is. see for example
[http://mathworld.wolfram.com/HyperspherePointPicking.html](http://mathworld.wolfram.com/HyperspherePointPicking.html)

~~~
dahart
It's uniform if you reject points with length > 1 before normalizing (your
link mentions this), but not uniform as written. It's a simple adjustment, but
you do have to cull.

Edit: yeah oops, I goofed and I'm wrong. I didn't notice it was randn -
normally distributed. No culling necessary.

~~~
murbard2
If you try rejection sampling in 300 dimensions, you're gonna have a bad time.

~~~
dahart
Is it bad for a unit sphere specifically, or just a generally bad idea? I was
just thinking the volume ratio of the hyper sphere to the hyper cube is always
over 0.5 in any number of dimensions? But I'm not sure about that.

Or do you mean more that it sucks pretty bad to generate 300 rands and then
throw them away, even once?

~~~
asafira
Nope, I was wrong --- the volume of the hyper sphere to the hyper cube is not
always over 0.5.

~~~
murbard2
It in fact decreases exponentially

~~~
dahart
Yeah, wild, I didn't know this.
[https://en.m.wikipedia.org/wiki/Volume_of_an_n-
ball](https://en.m.wikipedia.org/wiki/Volume_of_an_n-ball)

Jeez, my intuition stinks. I initially thought it was obvious why the 3D
sphere to unit cube ratio is > 0.5, by analogy to the 2d diamond embedded in a
square, but in 3D the embedded octagon volume isn't 0.5. Obvious now, but
that'll teach me not to guess.

Indeed, rejection sampling the 300 dimensional unit sphere will suck really
bad.

------
skybrian
Here's a rough analogy to get intuition going:

Say we forget about dimensions and hyperspheres for a moment and just talk
about quizzes. Specifically, a quiz of 300 true-false questions.

We have two students fill out the quiz at random, by flipping a coin. What's
the probability that they answer all the questions the same way? 2^300 -
vanishingly small. Similarly for answering all the questions in exactly the
opposite way.

If we only have two or three dimensions, that's like a quiz with only two or
three questions. If the possible answers to the questions are constrained,
it's fairly likely that some students would get similar results just by
chance. But the number of combinations goes up exponentially as we add more
questions.

~~~
skybrian
Oops, that should be 1/(2^300).

------
jobigoud
So, I'm not understanding something. If you take a vector formed only of ±1
coordinates, it's not going to be within the n-dimensional ball, nor on the
n-1-dimensional sphere, it's going to be at the corner of the unit hypercube,
right?

Secondly, I don't understand by which measure the distribution of these
vectors would be uniformly distributed or random. Normalizing the vectors to
get the coordinates on the unit hypersphere is only ever going to land on 2^n
different coordinates. Clearly in low dimension this wouldn't be a good way to
select random points on the sphere…

Does his approach make sense only because 2^300 is so much larger than 3
million?

~~~
Dylan16807
> If you take a vector formed only of ±1 coordinates, it's not going to be
> within the n-dimensional ball, nor on the n-1-dimensional sphere

The points are going to be on a hypersphere, just not the unit one. The radius
will be about 17, right?

> Does his approach make sense only because 2^300 is so much larger than 3
> million?

It has to be larger, I don't think it needs to be crushingly larger. As long
as you're still sparse, all those extra digits on your random numbers don't
matter. You can round it all the way to a single bit. Each dimension is only a
tiny factor on your position on the sphere, so you get perfectly good coverage
for something like dot products.

~~~
jobigoud
Ah, thanks, got it now. Points are on the circumscribed sphere, radius
sqrt(300).

------
javajosh
300 dimensions seems like way too many to have any kind of spatial intuition
about. For me, it's easier to think of a n-tuple where n is 300. I'm not sure
really what the cardinality of each dimension in this problem is, but I'll
assume its a) the same for every dimension and b) something like a subset of
the natural numbers (like, one int per 100k english words or something).

The sphere part is a little confusing at first, but it's really just a fancy
(and yes, elegant) way of specifying a constraint on the n-tuple. Given some
radius R, it defines the constraint sum(n_i^2) = R^2 for i=1..300.

It might be useful to normalize R, and it is convenient to select R=300 so
that the average contribution of each n_i^2 term is 1. At this point, you just
have to pick 3M n-tuples. I would probably use a random number generator to a)
fill in the n-tuple, b) adjust a constant factor to fit with real world R, and
c) repeat 3M times.

------
blackholesRhot
This general area is called "concentration of measure" in case you want to
Google around and find other examples. A simple but informative calculation is
to compute the volume of a unit n-sphere and the volume contained between
radius .9-1, and then see how the ratio of these scales with dimension. You'll
see that that almost all of the volume is contained close to the boundary
(exponential in dimension.)

------
orasis
This is called random projection. Here is the paper:

[https://users.soe.ucsc.edu/~optas/papers/jl.pdf](https://users.soe.ucsc.edu/~optas/papers/jl.pdf)

------
Jabbles
Are the points on the surface of the sphere as implied by the title, or in the
unit sphere, as implied by the quote?

~~~
Dylan16807
The size of the sphere doesn't matter because he normalizes the dot product in
step 7.

