
Executing gradient descent on the earth - gballan
https://fosterelli.co/executing-gradient-descent-on-the-earth
======
ajtulloch
One big problem with the conclusion is that intuitions from low dimensional
spaces often don’t carry over to high dimensional spaces. e.g. common example
is how the volume of the unit hypersphere intersected with hypercube ratio
goes to zero. One funny thing I saw once was something like “the real curse of
dimensionality is how different optimization in high dimensional spaces is
compared to low dimensional spaces”.

~~~
Terr_
> One big problem with the conclusion is that intuitions from low dimensional
> spaces often don’t carry over to high dimensional spaces. e.g. common
> example is how the volume of the unit hypersphere intersected with hypercube
> ratio goes to zero.

Is that un-intuitive, or am I looking at it the "wrong" way? I'm imagining the
overlap of a 2D square/unit-circle versus the overlap of a 3D cube/unit-
sphere, and from those two data points there's already a downward trend.

I mean, the 2D case is really just the single cross-sectional slice from the
3D case through the middle, the one that that has the _greatest_ possible
overlap. All other possible slices will less overlap.

Following that logic, a 3D square/cube overlap is likewise the the "middle
cross-section of greatest overlap" for a 4D scenario, etc.

~~~
Strilanc
Try this one.

I give you a megabox: a million dimensional hypercube spanning one meter along
each axis. I also provide a bunch of megaspheres, million-dimensional
hyperspheres scaled to be 0.99 meters in diameter. How many megaspheres can
you fit into the megabox?

~~~
jstanley
I'd love to hear the correct answer to this, with explanation.

~~~
OscarCunningham
I can't tell you the exact answer, but it's at least 10^100000.

The coordinates of the centre of each megasphere have to lie in the range
[0.49,0.51]. Let's just think about megaspheres which are pushed as far as
they will go into some corner, so their coordinates are each either 0.49 or
0.51. For each of the one million directions, the distance between the centres
of two megaspheres in that dimension will either be 0.02 or 0, depending if
their coordinates are the same or different. So by Pythagoras' Theorem the
total distance between their centres will be sqrt(n)0.02, where n is the
number of dimensions in which their coordinates differ. In order for them to
fit the distance must be at least 0.99, so we need sqrt(n)0.02 >= 0.99, which
means we need n >= 2451. But we have 1000000 coordinates to choose from, so
it's not hard at all to make sure that at least 2451 of them differ. We can
start with the sphere whose centre has all coordinates equal to 0.49. Then the
sphere with coordinates 0.51 in the first 2451 dimensions, and all others
0.49. Then the sphere with coordinate 0.51 in the next 2451 dimensions. And so
on.

That squeezes in at least 1000000/2451 = 407 spheres, but even after that
there's loads more room. How about the sphere that has coordinates 0.49 and
0.51 alternating? Or changing every 3? In fact if we just assign coordinates
at random then each dimension has a 50/50 chance of being different, so we
would expect them to be different in 500000 dimensions. The probability of
them being different in only 2450 dimensions is given by (1000000 C
2450)/(2^1000000) = 10^-293570. The number of pairs of spheres is quadratic in
the number of spheres. So if we just shove the spheres into corners at random
we can expect to place about sqrt(1/10^-293570) = 10^146785 spheres before two
of them collide.

And that still leaves loads of room away from the corners! The point at
(0.5,...,0.5) is still a distance of sqrt(1000000)0.01 = 10 away from the
centre of any sphere we've placed so far, so there's still plenty of space for
more spheres.

EDIT: The coordinates should actually be 0.495 and 0.505, but the principle is
the same.

------
OscarCunningham
The actual global minimum isn't the ocean but rather the Dead Sea. I imagine
this is rather hard to find, since you have to be fairly close to it before
the gradient leads toward it rather than the ocean.

In fact we can be very precise. The Dead Sea has a catchment area of 41650
km^2, which is 0.0000816 of the Earth's surface. So we need on average 12200
random initializations in order to find the Dead Sea by gradient descent.

~~~
mirimir
Well, the Dead Sea is certainly below mean ocean level. But the deepest
oceanic trenches are far deeper (~11 km below mean ocean level) than the
bottom of the Dead Sea (~0.74 km below mean ocean level).

~~~
sparky_z
Except its pretty clear that for this problem, as posed, we're considering the
surface of the ocean to be the "ground".

~~~
mirimir
OK, sure. But the site talks about energetic boulders. And they wouldn't stop
at sea level.

------
madisfun
Finding ocean is easy: water and gravity have formed the surface to have few
local minima. It would be much more interesting finding peaks: there are many
local maxima.

And the most important maxima are often surrounded by lots of local maxima.
They may also have some steep slopes so gradient methods can easily overshoot.
Sometimes the surface function is not even continuous (overhanging walls).

~~~
iainmerrick
_water and gravity have formed the surface to have few local minima_

From the article: "It turns out that the earth is _filled_ with local minima"

(I suspect the deal is that you're right that water always flows downhill to
the ocean; but that it often happens at scales too small to capture in the GIS
data, or underground.)

~~~
chrisfosterelli
Yeah you got it! Water moves in ways that are not captured by the resolution
of the elevation dataset. Also, water moves into/through the ground.

------
ucaetano
Well, in a large scale, performing a gradient descent on the earth should
always end at an ocean or at an endorheic basin.

Sure, when you increase the "resolution" to meters or smaller, there will be
an endless number of mini-endorheic basins. For example, the former Collect
Pond in NYC:

[https://en.wikipedia.org/wiki/Collect_Pond](https://en.wikipedia.org/wiki/Collect_Pond)

~~~
mdturnerphys
Any lake or pond is a local minimum. If it has an outlet it's only because the
water level has risen to the point that it can escape.

~~~
sitkack
Which if we want to continue this analogy, when GD finds a local minimum, it
needs to start filling that basin so it can escape.

------
hodgesrm
OK, I didn't run through the math on this but I have spent a lot of time
walking around the Olympics and the computed path from the summit of Olympus
looks extremely inefficient. Also there's a point east of Forks that looks as
if the path was close to getting trapped. The example does not seem compelling
as a proof of convergence, especially if you reduce the momentum component to
a point where you can't just hop over 4000' ridges as this path does.

For now I'm sticking with the path down the Hoh River. ;)

edit:typo

------
make3
Maybe use a different dataset with the levels of the oceans, and see if you
can find the lowest point on earth (Challenger Deep) from the peek of the
Everest ? :P

And now, do Adagrad, RMSProp, RMSProp + Nesterov and Adam, and maybe Newton,
BFGS, L-BFGS and conjugate gradients, and then coordinate ascent :)

It would be a pretty good educational tool to teach people the different
gradient descent methods, though it's probably too simple of a problem for
these methods to be at all useful.

------
wavefunction
>The earth should actually be a very easy function to optimize. Since the
earth is primarily covered by oceans, more than two thirds of possible valid
inputs for this function return the optimal cost function value.

I guess if you're going to count the surface of the oceans as part of the
descent you have to remember the surface of the ocean is not uniform in
altitude neither due to tidal forces nor other factors.

------
edejong
“We have success!”

Ehm, no. The lowest point should have been somewhere 413 meters below sea
level (Dead sea depression).

Earth is actually a very difficult place to perform gradient descent on, due
to its very large plateau (oceans).

(Edit: lowest point is 413 meters)

~~~
chrisfosterelli
The lowest point would be in the ocean, if you want to consider elevations
below 0. In this implementation everything under 0 is rounded to 0. Otherwise,
you'll have negative loss.

~~~
edejong
Obviously I considered the lowest point on land.

------
placebo
While there are good reasons to use gradient descent on neural networks,
finding the global minima of the surface of the earth would not be my first
choice. Various metaheuristics would do much better.

------
subroutine
"We have success! It’s interesting watching the behaviour of the optimizer, it
seems to fall into a valley and “roll off” each of the sides on it’s way down
the mountain"

Anyone sitting at a computer right now have a few minutes to make an animation
of this? I'd love to see it in action.

~~~
camtarn
The last image in the article is an animation with a tiny red dot following
the path.

