
Understanding gradient descent - ingve
http://eli.thegreenplace.net/2016/understanding-gradient-descent/
======
chestervonwinch
I've always thought that this was a fun way to look at it:

Suppose you solve the differential equation,

    
    
        x'(t) = -f'(x(t))                    (1)
    

Then,

    
    
        d/dt f(x(t)) = f'(x) x'(t) = - [f'(x)]^2 <= 0
    

In other words, if `x(t)` follows a path that solves (1), then `x(t)` follows
a path that decreases the value of `f(x)`.

The gradient descent algorithm is a numerical approximation to solving (1)
using the forward Euler method:

    
    
        (x(t_(n+1)) - x(t_n)) / dt = -f'(x(t_n))

~~~
dxbydt
Perhaps an example might help.

Say x(t) = exp(-t)

f(x(t)) = (x^2)/2

Then it satisfies conditions imposed by parent, namely x' = -x = -f'

~~~
chestervonwinch
Nice. I think you mean f(x) = 0.5*x^2 though.

------
jordigh
I need to write something about conjugate gradients. The first thing we did in
my optimisation class at CIMAT was gradient descent and showed how it was so
slow compared to everything else. In practical applications, there's way too
much magic in choosing the so-called "learning rate" (known as "step size" in
numerical optimisation). Even choosing an optimal step size per iteration (by
doing a line search), gradient descent is still slow. Conjugate gradients are
a bit more difficult to understand, but they're not _that_ hard to implement
if you just blindly copy a few formulae, and they are a _dramatic_ speedup.

All of these blog posts about gradient descent feel like everyone keeps going
on about what a great algorithm bubblesort is because it's so easy to
understand and implement.

~~~
eliben
Claiming that gradient descent is a great algorithm certainly wasn't the goal
of this article. But GD is the basis from which most other iterative
algorithms stem, and it's worthwhile having a good understanding of its inner
workings as a prerequisite for more advanced algorithms.

Personally I found [http://sebastianruder.com/optimizing-gradient-
descent/](http://sebastianruder.com/optimizing-gradient-descent/) interesting
- it goes into the advanced variants like nesterov, adadelta, etc

~~~
jordigh
I realise, but since a lot more gets written about steepest descent than
everything else, it also gets implemented and used a lot more often, even when
other alternatives are readily available and already implemented. For example,
the trainscg method of Matlab is relatively obscure thus unused and not
frequently re-implemented.

There's an implicit endorsement in these blog posts. People wouldn't be
spilling all this ink nowadays on plain adalines, even if they're building
blocks for backprop networks, right? So by writing so much about it, people
get the impression that it has to be studied and implemented very carefully,
to the exclusion of better methods.

~~~
maxerize
As the above poster said, if you can introduce people to a topic by explaining
a simplified or 'naive' solution/algorithm, then that could be a good
springboard to learn more about the topic. Which is why there should be more
comments presenting improvements and alternatives, rather than criticising
what is obviously meant as a primer on optimisation. From the intro:

 _Gradient descent is a standard tool for optimizing complex functions
iteratively within a computer program. Its goal is: given some arbitrary
function, find a minima.

For some small subset of functions - those that are convex - there's just a
single minima which also happens to be global. For most realistic functions,
there may be many minima, so most minima are local.

Making sure the optimization finds the "best" minima and doesn't get stuck in
sub-optimial minima is out of the scope of this article.

Here we'll just be dealing with the core gradient descent algorithm for
finding some minima from a given starting point._

~~~
theoh
The way the author uses the word "minima" as if it was both singular and
plural is a really discouraging sign.

------
pmarreck
If only there was a faster known way to get the inverse of a matrix...
Gradient descent seems like an ugly hack compared to that (having taken an ML
class)

It reminds me of taking the derivative by manually measuring the slope between
2 points on the curve, instead of, you know, directly getting the derivative

~~~
plg
[http://www.johndcook.com/blog/2010/01/19/dont-invert-that-
ma...](http://www.johndcook.com/blog/2010/01/19/dont-invert-that-matrix/)

~~~
Sherlock
This is the clearest signal to tell apart people who have studied applied math
and people that haven't.

------
gulpahum
What about minimum of a non-differentiable function? I've been using something
like Nelder–Mead method (downhill simplex method / amoeba method) [1], but it
gets slow near the minimum.

[1]
[https://en.wikipedia.org/wiki/Nelder–Mead_method](https://en.wikipedia.org/wiki/Nelder–Mead_method)

