
Why Momentum Works - m_ke
http://distill.pub/2017/momentum/
======
throwaway71958
Some of the multi-author articles on Distill have a very important (IMO)
innovation. They quantify precisely the contribution each author has made to
the article. I would like to see this become norm in scientific papers, so on
the one hand it'd be clear who to ask questions if they arise, and on the
other the various dignitaries won't get their honorary spot on the authors
list of papers they were barely involved with scientifically.

~~~
colah3
As others mentioned, this is common in other fields, it's just not done in
machine learning.

We've been reading through the policies of lots of journals that seem
thoughtfully run and borrowing good ideas. :P

------
tempodox
I was fully expecting an article about some braindead product that nobody
needs, called Momentum. Imagine my surprise finding physics and a healthily
low percentage of BS.

~~~
dnautics
it's not really physics so much as numerical methods.

~~~
malmsteen
your contribution is appreciated

------
tzs
I'm curious about the method chosen to give short term memory to the gradient.
The most common way I've seen when people have a time sequence of values X[i]
and they want to make a short term memory version Y[i] is to do something of
this form:

    
    
      Y[i+1] = B * Y[i] + (1-B) * X[i+1]
    

where 0 <= B <= 1.

Note that if the sequence X becomes a constant after some point, the sequence
Y will converge to that constant (as long as B != 1).

For giving the gradient short term memory, the article's approach is of the
form:

    
    
      Y[i+1] = B * Y[i] + X[i+1]
    

Note that if X becomes constant, Y converges to X/(1-B), as long as B in
[0,1).

Short term memory doesn't really seem to describe what this is doing. There is
a memory effect in there, but there is also a multiplier effect when in
regions where the input is not changing. So I'm curious how much of the
improvement is from the memory effect, and how much from the multiplier
effect? Does the more usual approach (the B and 1-B weighting as opposed to a
B and 1 weighting) also help with gradient descent?

~~~
halflings
I assume that multiplying by a given factor shouldn't matter since you still
have the learning rate as a factor (which is itself a factor of the gradient).
This might just mean that the learning rate should be lower or higher with
this method.

~~~
im3w1l
The question is then really about which method makes it easier to tune
parameters or which helps intuition the most.

------
poppingtonic
I'm really loving the choice of articles, especially since you're just getting
started.

Edit: I'm referring to the journal, not the author.

~~~
colah3
Thanks! We're just lucky to have authors like Gabe (@gabrielgoh) come to us
with incredible articles. :)

~~~
stared
How to know if an article would fit there? For example, I was thinking about
adjusting my [http://p.migdal.pl/2017/01/06/king-man-woman-queen-
why.html](http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html)
(already with some interactive components) or writing about RoI polling
([https://deepsense.io/region-of-interest-pooling-
explained/](https://deepsense.io/region-of-interest-pooling-explained/) \- by
my colleague, but more interactive).

Would it be on-topic? (After changing style accordingly.)

~~~
colah3
Please check out our journal policies page:
[http://distill.pub/journal/](http://distill.pub/journal/)

In brief, Distill needs to see three things to publish an article: outstanding
communication, advancing the dialogue, and scientific integrity. Distill often
works with authors to help them bring their articles up to our standards.

Additionally, as a primary publication, Distill will not republish content
already published elsewhere, or publish "translations" of papers where someone
rewrites the content of a previous paper. (This relates to advancing the
dialogue.)

If you reach out to editors@distill.pub, we're happy to discuss pre-submission
inquiries about journal scope and related topics.

~~~
stared
Thank you! Of course I read it, just (as it is a new thing) still guessing
what is a good fit, and what isn't.

(For some reason I though that this t-SNE article was published elsewhere. Now
I see that it was on Distill, but just before its big start.)

~~~
colah3
Yep, we're still clarifying our policies. If you have questions, please email
us!

(Distill needs to be extra careful about a lot of this stuff because we're
trying to build legitimacy for a kind of work that many people are inclined to
not treat as academic contributions. So on some things, like being a primary
publication, we may end up taking a more defensive posture than we would in an
ideal world.)

~~~
stared
I wish all the best to this wonderful initiative! (As a side note, my co-
advisor dreamt about this style of communication in science (though, in
physics):
[https://physicsnapkins.wordpress.com/2012/04/16/a-personal-d...](https://physicsnapkins.wordpress.com/2012/04/16/a-personal-
dream-jpi/.))

I totally understand that you need to be set high reference level at the very
beginning, even if at the cost of being a bit "conservative" (I am not sure if
the best word here).

------
lutorm
It would be nice if the introduction made clear that the "Momentum" that
"works" is some algorithm and not at all the physical concept "momentum".

~~~
justinpombrio
At least it wasn't some random software project. My expectation was about
50/50 between "something to do with actual momentum" and "yet another
Javascript framework".

------
wonderous
Reminds me of this paper, "Causal Entropic Forces":

[http://math.mit.edu/~freer/papers/PhysRevLett_110-168702.pdf](http://math.mit.edu/~freer/papers/PhysRevLett_110-168702.pdf)

~~~
RangerScience
This is one of my favorite things ever. You should also look up "Jeremy
England's Entropic Life" for a good companion thought.

~~~
wonderous
A New Physics Theory of Life: [https://www.quantamagazine.org/20140122-a-new-
physics-theory...](https://www.quantamagazine.org/20140122-a-new-physics-
theory-of-life/)

How Life (and Death) Spring From Disorder:
[https://www.wired.com/2017/02/life-death-spring-
disorder/](https://www.wired.com/2017/02/life-death-spring-disorder/)

~~~
RangerScience
Yep! I encountered both of these within a few weeks of each other, after
having come to a similar (but way less educated and rigorous) conclusion to
Prof England's.

------
colmvp
I have to say that the dynamics of momentum diagram is a thing of real beauty.
The whole paper felt a little NYTimes like and then of course, I see that Shan
Carter helped a little bit with it!

------
mark212
I can't follow the math but the presentation is gorgeous (Safari on a MacBook
retina display). Really great, keep up the good work!

~~~
hardmaru
Which part of the math did you find difficult to follow?

~~~
ABCLAW
Better question: What background is the reader expected to have?

Until Xi, not a single variable is defined prior to inclusion into an
expression. Even Alpha and Beta are only defined in the header diagram rather
than the body of the text. Also, why are the iteration notations in the
superscript rather than subscript?

And before someone chimes in and says that rudiments are necessary to
understand this work, no they aren't. The logical steps here are exceptionally
simple (and intuitive - as the introduction might lead you to believe) once
you get past the delivery. This is a fantastic article that could become very
accessible with the proper notation housekeeping.

~~~
gabrielgoh
These are valid criticisms, thank you very much for this.

------
simplynot
But simply not on Firefox

~~~
colah3
Thanks for pointing that out! We fixed the diagram bug in firefox. There's
still bad performance for ~30s after page load -- we're looking into why
that's happening -- but after that the page seems to work well.

~~~
detaro
The generating of the nice formulas from TeX syntax seems to eat most of those
30s for me - maybe you could do that during the build instead of pushing that
work to the client?

------
amelius
Curious, has this method been used for solving linear systems? How would it
perform e.g. against conjugate gradient?

And how would it perform for non-positive-definite systems?

~~~
gabrielgoh
Author here - yes! It can be used to solve symmetric PSD Systems, as the
solution of Ax = b is also the minimizer of 0.5*x'Ax - b'x. Conjugate gradient
can be seen as a special form of momentum with adaptive alphas and betas

~~~
chestervonwinch
> Conjugate gradient can be seen as a special form of momentum

Just to be clear, though, CG doesn't use the negative gradients as search
directions, as steepest descent would.

------
Paul-ish
For those that don't read materials about optimization as much as they maybe
should, what is "w​⋆"? It used without introduction and I don't know what it
is. Perhaps this a convention I am not aware of?

~~~
blt
Optimal w. It is a convention but still ought to be introduced.

------
hatsunearu
Pretty cool, it started using stuff about classical control theory. I always
kinda missed that classical controls weren't really brought up in the
discussion of gradient descent.

------
PDoyle
People, listen. "Damping" means to reduce the amplitude of an oscillation.
"Dampening" means to make something wetter.

</rant>

------
pizza
Related? Flat minima

[http://www.bioinf.jku.at/publications/older/3304.pdf](http://www.bioinf.jku.at/publications/older/3304.pdf)

~~~
gabrielgoh
This is indeed related. See the section on polynomial regression!

------
Animats
Hm. So that helps with high-frequency noise. Any progress on what to do when
the dimensions are of vastly different scales? I have an old physics engine
which had to solve about 20-value nonlinear differential equations. During a
collision, the equations go stiff, and some dimensions may be 10 orders of
magnitude steeper than others. Gradient descent then faces very steep knife
edges. This is called a "stiff system" numerically.

~~~
gabrielgoh
Author here - I believe the problem of a "stiff system" you're referring to is
exactly the problem of pathological curvature!

Some points not touched on in the article. If the individual dimensions are of
different scales, this problem can be easily fixed with a diagonal
preconditioner. Even something like ADAM or Adagrad (unconventional, I know,
in this domain) can be used.

There's also a small industry around more sophisticated preconditioners for
the linear systems in PDEs, see Multigrid, for example, or preconditioned
conjugate gradient.

~~~
Animats
The stiffness may be local. It definitely is in a physical simulation for hard
collisions. Machine learning data is usually normalized into [0..1], so if you
get a really steep slope, something is pathological.

------
fabmilo
If you are curious to see the code to produce that post you can check it out
here: [https://github.com/distillpub/post--
momentum](https://github.com/distillpub/post--momentum) I was surprised to see
that each post has its own html page and javascript library. I was expecting
to see some form of rendering engine and a common javascript library.

------
codekilla
It would be really, really great if you could somehow hook this up to
Discourse so people could comment on and ask questions about the article.
Allowing people to ask questions and having others answer like MathOverflow
would I think bring a lot more clarity. Many different kinds of people want to
understand material like this but may need the math unpacked in different
ways.

~~~
ktta
I don't see how Discourse will be better than something like HN or reddit.
There's also a submission to Reddit[1].

With discourse, I think there will be more noise and a lot of time will be
wasted scrolling through unnecessary replies. What's good about the thread-
like nature of HN/Reddit is that you'll have proper context and the rating
system does its job so everyone's time won't be wasted.

Questions can be answered here and on Reddit too. I think Reddit can be
sometimes more helpful than HN when it comes to answering 'easy' questions so
I would go there if you have any simple questions.

[1]:
[https://www.reddit.com/r/MachineLearning/comments/63f3uk/r_w...](https://www.reddit.com/r/MachineLearning/comments/63f3uk/r_why_momentum_really_works/)

~~~
codekilla
I disagree. The most useful replies can be upvoted to deal with noise, and
using mathematical formulas to help explain responses is absolutely crucial. I
can't tell you how helpful it has been to be looking at an obscure proof, post
a question to Math Overflow, and have the answer explained in an intuitive way
with reference to the symbols and notation used.

These articles on distill I believe could greatly benefit from this. Let the
community help distill.

~~~
ktta
I see where you're coming from. But maintaining another service and the added
expenses, community managers, spam control etc. might be a bit much for
something that intended to be a publishing platform.

And if there is a question that requires more control, like math formatting,
etc. I would actually suggest posting to cross validated[1] and then linking
it here.

[1]: [http://stats.stackexchange.com/](http://stats.stackexchange.com/)

~~~
codekilla
Understood. It's true that this type of thing could incur additional expenses
and effort, I think though that it is truly worth the effort. It's going to
take a push from the top to create a community around the idea, a community
that can distill the idea to those that do not understand it. I really
strongly believe that everything needs to be in the same place, the tooling
needs to be good(perfect formatting of both code and symbols), etc. I admire
projects like distill, but I can't help but think that an article like this
suffers from what I will term 'the symbol grounding problem'(yes a theft from
classical AI). When you write an article like this, for some people it is
incomprehensible because the symbols used are not grounded in concrete
numerical examples. It's been my experience(and just look at some of the
comments on this thread), that when you don't provide many analogies and
examples of concrete computations to illustrate the inner workings of what the
mathematical symbols encode, a very significant portion of those reading do
not actually take away any understanding. I truly do not want this to be the
case, and I must strongly advocate that building infrastructure around helping
the community be able to pitch in is absolutely critical. It should not be
only on the author to take on the burden, with a community it can be done much
better. It's worth it to build something where you can publish an article and
by default it is expected that questions will be asked, and answers will be
provided. I work in academia at a technical institute you have definitely
heard of and I just want to stress this point as much as possible, I see this
problem every day, all day. If someone at distill reads this, please consider
it carefully.

------
xapata
> The added inertia acts both as a smoother and an accelerator
    
    
        momentum = mass * velocity
    
        force = mass * acceleration
    
        momentum = (force / acceleration) * velocity
    

So, it looks to me like momentum is inversely related to acceleration. It
doesn't seem right to call momentum an "accelerator".

~~~
colah3
Hi! This article is about momentum in the mathematical field of optimization.
Acceleration also refers to a phenomenon in optimization. While there are deep
connections to their physical analogues, they aren't aren't quite the same
thing.

If you want to make your analogy work, the momentum algorithm adds mass to an
object. In terms of literal acceleration at any point in time it is neutral,
but the added mass causes you to get through difficult areas much faster.

That said, much of this article is moving away from that physical analogy. :)

~~~
xapata
The use of the term "accelerator" is misleading as it's introduced during the
physical metaphor. You should consider editing that sentence to clarify your
meaning:

The added inertia smooths out variation in velocity, dampening oscillations
and causing us to barrel through narrow valleys, small humps and local minima.
Keeping our speed steadier, we arrive at the global optimum faster.

Note the alliteration :-)

> momentum accelerates convergence

Here it's more clear that momentum is accelerating convergence, not the "heavy
ball" itself.

> inertia acts both as a smoother and an accelerator

On the second read, it's more clearly a contradiction. Speed can't be both
held steadier and accelerated simultaneously. If you meant that momentum
alternately smooths and accelerates, then it's even more strange. For that
behavior, some sort of motor would be a more appropriate metaphor.

~~~
arjie
To the author: I found the sentences quoted above quite clear. Please do not
change them. They helped me rapidly comprehend what the article was going to
be about.

~~~
xapata
You don't find the idea that increased momentum causes increased speed a bit
strange?

~~~
xapata
I think I've figured out my confusion. I'm thinking of momentum as mass, not
the product of mass and velocity. To an extent, the authors seem to have the
same confusion.

------
jchrisa
Is this related to the way bias frequency works in analog audio recording?
[https://en.wikipedia.org/wiki/Tape_bias](https://en.wikipedia.org/wiki/Tape_bias)

~~~
olleromam91
That's exactly what came to my mind as well. Can't answer your question
though!

------
andai
The math here is beyond my reasoning, but I loved playing with the sliders!

------
panic
Turn the step size and momentum to maximum for some wonderfully glitchy chaos!
(and a demonstration of why forward Euler integration only works well with
relatively small step sizes)

------
jacobush
Am I the only one who drew parallels to real life and how "just do stuff"
often works better than the deliberate, slow step by step process?

------
soVeryTired
In your polynomial regression example, I can't follow what you mean by p_i =
\xi \rightarrow \xi^{i-1} when you're setting up the model.

~~~
gabrielgoh
I just mean p_i(\xi) = \xi^{i-1}. The notation is a little cumbersome here,
but it pays of in the second equation

~~~
soVeryTired
More polynomial regression questions :) Are you using the optimal step size
you derived in the previous section? If so, why don't the first and last
eigendirections converge at once? If not, doesn't it suggest that there's a
trade-off between speed of convergence and the ability to stop early?

In general, this is a nicely written article, though. Good work!

~~~
gabrielgoh
Good question! The parameter has been set to a touch below the optimum. Your
observation is accurate, there is indeed such a tradeoff, though it is less
than you might think. The qualitative behavior of the system is very sensitive
to changes at that point, and the tiny bit of extra convergence you get by
getting the step-size exactly right of offset dramatically by the chances of
diverging.

------
jxy
It seems have no improvement for badly conditioned problems, i.e. k >> 1\. The
convergence rate is 1 with or without the damped momentum.

------
tnecniv
Figured I'd mention since I saw the author and an editor in here: in footnote
4, the LaTeX isn't rendering (chrome, OSX).

------
c517402
This looks like gradient decent passed through a Kalman filter. Which, on
reflection, seems like a good idea for overcoming ripples.

------
cuca_de_chumbo
isn't this akin, in effect, to successive over relaxation?
[https://en.wikipedia.org/wiki/Successive_over-
relaxation](https://en.wikipedia.org/wiki/Successive_over-relaxation)

under-relax, converge real slowly

over-over-relax, oscillate

just-right-over-relax, get fast convergence

~~~
gabrielgoh
Its not quite the same. SOR is closer to coordinate descent in the way it acts

------
the8472
Is the decreasing momentum related to the temperature in simulated annealing?

------
retox
Page killed my browser.

------
cee_el123
_So_ much beauty in the presentation.

