Why Momentum Works (distill.pub)
Why Momentum Works (distill.pub)
183 points by m_ke 3 hours ago | 30 comments





Hm. So that helps with high-frequency noise. Any progress on what to do when the dimensions are of vastly different scales? I have an old physics engine which had to solve about 20-value nonlinear differential equations. During a collision, the equations go stiff, and some dimensions may be 10 orders of magnitude steeper than others. Gradient descent then faces very steep knife edges. This is called a "stiff system" numerically.

I'm really loving the choice of articles, especially since you're just getting started.

Edit: I'm referring to the journal, not the author.

Likewise, I'm very impressed with the early article selection. The site itself is beautiful, and interactive figures like the one at the top of this link are incredibly helpful.

Overall -- huge fan!

Thanks! We're just lucky to have authors like Gabe (@gabrielgoh) come to us with incredible articles. :)

I'm curious, did the author write the whole article including figures, or did someone else give life to the figures?

I can see this type of interactive journal becoming very popular in other fields, but not if the author has to create the diagrams him/herself.

Author here - I've created all the diagrams, though I've received really helpful editorial input from Shan Carter and Chris Olah. If you feel like doing some archeology, you can see for yourself the really ugly drafts in the github history - it isn't pretty!

I think these visualizations are deceptively easy to create. Javascript is a powerful language with many libraries, and in my experience, it just took a few nudges at exactly the right spots from Shan to go from an idea in my head to fully fledged diagram in distill. The tricky part has always been figuring out what to visualize, and if you're a researcher with an clever idea for a visualization, I recommend you reach out to the distill team.

As Gabe said, we mostly expect authors to produce diagrams and us to help edit them into outstanding articles. This is one example of the editing:

https://github.com/distillpub/post--momentum/issues/1

We've also had a few designers volunteer to work with researchers on visualization. So, in special cases, we may match-make researchers with designers to produce a great article.

Reminds me of this paper, "Causal Entropic Forces":

http://math.mit.edu/~freer/papers/PhysRevLett_110-168702.pdf

This is one of my favorite things ever. You should also look up "Jeremy England's Entropic Life" for a good companion thought.

It would be really, really great if you could somehow hook this up to Discourse so people could comment on and ask questions about the article. Allowing people to ask questions and having others answer like MathOverflow would I think bring a lot more clarity. Many different kinds of people want to understand material like this but may need the math unpacked in different ways.

I can't follow the math but the presentation is gorgeous (Safari on a MacBook retina display). Really great, keep up the good work!

Which part of the math did you find difficult to follow?

For those that don't read materials about optimization as much as they maybe should, what is "w​⋆"? It used without introduction and I don't know what it is. Perhaps this a convention I am not aware of?

the commenters below are correct - but I will push a change for this right away! It is my fault for not introducing it.

Optimal w. It is a convention but still ought to be introduced.

In general, something star is the optimal value.

Curious, has this method been used for solving linear systems? How would it perform e.g. against conjugate gradient?

And how would it perform for non-positive-definite systems?

Author here - yes! It can be used to solve symmetric PSD Systems, as the solution of Ax = b is also the minimizer of 0.5*x'Ax - b'x. Conjugate gradient can be seen as a special form of momentum with adaptive alphas and betas

> Conjugate gradient can be seen as a special form of momentum

Just to be clear, though, CG doesn't use the negative gradients as search directions, as steepest descent would.

Related? Flat minima

http://www.bioinf.jku.at/publications/older/3304.pdf

This is indeed related. See the section on polynomial regression!

But simply not on Firefox

Thanks for pointing that out! We fixed the diagram bug in firefox. There's still bad performance for ~30s after page load -- we're looking into why that's happening -- but after that the page seems to work well.

Cool that you fixed it already! I should have started with some praise: great selection of articles

Wow...pretty amazing how easily that site killed my desktop.

The site killed my Firefox on Ubuntu.

it spun my firefox hard

It seems have no improvement for badly conditioned problems, i.e. k >> 1. The convergence rate is 1 with or without the damped momentum.

> The added inertia acts both as a smoother and an accelerator

    momentum = mass * velocity

    force = mass * acceleration

    momentum = (force / acceleration) * velocity
So, it looks to me like momentum is inversely related to acceleration. It doesn't seem right to call momentum an "accelerator".

Hi! This article is about momentum in the mathematical field of optimization. Acceleration also refers to a phenomenon in optimization. While there are deep connections to their physical analogues, they aren't aren't quite the same thing.

If you want to make your analogy work, the momentum algorithm adds mass to an object. In terms of literal acceleration at any point in time it is neutral, but the added mass causes you to get through difficult areas much faster.

That said, much of this article is moving away from that physical analogy. :)

