Reasonable Effectiveness of the Multiplicative Weights Update Algorithm (2017)

atq2119 · on July 5, 2021

+1 for Multiplicative Weights Update as a technique, and kudos for the nice write-up.

Having studied this myself a couple of years ago, I believe an intuitive way to understand its power is that MWU approximately optimizes an l_\infty-norm by approximating it with a sum of exponentials (the potential function from the proof). The analogy isn't perfect, but it helped me reason about it.

There is also a tentative link to tropical geometry because of this. When you have a problem that by its nature is "tropical" (using addition and taking maximums), the MWU method may be useful via translating into the non-tropical world of multiplication and addition, respectively.

TimonKnigge · on July 5, 2021

Multiplicative weights, like mirror descent, is a special case of a more general algorithm called mirror descent. You can indeed view it as optimizing an \ell_\infty norm, though I think the mirror descent perspective is more insightful. I wrote about this in my thesis, see section 2.4 here: http://timonknigge.com/univ/msc-thesis.pdf

p1esk · on July 5, 2021

Intuitive way?

atq2119 · on July 5, 2021

Yes. Unlike l_\infty, sums of exponentials are smooth everywhere, and smooth functions are easier to optimize. But geometrically, the two can be brought arbitrarily close together by adjusting the exponents with a multiplicative constant.

axiosgunnar · on July 5, 2021

What is "l_\infty"?

ponow · on July 5, 2021

That's TeX for "l" with subscript infinity. It's a vector norm, and is equal to the element of the vector with maximal absolute value. It's also called the max-norm and sup-norm (for the infinite dimensional generalization, e.g., where the vector becomes a real-valued function of a single real variable).

eigenket · on July 5, 2021

A norm (meaning a definition of "length") where the length of a vector is defined as the largest absolute value of any of its components. Its one of the standard lp norms, you get a norm for every p in the closed interval [1,infinity]. The standard Euclidian norm is the case p=2.

https://en.wikipedia.org/wiki/Lp_space#The_p-norm_in_finite_...

dandellion · on July 5, 2021

It's Latetx syntax, doesn't work in hn, but would be this: http://www.sciweavers.org/tex2img.php?eq=l_%5Cinfty&bc=White...

enriquto · on July 5, 2021

A somewhat particular case of this, which is missing in the examples, is the IRLS [0] algorithm for minimizing sums of distances. I say "somewhat" because the random part is missing, you just update all the weights at each iteration.

The idea is that optimizing a weighted sum of squared distances is trivial: the optimum is just the weighted average. Thus you set up the weights, iteratively, to the inverse distances, so that you end up minimizing the actual sum of distances.

[0] Iteratively Re-weighted Least Squares

contravariant · on July 5, 2021

If you combine

>In the more general event that you have rewards for all objects (if not, the reward-producing function can output zero), you would perform this weight update on all objects

with the following justification of its properties then it kind of seems like the guarantee that it gets close to the maximum reward only holds if:

1. You know the rewards for all options and update the weights accordingly

or

2. You consider the rewards for all options not taken 0 (in which case you trivially got the maximum reward possible).

This kind of seems to limit its applicability somewhat, unless I'm missing something?

antman · on July 5, 2021

Nice write up, the 'it's not gradient descent due to rewards adaptiveness' I don't understand, most usages of GD are non vanilla and adaptive.

enriquto · on July 5, 2021

Yes, the article is beautiful and clearly written but it seems that this insight is wrong. The algorithm can indeed be regarded as a stochastic gradient descent with approximate descent directions.

TimonKnigge · on July 5, 2021

Not exactly -- multiplicative weights is a special case of mirror descent (as is gradient descent). It arises doing prox-steps on the simplex using the negative entropy as a regularizer (rather than the usual Euclidian distance, as is the case with gradient descent). In particular, if you do regular gradient descent then the runtime will be exponentially worse (the \log n in the guarantee will become a \sqrt{n}).

Here's an interesting lecture on the topic: https://nisheethvishnoi.files.wordpress.com/2018/05/lecture4...

habibur · on July 5, 2021

I thought that's how neural network and back propagation work. Randomize, increase weight, check the output, refix weight based on that.

prionassembly · on July 5, 2021

Jeremy Kun's blog is a treasure.

roenxi · on July 5, 2021

One point against this algorithm is the generic downside of all machine learning techniques - if you apply your algorithm and don't get a satisfactory result ...

- The algorithm not amenable to this problem

- You have a buggy implementation

Which is true? This is one of the issues that genetic algorithms run into as a similar powerful and chronically underutilised tool.

Being nondeterministic isn't the end of the world, but the payoffs have to be much higher than using simple deterministic methods.