
Introduction to Learning Rates in Machine Learning - thesupercoder_
https://heartbeat.fritz.ai/introduction-to-learning-rates-in-machine-learning-6ed685c16506
======
mochomocha
This is an extremely poor article. I appreciate the efforts towards
democratizing machine learning... But without sounding like a gatekeepr,
there's a limit to this. Reading an introductory material on convex
optimization will give you a better understanding of what a learning rate is
than all these hyped "AI for noobs without math" blog posts.

~~~
rspen
I'm totally in agreement with you here. The issue is that many of the people
writing these posts are in the process of learning themselves and don't know
what they don't know (Dunning-Kruger effect) and because they skip over a lot
of the theoretical details it is very easy to have missteps. Having a decent
theoretical basis is absolutely necessary to be anything beyond a hobbyist in
ML.

------
kxyvr
Is there any particular reason why we have conversations on learning rates and
not simply use a line-search that finds a point that satisfies the strong
Wolfe conditions?

[https://en.wikipedia.org/wiki/Wolfe_conditions](https://en.wikipedia.org/wiki/Wolfe_conditions)

Unless I'm missing something, optimization theory and practice tells us
concretely how to guarantee convergence.

~~~
mochomocha
Line search is expensive. It costs you 1 pass over the data just for function
evaluation. Also, there's only theoretical guarantees in convex cases (and
even in these cases, details get murky if you look at L-BFGS and its treatment
in Nocedal optimization book for example). Most modern neural networks are
optimized over non-convex loss functions and through SGD optimizers which
prove to be extremely effective because they make quick progress for super
cheap (mandatory plug to this classic NIPS paper:
[https://papers.nips.cc/paper/3323-the-tradeoffs-of-large-
sca...](https://papers.nips.cc/paper/3323-the-tradeoffs-of-large-scale-
learning.pdf)).

~~~
kxyvr
I suppose I don't follow. Strong Wolfe conditions do not require convexity in
order to guarantee convergence. If I recall correctly, the requirement for
convergence is that the gradient be Lipschitz continuous, not convexity, which
should be Theorem 3.2 in Nocedal and Wright's book that you referenced.

As far as expense, yes, there is a cost for a line-search. However, machine
learning algorithms typically use automatic differentiation (back-propogation)
in order to calculus the derivative. This is far more expensive than a
function evaluation. As such, in optimization, we generally try to use cheaper
computations such as a function evaluation to better capitalize on expensive
calculations like the gradient. Essentially, it seems like using a learning
rate rather than doing a line search means we're wasting a huge amount of
information and doing something that's not guaranteed to converge even if the
underlying functions were convex, which they're not.

~~~
thorel
There are a few inaccuracies here: using automatic differentiation basically
makes computing the gradient of the objective function as efficient as
computing the function itself. But the main goal of algorithms being used in
machine learning nowadays (stochastic gradient descent and variants thereof)
is to avoid having to compute the objective function or its gradient
altogether: instead, the gradient is computed at a single data point (example)
which provides an approximation of the true gradient.

The important thing is that what is considered expensive is not to compute the
objective function or its gradient, but to compute it _over the entire
dataset_. Line-search would require evaluating the function (and its gradient)
over the entire dataset several times, which completely defeats the purpose of
stochastic gradient descent.

One could imagine doing a line search using the approximate function or its
gradient (coming from the evaluation at a single data point) as the basis for
a line-search, but intuitively it does not make much sense to fine-tune the
step size to a single example, and this would in any case destroy the
guarantees provided by the Strong Wolfe conditions.

Finally, there are convergence guarantees for stochastic gradient descent with
fixed learning rate when the objective functions are convex.

~~~
kxyvr
I don't believe this to be entirely accurate:

1\. Reverse mode in automatic differentiation is not as efficient at the
function evaluation. Even discounting certain costs, and depending on how you
count, the theoretic cost is 4-5 times a function evaluation. Practically
speaking, operator overloading approaches run somewhere between 20-40 times
function evaluation whereas source code transformation tools run at 10-20
times. This is fantastic, but the function evaluation is cheaper.

2\. I also don't believe that stochastic gradient descent requires the entire
function and gradient to be revaluated in the manner that you describe. One
way to view stochastic gradient descent in the context of least squares
fitting is through the use of Johnson-Lindenstrauss, which means that the data
set can be randomly projected once per iteration. This means that the gradient
and line-search parameters can be consistently evaluated at the per iterations
level. Practically speaking, this means we randomly add our data together and
then proceed as normal changing the randomization each iteration. As such,
there should not be an increase in cost by doing a line-search over the
already discounted cost.

3\. As far as if the Wolfe conditions are destroyed, kind of sort of. In order
to guarantee convergence, the amount of reduction that we use must also be
reduced. Meaning, we can't project down the data quite as much if we really
want to achieve convergence. However, practically speaking, I believe it to
matter, a lot.

