Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Beyond automatic differentiation (googleblog.com)
196 points by shantanu_sharma on April 14, 2023 | hide | past | favorite | 34 comments



> We can use a similar idea to take an existing optimizer such as Adam and convert it to a hyperparameter-free optimizer that is guaranteed to monotonically reduce the loss (in the full-batch setting).

I'm not a huge fan of analyses like this that make one huge assumption that never holds in practice. Nobody uses full-batch gradient descent. It's not possible in practice or even desirable in theory. The stochasticity of SGD is important for performance! There is no reason to expect results with unrealistic assumptions like these to generalize to the cases people actually care about.

> at the cost of a single additional forward pass that increases the wall time for each step by a small factor (about 2x in the example above).

So it doubles the training time and FLOPS. Then the graph X axis should be wall time or FLOPS when comparing the methods, instead of steps.


In the paper https://arxiv.org/pdf/2212.11429.pdf In point 5.5 (page 57) Future work :

"

Promising areas of future work include: • Mini-batch optimization. Both the theory and experiments in this chapter have been limited to full- batch optimization. However, the MM paradigm can be extended to mini-batch optimization (e.g., [54]). A universal mini-batch MM optimization algorithm may outperform Adam and AdaGrad on certain large-scale machine learning problems.

"

The current limits of this algorithm (point 2 page 56) is that the reverse mode bounding is not implemented yet (and related work didn't obtain tight bounds) so you can't yet apply it directly for function with huge number of input parameters, so for the time being the main usage is limited to few hyper-parameter optimizations.


> So it doubles the training time and FLOPS. Then the graph X axis should be wall time or FLOPS when comparing the methods, instead of steps.

Thanks for pointing this out.

Inflating your results by tricky measurement methods has been around forever, but it can be hard to spot if you're not close enough to the field.

Also, unfortunately, the blog post does not contain anything about the key idea used. (Well, I couldn't find it.)

Trust regions have been around forever; that's not the new idea.

It would be nice if somewhere there was a piece about "this is what we do better". Decoding a scientific article to figure out how much fluff it is seems like it has an uncertain return.


So what? This isn't some kind of gotcha. They're being explicit about the limitation. The built an interesting thing and showed that it does interesting things in a toy setting. This gets me excited about future work down this path.


I disagree that the blog post was explicit enough. Tacking parenthesized disclaimers onto the end of your sentences is not really very explicit. What they're comparing against isn't even Adam; calling it Adam at all is misleading. The subtitle of the Adam paper is "A Method for Stochastic Optimization". It's not designed or intended for full-batch, non-stochastic training.

What they should do is lead that paragraph with the fact that this is full-batch, note that this is not normal, and explain why they chose to do it anyway. And they shouldn't call it plain "Adam".

The paper is probably good, and it's fine to test things in toy settings, but you need to be upfront with your assumptions and limitations.


With the hype many people have a difficult time interpreting research. Thinking it is all results oriented. For readers, you need to remember that there's still a lot of theory research that's important. Research is far more than getting benchmark results or flashy images. Benchmarks are only proxies for what we're actually trying to learn and as such need to be evaluated with care (I wrote about this recently). This is especially true for metics that use parametric models to evaluate, such as FID. It's problematic if we can't do research in a research field. Research isn't production, though the two can overlap.


I also liked the graph that showed only training loss vs Adam. Would have been nice to see validation loss as well; are we just helping Adam overfit the data?


I'm interested what fdej thinks about this one. arb has recently been merged with flint under his maintainership and is getting some generic rings wrapper and more interfaces with other languages. But maybe its focus is more like algebraic things like modular forms and less like tensor libraries for machine learning and GPGPU.

https://news.ycombinator.com/user?id=fdej

https://news.ycombinator.com/item?id=26054520

https://arblib.org/


This looks quite similar like the "Lagrange models" defined by Joris van der Hoeven in https://hal.science/hal-01188378/document, a version of Taylor models where the constant error term is replaced by an interval multiple of x^n.

Representing functions locally is a very interesting problem: there are lots of possible approaches using intervals, Taylor series, Chebyshev series, etc. Big open design space if you ask me.


See my comment above about your linear algebra implementation.


I've just looked at the documentation for eigenvalues and eigenvectors in arblib. I was thinking that:

- Eigenvectors are not computable in regions surrounding eigenvalue clashes. Therefore you should never compute them.

- Let M be a symmetric matrix being eigen-decomposed. The decomposition should be A = P D P^(-1) where only D is interval-valued, and P and P^(-1) are NOT UNDER ANY CIRCUMSTANCE INTERVAL-VALUED, and A should include M.


I see what you mean, but say you have such a matrix and there are two eigenvalues that you can't separate from each other. Then wouldn't their corresponding eigenvectors define like a 2d plane? Couldn't there still be some interval for that plane, like say that the plane is defined by the linear combination of two vectors and both vectors have zero on some coordinate, say the x coordinate. The interval could still be helpful by saying that any combination will also be zero there, right?


That makes sense, but what can be done with it?


sorry I'm not sure what you mean


Interesting, but I have questions. The OP shows how SafeRate improves plain-vanilla Adam, but almost no one uses that anymore. The most commonly used optimizer nowadays is probably AdamW. Does SafeRate improve AdamW to the same extent, or at all? How does using SafeRate compare with Bayesian hyperparameter search? How does it compare with rule-of-thumb hyperparameter selection methods? Many people seem to get a lot of mileage with AdamW and sensible defaults (e.g., a "one cycle" schedule with a max lr of 3e-4, momentum peaking at 0.95, and some weight decay)?


It improves Adam in the full-batch setting, which is not usually done or desired since minibatch should regularize and improve the loss.


Batch size can be used for regularisation, but using it for that will limit training performance. From the Google Research Tuning Playbook:

> The batch size governs the training speed and shouldn't be used to directly tune the validation set performance. Often, the ideal batch size will be the largest batch size supported by the available hardware.

> […]

> As long as all hyperparameters are well-tuned (especially the learning rate and regularization hyperparameters) and the number of training steps is sufficient, the same final performance should be attainable using any batch size (see Shallue et al. 2018).

https://github.com/google-research/tuning_playbook#choosing-...

The ideal case is full-batch with tuneable regularisation, just the hardware gets expensive.


The source you cite is talking about "large" batch sizes which are still a miniscule fraction of the training set. Their (good) advice has no relevance at all in a discussion of full-batch training.

Full-batch training is not the ideal case. It's a recipe for overfitting. We don't optimize models for their training set performance. We need them to generalize to the true data, of which our training set may not even be a representative sample. SGD works exceptionally well for that.


Take a look at the linked paper, they train using batches with about 5% of ImageNet in them (batch size = 2^16), which is pretty large from my perspective. Overall they show diminishing returns for larger batch sizes after a certain point. That fits the intuition that SGD is an approximation to GD, which gets better as batch size increases until the approximation error is negligible.


Ah yes, that too. I forgot to point it out. Thanks!


Lots of people still use Adam and SGD!


Haven't seen anyone use it for large-scale (or largish-scale) learning in a long while...


My main question is whether this is still helpful in an SGD context?

I.e. if you have $f = \sum_i^n f_i$ and with every batch you make $f_S = \sum_{j\in S} f_j$ for that batch ... being able to take a step which definitely causes $f_S$ to decrease doesn't necessarily cause $f$ to decrease / the window for $f_S$ may be different than the window for $f$?

If it only really makes sense for cases where you can evaluate the whole function of interest at every step, which tend to be lower-dimensional anyways, then I think an interesting comparison would be to quasi-newton methods like LBGFS. I.e. if you're involving the taylor-series machinery, when should you use it to pick the direction to step (normal QN) vs when should you use it to pick only the step size (this work)?


The newton method has the nasty Wada property arise, which is often glossed over as being the same as scale invariant noise in class but which is better thought of as a topological feature rather than scale invariant noise.

It is indecomposable continua and not the imperfect knowledge of initial conditions like chaotic systems often invoke.

Boundary conditions becoming indeterminate is a problem with 3 or more attractors or exit basins.


Could you provide a link to some more reading about this? i do not know what a wada property is, nor why it is topological


The Wada property is when you have 3+ disjoint sets that share the same boundary set, at least for the complete Wada property.

In Newton's fractal, any circle you draw will either have a single root or all roots, no matter how small you draw that circle. If your initial conditions are near that boundary the exit basin you take is indeterminate because that point wise boundary point is the boundary of multiple exit basins and isn't simply connected.

This differs from other scale invariant features like self similar fractal scattering which will have similar 'noise'at different scales but is somewhat deterministic if treated like noise as an example.

The Wada property arises in several places like Hamilton systems, delayed partials and predator prey with food and refuge.

If you look for papers on light exiting binary black holes there are some good papers.

This page may help, but the Wada property is counter intuitive.

https://users.math.yale.edu/public_html/People/frame/Fractal...


That's great. Interval arithmetic is an underappreciated area of study, not because it's particularly difficult - anyone can grasp it in short order, but it's rarely the first, second, or third tool used for the job when confronted with a problem. Perhaps it's just not prestigious enough compared to more sophisticated methods.


How does this generally work in high dimensions? In 1d, an interval is a really simple thing. In higher dimensions, it generalizes in a bunch of ways, from a box to an arbitrary convex region. Given that practical problems tend to have the interesting bits constrained around a lower dimensional region, I imagine that the wrong representations (e.g. just an axis-aligned box) could make the intervals meaningless. Is that the case? Is it simple to overcome that (e.g. just align the box along the interesting subspace), or in practice is it harder to have useful high dimensional "intervals?"


Haven't looked in any depth: How does this cope with functions like f(x) = x - x? Interval arithmetic sometimes produces very conservative bounds.

If done well, rigorous interval bounds seem like they could be really useful.


Can you define a hairy version of a function that misleads at every point? Ie

Take a well known function: sin(x). Call it G

Now make a function H that makes hairy version of functions. H takes a function f, a slope m, and a probability p.

H returns a version of the function where every value is the close to f(x), and p percent of the points have a derivative of m. The remaining points have the job of undoing the derivative of m and jumping back to f(x).

If you are dealing with real numbers then things can get very hairy without ever deviating from f(x). You could turn p up to 99.999999% and m to a crazy slope.

(I am not a mathematician, I don’t know if these musings are misguided)

My point is that if you had a hairy function, you wouldn’t be able to use gradient decent on it, because a derivative takes the slope at a point, and these slopes are misleading. However, you could use a numerical derivative, that plots the slope between a point and another point separated by a non zero distance D. The hairiness would not be detected.


I think for the application here (analogous to autograd on neural networks) you know each individual building block of the function (a well-defined neural network layer) so can be sure such things will not happen.


Taking g(x) = f(x) + epsilon*cos(bx) adds to the derivatives a function that bounces between -b*epsilon and b*epsilon while g remains within an epsilon envelope of f. Take b large enough to fit your needs, e.g. 100/epsilon times an upper bound on |f'|.

Add a scaled version of the Weierstrass function instead if you don't want a derivative at all.


So could this be used with something like a trust-region SQP solver to remove the tweakable trust region parameter?


seems vaguely like McCornick relaxations? https://optimization.cbe.cornell.edu/index.php?title=McCormi...




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: