Gradient noise scale would be interesting to compute individually for each weight/bias. I'm curious how it scales as a function of layer depth, for example -- it could well be that different layers require very different treatment in terms of learning rate. Supporting this is that gradient magnitudes are typically very different for different layers, although they evolve similarly in time. Here's something I just whipped up in Mathematica:
We also show that it's possible to compute the noise scale as you train (without any extra overhead), and this can be used to adjust the batch size in real-time, so that in theory you can get it right the first time and in a way that adapts over the training run. However, those experiments are still preliminary (see appendix D of the paper).