Another post of the "If it makes you feel any better" type: I'm a relatively est...

Aeolos · on Jan 30, 2017

> The randomness also doesn't help, as running the same model with different random seeds I have noticed that there is huge variance in accuracy.

Set all your random seeds to something predefined, such as 42. Even though the exact randomness is OS-specific, this will at least rule out lucky runs from real hyperparameter improvements.

Al-Khwarizmi · on Jan 30, 2017

I alredy do that for reproducibility reasons, but I don't really think it takes luck out of the equation. 42 may be a great seed for a model with 400 cells per layer and a terrible seed for a model with 600 cells per layer, as the different layout will lead to a totally different distributions of the weights even if the seed remains the same.

Aeolos · on Jan 30, 2017

Indeed, but if performance is affected so much by the initialisation, then I would avoid random initialisation in the first place. There are various publications exploring different initialisation methods for various problems.

I'm afraid that I cannot go into more specific details right now, but you can get more stable training and faster convergence with a better initialisation strategy.

yorwba · on Jan 30, 2017

Couln't you use an initialization pattern that includes all the weights of the smaller layer in the larger layer? This would keep the behavior of a subset of units exactly the same, at least at initialization time.

cwilkes · on Jan 30, 2017

Maybe I'm missing something here but if the problem is that different starting positions return different answers when they should all converge to the same one -- doesn't that mean that there's a fundamental problem with the actions being taken?

Fixing the starting randomness seems to be like that old adage of a man with two clocks doesn't know what time it is so he throws away one of them.

Al-Khwarizmi · on Jan 31, 2017

There is a fundamental problem in general (not in my particular approach) which is that we don't know how to do non-convex optimization. There are many problems where it's just not possible with our current techniques to know if a minimum is local or global.

Of course, with better tuning one can obtain better optima in general (that's what the whole field is doing) and it's possible that I'm not applying the best techniques and I would get less randomness if I had a better model. But as far as I understand, even the best models can converge to different local optima depending on initialization.