I'm training a sequence-to-sequence model and have been tuning hyperparameters for the last 2-3 months. I'm making progress, but painfully slowly due to the large time it takes to train and test models (I have a local Titan X and some Tesla K80's in a remote cluster, to which I can send models expecting a latency of 3-4 days of queue and a throughput of around 4 models running simultaneously on average - probably more than many people can get, but still feels slow for this purpose) and the fact that hyperparameter optimization seems to be little more that blind guessing with some very rough rules of thumb. The randomness also doesn't help, as running the same model with different random seeds I have noticed that there is huge variance in accuracy. So sometimes I tweak a parameter and get improvements, but who knows if they are significant or just luck with the initialization. I would have to run every experiment with a bunch of seeds to be sure, but that would mean waiting even more for results and my research would be old before I got to state of the art accuracy.
Maybe I'm just not good at it and I'm a bit bitter, but my feeling is that this DL revolution is turning research in my area from a battle of brain power and ingenuity to a battle of GPU power and economic means (in fact my brain doesn't work much in this research project, as it spends most of the time waiting for results for some GPU - fortunately I have a lot of other non-DL research to do in parallel so the brain doesn't get bored). In the same line, I can't help but notice that most of the top DL NLP papers come from a very select few institutions with huge resources (even though there are heroic exceptions). This doesn't happen as much with non-DL papers.
Good thing that there is still plenty of non-DL research to do, and if DL takes over the whole empirical arena, I'm not bad at theoretical research...
Set all your random seeds to something predefined, such as 42. Even though the exact randomness is OS-specific, this will at least rule out lucky runs from real hyperparameter improvements.
I'm afraid that I cannot go into more specific details right now, but you can get more stable training and faster convergence with a better initialisation strategy.
Fixing the starting randomness seems to be like that old adage of a man with two clocks doesn't know what time it is so he throws away one of them.
Of course, with better tuning one can obtain better optima in general (that's what the whole field is doing) and it's possible that I'm not applying the best techniques and I would get less randomness if I had a better model. But as far as I understand, even the best models can converge to different local optima depending on initialization.